Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

Beyond Accuracy: Behavioral Testing of NLP models with CheckList

ACL 2020 Best Paper

0. Summary

本文解决的问题

由于数据集和真实世界数据相比有bias，使用Accuracy并不能全面地说明NLP模型的好坏。因此需要一个通用、全面的方法框架评估NLP模型。

本文的主要贡献

仿照软件工程中的测试原则，提出了 CheckList 方法^[1]，在 capabilities 和 test types 两个维度上对NLP模型进行评估。
提供了相应的工具，可以快速地生成大量测试样本。

本文的主要内容

解释了CheckList方法，以及在capabilities 和 test types 两个维度上提出至少测试的要点。
在三个数据集上进行了测试（sentiment analysis (Sentiment), duplicate question detection (QQP), and machine comprehension (MC)) 。使用传统评估方法（Accuracy）时，现有模型表现和人类一致，但CheckList方法则揭露了现有模型的诸多缺陷。

1. Background

Generalization是NLP模型的主要目标之一。然而，传统的评估方法很难对泛化进行衡量：

标准的模型验证方法为对数据集使用训练集-测试集分割。通过验证测试集上的指标衡量其泛化性能。然而，数据集本身的分布和真实世界相比就常常带有bias。因此，测试集上的性能也仅能表示模型在数据集的分布上的性能，从而高估模型在真实世界上的表现。此外，这一方法只能对模型性能进行衡量，无法指出模型在哪些情况下会失败，帮助我们进一步优化。
使用Accuracy、F1等 aggregate statistic的指标，难以帮助我们理解模型的弱点在哪里。
一些额外的验证方法，如验证对噪声和对抗性攻击的鲁棒性、公平性、逻辑一致性等等，只对某一特定任务或能力进行评估，并不全面。CheckList提供一种全面系统地检测各种能力的方法。但是CheckList不能用于 non-behavioral issues such as data versioning problems, labeling errors, annotator biases, worst-case security issues, or lack of interpretability.
挑战性数据集（challenge datasets），即含有极端困难的样本的数据集。然而这些数据集主要用于Natural Language Inference，且数据集较小，和现实世界偏差较大。CheckList方法可以有效弥补这些不足。首先，基于模板的样本生成可以很好地控制语义信息，而基于扰动生成的方法INV和DIR可以在未经标注的现实世界样本中进行测试。MFT还专注于测试模型在简单任务上的表现，且能帮助用户发现模型中存在的问题。

尤其是第一点，这导致了模型在例如GLUE等大型数据集上的Accuracy虽然很高，甚至达到了和人类相似的水平，但仍会被发现有很多缺陷。^[2]

acl_页面_008.jpg

例如：

Shortcuts/right for wrong reasons. 模型并没有真正理解图片，而是学习到了某些shortcuts。
Semantically equivalent adversaries. 模型对相同语义的对抗性攻击没有鲁棒性。
Lack of consistency.

针对以上问题，本文提出了CheckList方法，一种通用、全面的NLP模型评估框架。

2. Method

在软件工程中，我们检测每一个最小单元是否允许正常。类似地，CheckList方法对模型的每一个Capability进行检测。这是因为一个好的模型的Capability是e task-independent的。另外，这种方法适用于黑盒模型，因此也可以对不开源的商业模型评估。对每个Capability，CheckList设计了多种测试方式。如下面的矩阵所示，其中每一行代表对一个特定的capability的检测，每一列表示一个测试类型。每一个Cell内表示模型的失败率。

Capabilities	MFT	INV	DIR
Vocabulary+POS	Pos/Neg: 15%, Neutral: 7.6%
NER
Negation	Easy: 49.2%
…

2.1 Capabilities

CheckList对语言模型的通用(universal)语言能力分别测试，包括但不限于：

Vocabulary+POS (important words or word types for the task)
Taxonomy (synonyms, antonyms, etc.)
Robustness (to typos, irrelevant changes, etc.)
NER (appropriately understanding named entities)
Fairness (not biasing towards certain gender/race groups)
Temporal (understanding order of events)
Negation (understand the negation words)
Coreference (resolve ambiguous pronouns, etc.)
Semantic Role Labeling (understanding roles such as agent, object, etc.)
Logic (ability to handle symmetry, consistency, and conjunctions)

2.2 Test Types

建议用户至少使用以下三种测试类型：

Minimum Functionality test (MFT). 创建简单的、基本的、带有label的句子评估模型。主要用来检测模型是真正学习到了这个Capacity，还是找到了一些shortcuts。
Invariance test (INV). 对现有的句子进行扰动，同时label不变（无需知道原始label）。
Directional Expectation test (DIR). 对现有的句子进行扰动，同时对label的预期产生一个定向的变化（无需知道原始label）。

2.3 Examples

举例说明以上CheckList矩阵如何运作：

Vocab/Pos + MFT
- Pos/Neg test
  
  This was a great flight. (positive)
  
  I hated this seat. (negative)
- Neutral test
  
  This is a commercial flight. (neutral)
  
  I flew to Indiana yesterday. (neutral)
Negation + MFT

The cabin crew was not great. (negative)

I can’t say I enjoyed the food. (negative)
NER + INV

Sentiments数据集中，修改地名不应该改变模型的预测。

@AmericanAir thank you we got on a different flight to ~~Chicago~~ Dallas.

@VirginAmerica I can’t lose my luggage, moving to ~~Brazil~~ Turkey soon.
Vocab/Pos + DIR

Sentiments数据集中，末尾添加负面的简单句会导致 Sentiment monotonic decreasing。

@AmericanAir service wasn’t great. You are lame.

@JetBlue why won’t YOU help them?! Ugh. I dread you.

2.4 CheckList as a tool

CheckList在给出评估模型框架的同时，在Github上开源了一个用于大规模生成测试数据的工具。

2.4.1 Templates

CheckList使用Templates生成测试数据。例如：

This is a good book

可以被抽象为如下的template

This is a {POS} {THING}

其中，{POS} 是一个Positive的形容词，如 good, great, terrific。

{THING} 可以是任何一个物品，如book, film, movie。

使用笛卡尔乘积，即可生成大量的测试样本。

ret = editor.template('This is {a:adj} movie.', adj=['good', 'great', 'awesome', 'excellent'])
ret.data

['This is a good movie.',
 'This is a great movie.',
 'This is an awesome movie.',
 'This is an excellent movie.']

2.4.2 Lexicons

在上面的例子中，{adj}需要手动指定替换的内容。CheckList中自带了一些Lexicons，包括如下内容：

First, last names: by race, sex
Countries, nationalities: by income, continent
US cities: by population
Religions: both nouns (Christianity) and adjs (Christian)
Sexuality adjs: gay, straight, bisexual, etc


ret = editor.template('{male} is not friends with {female}')
ret.data[0:4]

['Michael is not friends with Jennifer',
 'Michael is not friends with Jessica',
 'Michael is not friends with Ashley',
 'Michael is not friends with Sarah']

还可以自定义lexicons：

editor.add_lexicon('adj', ['good', 'bad', 'great', 'terrible'])
ret = editor.template('{adj} is not the same as {adj2}', remove_duplicates=True)
ret.data[:4]

['good is not the same as bad',
 'good is not the same as great',
 'good is not the same as terrible',
 'bad is not the same as good']

2.4.3 Masked Language Model Suggestion

除了手动指定填充template的单词外，还可以使用诸如RoBERTa的语言模型对template中masked的部分填充，生成更加多样的数据集。例如

This is a good book

可以被抽象为如下的template

This is a {POS} {MASK}

其中{MASK}部分将由语言模型自动填充。

ret = editor.template('This is {a:adj} {mask} {mask}.', remove_duplicates=True)
ret.data[:5]

['This is a good history lesson.',
 'This is a good chess move.',
 'This is a good news story.',
 'This is a good programming language.',
 'This is a good data set.']

2.4.4 Wordnet

You can also ask for context-specific suggestions based on wordnet categories (synonyms, antonyms, hypernyms, hyponyms):

1
2
3

editor.synonyms('My drink is hot.', 'hot')

['spicy', 'raging']

1
2
3

editor.hyponyms('My animal eats other animals.', 'animal')[:5]

['dog', 'pet', 'bird', 'prey', 'baby']

3. Experiments

作者测试了来自不同公司的商用模型(Microsoft, Google和Amazon)，以及开源的BERT和RoBerta(RoB)。这些测试揭露了许多模型的缺陷。

3.1 Sentiment

测试表明，一些模型在一些能力上有着明显的缺陷。

在 Vocab/Pos + DIR 测试中，作者在原句末尾添加一个负面的句子。如果模型预测的Sentiment提升超过0.1则被认为是failure。Google的模型在这一项测试中错误率是34.6%.

在 Robust + INV 测试中，作者在tweets结尾添加了随机的超链接，模型预测结果应当保持不变。Amazon模型在这一项测试中的错误率高达24.8%.

在 Temporal + MFT 测试中，作者使用了Sentiment随时间变化的描述，并测试模型是否能准确捕捉当前的情感。所有商用模型的错误率都很高（超过了35%)。

在 Negation + MFT 测试中，作者测试了多种形式的Negation。所有模型都在这一项测试中失败。尤其是Negation放在句子结尾时（如I thought the plane would be awful, but it wasn’t.），Microsoft和Amazon的错误率为100%。BERT和RoBert在SST-2上进行训练，尽管SST-2验证机中包含18%的negation，且他们的精确率都超过90%，这些模型并没有通过作者的Negation测试。

在 Fairness + MFT，测试中，作者使用“I am a {PROTECTED} {NOUN}.”作为template检查模型是否对特定人群有歧视现象。商用模型在这一项中success了，但开源模型对 black, atheist, gay, and lesbian这些词汇永远会进行负面的预测。

3.2 Quora question pair （QQP）

BRRT模型和RoBert模型在QQP benchmark中都超过了人类水平，而CheckList的测试表明，这些模型更可能是学会了shortcut来实现超高的精度。

在 Vocab + MFT 测试中，作者通过添加形容词更改疑问句的重点。例如把Is Patrick Thomas a teacher? 更改为 Is Patrick Thomas an accredited teacher?,这两个疑问句应该是non-duplicate的，而两个模型的错误率都高达78%。

在 NER + DIR 测试中，作者保留原本的 named entity，改变整个问句；或者保留整个问句，仅更改named entity，产生non-duplicate的问句。BERT和RoB的错误率分别为30.0%和32.8%。

在 Temporal + MFT 测试中，作者将句子中的 before 和 after 替换产生non-duplicate的问句。BERT的错误率高达98%。

在 Semantic Role Labeling + MFT 测试中，作者将句子主动语态和被动语态进行转换，同时保留语义或反转语义。两个模型同样在测试中失败。

3.3 Machine comprehension (MC)

Machine comprehension包含一个上下文和一个提问。BERT在Taxonomy、Fairness、Temporal、Negation、Coreference这些测试上全部失败。

3.4 User Evaluation

作者评估自己开源的工具是否有助于用户（开发者、研究者、顾客等）发现模型的更多缺陷。实验发现，通过使用CheckList工具，在相同的时间内用户能够完成更多的测试，同时找到更多的bug。

4. 锐评

总的来说，CheckList提供了一种通用、全面的评估方法，可以在不同任务、不同模型上找出其缺陷。现有实验证明，商用和学术的SOTA模型依然有许多可以被CheckList方法找出的缺陷，同时可能还潜在更多未被发现的缺陷。产生这些缺陷的原因包括：

数据集标注存在问题。例如数据集标注本身带有歧视，缺乏公平性。
数据集不够全面。在高维的语义环境下，过少的数据导致模型难以泛化。

CheckList方法提供了一个矩阵，让用户从多个角度评估、比较模型，并在矩阵的cell中填写模型在对应测试中的失败率。然而，我们不能直接从失败率的高低判断模型是否掌握了这一能力。其原因总结如下：

在不同的test case下，失败率不同。
多高的失败率应当被认为success/failure？这一评价是主观的。
应当具体分析。例如虽然失败率很高，但模型只在罕见的token上出错，那可以视为success。而如果失败率很低，但模型某个常见的token上出错，那仍应被视为failure。
Failure不代表模型没有掌握对应的能力。Linguistic capabilities are more intertwined. Should try to further isolate compounds through INV tests.
Success不代表模型掌握了对应的能力。Test cases are not comprehensive; Only give you more confident that the basic works.

CheckList之前的NLP评估方法常常专注于某一特点（evaluating robustness to noise (Belinkov and Bisk, 2018; Rychalska et al., 2019) or adversarial changes (Ribeiro et al., 2018; Iyyer et al., 2018), fairness (Prabhakaran et al., 2019), logical consistency (Ribeiro et al., 2019), explanations (Ribeiro et al., 2016), diagnostic datasets (Wang et al., 2019b), and interactive error analysis (Wu et al., 2019).）CheckList综合了一部分前人的方法，首次提出了一个统一且全面的NLP模型评估框架和工具。并揭示了当时SOTA模型中存在的诸多缺陷。

CheckList实验的模型，以BERT为例，参数量在100M量级。在2020年后，大语言模型（LLM）逐渐流行，参数量提高到10B-100B量级。NLP评估方法也都专注于对LLM的评估。^[3]例如：

Beyond the Imitation Game benchmark（BIG-bench）。BIG-bench 目前包括 204 个任务，由 132 个机构的 450 位作者提供。任务主题多种多样，涉及语言学、儿童发展、数学、常识推理、生物学、物理学、社会偏见、软件开发等领域的问题。^[4]
Holistic Evaluation of Language Models (HELM)。HELM对 16 个核心场景中的每个场景测量 7 项指标（accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency）。测试涵盖了30种主流的语言模型。^[5]
PromptBench。PromptBench提出了一种针对LLM的对抗性攻击方法，生成了 4788 个对抗性prompt，并对 8 项任务和 13 个数据集进行了细致评估。结果表明，当代的 LLM 对对抗性prompt不具有鲁棒性。^[6]

6. 在ChatGPT3.5进行类似CheckList的小实验

Semantic Role Labeling + MFT：

Microsoft Google Amazon BERT RoB ChatGPT(5 samples)

Error Rate 96.8 90.8 81.6 55.4 54.8 20.0

Failure example:

User

Is the following context positive or negative ? You should answer “positive”, “neutral” or “negative”.

Do I think this lie is bad? No.

ChatGPT

Negative
Negation + MFT:

Microsoft Google Amazon BERT RoB ChatGPT(5 samples)

Error Rate 100.0 90.4 100.0 84.8 7.2 20.0

Failure example:

User

Is the following context positive or negative ? You should answer “positive”, “neutral” or “negative”.

I thought I would hate you, but I didn’t.

ChatGPT

Negative

	Microsoft	Google	Amazon	BERT	RoB	ChatGPT(5 samples)
Error Rate	96.8	90.8	81.6	55.4	54.8	20.0

	Microsoft	Google	Amazon	BERT	RoB	ChatGPT(5 samples)
Error Rate	100.0	90.4	100.0	84.8	7.2	20.0

总的来说，大语言模型由于训练样本远大于普通的语言模型，其泛化性能理论上也就更好。在这个简单的实验中，ChatGPT在个别任务中仍然会出现失误，但总体来说远优于CheckList中评估的语言模型。

References

Ribeiro, Marco Tulio, et al. Beyond accuracy: Behavioral testing of NLP models with CheckList. arXiv preprint arXiv:2005.04118 (2020). ↩
AITimer-鸽鸽. ACL 2020最佳论文：一种全新的NLP模型测试方法CheckList. 微信公众平台, 2021, mp.weixin.qq.com/s/b0S5Fy8W338RfdaI2zZKdg. ‌ ↩
Chang, Yupeng, et al. A Survey on Evaluation of Large Language Models. arXiv preprint arXiv:2307.03109 (2023). ↩
Srivastava, Aarohi, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 (2022). ↩
Liang, Percy, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110 (2022). ↩
Zhu, Kaijie, et al. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528 (2023). ↩

机器学习

#机器学习 #论文阅读 #NLP

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

https://blog.wintertee.top/机器学习/arXiv-2005-04118/

作者

WinterTee

发布于

2024年4月20日

更新于

2024年5月6日

许可协议

NeuroMLR: Robust & Reliable Route Recommendation on Road Networks 上一篇

整了个Oh-My-Posh主题下一篇

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

Beyond Accuracy: Behavioral Testing of NLP models with CheckList

0. Summary

本文解决的问题

本文的主要贡献

本文的主要内容

1. Background

2. Method

2.1 Capabilities

2.2 Test Types

2.3 Examples

2.4 CheckList as a tool

2.4.1 Templates

2.4.2 Lexicons

2.4.3 Masked Language Model Suggestion

2.4.4 Wordnet

3. Experiments

3.1 Sentiment

3.2 Quora question pair （QQP）

3.3 Machine comprehension (MC)

3.4 User Evaluation

4. 锐评

5. Related works

6. 在ChatGPT3.5进行类似CheckList的小实验

References