BERT 可解释性

BERT 可解释性

分析可解释性的一些常见方法

Probing tasks：通过设计一些独立性较强的简单任务，来验证模型的不同层再不同Probing task 上的能力，进而得到模型对数据的深入解释
Visualization：一般都是通过可视化 Attention 系列来对模型做出解释。
Adversarial attacks：对抗性攻击通过使用特定干扰信息创建的例子来验证模型的鲁棒性

What does BERT Learn from Multiple-Choice Reading Comprehension Datasets[1]

an un-readable data attack, in which we add keywords to confuse BERT, leading to a significant performance drop
an un-answerable data training, in which we train BERT on partial or shuffled input

1. Un-Readable Data Attack

The un- readable data is mainly obtained by randomly shuf- fling the word order of the input to make it gram- matically wrong and un-readable.

We first fine-tune BERT on the original MCRC data and then test it under adversarial attacks.

2. Un-Answerable Data Training

What does BERT learn about the structure of language[2]

本文通过设计不同粒度的任务探索 BERT 的不同层的向量能够扑捉到什么样的信息。

浅层网络能够捕捉到 phrase-level 信息，而这部分信息在高层被稀释了
中层网络能够捕捉到语法结构信息
顶层网络能够扑捉到语义信息
BERT 需要深层网络才能学习到长距离的依赖

1. Phrasal Syntax

这部分主要验证 BERT 能否扑捉到短语级别的结构信息。对于短语片段$(si,s_j)$，通过 concat 片段在 l 层的第一个向量$h_i$和最后一个向量$s_j$进行，得到该短语的向量表示$S{(s_i, s_j)}, l$。通过对CoNLL 2000 chunking dataset 进行随机抽取，可视化结果如下：

NP (noun phrase) # 名词短语
VP (verb phrase) # 动词短语
PP (prepositional phrase) # 介词短语
ADVP (adverb phrase)  # 副词短语
SBAR (subordinated clause) # 从句
ADJP (adjective phrase) # 形容词短语
PRT (particles) # 
CONJP (conjunction phrase) # 连词
INTJ (interjection)
LST (list marker)
UCP (unlike coordinated phrase)

从上图中可以得到：BERT 浅层能够扑捉到 phrase-level information，而随着层的加深，此特征被稀释了。

2. 探测任务(Probing Tasks)

Probing Tasks 主要是用来评估BERT每层向量对于不同语言信息的扑捉能力。本文的探测任务主要有十个，可以分为三组：

Surface tasks（表层任务）：SentLen（）， WC
Syntactic tasks（句法任务）：BShift， TreeDepth， TopConst
Semantic tasks（语义任务）：Tense， SubjNum，SOMO， CoordInv

3. Subject-Verb Agreement

4. Compositional Structure

Open Sesame: Getting Inside BERT’s Linguistic Knowledge

BERT vs ELMO vs Flair [9]

本文主要探讨这三种上下文向量在 WSD（word sense disambiguation）上的表现，进而判断上下文向量能否解决 WSD 问题。


MFS	54.79	58.95	70.94	48.44

从上表中看出，BERT 要远高于baseline 模型，且要比其他上下文向量表现更佳。

从上图中我们得出如下结论：

The Flair embeddings hardly allow to distinguish any clusters as most senses are scattered across the entire plot.
In the ELMo embedding space, the major senses are slightly more separated in different regions of the point cloud
Only in the BERT embedding space, some senses form clearly separable clusters

BERT 的不足之处

看 Error analysis

BERT vs ELMO vs GPT-2

BERT Attention 学到了什么

https://zhuanlan.zhihu.com/p/148729018

A Primer in BERTology: What we know about how BERT works

Reference

[1] What does BERT Learn from Multiple-Choice Reading Comprehension Datasets? — 2019-5

[2] What does BERT learn about the structure of language? — 2019-6-4

[3] Open Sesame: Getting Inside BERT’s Linguistic Knowledge - 2019-6

[4] What Does BERT Look At? An Analysis of BERT’s Attention — 2019-6-11

[5] A multiscale visualization of attention in the transformer model —2019-6-12

[6] BERT Rediscovers the Classical NLP Pipeline — 2019-8

[7] How Does BERT Answer Questions? A Layer-Wise Analysis of Transformer Representations — 2019-9

[9] Does BERT make any sense? interpretable word sense disambiguation with contextualized embeddings

[10] Contextual Embeddings: When Are They Worth It? — 2020-3-18

[11] A Primer in BERTology: What we know about how BERT works —2020-2-27 综述

How Contextual are Contextualized Word Representations?

Linguistic Knowledge and Transferability of Contextual Representations

https://www.jiqizhixin.com/articles/2019-09-09-6