Earyant的技术博客

欢迎来到Earyant的技术博客,在这里我将与你分享新技术。

预训练语言模性-可解释性

BERT 可解释性

分析可解释性的一些常见方法

  • Probing tasks: 通过设计一些独立性较强的简单任务,来验证模型的不同层再不同Probing task 上的能力,进而得到模型对数据的深入解释
  • Visualization:一般都是通过可视化 Attention 系列来对模型做出解释。
  • Adversarial attacks:对抗性攻击通过使用特定干扰信息创建的例子来验证模型的鲁棒性

What does BERT Learn from Multiple-Choice Reading Comprehension Datasets[1]

  • an un-readable data attack, in which we add keywords to confuse BERT, leading to a significant performance drop
  • an un-answerable data training, in which we train BERT on partial or shuffled input

1. Un-Readable Data Attack

The un- readable data is mainly obtained by randomly shuf- fling the word order of the input to make it gram- matically wrong and un-readable.

  • We first fine-tune BERT on the original MCRC data and then test it under adversarial attacks.

2. Un-Answerable Data Training

What does BERT learn about the structure of language[2]

本文通过设计不同粒度的任务探索 BERT 的不同层的向量能够扑捉到什么样的信息。

  • 浅层网络能够捕捉到 phrase-level 信息,而这部分信息在高层被稀释了
  • 中层网络能够捕捉到语法结构信息
  • 顶层网络能够扑捉到语义信息
  • BERT 需要深层网络才能学习到长距离的依赖

1. Phrasal Syntax

这部分主要验证 BERT 能否扑捉到短语级别的结构信息。 对于短语片段$(si,s_j)$, 通过 concat 片段在 l 层的第一个向量$h_i$和最后一个向量$s_j$进行 ,得到该短语的向量表示$S{(s_i, s_j)}, l$。通过对CoNLL 2000 chunking dataset 进行随机抽取,可视化结果如下:

屏幕快照 2020-08-06 下午1.35.04.png

1
2
3
4
5
6
7
8
9
10
11
NP (noun phrase) # 名词短语
VP (verb phrase) # 动词短语
PP (prepositional phrase) # 介词短语
ADVP (adverb phrase) # 副词短语
SBAR (subordinated clause) # 从句
ADJP (adjective phrase) # 形容词短语
PRT (particles) #
CONJP (conjunction phrase) # 连词
INTJ (interjection)
LST (list marker)
UCP (unlike coordinated phrase)

从上图中可以得到:BERT 浅层能够扑捉到 phrase-level information,而随着层的加深,此特征被稀释了。

2. 探测任务(Probing Tasks)

Probing Tasks 主要是用来评估BERT每层向量对于不同语言信息的扑捉能力。本文的探测任务主要有十个,可以分为三组:

  • Surface tasks(表层任务):SentLen(), WC
  • Syntactic tasks(句法任务):BShift, TreeDepth, TopConst
  • Semantic tasks(语义任务):Tense, SubjNum,SOMO, CoordInv

3. Subject-Verb Agreement

4. Compositional Structure

Open Sesame: Getting Inside BERT’s Linguistic Knowledge

BERT vs ELMO vs Flair [9]

本文主要探讨这三种上下文向量在 WSD(word sense disambiguation)上的表现,进而判断上下文向量能否解决 WSD 问题。

屏幕快照 2020-08-06 上午11.14.15.png

MFS 54.79 58.95 70.94 48.44

从上表中看出,BERT 要远高于baseline 模型,且要比其他上下文向量表现更佳。

屏幕快照 2020-08-06 上午11.31.21.png

从上图中我们得出如下结论:

  • The Flair embeddings hardly allow to distinguish any clusters as most senses are scattered across the entire plot.
  • In the ELMo embedding space, the major senses are slightly more separated in different regions of the point cloud
  • Only in the BERT embedding space, some senses form clearly separable clusters

BERT 的不足之处

看 Error analysis

BERT vs ELMO vs GPT-2

BERT Attention 学到了什么

https://zhuanlan.zhihu.com/p/148729018

A Primer in BERTology: What we know about how BERT works

Reference

[1] What does BERT Learn from Multiple-Choice Reading Comprehension Datasets? — 2019-5

[2] What does BERT learn about the structure of language? — 2019-6-4

[3] Open Sesame: Getting Inside BERT’s Linguistic Knowledge - 2019-6

[4] What Does BERT Look At? An Analysis of BERT’s Attention — 2019-6-11

[5] A multiscale visualization of attention in the transformer model —2019-6-12

[6] BERT Rediscovers the Classical NLP Pipeline — 2019-8

[7] How Does BERT Answer Questions? A Layer-Wise Analysis of Transformer Representations — 2019-9

[9] Does BERT make any sense? interpretable word sense disambiguation with contextualized embeddings

[10] Contextual Embeddings: When Are They Worth It? — 2020-3-18

[11] A Primer in BERTology: What we know about how BERT works —2020-2-27 综述

How Contextual are Contextualized Word Representations?

Linguistic Knowledge and Transferability of Contextual Representations

https://www.jiqizhixin.com/articles/2019-09-09-6

https://zhuanlan.zhihu.com/p/148729018

https://zhuanlan.zhihu.com/p/74515580

https://lsc417.com/2020/06/19/paper-reading3/#prevalence-of-unseen-words

https://zhuanlan.zhihu.com/p/145695511

https://zhuanlan.zhihu.com/p/110085059

欢迎关注我的其它发布渠道