
Paper Reading List

Step1: NMT, Attention and Transformer

Reference Blogs

TODO: 2014-Google-Sequence to Sequence

Sequence to Sequence Learning with Neural Networks


Neural Machine Translation by Jointly Learning to Align and Translate1. 本文首次提出了Attention的概念, 首先介绍了什么是Neural machine translation(NMT): attempts to build and train a single, large neural network that reads a sentence and outputs a correct translation.

之后指出当时现有的NMT model大多属于encoder-decoder类型的网络: An encoder neural network reads and encodes a source sentence into a fixed-length vector. A decoder then outputs a translation from the encoded vector. 在这种框架下, A neural network needs to be able to compress all necessary information of a source sentence into a fixed-length vector. 那么就有一个潜在的问题,在source sentence 过长的时候, 这里得到的fixed-length vector不足以表达全部的信息, 所以后面基于此vector的翻译任务将很难有好的表现, 如下图1所示:


接着就是文章提出方法的具体描述了: The most important distinguishing feature of this approach from the basic encoder–decoder is that it does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it en-codes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation. This frees a neural translation model from having to squash all the information of a source sentence, regardless of its length, into a fixed-length vector.

上图(来自本文)进一步地补充了上面的解释。这里的Encoder是一个双向RNN, Decode仍然是一个RNN。 不同的是,我们不再将source sentence的全部信息强行压缩到一个fixed-length vector, 而是加上了一个alignment model,其在每次预测\(y_t\)的时候都会汇总来自Encoder隐藏层的信息(annotations, 双向RNN拼接而成)作为输入。

接下来解释了为什么这样方法是有效的:Intuitively,this implements a mechanism of attention in the decoder. The decoder decides parts of the source sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed-length vector. With this new approach the information can be spread throughout the sequence of annotations, which can be selectively retrieved by the decoder accordingly.

关于alignment model的实现: We parametrize the alignment model as a feedforward neural network which is jointly trained with all the other components of the proposed system.

综上可知,这里的Attention就是说在预测的时候持续“关注”source sentence(输入)以获得更好的预测, 而完成这个“关注”任务的,协调“关注”工作的任务就交给了Attention Mechanism的实现,如这里的alignment model.

TODO: 2017-Transformers

Attention Is All You Need

Step2: GPT Series


Improving Language Understanding by Generative Pre-Training


BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding


Language Models are Unsupervised Multitask Learners


Language models are few-shot learners

Step3: Engineering in LLM



