In December 2020 I attended COLING'2020 (the 28th International Conference on Computational Linguistics) to present “Contextual BERT” at the TextGraphs-14 workshop and to learn about the latest advancements in natural language processing (NLP). This year the conference was not held in Barcelona but remotely, which shifted the focus from poster sessions and chatting more towards plain reading. After reading all 613 titles of the accepted papers, around 80 abstracts, and more than 20 papers, I came up with the following list of favorites. This is a very subjective choice, which largely depends on my research interests and prior knowledge. Others might still benefit from the filtering I have done, so here is the compilation:
Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation Bryan Eikema | Wilker Aziz
The paper starts off by naming three pathologies of machine translation (MT): models underestimate sentence lengths (which they shouldn’t), a large beam size hurts performance (which it shouldn’t), and the inadequacy of the model problem (which I did not fully understand). The authors investigate whether the pathologies are inherent to the training methods or caused by the sampling strategy, which is most commonly beam search. In a statistical analysis that compares samples retrieved through beam search (MAP) to samples generated using ancestral sampling (which is unbiased), they find that the latter does at least partially solve the MT pathologies mentioned in the beginning. The paper is interesting because it deals with a fundamental aspect of sequence model inference, is very well written, and seems to be well researched. COLING recognized the paper by adding it to the list of the outstanding papers.
Lost in Back-Translation: Emotion Preservation in Neural Machine Translation Enrica Troiano | Roman Klinger | Sebastian Padó
This paper deals with emotion preservation in MT. Think of an example where a sentence is translated from source (S) → target (T). An emotion classifier would probably output different values for S and T sentences. That is undesirable, because the translation is supposed to preserve the emotions. The authors from the university of Stuttgart find that state-of-the-art MT loses emotion information and propose a way of preserving it by reranking the top-k candidate translations in a post-processing step. The idea of using back-translation is (despite not being novel) very intriguing to me. The authors use it to solve the problem of not having an identical emotion classifier in different languages. They therefore use back-translation: S → T → S. The comparison of the emotion in the source language serves as a proxy for emotion loss. Besides that, the paper made me appreciate the difficulty of high-quality machine translation even more. Lastly, the amount of research on NLP style transfer surprised me: the work lists more than ten papers on that topic in the section on related work, see e.g. Li et al. (2018) who make Amazon reviews more romantic or humorous.
Tiny Word Embeddings Using Globally Informed Reconstruction Sora Ohashi | Mao Isogawa | Tomoyuki Kajiwara | Yuki Arase
When reading the title I noted down “very small word embeddings, how does that work?” The paper answers it: There is a good amount of work on compressing word embedding, just to subsequently reconstruct the original embedding based on the compressed representation combined with sub-words/characters. The reconstruction then serves as a substitute for the regular embedding, e.g., as a model input. The intention is to reduce the memory requirements by holding a smaller embedding matrix in RAM than otherwise required. The short paper adds global information to the embedding reconstruction function to be able to compress more aggressively or approximate the original embedding better. The loss for training the reconstruction model is defined such that the geometry of the embedding space is preserved (Eq. 2 in the paper).
Picking BERT’s Brain: Probing for Linguistic Dependencies in Contextualized Embeddings Using Representational Similarity Analysis Michael Lepori | R. Thomas McCoy
Contextualized embedding is superior to fixed embedding (e.g., word2vec) in that it encodes the semantic and syntactic role of a word in a specific sentence. What exactly is captured in the embedding, however, is unknown. The authors from the Johns Hopkins University find that linguistic dependencies are being encoded: for example, verbs encode the subject, pronouns encode their antecedents, and sentence embeddings encode the main verbs of the sentence the most. I find this paper interesting because it sheds light on the internal workings of contextualized embedding, beyond the analysis of attention scores (What does BERT look at?). It is also pleasant to follow because they provide a tangible explanation of their approach.
Would you describe a leopard as yellow? Evaluating crowd-annotations with justified and informative disagreement Pia Sommerauer | Antske Fokkens | Piek Vossen
In the introduction the paper raises that “[annotated] datasets rarely provide indications about the difficulty and ambiguity on the level of annotated units”. The authors argue that the information provided by multiple annotators is rich and should not only be reduced to a single label. I like this direction of thinking. In the fashion domain ambiguity is omnipresent and thoughts towards modeling it can start with having richer data.
Fine-tuning BERT for Low-Resource Natural Language Understanding via Active Learning Daniel Grießhaber | Johannes Maucher | Ngoc Thang Vu
The paper deals with the scenario where unlabeled in-domain data is readily available but labeled training data is sparse – a highly relevant topic in applied science. The idea is to iteratively pick unlabeled samples for labeling which are the most “surprising” to the model (see Equation 1 of the paper). The newly labeled samples become a part of the training dataset and the process starts anew.
Don’t take “nswvtnvakgxpm” for an answer – The surprising vulnerability of automatic content scoring systems to adversarial input Yuning Ding | Brian Riordan | Andrea Horbach | Aoife Cahill | Torsten Zesch
Scoring systems rate whether an answer to a question is correct. An example for a question is: Explain how pandas in China are similar to koalas in Australia and how they both are different from pythons. The paper shows that common scoring systems are susceptible to adversarial attacks. Here, adversarial attacks are inputs that are obviously wrong to the human, but scored as correct by the model. That differs from the common definition where they are created through slight, often imperceptible input modifications. It is interesting to me to see adversarial attacks in NLP, because they are more prominent in computer vision. The second paragraph in the related work in Section 2 lists a number of NLP papers on adversarial attacks.
Train Once, and Decode As You Like Chao Tian | Yifei Wang | Hao Cheng | Yijiang Lian | Zhihua Zhang
Current MT is commonly using an encoder-decoder setup, where the decoder autoregressively generates the model prediction of length n. This requires O(n) time. XLNet (Yang et al., 2019) introduced the idea of learning all possible factorization orders, not only the left-to-right one. Based on that, the authors suggest generating parts of the target sequence in parallel, to speed up the process. The method gives the flexibility to trade-off generation speed and quality. I find particularly interesting how the authors dynamically select the next position to generate a word at, by looking at the confidence (likelihood of the most probable word) of the model. Also, the empirical evaluation is very thorough: it is comparing the authors’ method to a number of other refinement-based, non-autoregressive models, as well as different training strategies.
GPolS: A Contextual Graph-Based Language Model for Analyzing Parliamentary Debates and Political Cohesion Ramit Sawhney | Arnav Wadhwa | Shivam Agarwal | Rajiv Ratn Shah
“This esoteric and tedious nature of political debates makes their analysis complex, forming a barrier to ordinary citizen’s insights into political stances and wide-ranging consequences they entail.” – besides being honored as an outstanding COLING paper, this research motivation alone makes the paper worth reading already. More technically, the authors train BERT on debate speeches from the UK’s House of Commons and compare the similarities of the retrieved speech representations across debates, topics, and speakers. The extracted feature vectors are arranged in a graph, which is processed by a graph network to classify whether a speaker approves of a topic or not.
Generating Diverse Corrections with Local Beam Search for Grammatical Error Correction Kengo Hotate | Masahiro Kaneko | Mamoru Komachi
The short paper introduces a simple penalty score (Equation 1) to make beam search more diverse at specific parts of the sequence where more diversity is desired. This is motivated by the task of grammatical error correction in which one wants to generate alternatives only for grammatically faulty parts of the text sequence. The idea is simple yet interesting and the references to other beam search methods are valuable.
Do Word Embeddings Capture Spelling Variation? Dong Nguyen | Jack Grieve
The authors investigate whether spelling variations – which are by the way very common on platforms like Reddit and Twitter – are captured in word embeddings. Similarly to the popular “king – man ≈ queen – woman” one could check for “cookin – cooking ≈ movin – moving”, where movin and cooking are spelling variations called g-dropping (or less formally: g-droppin’). In my opinion word embedding spaces are generally fascinating and the idea of looking into them on a non-semantic level, as done here by analyzing spelling, is intriguing. The paper’s source code is publicly available.