ACL Reference Age

I reviewed a paper recently and while the ideas were sound and the implementation was thorough, something felt off. I couldn’t put my finger on it until I read the related work section: every reference had been published within the last 2-3 years.

This got me thinking. We joke about how nobody cites papers older than 5 years any more, but every generation believes that “back in my day, things were better”. How true is it actually?

This is an empirical question, and thanks to the amazing ACL Anthology, we can answer it. To do this, I used the Semantic Scholar API to gather the title, authors, and references for most ACL papers going back to the beginning (1979). This involved looking at this page for a while, trying to decide which papers to include (e.g. include main conference long and short, don’t include workshop papers). You can find my code and data here.

As an aside, this was an instructive process in itself. Did you know that the short paper track was only introduced around 2008?

Anyway, having gathered all this data, there are loads of interesting statistics to calculate.

Number of accepted papers has always been growing, but it has skyrocketed in the last 3 years. Two interesting facts: 1) it was only 2006 when publication numbers started regularly going above 100, and 2) there have been more ACL papers published in 2014-2021 than in 1979-2013. Imagine having less than 100 ACL papers each year. I might actually read some instead of being paralyzed by the avalanche.

The spikes in publication numbers in 1984, 1998, and 2006 are because on those years ACL was held jointly with COLING.

Number of references has steadily increased since 2010. This is a fun one. Up until and including 2009, only one page was allowed for references, probably because of printing constraints. You can fit about 20 references on a page (unless you cite Universal Dependencies), which explains why the average hovered around 20 for a while. But in 2010, 2012, and then from 2016 onwards, papers were allowed unlimited references. Interestingly, the number of references has slowly grown since then. Is that just the field righting itself after years of reference starvation, or are we requiring unnecessary thoroughness?

Reference age (mean and median) has been decreasing since around 2015 and has reached a low not seen since the early 1990s. To be clear, I calculate reference age for a reference by subtracting reference year from conference year. For example, in an ACL 2017 paper, a citation to a paper in 2013 has reference age 4. Then for that year, I put every reference from every paper into a list and calculate statistics.

At first blush, this result seems to confirm my hypothesis: recent papers do tend to cite other recent papers. But much of this behavior can be explained by the dramatic growth in the number of publications per year and the number of references per paper. That is, in a world in which your topic has twice as many publications as before, and there’s no limit on references, it’s considered good practice (in fact, often recommended by reviewers) to cite all of them. This brings the average reference age down, even if all references to relevant “old” papers remain.

But looking from 2005 to 2014 or so, reference age increased even as publication and reference list sizes increased. This suggests that in those years, authors tended to not cite recent work, which may be a sign of stagnation.

If you look at the top 3 papers cited each year (see list below), you can see what happened. Up until 2014, Statistical Machine Translation references got the most action, and these largely came from the early 2000’s (reference age 10+ years). Then in 2015, word vectors burst onto the scene, from publications in 2013 or so (reference age 2+ years).

But even after 2014, increasing publication/reference list sizes don’t explain everything! These aggregate statistics decouple references from the paper that cited them. Let’s look at two other angles: median reference age by paper, and probability of oldest paper reference being less than N.

The graph below shows the former. Think about it this way, each paper in an ACL conference year is reduced to a single number: the median reference age. So if in 2021, you cite only BERT* papers, your median reference age will be about 2 or 3. But if you cite Penn Treebank style papers, your median reference age will be much higher. Having reduced each paper per year to a list of numbers, we can calculate percentiles.

After 2015, every percentile greater than 10 showed steep decline in median reference age. Even papers in the 95th percentile (those that cite the oldest papers), had medians that declined.

While this is interesting, it may still be dependent on increasing publication and reference list size.

Now, let’s represent each paper in a conference year by it’s oldest reference. This (mostly) controls for growing reference list sizes. Now, let’s calculate the probability that a randomly selected paper has oldest reference age less than N years, which controls for increased conference sizes. The graph below has the results!

In a nascent field, you’d expect all citations to be young but getting older, and that’s exactly what we see back around 1979-1990. The probability of selecting a young paper continues to drop until about 2014, then starts to grow! That is, in 2014, if you grab a paper randomly, there’s about a 0.7% chance that the oldest reference will be less than 10 years old. Contrast that with 2021, when the probability is closer to 10%.

(In case you’re wondering, there were 2 papers in 2021 that had oldest reference age less than 3.)

Conclusion

Te recap, reference ages have been getting younger on average, although this is probably due to increased number of publications and reference page sizes. That said, we are seeing an increasing proportion of papers that fail to cite older work.

Again, you can find my code and data here. I invite interested readers to do their own analyses.

Caveats

Update: comparing to “On Forgetting to Cite Older Papers”

Marcel Bollman graciously pointed out that I should probably have at least linked to his paper on a similar topic: On Forgetting to Cite Older Papers: An Analysis of the ACL Anthology, published in ACL 2020. Ironically, this paper is part of my dataset.

I’d encourage you to read their paper (and cite it), but here are my brief observations:

Bonus: Most Cited Papers by Year

I thought this would be fun to put out there. It’s interesting to watch the ebb and flow of topics over the years. In particular, I didn’t realize that machine translation was so dominant in NLP for so many years!

Year 3 most cited papers Citation year Citation count
1979On Overview of KRL, a Knowledge Representation Language19764
Transition network grammars for natural language analysis19703
An Experimental Parsing System for Transition Network Grammars19733
1980Transition network grammars for natural language analysis19705
Innocence: A Second Idealization for Linguistics19793
Towards a Self-Extending Parser19793
1981A theory of syntactic recognition for natural language19794
Developing a natural language interface to complex data19773
Human engineering fcr applied natural language processing19773
1982The representation and use of focus in dialogue understanding.19774
Linguistic Analysis of Natural Language Communication With Computers19804
DIAGRAM: a grammar for dialogues19863
1983A theory of syntactic recognition for natural language19795
Unbounded Dependencies and Coordinate Structure19813
Phrase Structure Grammar19823
1984A theory of syntactic recognition for natural language19798
The Mental representation of grammatical relations19855
Transition network grammars for natural language analysis19705
1985Definite Clause Grammars for Language Analysis - A Survey of the Formalism and a Comparison with Augmented Transition Networks19804
A theory of syntactic recognition for natural language19793
The Pragmatics of Referring and the Modality of Communication19843
1986Providing a Unified Account of Definite Noun Phrases in Discourse19835
Discourse Structure and the Proper Treatment of Interruptions19854
Towards a computational theory of definite anaphora comprehension in English discourse19793
1987A theory of syntactic recognition for natural language19795
Word Meaning and Montague Grammar19794
Generalized Phrase Structure Grammar19854
1988Natural Language Information Processing: A Computer Grammar of English and Its Applications19805
The Proper Treatment of Quantification in Ordinary English19735
Attention, Intentions, and the Structure of Discourse19865
1989Attention, Intentions, and the Structure of Discourse19867
The Proper Treatment of Quantification in Ordinary English19734
Providing a Unified Account of Definite Noun Phrases in Discourse19834
1990Lectures on Government and Binding19815
The Mental representation of grammatical relations19855
A theory of syntactic recognition for natural language19795
1991Information-based syntax and semantics19875
Attention, Intentions, and the Structure of Discourse19865
Book Reviews: Lecture on Contemporary Syntactic Theories: An Introduction to Unification-Based Approaches to Grammar19875
1992Attention, Intentions, and the Structure of Discourse19864
A stochastic parts program and noun phrase parser for unrestricted text19894
Discourse Relations and Defeasible Knowledge19913
1993Attention, Intentions, and the Structure of Discourse19865
Information-based syntax and semantics19875
Class-Based n-gram Models of Natural Language19925
1994Information-based syntax and semantics19876
Information-Based Syntax and Semantics: Volume 1, Fundamentals19875
Attention, Intentions, and the Structure of Discourse19864
1995Head-driven phrase structure grammar19946
Attention, Intentions, and the Structure of Discourse19865
Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora19924
1996Building a Large Annotated Corpus of English: The Penn Treebank19935
An efficient context-free parsing algorithm19705
A stochastic parts program and noun phrase parser for unrestricted text19895
1997Head-driven phrase structure grammar19947
A stochastic parts program and noun phrase parser for unrestricted text19897
Attention, Intentions, and the Structure of Discourse19867
1998A stochastic parts program and noun phrase parser for unrestricted text19898
Building a Large Annotated Corpus of English: The Penn Treebank19938
Introduction to WordNet: An On-line Lexical Database19908
1999Building a Large Annotated Corpus of English: The Penn Treebank199312
WordNet : an electronic lexical database20006
Distributional Clustering of English Words19935
2000Building a Large Annotated Corpus of English: The Penn Treebank19939
A Maximum Entropy Model for Part-Of-Speech Tagging19966
The Mathematics of Statistical Machine Translation: Parameter Estimation19935
2001Foundations of statistical natural language processing19997
Tree-Bank Grammars19966
The Mathematics of Statistical Machine Translation: Parameter Estimation19936
2002A Maximum-Entropy-Inspired Parser20008
Estimators for Stochastic "Unification-Based" Grammars19996
Head-Driven Statistical Models for Natural Language Parsing20036
2003The Mathematics of Statistical Machine Translation: Parameter Estimation199310
A Maximum-Entropy-Inspired Parser20008
C4.5: Programs for Machine Learning19928
2004A Maximum-Entropy-Inspired Parser20009
Head-Driven Statistical Models for Natural Language Parsing20037
C4.5: Programs for Machine Learning19927
2005A Maximum-Entropy-Inspired Parser200015
The Mathematics of Statistical Machine Translation: Parameter Estimation199313
Bleu: a Method for Automatic Evaluation of Machine Translation200213
2006The Mathematics of Statistical Machine Translation: Parameter Estimation199328
Bleu: a Method for Automatic Evaluation of Machine Translation200223
Building a Large Annotated Corpus of English: The Penn Treebank199320
2007Bleu: a Method for Automatic Evaluation of Machine Translation200220
Statistical Phrase-Based Translation200315
Minimum Error Rate Training in Statistical Machine Translation200315
2008Bleu: a Method for Automatic Evaluation of Machine Translation200218
Minimum Error Rate Training in Statistical Machine Translation200317
SRILM - an extensible language modeling toolkit200215
2009Bleu: a Method for Automatic Evaluation of Machine Translation200230
Minimum Error Rate Training in Statistical Machine Translation200321
Statistical Phrase-Based Translation200318
2010Bleu: a Method for Automatic Evaluation of Machine Translation200225
Building a Large Annotated Corpus of English: The Penn Treebank199323
Statistical Phrase-Based Translation200320
2011Bleu: a Method for Automatic Evaluation of Machine Translation200237
SRILM - an extensible language modeling toolkit200229
Statistical Phrase-Based Translation200328
2012Bleu: a Method for Automatic Evaluation of Machine Translation200225
Statistical Phrase-Based Translation200321
Minimum Error Rate Training in Statistical Machine Translation200319
2013Bleu: a Method for Automatic Evaluation of Machine Translation200246
Statistical Phrase-Based Translation200336
Moses: Open Source Toolkit for Statistical Machine Translation200733
2014Latent Dirichlet Allocation200137
Bleu: a Method for Automatic Evaluation of Machine Translation200228
Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms200225
2015Distributed Representations of Words and Phrases and their Compositionality201352
Efficient Estimation of Word Representations in Vector Space201350
Natural Language Processing (Almost) from Scratch201143
2016Distributed Representations of Words and Phrases and their Compositionality201375
Long Short-Term Memory199755
Efficient Estimation of Word Representations in Vector Space201354
2017Neural Machine Translation by Jointly Learning to Align and Translate201483
Long Short-Term Memory199773
Adam: A Method for Stochastic Optimization201473
2018GloVe: Global Vectors for Word Representation2014117
Adam: A Method for Stochastic Optimization2014117
Neural Machine Translation by Jointly Learning to Align and Translate2014109
2019Adam: A Method for Stochastic Optimization2014237
GloVe: Global Vectors for Word Representation2014218
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding2019190
2020BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding2019418
Attention is All you Need2017290
Adam: A Method for Stochastic Optimization2014243
2021BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding2019466
Attention is All you Need2017292
Adam: A Method for Stochastic Optimization2014192