ACL Reference Age

03 May 2022

I reviewed a paper recently and while the ideas were sound and the implementation was thorough, something felt off. I couldn’t put my finger on it until I read the related work section: every reference had been published within the last 2-3 years.

This got me thinking. We joke about how nobody cites papers older than 5 years any more, but every generation believes that “back in my day, things were better”. How true is it actually?

This is an empirical question, and thanks to the amazing ACL Anthology, we can answer it. To do this, I used the Semantic Scholar API to gather the title, authors, and references for most ACL papers going back to the beginning (1979). This involved looking at this page for a while, trying to decide which papers to include (e.g. include main conference long and short, don’t include workshop papers). You can find my code and data here.

As an aside, this was an instructive process in itself. Did you know that the short paper track was only introduced around 2008?

Anyway, having gathered all this data, there are loads of interesting statistics to calculate.

Number of accepted papers has always been growing, but it has skyrocketed in the last 3 years. Two interesting facts: 1) it was only 2006 when publication numbers started regularly going above 100, and 2) there have been more ACL papers published in 2014-2021 than in 1979-2013. Imagine having less than 100 ACL papers each year. I might actually read some instead of being paralyzed by the avalanche.

The spikes in publication numbers in 1984, 1998, and 2006 are because on those years ACL was held jointly with COLING.

Number of references has steadily increased since 2010. This is a fun one. Up until and including 2009, only one page was allowed for references, probably because of printing constraints. You can fit about 20 references on a page (unless you cite Universal Dependencies), which explains why the average hovered around 20 for a while. But in 2010, 2012, and then from 2016 onwards, papers were allowed unlimited references. Interestingly, the number of references has slowly grown since then. Is that just the field righting itself after years of reference starvation, or are we requiring unnecessary thoroughness?

Reference age (mean and median) has been decreasing since around 2015 and has reached a low not seen since the early 1990s. To be clear, I calculate reference age for a reference by subtracting reference year from conference year. For example, in an ACL 2017 paper, a citation to a paper in 2013 has reference age 4. Then for that year, I put every reference from every paper into a list and calculate statistics.

At first blush, this result seems to confirm my hypothesis: recent papers do tend to cite other recent papers. But much of this behavior can be explained by the dramatic growth in the number of publications per year and the number of references per paper. That is, in a world in which your topic has twice as many publications as before, and there’s no limit on references, it’s considered good practice (in fact, often recommended by reviewers) to cite all of them. This brings the average reference age down, even if all references to relevant “old” papers remain.

But looking from 2005 to 2014 or so, reference age increased even as publication and reference list sizes increased. This suggests that in those years, authors tended to not cite recent work, which may be a sign of stagnation.

If you look at the top 3 papers cited each year (see list below), you can see what happened. Up until 2014, Statistical Machine Translation references got the most action, and these largely came from the early 2000’s (reference age 10+ years). Then in 2015, word vectors burst onto the scene, from publications in 2013 or so (reference age 2+ years).

But even after 2014, increasing publication/reference list sizes don’t explain everything! These aggregate statistics decouple references from the paper that cited them. Let’s look at two other angles: median reference age by paper, and probability of oldest paper reference being less than N.

The graph below shows the former. Think about it this way, each paper in an ACL conference year is reduced to a single number: the median reference age. So if in 2021, you cite only BERT* papers, your median reference age will be about 2 or 3. But if you cite Penn Treebank style papers, your median reference age will be much higher. Having reduced each paper per year to a list of numbers, we can calculate percentiles.

After 2015, every percentile greater than 10 showed steep decline in median reference age. Even papers in the 95th percentile (those that cite the oldest papers), had medians that declined.

While this is interesting, it may still be dependent on increasing publication and reference list size.

Now, let’s represent each paper in a conference year by it’s oldest reference. This (mostly) controls for growing reference list sizes. Now, let’s calculate the probability that a randomly selected paper has oldest reference age less than N years, which controls for increased conference sizes. The graph below has the results!

In a nascent field, you’d expect all citations to be young but getting older, and that’s exactly what we see back around 1979-1990. The probability of selecting a young paper continues to drop until about 2014, then starts to grow! That is, in 2014, if you grab a paper randomly, there’s about a 0.7% chance that the oldest reference will be less than 10 years old. Contrast that with 2021, when the probability is closer to 10%.

(In case you’re wondering, there were 2 papers in 2021 that had oldest reference age less than 3.)

Conclusion

Te recap, reference ages have been getting younger on average, although this is probably due to increased number of publications and reference page sizes. That said, we are seeing an increasing proportion of papers that fail to cite older work.

Again, you can find my code and data here. I invite interested readers to do their own analyses.

Caveats

This analysis was entirely over ACL publications! It may be that EMNLP or NAACL have different patterns, although I doubt it.
I personally only started reading and paying attention to NLP papers in 2012 or so. Some of my readers have been around longer, and probably have nuances or corrections for me. For example, thanks to Burr Settles for pointing out the unlimited references thing in 2010.

Update: comparing to “On Forgetting to Cite Older Papers”

Marcel Bollman graciously pointed out that I should probably have at least linked to his paper on a similar topic: On Forgetting to Cite Older Papers: An Analysis of the ACL Anthology, published in ACL 2020. Ironically, this paper is part of my dataset.

I’d encourage you to read their paper (and cite it), but here are my brief observations:

On the whole, we have pretty similar conclusions! Old papers are still being cited, but the influx of new papers makes it harder to see that.
My hypothesis about different conferences having similar behavior was largely correct, although very interesting to read about TACL and and CL having different patterns.
I was wrong about unlimited references starting in 2010! ACL allowed unlimited references in 2010, 2012, and then from 2016 onwards.
Their analysis goes back to 2010, but my analysis goes all the way back to 1979, which surfaces a few interesting patterns.
We had different methods for gathering references. They parsed references themselves, I used SemanticScholar. It’s hard to say if one is better than another, but given similar conclusions, they are probably similar.
We both included tables of Most Cited Papers by year – great minds, etc. etc.
I had wondered if this kind of thing was publishable, and I’m happy to see the answer is Yes!

Bonus: Most Cited Papers by Year

I thought this would be fun to put out there. It’s interesting to watch the ebb and flow of topics over the years. In particular, I didn’t realize that machine translation was so dominant in NLP for so many years!

Year	3 most cited papers	Citation year	Citation count
1979	On Overview of KRL, a Knowledge Representation Language	1976	4
	Transition network grammars for natural language analysis	1970	3
	An Experimental Parsing System for Transition Network Grammars	1973	3
1980	Transition network grammars for natural language analysis	1970	5
	Innocence: A Second Idealization for Linguistics	1979	3
	Towards a Self-Extending Parser	1979	3
1981	A theory of syntactic recognition for natural language	1979	4
	Developing a natural language interface to complex data	1977	3
	Human engineering fcr applied natural language processing	1977	3
1982	The representation and use of focus in dialogue understanding.	1977	4
	Linguistic Analysis of Natural Language Communication With Computers	1980	4
	DIAGRAM: a grammar for dialogues	1986	3
1983	A theory of syntactic recognition for natural language	1979	5
	Unbounded Dependencies and Coordinate Structure	1981	3
	Phrase Structure Grammar	1982	3
1984	A theory of syntactic recognition for natural language	1979	8
	The Mental representation of grammatical relations	1985	5
	Transition network grammars for natural language analysis	1970	5
1985	Definite Clause Grammars for Language Analysis - A Survey of the Formalism and a Comparison with Augmented Transition Networks	1980	4
	A theory of syntactic recognition for natural language	1979	3
	The Pragmatics of Referring and the Modality of Communication	1984	3
1986	Providing a Unified Account of Definite Noun Phrases in Discourse	1983	5
	Discourse Structure and the Proper Treatment of Interruptions	1985	4
	Towards a computational theory of definite anaphora comprehension in English discourse	1979	3
1987	A theory of syntactic recognition for natural language	1979	5
	Word Meaning and Montague Grammar	1979	4
	Generalized Phrase Structure Grammar	1985	4
1988	Natural Language Information Processing: A Computer Grammar of English and Its Applications	1980	5
	The Proper Treatment of Quantification in Ordinary English	1973	5
	Attention, Intentions, and the Structure of Discourse	1986	5
1989	Attention, Intentions, and the Structure of Discourse	1986	7
	The Proper Treatment of Quantification in Ordinary English	1973	4
	Providing a Unified Account of Definite Noun Phrases in Discourse	1983	4
1990	Lectures on Government and Binding	1981	5
	The Mental representation of grammatical relations	1985	5
	A theory of syntactic recognition for natural language	1979	5
1991	Information-based syntax and semantics	1987	5
	Attention, Intentions, and the Structure of Discourse	1986	5
	Book Reviews: Lecture on Contemporary Syntactic Theories: An Introduction to Unification-Based Approaches to Grammar	1987	5
1992	Attention, Intentions, and the Structure of Discourse	1986	4
	A stochastic parts program and noun phrase parser for unrestricted text	1989	4
	Discourse Relations and Defeasible Knowledge	1991	3
1993	Attention, Intentions, and the Structure of Discourse	1986	5
	Information-based syntax and semantics	1987	5
	Class-Based n-gram Models of Natural Language	1992	5
1994	Information-based syntax and semantics	1987	6
	Information-Based Syntax and Semantics: Volume 1, Fundamentals	1987	5
	Attention, Intentions, and the Structure of Discourse	1986	4
1995	Head-driven phrase structure grammar	1994	6
	Attention, Intentions, and the Structure of Discourse	1986	5
	Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora	1992	4
1996	Building a Large Annotated Corpus of English: The Penn Treebank	1993	5
	An efficient context-free parsing algorithm	1970	5
	A stochastic parts program and noun phrase parser for unrestricted text	1989	5
1997	Head-driven phrase structure grammar	1994	7
	A stochastic parts program and noun phrase parser for unrestricted text	1989	7
	Attention, Intentions, and the Structure of Discourse	1986	7
1998	A stochastic parts program and noun phrase parser for unrestricted text	1989	8
	Building a Large Annotated Corpus of English: The Penn Treebank	1993	8
	Introduction to WordNet: An On-line Lexical Database	1990	8
1999	Building a Large Annotated Corpus of English: The Penn Treebank	1993	12
	WordNet : an electronic lexical database	2000	6
	Distributional Clustering of English Words	1993	5
2000	Building a Large Annotated Corpus of English: The Penn Treebank	1993	9
	A Maximum Entropy Model for Part-Of-Speech Tagging	1996	6
	The Mathematics of Statistical Machine Translation: Parameter Estimation	1993	5
2001	Foundations of statistical natural language processing	1999	7
	Tree-Bank Grammars	1996	6
	The Mathematics of Statistical Machine Translation: Parameter Estimation	1993	6
2002	A Maximum-Entropy-Inspired Parser	2000	8
	Estimators for Stochastic "Unification-Based" Grammars	1999	6
	Head-Driven Statistical Models for Natural Language Parsing	2003	6
2003	The Mathematics of Statistical Machine Translation: Parameter Estimation	1993	10
	A Maximum-Entropy-Inspired Parser	2000	8
	C4.5: Programs for Machine Learning	1992	8
2004	A Maximum-Entropy-Inspired Parser	2000	9
	Head-Driven Statistical Models for Natural Language Parsing	2003	7
	C4.5: Programs for Machine Learning	1992	7
2005	A Maximum-Entropy-Inspired Parser	2000	15
	The Mathematics of Statistical Machine Translation: Parameter Estimation	1993	13
	Bleu: a Method for Automatic Evaluation of Machine Translation	2002	13
2006	The Mathematics of Statistical Machine Translation: Parameter Estimation	1993	28
	Bleu: a Method for Automatic Evaluation of Machine Translation	2002	23
	Building a Large Annotated Corpus of English: The Penn Treebank	1993	20
2007	Bleu: a Method for Automatic Evaluation of Machine Translation	2002	20
	Statistical Phrase-Based Translation	2003	15
	Minimum Error Rate Training in Statistical Machine Translation	2003	15
2008	Bleu: a Method for Automatic Evaluation of Machine Translation	2002	18
	Minimum Error Rate Training in Statistical Machine Translation	2003	17
	SRILM - an extensible language modeling toolkit	2002	15
2009	Bleu: a Method for Automatic Evaluation of Machine Translation	2002	30
	Minimum Error Rate Training in Statistical Machine Translation	2003	21
	Statistical Phrase-Based Translation	2003	18
2010	Bleu: a Method for Automatic Evaluation of Machine Translation	2002	25
	Building a Large Annotated Corpus of English: The Penn Treebank	1993	23
	Statistical Phrase-Based Translation	2003	20
2011	Bleu: a Method for Automatic Evaluation of Machine Translation	2002	37
	SRILM - an extensible language modeling toolkit	2002	29
	Statistical Phrase-Based Translation	2003	28
2012	Bleu: a Method for Automatic Evaluation of Machine Translation	2002	25
	Statistical Phrase-Based Translation	2003	21
	Minimum Error Rate Training in Statistical Machine Translation	2003	19
2013	Bleu: a Method for Automatic Evaluation of Machine Translation	2002	46
	Statistical Phrase-Based Translation	2003	36
	Moses: Open Source Toolkit for Statistical Machine Translation	2007	33
2014	Latent Dirichlet Allocation	2001	37
	Bleu: a Method for Automatic Evaluation of Machine Translation	2002	28
	Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms	2002	25
2015	Distributed Representations of Words and Phrases and their Compositionality	2013	52
	Efficient Estimation of Word Representations in Vector Space	2013	50
	Natural Language Processing (Almost) from Scratch	2011	43
2016	Distributed Representations of Words and Phrases and their Compositionality	2013	75
	Long Short-Term Memory	1997	55
	Efficient Estimation of Word Representations in Vector Space	2013	54
2017	Neural Machine Translation by Jointly Learning to Align and Translate	2014	83
	Long Short-Term Memory	1997	73
	Adam: A Method for Stochastic Optimization	2014	73
2018	GloVe: Global Vectors for Word Representation	2014	117
	Adam: A Method for Stochastic Optimization	2014	117
	Neural Machine Translation by Jointly Learning to Align and Translate	2014	109
2019	Adam: A Method for Stochastic Optimization	2014	237
	GloVe: Global Vectors for Word Representation	2014	218
	BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	2019	190
2020	BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	2019	418
	Attention is All you Need	2017	290
	Adam: A Method for Stochastic Optimization	2014	243
2021	BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	2019	466
	Attention is All you Need	2017	292
	Adam: A Method for Stochastic Optimization	2014	192

Stephen Mayhew

ACL Reference Age

Conclusion

Caveats

Update: comparing to “On Forgetting to Cite Older Papers”

Bonus: Most Cited Papers by Year

Related Posts

Universal NER -- The next steps 30 Jul 2024

NLP Conference Rankings 04 Mar 2022

Visualizing CoNLL NER Embeddings 30 Jan 2022