Universal NER -- The next steps

30 Jul 2024

The Universal NER project hit the ground running with a data release and a NAACL paper in 2024. In the initial release, we had 19 datasets spanning 13 languages. Nice prime numbers. As we have modeled ourselves after the Universal Dependencies project, we also plan to keep expanding and growing, with a data release roughly once per year.

For 2024, we’ve set a target of adding 7 new languages, bringing us to 20 unique languages in the dataset.

While UNER is open to annotations on any UD dataset in any new language, this year we are focusing specifically on the Parallel Universal Dependencies (PUD) datasets. Their parallel nature makes analysis and quality control very nice, and the datasets are a pretty manageable size for annotation.

There are 24 PUD datasets (23 unique: 1 is a variant of the Japanese), and we currently have 6 of these datasets annotated already. The remaining languages, in no particular order, are: Hindi, Turkish, Thai, Spanish, Polish, Korean, Japanese, Italian, Indonesian, Icelandic, Galician, French, Finnish, Czech, Arabic, Magahi, Bengali.

Would you consider helping out? If you’re interested, join our Discord.

Stephen Mayhew

Universal NER -- The next steps

Related Posts

ACL Reference Age 03 May 2022

NLP Conference Rankings 04 Mar 2022

Visualizing CoNLL NER Embeddings 30 Jan 2022