Universal NER -- The next steps

The Universal NER project hit the ground running with a data release and a NAACL paper in 2024. In the initial release, we had 19 datasets spanning 13 languages. Nice prime numbers. As we have modeled ourselves after the Universal Dependencies project, we also plan to keep expanding and growing, with a data release roughly once per year.

For 2024, we’ve set a target of adding 7 new languages, bringing us to 20 unique languages in the dataset.

While UNER is open to annotations on any UD dataset in any new language, this year we are focusing specifically on the Parallel Universal Dependencies (PUD) datasets. Their parallel nature makes analysis and quality control very nice, and the datasets are a pretty manageable size for annotation.

There are 24 PUD datasets (23 unique: 1 is a variant of the Japanese), and we currently have 6 of these datasets annotated already. The remaining languages, in no particular order, are: Hindi, Turkish, Thai, Spanish, Polish, Korean, Japanese, Italian, Indonesian, Icelandic, Galician, French, Finnish, Czech, Arabic, Magahi, Bengali.

Would you consider helping out? If you’re interested, join our Discord.