Workshop on Balto-Slavic NLP

This year, the 6th Workshop on Balto-Slavic NLP was held, co-located with EACL 2017, in Valencia, Spain. The associated shared task was formally called Multilingual Named Entity Recognition, but in fact consisted of 3 subtasks:

(I didn’t participate, but it’s related to my research). For now, I will only discuss the first task.

The evaluation for the NER subtask is slightly different than standard NER. They evaluate at the document-level (I like to call it “bag-of-entities”), where a system returns only which entities are present in the document. Systems do not report spans. They break this evaluation down two ways again: relaxed and strict. In the relaxed eval, a system is considered correct if it reports an entity in any form (including nicknames). In the strict eval, no slack is given for such laziness. If Beata Szydło shows up in different forms, all forms need to be discovered.

This makes the task rather easier. Obviously, people use NER in many ways, but if it is coarse document discovery (“Find me documents about Fidel Castro”) or coarse understanding (“This document contains the entities: Oliver North, Nicaragua, Sandinista National Liberation Front”) then this evaluation makes a lot of sense.

The seven Slavic languages included were Croatian, Czech, Polish, Russian, Slovak, Slovene, and Ukrainian. See the shared task paper for more details.

An interesting aspect of this task is how they gathered data. They choose a search query, search Google for that string, and download the first 100 pages returned. These are post-processed to be as close to plain text as possible, although some HTML noise leaks through. This gives fresh data, that mimics how such a system might be used in the wild. That is, on recent data with new entities, instead of the immortal Pierre Vinken.

The evaluation data is created from two search terms: Donald Trump, and European Commission. A small amount of extra-cleaned data was annotated by native speakers (and almost native speakers) and used as test. The nature of the data means that domain adaptation will be an issue (unless you somehow already have annotated corpora about Donald Trump).

Crucially, they didn’t provide any training data, although they did provide “trial datasets” using separate search terms, presumably so users get a feel for what the data looks like. The task is to use found data, or created data, for training.

So how did teams get around the lack of training data? There weren’t many entries in the shared task.

(It’s unclear what the value of a shared task is if everyone is using such vastly different resources. You can’t compare a system that uses large-scale supervised data with a system that uses a transfer model).

The eval data (along with a scorer, thankfully!) is available for download.

The best scores reported for each language hovered around 50%, with the highest around 60% (Slovene), and the lowest at 27% (Ukrainian). There’s no clear overall winner. The 3 systems that reported scores on multiple languages (Sharoff, JHU, LexiFlexi) have averages within 1 or 2 points of each other, around 40 F1, which seems low given the evaluation metric.

What makes this task hard? Having no training data is a setback, obviously, but in similar tasks, teams have gotten higher scores with fewer resources. Several of these languages have complex morphology, which I suspect is a much more serious problem than it gets credit for. I wonder if the temporal domain shift is also a tough. That is, perhaps teams find and use data from several years prior, and this eval data is in the domain of 2017. The test data is also delivered without tokenization. If teams didn’t use a strong tokenizer, they may have lost some points.

Then there are the structural issues. Teams were not given any kind of feedback on their submission until after the deadline, so there may have been confusion about labeling strategies. The time span between release of data and deadline for submission was about 2 and a half months, which should be plenty of time to throw something together. In a new task, with new data, it’s always reasonable to ask about the quality of the eval data. Can it be trusted as a gold standard? (A quick perusal suggests it is reasonable).

On the whole, the task is interesting. There is talk of running it again next year, with more languages, and (hopefully) more participation.