Vocabulary in Allennlp

I’ve been using allennlp a lot lately, and I have found it to be immensely valuable. As with any library, there are several tricks you need to know, and I came across something recently that I wanted to make a note of (perhaps mainly for myself).

Long story short, the question is: should you include your test data in your vocabulary? In principle, the answer is No, because test data should be completely unseen. Ideally, if a test token is present in your pre-trained embeddings, then it should be retrieved, otherwise, it should become an UNK token and get an UNK embedding.

In an allennlp config file (for example, see here), you are required to give paths for train_data_path and validation_data_path, and you may give a path fo test_data_path. All datasets given in the config are used when creating the vocabulary. For example, in this file the test data is included in the vocabulary.

But here’s the thing: in allennlp, if you do not include the test_data_path key in your config, then every word not in the vocabulary maps to UNK, whether or not it is present in the embeddings. For example, perhaps a relatively frequent word such as conversation (for which you have an embedding) is in test, but not in train or dev. This word will be mapped to UNK, which is counter-intuitive. You can specify pre-trained files in a vocabulary key in the config, but this doesn’t solve the issue.

Since I posted this originally, I learned a little more, thanks to comments on twitter. This pull request shows that you can add a parameter min_pretrained_embeddings that will force a certain number of embeddings to be stored in the vocabulary, potentially increasing the model size. So, at train time you can set this number to be arbitrarily large and it will include all embeddings.

If all you want to do is evaluate, then you can use the --extend-vocab flag. But if you’re trying to use allennlp as a library to actually tag documents, then you need to use the predict command, and --extend-vocab isn’t available.

So, if you want test tokens to be properly retrieved from the pre-trained embeddings, I believe these are your options:

Note: if you fine-tune embeddings at train time, it’s not clear what to do at test time. For a word not present in train, it may not make sense to select from the untuned lookup table. Then again, maybe this is better than mapping to UNK.

The badge below will take you to a minimal code sample you can play with.

Binder