Illinois Transliteration
30 Dec 2015This code is in our group’s private gitlab repository, but is available upon request. This post is meant as a quick place for a tutorial on the code.
First, look at Runner.java
, and set the trainlang
and testlang
variables. These are languages for test and train. Typically, these should be the same. In the most common use case, these are names of languages as found in Wikipedia (and specifically, which correspond to files I have downloaded, and which are in the directory variable called wikidata
). For example, Hindi
and Hindi
.
Next, set the method
variable. In the most common use case, this is the string "wikidata"
. Notice also that NUMTRAIN
and NUMTEST
variables should be set. These should add up to less than the size of the data file. For example, if the Hindi training file has 1000 pairs, you can have NUMTEST
and NUMTRAIN
at 600 and 400 respectively.
With method wikidata
, this calls the TrainAndTest
method, which trains and tests the model in a loop. This is to smooth out any randomness and get the average, because we shuffle the training data and results can vary.
If you want to create a model for use at another time, this happens with the model.WriteProbs(...)
code. This will overwrite at each iteration, so you just need a single iteration (i.e. num=1
). This model is a text file called probs-*.txt
where the *
represents the training language. It is a set of weighted productions from English to the target language.
You can read this model file using the SPModel(file)
constructor. For an example of this, see the LoadAndTest
method in Runner.java
.
When you want to generate candidates in a trained language, look at the code in TestGenerate
method. Essentially, you give it a string, and it returns a ranked list of candidates. If you only need one, just get the top element.
To run, you can just use the runner.sh
script in the scripts/
directory.
> ./scripts/runner.sh