Building a Language Pack (Joshua 6.1)

Joshua's language packs are models that have been trained and tuned for particular language pairs. Starting with version 3, language packs have the following capabilities:

No dependencies. The language pack will run with no external dependencies, except for Java 8. You simple download, unpack, and run the attached shell script to start Joshua in command-line or server model.
Docker support. In order for there to be zero external dependencies, language packs depend by default on BerkeleyLM. With Docker, you can easily build KenLM and use the language pack with the include KenLM configuration file and language model files. KenLM is faster, produces better translations, and requires fewer resources.

A Version 3 language pack is distinguished by the following files:

joshua — an entry point shell script
joshua.config — the default Joshua config
joshua.config.kenlm — the KenLM config file
lp.conf — a language pack configuration, defining the version number and the git commit for the Joshua tarball
example.{SOURCE,TARGET} — an example input file and human reference
prepare.sh — a script to apply to tokenize and normalize input data
model/* — the model files referenced in the Joshua configuration files
target/*.jar — the Joshua jar file (with dependencies)
README
BENCHMARKS — a file indicating performance on various test sets
CREDITS — details on how the language pack was built

Creating a Language Pack

You can easily create your own language pack from a tuned model using $JOSHUA/scripts/language-pack/build_lp.sh. This script gathers everything up into a bundle, including the Joshua runtime jar files. This bundle can then be easily packed up with tar and released.

The script takes six arguments, which you can see if you run it with none.

$ cd $JOSHUA
$ ./scripts/language-pack/build_lp.sh
Usage: ./scripts/language-pack/build_lp.sh langpair config mem credits-file benchmark-file example
where
  langpair is the language pair, (e.g., es-en)
  config is the tuned Joshua config, (1/tune/joshua.config.final)
  mem is the amount of memory the decoder needs
  credits-file is a file describing how the model was built (1/CREDITS
  benchmark-file is a file describing model performance on test sets (1/BENCHMARK)
  example is a path prefix to a pair of small (~10 lines) example files

In detail:

The language pair, using the ISO 639-1 two-character code. This should correspond to what you used when you ran the pipeline.
The tuned Joshua config file. It is best if this contains model file paths that are absolute instead of relative.
The amount of memory Joshua will use when running. The default is 4 GB. To estimate the amount needed, sum the file sizes of all model files (language models and grammars) and round to nearest 2 GB. For example, if your language model is 2.1 GB, and your packed grammar is 0.8 GB, 4 GB should be fine.
The credits file contains information about who build the language pack, and what data sources were used to do so.
The benchmark file should contain information about how well the language pack performs on a range of standard test sets for the language.
The example should be two small files. These will be referenced in the README file that is created, and provide a quick way for a user to test the language pack.

Example Benchmark and Credits files

There is no particular prescribed format for these files. They should be human-readable files that would provide some guidance for a human wishing to evaluate performance of the models on a range of popular test steps, and to gather all the training data for themselves, should they wish to build their own similar model.

Here is an example Benchmark file (used in our 2016-11-18 Turkish–English model):

These benchmarks are the results of a phrase-based model using the last 2,500 lines of bitext (held-out) from each OPUS training source.
4936999 parallel sentences were used to build the tr-en model.
Single-reference four-gram BLEU scores are reported for each test set.
KDE4    0.1571
OpenSubtitles2016   0.1377
SETIMES2   0.2204
Tanzil   0.1288
Tatoeba 0.2977
TED2013 0.1505
Wikipedia 0.3072

And here is an example Credits file, from the same language pack:

Language Pack (tr to en) created by Paul McNamee (mcnamee@jhu.edu) on 10/20/16.

The following corpora were used to train the model:
bible-literal GlobalVoices KDE4 OpenSubtitles2016 SETIMES2 Tanzil Tatoeba TED2013 Ubuntu Wikipedia

Except the Bible, these corpora are available from the OPUS portal at:
http://opus.lingfil.uu.se/

The OPUS corpora were downloaded from the website on 10/4/16.
The last 5,000 lines of each bitext are used for tuning (1st 2,500 lines)
and testing (2nd 2,500 lines). Up to the first 3 million lines of each
training file are used in building the model.

The target (English) side of the bitext was used in addition to a 2% sample
of English Gigaword 5th (LDC2011T07) to construct the language model.

The result

After running the script, you will find a directory structure that looks like the following:

releases/apache-joshua-<LANGPAIR>-YYYY-MM-DD/
    README
	BENCHMARK
	CREDITS
	example.<SRC>
	example.<TRG>
	joshua
	joshua.config
	model/
	prepare.sh
	scripts/
		preprocess.sh
		normalize.pl
		tokenize.pl
		nonbreaking_prefixes/
		detokenize.pl
	target/
		joshua-6.1-jar-with-dependencies.jar
	web/
		index.html
		README.md
		...

Some notes:

README, BENCHMARK, and CREDITS are created automatically. They will have those names regardless of the names of the files you passed in.
Names above enclosed in <> are parameters.
joshua is the entry script to running the decoder
prepare.sh automatically prepares input data prior to sending it to the decoder. See the README file for usage. It basically lowercases, normalizes, and tokenizes the input.

Adding Docker support

The Dockerized version of Joshua requires you to manually add a "joshua.config.kenlm" file, which should be placed in the root directory of the language pack, along with the associated KenLM files, which should be referenced from the config file and placed under model/.

Examples

Here is an example of packing a model that has been build using the Joshua pipeline, which resides in directory 5/.

$JOSHUA/scripts/language-pack/build_lp.sh \
	es-en \
	5/tune/joshua.config.final.berkeleylm \
	8g \
	5/CREDITS \
	5/BENCHMARKS \
	example

Note something important here: this model was tuned with KenLM, which is the default. Because KenLM has to be compiled, it introduces an external dependency. So I created a version of the final tuned model file that uses BerkeleyLM instead of KenLM, by changing the line:

feature-function = StateMinimizingLanguageModel -lm_order 4 -lm_file /export/projects/mpost/language-packs/es-en/5/tune/model/lm.kenlm

to

feature-function = LanguageModel -lm_type berkeleylm -lm_order 4 -lm_file /export/projects/mpost/language-packs/es-en/5/lm.berkeleylm

I manually created the BerkeleyLM file with the following command:

$JOSHUA/scripts/lm/compile_berkeley.py lm.gz lm.berkeleylm

Notes on Language Pack Creation

Most of the many language packs for Joshua were built using a very generic phrase-based approach built from freely available datasets downloadable from cOrPUS. Here are a number of things that may be useful for you to know in using them.

Output Quality. The output quality will not be perfect, particularly if you translate data that looks very different from the type of data used to train the models. The models are very simple phrase-based translation models with fixed distortion limits. For details, see the CREDITS file inside each language pack.
Help Us Improve Models. If you have interest in improving the results for a particular language pair, we have lots of ideas. Please contact us at dev@joshua.apache.org and we can help you out. If your models have better results on our test sets, we would be happy to replace the currently distributed model with yours!

Space shortcuts

Page tree