Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Statistical Machine Translation with Apache Joshua (Incubating)

Table of Contents

Introduction

The page provides detail on how to use Apache Joshua (Incubating) to undertake statistical machine translation (STM) via the Tika.translate API. This work is the result of development which that has taken place both through TIKA-1343 Create a Tika Translator implementation that uses JoshuaDecoder and via close work within the Joshua community.

The benefits of using this approach for achieving language translation through Tika are as follows;

  • It's free! As oppose opposed to several other translation services currently available via Tika, STM via Joshua is free. You build the language models, you setup set up and manage the infrastructure and you have 100% control over the resulting translation
  • You are not restricted under some usage ceiling. As there is no paid service, you can use this method completely unrestricted.
  • The language model generation and quality is are completely transparent. Nowadays a large issue with the use of statistical models (or more generally any models utilized within learning processes) are is typically not shared and hence it is difficult to fully quantify or justify the results you get. For example, if we were to use Google Translate, we have absolutely no insight into how the translations are undertaken, what accuracy they achieve, etc. The method and work which is proposed here addresses address this concern entirely. Everything is 100% transparent.

The downsides of using this approach are as follows;

  • Joshua, the underlying STM toolkit is quite a complex piece of software. This should by no means be a surprise... afterall after all STM is an extremely difficult and active research area. Some of the worlds world's largest companies e.g. Google, Yahoo!, Bing, IBM, etc are investing large sums of money and significant resources trying to address the issues. The fact that we have STM available via Tika is a huge step towards building the STM open source community.
  • Depending upon your translation requirements, you may be required to build you your own language models. This however depends on which models are available via the Joshua community. If you do need to build your own models/language packs, this is not exactly a trivial process however you can find loads of help on this topic over on the Joshua mailing lists.
  • Depending on the availability of good hardware, you may encounter performance issues. The loading of large language models, STM tasks generally, and building new language packs tends tend to benefit from powerful machines with lots of RAM. If this is not available then you may encounter issues.

With the above in mind, lets let us continue with configuring and provisioning Tika for STM with Joshua.

Step 1: Retrieve the Joshua Language Pack

In this example, we will be using a Spanish-to-English n-gram language model pack which was generated on October 6th 2016 and built using BerkeleyLM. For more detail on the language pack itself and how it was produced, see the Language Pack Details.

...

Step 3: Configure and Provision Apache Tika

So now lets let us grab the Tika source, configure, compile and deploy it such that we can utilize Joshua's STM functionality.

...

Benchmark

Test set

BLEU score

Meteor score

Description

 

 

 

 





Newstest2013

27.46

33.39

News

Fisher dev2

37.13

36.75

Conversational speech

Callhome evltest

32.44

33.39

Conversational speech

Global Voices

36.33

36.40

News

Configuration

No Format
# MERT optimized configuration
# decoder /export/projects/mpost/language-packs/es-en/5/tune/model/run-joshua.sh
# BLEU 0.2655 on dev /export/projects/mpost/language-packs/es-en/5/data/tune/corpus.es
# We were before running iteration 4
# finished Tue Sep 27 12:03:42 EDT 2016
# This file is a template for the Joshua pipeline; variables enclosed
# in <angle-brackets> are substituted by the pipeline script as
# appropriate.  This file also serves to document Joshua's many
# parameters.

# These are the grammar file specifications.  Joshua supports an
# arbitrary number of grammar files, each specified on its own line
# using the following format:
#
#   tm = TYPE OWNER LIMIT FILE
# 
# TYPE is "packed", "thrax", or "samt".  The latter denotes the format
# used in Zollmann and Venugopal's SAMT decoder
# (http://www.cs.cmu.edu/~zollmann/samt/).
# 
# OWNER is the "owner" of the rules in the grammar; this is used to
# determine which set of phrasal features apply to the grammar's
# rules.  Having different owners allows different features to be
# applied to different grammars, and for grammars to share features
# across files.
#
# LIMIT is the maximum input span permitted for the application of
# grammar rules found in the grammar file.  A value of -1 implies no limit.
#
# FILE is the grammar file (or directory when using packed grammars).
# The file can be compressed with gzip, which is determined by the
# presence or absence of a ".gz" file extension.
#
# By a convention defined by Chiang (2007), the grammars are split
# into two files: the main translation grammar containing all the
# learned translation rules, and a glue grammar which supports
# monotonic concatenation of hierarchical phrases. The glue grammar's
# main distinction from the regular grammar is that the span limit
# does not apply to it.  

tm = phrase -owner pt -maxspan 0 -path model/grammar.gz.packed

# This symbol is used over unknown words in the source language

default-non-terminal = X

# This is the goal nonterminal, used to determine when a complete
# parse is found.  It should correspond to the root-level rules in the
# glue grammar.

goal-symbol = GOAL

# Language model config.
#
# Multiple language models are supported.  For each language model,
# create one of the following lines:
#
# feature-function = LanguageModel -lm_type TYPE -lm_order ORDER -lm_file FILE
# feature-function = StateMinimizingLanguageModel -lm_order ORDER -lm_file FILE
#
# - TYPE is one of "kenlm" or "berkeleylm"
# - ORDER is the order of the language model (default 5)
# - FILE is the path to the LM file. This can be binarized if appropriate to the type
#   (e.g., KenLM has a compiled format)
#
# A state-minimizing LM collapses left-state. Currently only KenLM supports this.
#
# For each LM, add a weight lm_INDEX below, where indexing starts from 0.



# The suffix _OOV is appended to unknown source-language words if this
# is set to true.

mark-oovs = false

# The search algorithm: "cky" for hierarchical / phrase-based decoding, 
# "stack" for phrase-based decoding
search = stack

# The pop-limit for decoding.  This determines how many hypotheses are
# considered over each span of the input.

pop-limit = 100

# How many hypotheses to output

top-n = 1

# Whether those hypotheses should be distinct strings

use-unique-nbest = true

# This is the default format of the ouput printed to STDOUT.  The variables that can be
# substituted are:
#
# %i: the sentence number (0-indexed)
# %s: the translated sentence
# %t: the derivation tree
# %f: the feature string
# %c: the model cost

output-format = %S

# When printing the trees (%t in 'output-format'), this controls whether the alignments
# are also printed.

include-align-index = false

# And these are the feature functions to activate.
feature-function = OOVPenalty
feature-function = WordPenalty

## Model weights #####################################################

# For each langage model line listed above, create a weight in the
# following format: the keyword "lm", a 0-based index, and the weight.
# lm_INDEX WEIGHT


# The phrasal weights correspond to weights stored with each of the
# grammar rules.  The format is
#
#   tm_OWNER_COLUMN WEIGHT
#
# where COLUMN denotes the 0-based order of the parameter in the
# grammar file and WEIGHT is the corresponding weight.  In the future,
# we plan to add a sparse feature representation which will simplify
# this.

# The wordpenalty feature counts the number of words in each hypothesis.


# This feature counts the number of unknown words in the hypothesis.


# This feature weights paths through an input lattice.  It is only activated
# when decoding lattices.
feature-function = LanguageModel -lm_order 4 -lm_file model/lm.gz -lm_type berkeleylm
feature-function = Distortion
feature-function = PhrasePenalty



lowercase = -project-case

lm_0 0.242132004556722
WordPenalty -0.111308832033767
OOVPenalty 0.0101534888932218
tm_pt_2 0.0241130425384253
PhrasePenalty -0.0240605834315291
tm_pt_0 0.0262269656358665
tm_pt_1 0.0535319307753204
Distortion 0.121100027216756
tm_pt_3 0.191275104853519
tm_pt_4 0.119489193983075
tm_pt_5 0.076608826081799

...