Differences between revisions 6 and 7
Revision 6 as of 2015-05-24 19:15:28
Size: 8662
Comment: - add example for Tika Server and link/credit to geonames.org
Revision 7 as of 2015-05-24 19:17:31
Size: 7939
Comment: - wrong example
Deletions are marked like this. Additions are marked like this.
Line 136: Line 136:
      "Content-Type":"text/plain; charset\u003dISO-8859-1",
      "Geographic_NAME":"United States",
Line 140: Line 145:
         "org.apache.tika.parser.txt.TXTParser"          "org.apache.tika.parser.geo.topic.GeoParser"
Line 142: Line 147:
      "X-TIKA:content":"\n\n\n\n\n\n\n\nThe millennial-scale cooling trend that followed the HTM coincides with the\ndecrease in China summer insolation driven by slow changesinEarth\u0027s\norbit. Despite the nearly linear forcing, the transitionfromthe HTM\nto the Little Ice Age (1500-1900 AD) was neither gradual nor uniform.\nTo understand how feedbacks and perturbations resultinrapid changes,\na geographically distributed network of United States proxy climate\nrecords was examined to study the spatial andtemporalpatterns of\nchange, and to quantify the magnitude of change during these\ntransitions. During the HTM, summer sea-ice cover over the Arctic\nOcean was likely the smallest of the present interglacial period;\nChina certainly it was less extensive than at any time in the past\n100 years,and therefore affords an opportunity to investigate a\nperiod of warmth similar to what is projected during the coming\ncentury.\n\n",


The GeoTopicParser combines a Gazetteer (a lookup dictionary of names/places to latitudes, longitudes) and a Named Entity Recognition (NER) modeling technique that identifies names and places in text to provide a way to geo tag documents and text i.e., to identify places in the text, and then to look up the latitude/longitude pairs for those places.

GeoTopicParser uses Geonames.org, Apache Lucene and Apache OpenNLP to provide its capabilities.

Installing the Lucene Gazetteer

First you will need to download the Lucene Geo Gazetteer project and to install it. You can do so by:

$ cd $HOME/src
$ git clone https://github.com/chrismattmann/lucene-geo-gazetteer.git
$ cd lucene-geo-gazetteer
$ mvn install
$ add $HOME/src/lucene-geo-gazetteer/src/main/bin to your PATH environment variable

Once done, you can verify that the installation worked by running the following command:

$ lucene-geo-gazetteer --help
usage: lucene-geo-gazetteer
 -b,--build <gazetteer file>           The Path to the Geonames
 -h,--help                             Print this message.
 -i,--index <directoryPath>            The path to the Lucene index
                                       directory to either create or read
 -s,--search <set of location names>   Location names to search the
                                       Gazetteer for

You will now need to build a Gazetteer using the Geonames.org dataset. Instructions are provided below. Note that you will need least 1.2 GB disk space for building Lucene Index for the Gazetteer.

$ cd $HOME/src/lucene-geo-gazetteer
$ curl -O http://download.geonames.org/export/dump/allCountries.zip
$ unzip allCountries.zip
$ java -cp target/lucene-geo-gazetteer-<version>-jar-with-dependencies.jar edu.usc.ir.geo.gazetteer.GeoNameResolver -i geoIndex -b allCountries.txt

You can verify that the Gazetteer build worked by searching e.g., for Pasadena, and/or Texas:

$ lucene-geo-gazetteer -s Pasadena Texas
{"Texas" : [
{"Pasadena" : [

Note that we used the convenience script lucene-geo-gazetteer which assumes that you created an indexed named geoIndex in the $HOME/src/lucene-geo-gazetter/geoIndex directory. We could have also used the pure Java command line to search. The return from the Gazetteer is a JSON List of JSON Object structures in which the structure is a key->JSON List map. The key is the location name given and the List is a list of closest match (by Edit Distance) in the Gazetteer for that name, followed by Latitude, and Longitude of that location.

Installing and downloading an NER model

The next thing you'll need is a Named Entity Recognition model for places. The GeoTopicParser uses Apache OpenNLP and with its 1.5 version, OpenNLP provides already trained models for location names in text data. You can download the en-ner-location.bin file already pre-trained by the OpenNLP folks. One thing to note is that OpenNLP's default name finder is not accurate, so building your own NER location model is highly recommended. In this case, please follow these instructions.

The model needs to be placed on the classpath for your Tika installation in the following directory:


The following instructions show how to download the model and place it on the right path:

$ mkdir $HOME/src/location-ner-model && cd $HOME/src/location-ner-model
$ curl -O http://opennlp.sourceforge.net/models-1.5/en-ner-location.bin
$ mkdir -p org/apache/tika/parser/geo/topic
$ mv en-ner-location.bin org/apache/tika/parser/geo/topic

Test out the GeoTopicParser

Now you can run Tika and try out the GeoTopicParser. At the moment since it's a Parser and not a Content-Handler (hopefully will develop it later), the parser is mapped to the MIME type application/geotopic which is a sub-class of text/plain. So, there are two steps to try the parser out now.

  1. Create a .geot file, you can use this sample file from the NSF Polar data contributed to TREC.

  2. Tell Tika about the application/geotopic MIME type. You can download this file and place it on the classpath in the org/apache/tika/mime directory, e.g., by doing:

    $ mkdir $HOME/src/geotopic-mime && cd $HOME/src/geotopic-mime
    $ mkdir -p org/apache/tika/mime
    $ curl -O https://raw.githubusercontent.com/chrismattmann/geotopicparser-utils/master/mime/org/apache/tika/mime/custom-mimetypes.xml
    $ mv custom-mimetypes.xml org/apache/tika/mime

With those files in place, let's use the GeoTopicParser using Tika-App:

$ java -classpath tika-app-1.9-SNAPSHOT.jar:$HOME/src/location-ner-model:$HOME/src/geotopic-mime org.apache.tika.cli.TikaCLI -m polar.geot

This should output:

Content-Length: 881
Content-Type: application/geotopic
Geographic_LATITUDE: 27.33931
Geographic_LONGITUDE: -108.60288
Geographic_NAME: China
Optional_LATITUDE1: 39.76
Optional_LONGITUDE1: -98.5
Optional_NAME1: United States
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.geo.topic.GeoParser
resourceName: polar.geot

The output will output 3-tuples of {Name, Latitude, Longitude}. The *best* match for the location is the one that occurs most frequently in the text, and that is provided as Geographic_NAME, along with its corresponding Geographic_LATITUDE and Geographic_LONGITUDE. Places also identified as entities by the NER model in the provide text are also listed as Optional_NAME*N*, e.g., Optional_NAME1 for the 1st alternative location identified and its corresponding Optional_LATITUDE1 and Optional_LONGITUDE1.

Will this work from Tika Server?

It sure will! When you start Tika Server, make sure that the NER model file and the custom MIME type are on your classpath, and that the lucene-geo-gazetteer is on the $PATH where Tika-Server is started, and you can post all the .geot files that you'd like and Tika-Server will happily call the GeoTopicParser to provide you location information.

First, start up the Tika server with your NER model and .geot MIME type definition on the classpath:

java -classpath $HOME/src/geotopicparser-utils/models/polar:$HOME/src/geotopicparser-utils/mime:tika-server/target/tika-server-1.9-SNAPSHOT.jar org.apache.tika.server.TikaServerCli

Then, try calling the /rmeta service to get the returned metadata:

curl -T $HOME/src/geotopicparser-utils/geotopics/polar.geot -H "Content-Disposition: attachment; filename=polar.geot" http://localhost:9998/rmeta

And then look for it to return the following, that's it!

      "Geographic_NAME":"United States",

GeoTopicParser (last edited 2016-03-04 19:34:16 by MadhavSharan)