Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

109455316

The GeoTopicParser 109455316 combines a Gazetteer (a lookup dictionary of names/places to latitudes, longitudes) and a Named Entity Recognition (NER) modeling technique that identifies names and places in text to provide a way to geo tag documents and text i.e., to identify places in the text, and then to look up the latitude/longitude pairs for those places.

GeoTopicParser 109455316 uses Geonames.org, Apache Lucene and Apache OpenNLP to provide its capabilities.

The work is based on this paper:

An Automatic Approach for Discovering and Geocoding Locations in Domain-Specific Web Data

In Proceedings of the IEEE International Conference on Information Reuse and Integration, Pittsburgh, Pennsylvania, USA, July 28-30, 2016 | Read this article

Authors: Chris A. Mattmann, Madhav Sharan

Installing the Lucene Gazetteer

...

The next thing you'll need is a Named Entity Recognition model for places. The GeoTopicParser 109455316 uses Apache OpenNLP and with its 1.5 version, OpenNLP provides already trained models for location names in text data. You can download the en-ner-location.bin file already pre-trained by the OpenNLP folks. One thing to note is that OpenNLP's default name finder is not accurate, so building your own NER location model is highly recommended. In this case, please follow these instructions.

...

No Format
$ mkdir $HOME/src/location-ner-model && cd $HOME/src/location-ner-model
$ curl -O http://opennlp.sourceforge.net/models-1.5/en-ner-location.bin
$ mkdir -p org/apache/tika/parser/geo/topic
$ mv en-ner-location.bin org/apache/tika/parser/geo/topic

Test out the GeoTopicParser 109455316

Now you can run Tika and try out the GeoTopicParser 109455316. At the moment since it's a Parser and not a Content-Handler (hopefully will develop it later), the parser is mapped to the MIME type application/geotopic which is a sub-class of text/plain. So, there are two steps to try the parser out now.

  1. Create a .geot file, you can use this sample file from the NSF Polar data contributed to TREC. 2. Tell Tika about the application/geotopic MIME type. You can download this file and place it on the classpath in the org/apache/tika/mime directory, e.g., by doing:

    No Format
    $ mkdir $HOME/src/geotopic-mime && cd $HOME/src/geotopic-mime
    $ mkdir -p org/apache/tika/mime
    $ curl -O https://raw.githubusercontent.com/chrismattmann/geotopicparser-utils/master/mime/org/apache/tika/mime/custom-mimetypes.xml
    $ mv custom-mimetypes.xml org/apache/tika/mime
    


With those files in place, let's use the GeoTopicParser 109455316 using Tika-App:

No Format
$ java -classpath tika-app-<LATEST-VERSION>-SNAPSHOT.jar:$HOME/src/location-ner-model:$HOME/src/geotopic-mime org.apache.tika.cli.TikaCLI -m polar.geot

...

It sure will! When you start Tika Server, make sure that the NER model file and the custom MIME type are on your classpath, and that the lucene-geo-gazetteer is on the $PATH where Tika-Server is started, and you can post all the .geot files that you'd like and Tika-Server will happily call the GeoTopicParser 109455316 to provide you location information.

...