You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Grobid Quantities with Tika

Grobid Quantities is a Java library used to recognize any expressions of measurements (e.g. pressure, temperature, etc.) in textual documents, parse, normalize and finally convert the measurements into SI units. It can be used on technical and scientific articles (text, XML and PDF input) and patents (text and XML input). To use its capabilities with Tika, one must install the server endpoint created for Grobid Quantities to extract measurement units from text passed to it.

Installation

Steps to install: Install Grobid Quantities by following the steps from github and make sure the quantity model is trained as per the instructions provided

After installing and training the model, start the REST server using the following command

Start Grobid Quantities Server

$ mvn -Dmaven.test.skip=true jetty:run-war

The server starts by default on port number 8080 and the server can be seen running on http://127.0.0.1:8080.

Preparing resources for Grobid Quantities in Tika-App

  1. Activate Named Entity Parser In order to use any of the NamedEntityParser implementations in Tika , the parser responsible for handling the name recognition task needs to be enabled. This can be done with Tika Config XML file, as follows
     <?xml version="1.0" encoding="UTF-8"?>
     <properties>
         <parsers>
             <parser class="org.apache.tika.parser.ner.NamedEntityParser">
                 <mime>text/plain</mime>
                 <mime>text/html</mime>
                 <mime>application/xhtml+xml</mime>
             </parser>
         </parsers>
     </properties>
     
    This configuration has to be supplied in the later phases, so store it as 'tika-config.xml'.

    2. Supply GrobidServer.properties file
    It is imperative that Tika should know on what host you are running the grobid-quantities-server. By default Tika will assume your server runs on port 8080. In order to specify any other port, you must supply a GrobidServer.properties file. Sample GrobidServer.properties file. My file looks like the following:
    grobid.server.url=http://localhost:8080
    grobid.endpoint.text=/processQuantityText
     

    In a nutshell
     #Create a directory for keeping the config and properties file.
     export GROBID_QUANTITIES_RES=$HOME/GrobidQuantitiesRest-resources
     mkdir -p $GROBID_QUANTITIES_RES
     cd $GROBID_QUANTITIES_RES
     #config file must be stored in this directory
     pwd
    
     export PATH_PREFIX="$GROBID_QUANTITIES_RES/org/apache/tika/parser/ner/grobid"
     mkdir -p $PATH_PREFIX
     #create and edit the properties file
     vim $PATH_PREFIX/GrobidServer.properties
     


Running Grobid Quantities with Tika

export TIKA_APP={your/path/to/tika-app}/target/tika-app-1.13-SNAPSHOT.jar

#set the system property to use GrobidNERecogniser class
java -Dner.impl.class=org.apache.tika.parser.ner.grobid.GrobidNERecogniser -classpath $GROBID_QUANTITIES_RES:$TIKA_APP org.apache.tika.cli.TikaCLI --config=$GROBID_QUANTITIES_RES/tika-config.xml -m  https://en.wikipedia.org/wiki/Time

  • No labels