Grobid Quantities is a module of Grobid that specialised on in the recognition of any expressions of measurements (e.g. pressure, temperature, etc.) in textual documents such as PDF publications.
Measurements are parsed normalised and converted into SI units.
To use its capabilities with Tika, one must install the server endpoint created for Grobid Quantities to extract measurement units from text passed to it.
...
The resources to be created are 2 files:
to be supplied later.tika-config.xml
and GrobidServer.properties
A predefined set of configuration files are available here:
Code Block |
---|
git clone https://github.com/lfoppiano/grobid-quantities-tika-parser-resources.git grobidquantities-parser-resources |
Alternatively is possible to create the files automatically, as described below.
Manual configuration
Create Tika-config.xml
In order to use any of the NamedEntityParser implementations in Tika, the parser responsible for handling the name recognition task needs to be enabled.
This can be done by creating the tika-config.xml
file, as follows:
No Format |
---|
<?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <parser class="org.apache.tika.parser.ner.NamedEntityParser"> <mime>text/plain</mime> <mime>text/html</mime> <mime>application/xhtml+xml</mime> </parser> </parsers> </properties> |
...
Create GrobidServer.properties
It is imperative that Tika should know on what host you are running the grobid-quantities-server. By default, Tika will assume your server runs on port 8060.
In order to specify any other port, you must supply a GrobidServer.properties
file. Sample GrobidServer.properties file. My file looks like the following:
No Format |
---|
grobid.server.url=http://localhost:8060
grobid.endpoint.text=/processQuantityText
|
...
No Format |
---|
#Create a directory for keeping the config and properties file.
export GROBID_QUANTITIES_RES=$HOME/GrobidQuantitiesRest-resources
mkdir -p $GROBID_QUANTITIES_RES
cd $GROBID_QUANTITIES_RES
#config file must be stored in this directory
pwd
export PATH_PREFIX="$GROBID_QUANTITIES_RES/org/apache/tika/parser/ner/grobid"
mkdir -p $PATH_PREFIX
#create and edit the properties file
vim $PATH_PREFIX/GrobidServer.properties
|
Running Grobid Quantities with Tika
No Format |
---|
export TIKA_APP={your/path/to/tika-app}/target/tika-app-1.13-SNAPSHOT.jar #set the system property to use GrobidNERecogniser class java -Dner.impl.class=org.apache.tika.parser.ner.grobid.GrobidNERecogniser -classpath $GROBID_QUANTITIES_RES:$TIKA_APPgrobidquantities-parser-resources:tika-app-2.8.0.jar org.apache.tika.cli.TikaCLI --config=$GROBID_QUANTITIES_RESgrobidquantities-parser-resources/tika-config.xml -m https://en.wikipedia.org/wiki/Time |
...