The GrobidJournalParser uses the GROBID (or Grobid) GeneRation Of BIbliographic Data machine learning framework to parse PDF files and to extract information such as title, abstract, authors, affiliations, keywords, etc, from journal publications. The parser has been integrated into Tika. You can follow this guide to get it working on your system.

Installing GROBID

Currently, to install GROBID, it's necessary to start from the source code. We are currently working with the GROBID community to get pre-build binaries into Maven central, which is being tracked with issue #59. For now, a git checkout of head is recommended, as detailed here.

You should be able to install GROBID from a Git checkout such as the below.

  1. cd $HOME/src

  2. git clone https://github.com/kermitt2/grobid.git

    • now wait a while, the download is ~600MB

  3. now build GROBID by typing cd grobid && mvn install

You can verify GROBID works by running its batch runner:

  1. cd $HOME/src/grobid

  2. mkdir papers && mkdir out and put some PDF paper files in papers.

  3. java -Xmx1024m -jar grobid-core/target/grobid-core-0.3.4-SNAPSHOT.one-jar.jar -gH ./grobid-home/ -gP ./grobid-home/config/grobid.properties -dIn ./papers/ -dOut out -exe processFullText

Check the out directory, you should see *.tei.xml files in there.

Start the GROBID Service

To use GROBID with Tika, you need to start the GROBID Service. To do so, perform the following (note the service will start by default on port 8080, but that can be changed in the Jetty properties by going to Grobid Service's pom.xml and editing line 180).

  1. cd $HOME/src/grobid/grobid-service

  2. mvn -Dmaven.test.skip=true jetty:run-war

Once the server is started, you're good to proceed!

Running GROBID using Tika-App

Grab the latest 1.11-SNAPSHOT or later version of Tika-app and run Grobid by following the commands below.

First we need to create the GrobidExtractor.properties file that points to the Grobid REST Service. My file looks like the following:

grobid.server.url=http://localhost:8080

You can download GrobidExtractor.properties as a sample. Or better yet, you can install the following Github project and then modify the GrobidExtractor.properties file accordingly.

  1. cd $HOME/src && git clone https://github.com/chrismattmann/grobidparser-resources.git

  2. edit $HOME/src/grobidparser-resources/org/apache/tika/parser/journal/GrobidExtractor.properties

Now you can run GROBID via Tika-app with the following command on a sample PDF file.

java -classpath $HOME/src/grobidparser-resources/:tika-app-1.11-SNAPSHOT.jar org.apache.tika.cli.TikaCLI --config=$HOME/src/grobidparser-resources/tika-config.xml -J $HOME/src/grobid/papers/ICSE06.pdf

Which should produce as output (e.g., if piped to python -mjson.tool for pretty printing):

[
    {
        "Author": "End User Computing Services",
        "Company": "ACM",
        "Content-Length": "200435",
        "Content-Type": "application/pdf",
        "Creation-Date": "2006-02-15T21:13:58Z",
        "Last-Modified": "2006-02-15T21:16:01Z",
        "Last-Save-Date": "2006-02-15T21:16:01Z",
        "SourceModified": "D:20060215211344",
        "X-Parsed-By": [
            "org.apache.tika.parser.CompositeParser",
            "org.apache.tika.parser.journal.JournalParser"
        ],
        "X-TIKA:content": "<html xmlns=\"http://www.w3.org/1999/xhtml\">\n<head>\n<meta name=\"access_permission:extract_for_accessibility\" content=\"true\" />\n<meta name=\"meta:save-date\" content=\"2006-02-15T21:16:01Z\" />\n<meta name=\"grobid:header_Affiliation\" content=\"1 Jet Propulsion Laboratory California Institute of Technology; 2 Computer Science Department University of Southern California\" />\n<meta name=\"Content-Length\" content=\"200435\" />\n<meta name=\"dcterms:created\" content=\"2006-02-15T21:13:58Z\" />\n<meta name=\"Author\" content=\"End User Computing Services\" />\n<meta name=\"date\" content=\"2006-02-15T21:16:01Z\" />\n<meta name=\"access_permission:can_modify\" content=\"true\" />\n<meta name=\"creator\" content=\"End User Computing Services\" />\n<meta name=\"access_permission:modify_annotations\" content=\"true\" />\n<meta name=\"Creation-Date\" content=\"2006-02-15T21:13:58Z\" />\n<meta name=\"grobid:header_Address\" content=\"Pasadena, CA 91109 USA Los Angeles, CA 90089 USA \" />\n<meta name=\"meta:author\" content=\"End User Computing Services\" />\n<meta name=\"created\" content=\"Wed Feb 15 13:13:58 PST 2006\" />\n<meta name=\"access_permission:fill_in_form\" content=\"true\" />\n<meta name=\"grobid:header_FullAffiliations\" content=\"[Affiliation {orgName=Jet Propulsion Laboratory California Institute of Technology , address=Pasadena, CA 91109 USA},Affiliation {orgName=Computer Science Department University of Southern California , address=Los Angeles, CA 90089 USA}[Affiliation {orgName=Jet Propulsion Laboratory California Institute of Technology , address=Pasadena, CA 91109 USA},Affiliation {orgName=Computer Science Department University of Southern California , address=Los Angeles, CA 90089 USA}]\" />\n<meta name=\"grobid:header_Class\" content=\"org.apache.tika.metadata.Metadata\" />\n<meta name=\"dc:format\" content=\"application/pdf; version=1.4\" />\n<meta name=\"access_permission:can_print\" content=\"true\" />\n<meta name=\"Company\" content=\"ACM\" />\n<meta name=\"xmp:CreatorTool\" content=\"Acrobat PDFMaker 6.0 for Word\" />\n<meta name=\"resourceName\" content=\"ICSE06.pdf\" />\n<meta name=\"Last-Save-Date\" content=\..snip",
        "X-TIKA:parse_time_millis": "4302",
        "access_permission:assemble_document": "true",
        "access_permission:can_modify": "true",
        "access_permission:can_print": "true",
        "access_permission:can_print_degraded": "true",
        "access_permission:extract_content": "true",
        "access_permission:extract_for_accessibility": "true",
        "access_permission:fill_in_form": "true",
        "access_permission:modify_annotations": "true",
        "created": "Wed Feb 15 13:13:58 PST 2006",
        "creator": "End User Computing Services",
        "date": "2006-02-15T21:16:01Z",
        "dc:creator": "End User Computing Services",
        "dc:format": "application/pdf; version=1.4",
        "dc:title": "Proceedings Template - WORD",
        "dcterms:created": "2006-02-15T21:13:58Z",
        "dcterms:modified": "2006-02-15T21:16:01Z",
        "grobid:header_Address": "Pasadena, CA 91109 USA Los Angeles, CA 90089 USA ",
        "grobid:header_Affiliation": "1 Jet Propulsion Laboratory California Institute of Technology; 2 Computer Science Department University of Southern California",
        "grobid:header_Authors": "Chris A Mattmann 1,2 Daniel J Crichton 1 Nenad  Medvidovic 2 Steve  Hughes 1 ",
        "grobid:header_Class": "org.apache.tika.metadata.Metadata",
        "grobid:header_FullAffiliations": "[Affiliation {orgName=Jet Propulsion Laboratory California Institute of Technology , address=Pasadena, CA 91109 USA},Affiliation {orgName=Computer Science Department University of Southern California , address=Los Angeles, CA 90089 USA}[Affiliation {orgName=Jet Propulsion Laboratory California Institute of Technology , address=Pasadena, CA 91109 USA},Affiliation {orgName=Computer Science Department University of Southern California , address=Los Angeles, CA 90089 USA}]",
        "grobid:header_Keyword": "\"D2 Software Engineering, D211 Domain Specific Architectures\"",
        "grobid:header_TEIJSONSource": "{\"TEI\":{\"text\":{..snip",
        "grobid:header_TEIXMLSource": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<?xml-model href=\"file:///Users/mattmann/git/grobid/grobid-home/schemas/rng/Grobid.rng\" schematypens=\"http://relaxng.org/ns/structure/1.0\"?>\n<TEI xmlns=\"http://www.tei-c.org/ns/1.0\">\n\t<teiHeader xml:lang=\"en\">\n\t\t<fileDesc>\n\t\t\t<titleStmt>\n\t\t\t\t<title level=\"a\" type=\"main\">A Software Architecture-Based Framework for Highly Distributed and Data Intensive Scientific Applications</title>\n\t\t\t</titleStmt>\n\t\t\t<publicationStmt>\..snip</TEI>\n",
        "grobid:header_Title": "A Software Architecture-Based Framework for Highly Distributed and Data Intensive Scientific Applications",
        "meta:author": "End User Computing Services",
        "meta:creation-date": "2006-02-15T21:13:58Z",
        "meta:save-date": "2006-02-15T21:16:01Z",
        "modified": "2006-02-15T21:16:01Z",
        "pdf:PDFVersion": "1.4",
        "pdf:encrypted": "false",
        "producer": "Acrobat Distiller 6.0 (Windows)",
        "resourceName": "ICSE06.pdf",
        "title": "Proceedings Template - WORD",
        "xmp:CreatorTool": "Acrobat PDFMaker 6.0 for Word",
        "xmpTPg:NPages": "10"
    }
]

Will this work from Tika Server?

It sure will! When you start Tika Server, use the following command.

java -classpath $HOME/src/grobidparser-resources/:tika-server-1.11-SNAPSHOT.jar org.apache.tika.server.TikaServerCli --config $HOME/src/grobidparser-resources/tika-config.xml

Then, PUT a file to Tika-server like so:

curl -T $HOME/src/grobid/papers/ICSE06.pdf -H "Content-Disposition: attachment;filename=ICSE06.pdf" http://localhost:9998/rmeta

Which will output (if e.g., using python -mjson.tool):

[
    {
        "Author": "End User Computing Services",
        "Company": "ACM",
        "Content-Type": "application/pdf",
        "Creation-Date": "2006-02-15T21:13:58Z",
        "Last-Modified": "2006-02-15T21:16:01Z",
        "Last-Save-Date": "2006-02-15T21:16:01Z",
        "SourceModified": "D:20060215211344",
        "X-Parsed-By": [
            "org.apache.tika.parser.CompositeParser",
            "org.apache.tika.parser.journal.JournalParser"
        ],
        "X-TIKA:content": "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nProceedings Template - WORD\n\n\nA Software Architecture-Based Framework for Highly \nDistributed and Data Intensive Scientific Applications \n\n \nChris A. Mattmann1, 2        Daniel J. Crichton1        Nenad Medvidovic2        Steve Hughes1 \n\n \n1Jet Propulsion Laboratory \n\nCalifornia Institute of Technology \nPasadena, CA 91109, USA \n\n{dan.crichton,mattmann,steve.hughes}@jpl.nasa.gov \n\n2Computer Science Department \nUniversity of Southern California  \n\nLos Angeles, CA 90089, USA \n{mattmann,neno}@usc.edu \n\n \nABSTRACT \nModern scientific research is increasingly conducted by virtual \ncommunities of scientists distributed around the world. The data \nvolumes created by these communities are extremely large, and \ngrowing rapidly. The management of the resulting highly \ndistributed, virtual data systems is a complex task, characterized \nby a number of formidable technical challenges, many of which \nare of a software engineering nature.  In this paper we describe \nour experience over the past seven years in constructing and \ndeploying OODT, a software framework that supports large, \ndistributed, virtual scientific communities. We outline the key \nsoftware engineering challenges that we faced, and addressed, \nalong the way. We argue that a major contributor to the success of \nOODT was its explicit focus on software architecture. We \ndescribe several large-scale, real-world deployments of OODT, \nand the manner in which OODT helped us to address the domain-\nspecific challenges induced by each deployment.  \n\nCategories and Subject Descriptors \nD.2 Software Engineering, D.2.11 Domain Specific Architectures \n\nKeywords \nOODT, Data Management, Software Architecture. \n\n1. INTRODUCTION ..snip..",
        "X-TIKA:parse_time_millis": "957",
        "access_permission:assemble_document": "true",
        "access_permission:can_modify": "true",
        "access_permission:can_print": "true",
        "access_permission:can_print_degraded": "true",
        "access_permission:extract_content": "true",
        "access_permission:extract_for_accessibility": "true",
        "access_permission:fill_in_form": "true",
        "access_permission:modify_annotations": "true",
        "created": "Wed Feb 15 13:13:58 PST 2006",
        "creator": "End User Computing Services",
        "date": "2006-02-15T21:16:01Z",
        "dc:creator": "End User Computing Services",
        "dc:format": "application/pdf; version=1.4",
        "dc:title": "Proceedings Template - WORD",
        "dcterms:created": "2006-02-15T21:13:58Z",
        "dcterms:modified": "2006-02-15T21:16:01Z",
        "grobid:header_Address": "Pasadena, CA 91109 USA Los Angeles, CA 90089 USA ",
        "grobid:header_Affiliation": "1 Jet Propulsion Laboratory California Institute of Technology; 2 Computer Science Department University of Southern California",
        "grobid:header_Authors": "Chris A Mattmann 1,2 Daniel J Crichton 1 Nenad  Medvidovic 2 Steve  Hughes 1 ",
        "grobid:header_Class": "org.apache.tika.metadata.Metadata",
        "grobid:header_FullAffiliations": "[Affiliation {orgName=Jet Propulsion Laboratory California Institute of Technology , address=Pasadena, CA 91109 USA},Affiliation {orgName=Computer Science Department University of Southern California , address=Los Angeles, CA 90089 USA}[Affiliation {orgName=Jet Propulsion Laboratory California Institute of Technology , address=Pasadena, CA 91109 USA},Affiliation {orgName=Computer Science Department University of Southern California , address=Los Angeles, CA 90089 USA}]",
        "grobid:header_Keyword": "\"D2 Software Engineering, D211 Domain Specific Architectures\"",
        "grobid:header_TEIJSONSource": "{\"TEI\":{\"text\":{\"xml:lang\":\"en\"},\"teiHeader\": ..snip",
        "grobid:header_TEIXMLSource": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<?xml-model href=\"file:///Users/mattmann/git/grobid/grobid-home/schemas/rng/Grobid.rng\" schematypens=\"http://relaxng.org/ns/structure/1.0\"?>\n<TEI xmlns=\"http://www.tei-c.org/ns/1.0\">\n\t<teiHeader xml:lang=\"en\">\n\t\t<fileDesc>\n\t\t\t<titleStmt>\n\t\t\t\t<title level=\"a\" type=\"main\">A Software Architecture-Based Framework for Highly Distributed and Data Intensive Scientific Applications</title>..snip..</TEI>\n",
        "grobid:header_Title": "A Software Architecture-Based Framework for Highly Distributed and Data Intensive Scientific Applications",
        "meta:author": "End User Computing Services",
        "meta:creation-date": "2006-02-15T21:13:58Z",
        "meta:save-date": "2006-02-15T21:16:01Z",
        "modified": "2006-02-15T21:16:01Z",
        "pdf:PDFVersion": "1.4",
        "pdf:encrypted": "false",
        "producer": "Acrobat Distiller 6.0 (Windows)",
        "resourceName": "ICSE06.pdf",
        "title": "Proceedings Template - WORD",
        "xmp:CreatorTool": "Acrobat PDFMaker 6.0 for Word",
        "xmpTPg:NPages": "10"
    }
]

GrobidJournalParser (last edited 2015-08-19 17:35:44 by NickBurch)