Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Table of Contents

The GrobidJournalParser uses the GROBID (or Grobid) GeneRation Of Bibliographic Data machine learning framework to parse PDF documents and to extract structured informations such as title, abstract, authors, affiliations, keywords, etc, from journal publications. The parser has been integrated into Tika. You can follow this guide to get it working on your system.


Table of Contents

Installing GROBID

The best approach is to run Grobid via docker.

...

  1. cd $HOME && git clone https://github.com/chrismattmann/grobidparser-resources.git
  2. modify the file grobidparser-resources/org/apache/tika/parser/journal/GrobidExtractor.properties   


Both tika-server and tika-parser-nlp are required for calling Grobid. 

Running Grobid with Tika Server

...

No Format
java -cp grobidparser-resources/:tika-server-standard-2.8.0.jar:tika-parser-nlp-package-2.8.0.jar org.apache.tika.server.core.TikaServerCli --config grobidparser-resources/tika-config.xml

...

No Format
java -cp grobidparser-resources/:tika-app-2.8.0.jar:tika-parser-nlp-package-2.8.0.jar org.apache.tika.cli.TikaCLI --config=grobidparser-resources/tika-config.xml -J PATH_TO_YOUR_PDF_FILE

...