...
The GrobidJournalParser uses the GROBID (or Grobid) GeneRation Of Bibliographic Data machine learning framework to parse PDF documents and to extract structured informations such as title, abstract, authors, affiliations, keywords, etc, from journal publications. The parser has been integrated into Tika. You can follow this guide to get it working on your system.
Table of Contents |
---|
Installing GROBID
The best approach is to run Grobid via docker.
...
cd $HOME && git clone https://github.com/chrismattmann/grobidparser-resources.git
- modify the file
grobidparser-resources/org/apache/tika/parser/journal/GrobidExtractor.properties
Both tika-server and tika-parser-nlp are required for calling Grobid.
Running Grobid with Tika Server
...
No Format |
---|
java -cp grobidparser-resources/:tika-server-standard-2.8.0.jar:tika-parser-nlp-package-2.8.0.jar org.apache.tika.server.core.TikaServerCli --config grobidparser-resources/tika-config.xml |
...
No Format |
---|
java -cp grobidparser-resources/:tika-app-2.8.0.jar:tika-parser-nlp-package-2.8.0.jar org.apache.tika.cli.TikaCLI --config=grobidparser-resources/tika-config.xml -J PATH_TO_YOUR_PDF_FILE |
...