Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Checkout the source from SVN as detailed on the Apache Tika contributions page or retrieve the latest code from Github,
  2. Build source using Maven
  3. Run the Apache Tika JAXRS server runnable jar.
No Format

git clone https://github.com/apache/tika.git tika-trunk
cd ./tika-trunk/
mvn install
cd ./tika-server/target/
java -jar tika-server-x.x.jar

...

You will then see a message such as the following:

No Format

$ java -jar tika-server-1.24-SNAPSHOT.jar
19-Jan-2015 14:23:36 org.apache.tika.server.TikaServerCli main
INFO: Starting Apache Tika 1.8-SNAPSHOT server
19-Jan-2015 14:23:36 org.apache.cxf.endpoint.ServerImpl initDestination
INFO: Setting the server's publish address to be http://localhost:9998/
19-Jan-2015 14:23:36 org.slf4j.impl.JCLLoggerAdapter info
INFO: jetty-8.y.z-SNAPSHOT
19-Jan-2015 14:23:36 org.slf4j.impl.JCLLoggerAdapter info
INFO: Started SelectChannelConnector@localhost:9998
19-Jan-2015 14:23:36 org.apache.tika.server.TikaServerCli main
INFO: Started

...

You can specify additional information to change the host name and port number:

No Format

java -jar tika-server-x.x.jar --host=intranet.local --port=12345

...

Metadata Resource

No Format

/meta

HTTP PUTs a document to the /meta service and you get back "text/csv" of the metadata.

Some Example calls with cURL:

No Format

$ curl -X PUT --data-ascii @zipcode.csv http://localhost:9998/meta --header "Content-Type: text/csv"
$ curl -T price.xls http://localhost:9998/meta

Returns:

No Format

"Content-Encoding","ISO-8859-2"
"Content-Type","text/plain"

Get metadata as JSON:

No Format

$ curl -T test_recursive_embedded.docx http://localhost:9998/meta --header "Accept: application/json"

Or XMP:

No Format

$ curl -T test_recursive_embedded.docx http://localhost:9998/meta --header "Accept: application/rdf+xml"

Get specific metadata key's value as simple text string:

No Format

$ curl -T test_recursive_embedded.docx http://localhost:9998/meta/Content-Type --header "Accept: text/plain"

Returns:

No Format

application/vnd.openxmlformats-officedocument.wordprocessingml.document

Get specific metadata key's value(s) as CSV:

No Format

$ curl -T test_recursive_embedded.docx http://localhost:9998/meta/Content-Type --header "Accept: text/csv"

Or JSON:

No Format

$ curl -T test_recursive_embedded.docx http://localhost:9998/meta/Content-Type --header "Accept: application/json"

Or XMP:

No Format

$ curl -T test_recursive_embedded.docx http://localhost:9998/meta/Content-Type --header "Accept: application/rdf+xml"

...

Metadata Resource also accepts the files as multipart/form-data attachments with POST. Posting files as multipart attachments may be beneficial in cases when the files are too big for them to be PUT directly in the request body. Note that Tika JAX-RS server makes the best effort at storing some of the multipart content to the disk while still supporting the streaming:

No Format

curl -F upload=@price.xls URL http://localhost:9998/meta/form

Note that the address has an extra "/form" path segment.

Tika Resource

No Format

/tika

HTTP PUTs a document to the /tika service and you get back the extracted text in text, html or "body" format (see below). See also the /rmeta  endpoint for text and metadata of embedded objects.  

...

Get HELLO message back

No Format

$ curl -X GET http://localhost:9998/tika
This is Tika Server. Please PUT

Get the Text of a Document

No Format

$ curl -X PUT --data-binary @GeoSPARQL.pdf http://localhost:9998/tika --header "Content-type: application/pdf"
$ curl -T price.xls http://localhost:9998/tika --header "Accept: text/html"
$ curl -T price.xls http://localhost:9998/tika --header "Accept: text/plain"

Use the Boilerpipe handler (equivalent to tika-app's --text-main) with text output:

No Format

$ curl -T price.xls http://localhost:9998/tika/main --header "Accept: text/plain"

...

Tika Resource also accepts the files as multipart/form-data attachments with POST. Posting files as multipart attachments may be beneficial in cases when the files are too big for them to be PUT directly in the request body. Note that Tika JAX-RS server makes the best effort at storing some of the multipart content to the disk while still supporting the streaming:

No Format

curl -F upload=@price.xls URL http://localhost:9998/tika/form

...

Detector Resource

No Format

/detect/stream

HTTP PUTs a document and uses the Default Detector from Tika to identify its MIME/media type. The caveat here is that providing a hint for the filename can increase the quality of detection.

...

PUT an RTF file and get back RTF

No Format

$ curl -X PUT --data-binary @TODO.rtf http://localhost:9998/detect/stream

PUT a CSV file without filename hint and get back text/plain

No Format

$ curl -X PUT --upload-file foo.csv http://localhost:9998/detect/stream

PUT a CSV file with filename hint and get back text/csv

No Format

$ curl -X PUT -H "Content-Disposition: attachment; filename=foo.csv" --upload-file foo.csv http://localhost:9998/detect/stream

Language Resource

No Format

/language/stream

HTTP PUTs or POSTs a UTF-8 text file to the LanguageIdentifier to identify its language. 

...

PUT a TXT file with English This is English! and get back en

No Format

$ curl -X PUT --data-binary @foo.txt http://localhost:9998/language/stream
en

PUT a TXT file with French comme çi comme ça and get back fr

No Format

curl -X PUT --data-binary @foo.txt http://localhost:9998/language/stream
fr


No Format

/language/string

HTTP PUTs or POSTs a text string to the LanguageIdentifier to identify its language.

...

PUT a string with English This is English! and get back en

No Format

$ curl -X PUT --data "This is English!" http://localhost:9998/language/string
en

PUT a string with French comme çi comme ça and get back fr

No Format

curl -X PUT --data "comme çi comme ça" http://localhost:9998/language/string
fr

Translate Resource

No Format

/translate/all/translator/src/dest

...

PUT a TXT file named sentences with Spanish me falta práctica en Español and get back the English translation using Lingo24

No Format

$ curl -X PUT --data-binary @sentences http://localhost:9998/translate/all/org.apache.tika.language.translate.Lingo24Translator/es/en
lack of practice in Spanish

PUT a TXT file named sentences with Spanish me falta práctica en Español and get back the English translation using Microsoft

No Format

$ curl -X PUT --data-binary @sentences http://localhost:9998/translate/all/org.apache.tika.language.translate.MicrosoftTranslator/es/en
I need practice in Spanish

PUT a TXT file named sentences with Spanish me falta práctica en Español and get back the English translation using Google

No Format

$ curl -X PUT --data-binary @sentences http://localhost:9998/translate/all/org.apache.tika.language.translate.GoogleTranslator/es/en
I need practice in Spanish


No Format

/translate/all/src/dest

HTTP PUTs or POSTs a document to the identified *translator* and auto-detects the *src* language using LanguageIdentifiers, and then translates *src* to *dest*

...

PUT a TXT file named sentences2 with French comme çi comme ça and get back the English translation using Google auto-detecting the language

No Format

$ curl -X PUT --data-binary @sentences2 http://localhost:9998/translate/all/org.apache.tika.language.translate.GoogleTranslator/en
so so

Recursive Metadata and Content

No Format

/rmeta

Returns a JSONified list of Metadata objects for the container document and all embedded documents. The text that is extracted from each document is stored in the metadata object under "X-TIKA:content".

No Format

$ curl -T test_recursive_embedded.docx http://localhost:9998/rmeta

Returns:

No Format

[
 {"Application-Name":"Microsoft Office Word",
  "Application-Version":"15.0000",
  "X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.microsoft.ooxml.OOXMLParser"],
  "X-TIKA:content":"embed_0 "
  ...
 },
 {"Content-Encoding":"ISO-8859-1",
  "Content-Length":"8",
  "Content-Type":"text/plain; charset=ISO-8859-1"
  "X-TIKA:content":"embed_1b",
  ...
 }
 ...
]

The default format for "X-TIKA:content" is XML. However, you can select "text only" with

No Format

/rmeta/text

HTML with

No Format

/rmeta/html

and no content (metadata only) with

No Format

/rmeta/ignore

Multipart Support

Metadata Resource also accepts the files as multipart/form-data attachments with POST. Posting files as multipart attachments may be beneficial in cases when the files are too big for them to be PUT directly in the request body. Note that Tika JAX-RS server makes the best effort at storing some of the multipart content to the disk while still supporting the streaming:

No Format

curl -F upload=@test_recursive_embedded.docx URL http://localhost:9998/rmeta/form

...

See the TikaEval page for more details.  Please open issues on our JIRA if you would like other statistics included or if you'd like to make the calculated statistics configurable.

Unpack Resource

No Format

/unpack

HTTP PUTs an embedded document type to the /unpack service and you get back a zip or tar of the raw bytes of the embedded files.  Note that this does not operate recursively; it extracts only the child documents of the original file.

...

PUT zip file and get back met file zip

No Format

$ curl -X PUT --data-binary @foo.zip http://localhost:9998/unpack --header "Content-type: application/zip"

PUT doc file and get back met file tar

No Format

$ curl -T Doc1_ole.doc -H "Accept: application/x-tar" http://localhost:9998/unpack > /var/tmp/x.tar

PUT doc file and get back the content and metadata

No Format

$ curl -T Doc1_ole.doc http://localhost:9998/unpack/all > /var/tmp/x.zip

...

Available Endpoints

No Format

/

Hitting the route of the server in your web browser will give a basic report of all the endpoints defined in the server, what URL they have etc

Defined Mime Types

No Format

/mime-types

Mime types, their aliases, their supertype, and the parser. Available as plain text, json or human readable HTML

Available Detectors

No Format

/detectors

The top level Detector to be used, and any child detectors within it. Available as plain text, json or human readable HTML

Available Parsers

No Format

/parsers

Lists all of the parsers currently available

No Format

/parsers/details

List all the available parsers, along with what mimetypes they support

...

In Tika 1.14, we added the capability back, but the user has to acknowledge the security risk by including two commandline arguments:

No Format

$ java -jar tika-server-x.x.jar -enableUnsecureFeatures -enableFileUrl

This allows the user to specify a fileUrl in the header:

No Format

curl -i -H "fileUrl:http://tika.apache.org" -H "Accept:text/plain" -X PUT http://localhost:9998/tika

or

No Format

curl -i -H "fileUrl:file:///C:/data/my_test_doc.pdf" -H "Accept:text/plain" -X PUT http://localhost:9998/tika

...

NOTE 2: In Tika 1.x, to specify the JVM args for the child process, prepend the arguments with -J as in -JXmx4g after the -jar tika-server.x.x.jar call as in:

No Format

$ java -Dlog4j.configuration=file:log4j_server.xml -jar tika-server-x.x.jar -spawnChild -JXmx4g -JDlog4j.configuration=file:log4j_child.xml}}

...

If you are new to TLS, see our README.txt for how we generated client and server keystores and truststores for our unit tests.

Configuring Parsers at Parse time/per file

See Configuring Parsers At Parse Time in tika-server.

Architecture

Tika Server is based on JSR 311 for a network serve. The server package uses the Apache CXF framework that provides an implementation of JAX-RS for Java.