...
- Checkout the source from SVN as detailed on the Apache Tika contributions page or retrieve the latest code from Github,
- Build source using Maven
- Run the Apache Tika JAXRS server runnable jar.
No Format |
---|
git clone https://github.com/apache/tika.git tika-trunk
cd ./tika-trunk/
mvn install
cd ./tika-server/target/
java -jar tika-server-x.x.jar
|
...
You will then see a message such as the following:
No Format |
---|
$ java -jar tika-server-1.24-SNAPSHOT.jar
19-Jan-2015 14:23:36 org.apache.tika.server.TikaServerCli main
INFO: Starting Apache Tika 1.8-SNAPSHOT server
19-Jan-2015 14:23:36 org.apache.cxf.endpoint.ServerImpl initDestination
INFO: Setting the server's publish address to be http://localhost:9998/
19-Jan-2015 14:23:36 org.slf4j.impl.JCLLoggerAdapter info
INFO: jetty-8.y.z-SNAPSHOT
19-Jan-2015 14:23:36 org.slf4j.impl.JCLLoggerAdapter info
INFO: Started SelectChannelConnector@localhost:9998
19-Jan-2015 14:23:36 org.apache.tika.server.TikaServerCli main
INFO: Started
|
...
You can specify additional information to change the host name and port number:
No Format |
---|
java -jar tika-server-x.x.jar --host=intranet.local --port=12345
|
...
Metadata Resource
No Format |
---|
/meta
|
HTTP PUTs a document to the /meta service and you get back "text/csv" of the metadata.
Some Example calls with cURL:
No Format |
---|
$ curl -X PUT --data-ascii @zipcode.csv http://localhost:9998/meta --header "Content-Type: text/csv"
$ curl -T price.xls http://localhost:9998/meta
|
Returns:
No Format |
---|
"Content-Encoding","ISO-8859-2"
"Content-Type","text/plain"
|
Get metadata as JSON:
No Format |
---|
$ curl -T test_recursive_embedded.docx http://localhost:9998/meta --header "Accept: application/json"
|
Or XMP:
No Format |
---|
$ curl -T test_recursive_embedded.docx http://localhost:9998/meta --header "Accept: application/rdf+xml"
|
Get specific metadata key's value as simple text string:
No Format |
---|
$ curl -T test_recursive_embedded.docx http://localhost:9998/meta/Content-Type --header "Accept: text/plain"
|
Returns:
No Format |
---|
application/vnd.openxmlformats-officedocument.wordprocessingml.document
|
Get specific metadata key's value(s) as CSV:
No Format |
---|
$ curl -T test_recursive_embedded.docx http://localhost:9998/meta/Content-Type --header "Accept: text/csv"
|
Or JSON:
No Format |
---|
$ curl -T test_recursive_embedded.docx http://localhost:9998/meta/Content-Type --header "Accept: application/json"
|
Or XMP:
No Format |
---|
$ curl -T test_recursive_embedded.docx http://localhost:9998/meta/Content-Type --header "Accept: application/rdf+xml"
|
...
Metadata Resource also accepts the files as multipart/form-data attachments with POST. Posting files as multipart attachments may be beneficial in cases when the files are too big for them to be PUT directly in the request body. Note that Tika JAX-RS server makes the best effort at storing some of the multipart content to the disk while still supporting the streaming:
No Format |
---|
curl -F upload=@price.xls URL http://localhost:9998/meta/form
|
Note that the address has an extra "/form" path segment.
Tika Resource
No Format |
---|
/tika
|
HTTP PUTs a document to the /tika service and you get back the extracted text in text, html or "body" format (see below). See also the /rmeta
endpoint for text and metadata of embedded objects.
...
Get HELLO message back
No Format |
---|
$ curl -X GET http://localhost:9998/tika
This is Tika Server. Please PUT
|
Get the Text of a Document
No Format |
---|
$ curl -X PUT --data-binary @GeoSPARQL.pdf http://localhost:9998/tika --header "Content-type: application/pdf"
$ curl -T price.xls http://localhost:9998/tika --header "Accept: text/html"
$ curl -T price.xls http://localhost:9998/tika --header "Accept: text/plain"
|
Use the Boilerpipe handler (equivalent to tika-app's --text-main
) with text output:
No Format |
---|
$ curl -T price.xls http://localhost:9998/tika/main --header "Accept: text/plain"
|
...
Tika Resource also accepts the files as multipart/form-data attachments with POST. Posting files as multipart attachments may be beneficial in cases when the files are too big for them to be PUT directly in the request body. Note that Tika JAX-RS server makes the best effort at storing some of the multipart content to the disk while still supporting the streaming:
No Format |
---|
curl -F upload=@price.xls URL http://localhost:9998/tika/form
|
...
Detector Resource
No Format |
---|
/detect/stream
|
HTTP PUTs a document and uses the Default Detector from Tika to identify its MIME/media type. The caveat here is that providing a hint for the filename can increase the quality of detection.
...
PUT an RTF file and get back RTF
No Format |
---|
$ curl -X PUT --data-binary @TODO.rtf http://localhost:9998/detect/stream
|
PUT a CSV file without filename hint and get back text/plain
No Format |
---|
$ curl -X PUT --upload-file foo.csv http://localhost:9998/detect/stream
|
PUT a CSV file with filename hint and get back text/csv
No Format |
---|
$ curl -X PUT -H "Content-Disposition: attachment; filename=foo.csv" --upload-file foo.csv http://localhost:9998/detect/stream
|
Language Resource
No Format |
---|
/language/stream
|
HTTP PUTs or POSTs a UTF-8 text file to the LanguageIdentifier to identify its language.
...
PUT a TXT file with English This is English! and get back en
No Format |
---|
$ curl -X PUT --data-binary @foo.txt http://localhost:9998/language/stream
en
|
PUT a TXT file with French comme çi comme ça and get back fr
No Format |
---|
curl -X PUT --data-binary @foo.txt http://localhost:9998/language/stream
fr
|
No Format |
---|
/language/string
|
HTTP PUTs or POSTs a text string to the LanguageIdentifier to identify its language.
...
PUT a string with English This is English! and get back en
No Format |
---|
$ curl -X PUT --data "This is English!" http://localhost:9998/language/string
en
|
PUT a string with French comme çi comme ça and get back fr
No Format |
---|
curl -X PUT --data "comme çi comme ça" http://localhost:9998/language/string
fr
|
Translate Resource
No Format |
---|
/translate/all/translator/src/dest
|
...
PUT a TXT file named sentences with Spanish me falta práctica en Español and get back the English translation using Lingo24
No Format |
---|
$ curl -X PUT --data-binary @sentences http://localhost:9998/translate/all/org.apache.tika.language.translate.Lingo24Translator/es/en
lack of practice in Spanish
|
PUT a TXT file named sentences with Spanish me falta práctica en Español and get back the English translation using Microsoft
No Format |
---|
$ curl -X PUT --data-binary @sentences http://localhost:9998/translate/all/org.apache.tika.language.translate.MicrosoftTranslator/es/en
I need practice in Spanish
|
PUT a TXT file named sentences with Spanish me falta práctica en Español and get back the English translation using Google
No Format |
---|
$ curl -X PUT --data-binary @sentences http://localhost:9998/translate/all/org.apache.tika.language.translate.GoogleTranslator/es/en
I need practice in Spanish
|
No Format |
---|
/translate/all/src/dest
|
HTTP PUTs or POSTs a document to the identified *translator* and auto-detects the *src* language using LanguageIdentifiers, and then translates *src* to *dest*
...
PUT a TXT file named sentences2 with French comme çi comme ça and get back the English translation using Google auto-detecting the language
No Format |
---|
$ curl -X PUT --data-binary @sentences2 http://localhost:9998/translate/all/org.apache.tika.language.translate.GoogleTranslator/en
so so
|
Recursive Metadata and Content
No Format |
---|
/rmeta
|
Returns a JSONified list of Metadata objects for the container document and all embedded documents. The text that is extracted from each document is stored in the metadata object under "X-TIKA:content".
No Format |
---|
$ curl -T test_recursive_embedded.docx http://localhost:9998/rmeta
|
Returns:
No Format |
---|
[
{"Application-Name":"Microsoft Office Word",
"Application-Version":"15.0000",
"X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.microsoft.ooxml.OOXMLParser"],
"X-TIKA:content":"embed_0 "
...
},
{"Content-Encoding":"ISO-8859-1",
"Content-Length":"8",
"Content-Type":"text/plain; charset=ISO-8859-1"
"X-TIKA:content":"embed_1b",
...
}
...
]
|
The default format for "X-TIKA:content" is XML. However, you can select "text only" with
No Format |
---|
/rmeta/text
|
HTML with
No Format |
---|
/rmeta/html
|
and no content (metadata only) with
No Format |
---|
/rmeta/ignore
|
Multipart Support
Metadata Resource also accepts the files as multipart/form-data attachments with POST. Posting files as multipart attachments may be beneficial in cases when the files are too big for them to be PUT directly in the request body. Note that Tika JAX-RS server makes the best effort at storing some of the multipart content to the disk while still supporting the streaming:
No Format |
---|
curl -F upload=@test_recursive_embedded.docx URL http://localhost:9998/rmeta/form
|
...
See the TikaEval page for more details. Please open issues on our JIRA if you would like other statistics included or if you'd like to make the calculated statistics configurable.
Unpack Resource
No Format |
---|
/unpack
|
HTTP PUTs an embedded document type to the /unpack service and you get back a zip or tar of the raw bytes of the embedded files. Note that this does not operate recursively; it extracts only the child documents of the original file.
...
PUT zip file and get back met file zip
No Format |
---|
$ curl -X PUT --data-binary @foo.zip http://localhost:9998/unpack --header "Content-type: application/zip"
|
PUT doc file and get back met file tar
No Format |
---|
$ curl -T Doc1_ole.doc -H "Accept: application/x-tar" http://localhost:9998/unpack > /var/tmp/x.tar
|
PUT doc file and get back the content and metadata
No Format |
---|
$ curl -T Doc1_ole.doc http://localhost:9998/unpack/all > /var/tmp/x.zip
|
...
Available Endpoints
No Format |
---|
/
|
Hitting the route of the server in your web browser will give a basic report of all the endpoints defined in the server, what URL they have etc
Defined Mime Types
No Format |
---|
/mime-types
|
Mime types, their aliases, their supertype, and the parser. Available as plain text, json or human readable HTML
Available Detectors
No Format |
---|
/detectors
|
The top level Detector to be used, and any child detectors within it. Available as plain text, json or human readable HTML
Available Parsers
No Format |
---|
/parsers
|
Lists all of the parsers currently available
No Format |
---|
/parsers/details
|
List all the available parsers, along with what mimetypes they support
...
In Tika 1.14, we added the capability back, but the user has to acknowledge the security risk by including two commandline arguments:
No Format |
---|
$ java -jar tika-server-x.x.jar -enableUnsecureFeatures -enableFileUrl
|
This allows the user to specify a fileUrl
in the header:
No Format |
---|
curl -i -H "fileUrl:http://tika.apache.org" -H "Accept:text/plain" -X PUT http://localhost:9998/tika
|
or
No Format |
---|
curl -i -H "fileUrl:file:///C:/data/my_test_doc.pdf" -H "Accept:text/plain" -X PUT http://localhost:9998/tika
|
...
NOTE 2: In Tika 1.x, to specify the JVM args for the child process, prepend the arguments with -J
as in -JXmx4g
after the -jar tika-server.x.x.jar
call as in:
No Format |
---|
$ java -Dlog4j.configuration=file:log4j_server.xml -jar tika-server-x.x.jar -spawnChild -JXmx4g -JDlog4j.configuration=file:log4j_child.xml}}
|
...
If you are new to TLS, see our README.txt for how we generated client and server keystores and truststores for our unit tests.
Configuring Parsers at Parse time/per file
See Configuring Parsers At Parse Time in tika-server.
Architecture
Tika Server is based on JSR 311 for a network serve. The server package uses the Apache CXF framework that provides an implementation of JAX-RS for Java.