Key	Notes
Content-Type	This is the file's mime type as identified by Tika. Example: `application/pdf`

X-TIKA:digest:MD5	If you've configured digests, they are returned with a key of the form X-TIKA:digest:ALGORITHM.
resourceName	File name
Content-Length	When available, the number of bytes in a stream
X-TIKA:content	This is the text that is extracted from the files
X-TIKA:content_handler	This is the content handler that was used for handling the text (e.g. Text, XHTML, etc.)
X-TIKA:embedded_resource_path
X-TIKA:embedded_depth
X-TIKA:encrypted	If a parser throws an EncryptedDocumentException, the parser also sets this value to true in the metadata.
tika:file_ext	File extension

Format Specific Metadata

PDF Metadata

PDF metadata is typically stored via two mechanisms, one is the "native" PDF docinfo metadata object and the other is via XMP. For cases where there may be the same key, e.g. "created," in both the docinfo and the XMP, Tika reports the information in the XMP. In this case, the created date in the XMP would be reported as dcterms:created.

Some users want to extract the literal docinfo information (irrespective of the XMP), and for that Tika prefixes keys with pdf:docinfo.

Note that XMP metadata may have custom keys, and some PDFs store custom metadata in the docinfo.

PDF is a "page-based" file format, and the number of pages is stored in xmpTPg:NPages.

custom:Companypdf:docinfo:custom:SourceModified

Key	Notes
access_permission:assemble_document
access_permission:can_modify
access_permission:can_print
access_permission:can_print_degraded
access_permission:extract_content
access_permission:extract_for_accessibility
access_permission:fill_in_form
access_permission:modify_annotations
pdf:actionTrigger
pdf:annotationSubtypes
pdf:annotationTypes
pdf:charsPerPage
pdf:docinfo:custom:*	Custom metadata stored in the docinfo dictionary, e.g. `pdf:docinfo:custom:_dlc_policyId`
pdf:docinfo:created
pdf:docinfo:	creator
pdf:docinfo:creator_tool
pdf:docinfo:keywords
pdf:docinfo:modified
pdf:docinfo:producer
pdf:docinfo:title
pdf:docinfo:trapped
pdf:has3D
pdf:hasAcroFormFields
pdf:hasCollection
pdf:hasMarkedContent
pdf:hasXFA
pdf:hasXMP
pdf:PDFExtensionVersion
pdf:PDFVersion
pdf:producer
pdf:unmappedUnicodeCharsPerPage
pdfa:PDFVersion
pdfaid:conformance
pdfaid:part
pdfuaid:part
pdfvt:modified
pdfvt:version
pdfx:conformance
pdfx:version
pdfxid:version

Microsoft Office Files

Key	Notes
embeddedRelationshipId

RTF Files

Key	Notes
rtf_meta:emb_app_version
rtf_meta:emb_class
rtf_meta:thumbnail
rtf_pict:*	metadata around embedded images in RTF. A few examples include: rtf_pict:borderLeftColor, rtf_pict:borderRightColor, rtf_pict:borderTopColor, rtf_pict:dhgt, rtf_pict:dxHeightHR, rtf_pict:dxTextLeft, rtf_pict:dxTextRight, rtf_pict:dxWidthHR

Tiff Files

Key	Notes
tiff:ImageWidth
tiff:ImageLength
tiff:BitsPerSample

Exif Keys

Key	Notes
Exif SubIFD:Metering Mode
Exif SubIFD:White Balance Mode
Exif SubIFD:Scene Capture Type
Exif SubIFD:Exposure Mode

Text/Html-based Files

Key	Notes
Content-Encoding

Tool Specific Metadata

Tika Eval

To get this metadata, you need to have the tika-eval-core jar on your class path.

Key	Notes
tika-eval:numTokens	Number of tokens (words) in the extracted text.
tika-eval:numUniqueTokens	Number of unique tokens (words), when used with the numTokens, useful for measuring vocabulary richness/repetition
tika-eval:numAlphaTokens	Number of alphabetic tokens
tika-eval:numUniqueAlphaTokens	Number of unique alphabetic tokens
tika-eval:lang	Language automatically detected by Tika's modified OpenNLP language detector
tika-eval:langConfidence	Confidence of that language
tika-eval:oov	Out of vocabulary statistic. The tika-eval module has lists of the top 20k most common words for each of 120+ languages. Based on the detected languages, the number of "common tokens" is divided by the number of alphabetic tokens, we then subtract this value from 1 to calculate the percentage of words that are not in the top 20k "common words" for the identified language. This is very helpful for junk detection (identifying when text extraction failed) and for comparing the output of two parsers. See Popat's paper.

Siegfried Detector

To extract Siegfried detection information, you have to have Siegfried commandline application installed (and callable as "sf" on the commandline) and you need to add the tika-detector-siegfried jar to your class path.

Key	Notes
sf:pronom:mime
sf:pronom:format
sf:pronom:version
sf:pronom:id
sf:pronom:basis
sf:errors

Page tree

Versions Compared

Old Version 2

New Version Current

Key

Format Specific Metadata

PDF Metadata

Microsoft Office Files

RTF Files

Tiff Files

Exif Keys

Text/Html-based Files

Tool Specific Metadata

Tika Eval

Siegfried Detector

Page tree

Page History

Versions Compared

Old Version 2

New Version Current

Key

Format Specific Metadata

PDF Metadata

Microsoft Office Files

RTF Files

Tiff Files

Exif Keys

Text/Html-based Files

Tool Specific Metadata

Tika Eval

Siegfried Detector