****Page Under Construction****
This was written for version 2.6.0. There will be additions over time. For those migrating from Tika 1.x to 2.x, there are important changes, see: Migrating to Tika 2.0.0.
When possible, the Tika project tries to rely on standards such as Dublin Core, and we try to map file-format specific key names to those standards when possible. However, that is not possible in all cases, and in fact, many file formats allow for custom metadata, which means that the metadata keys that one might encounter in the wild is an open set.
In October 2022, the Tika team counted Metadata keys in Tika extracts from 1 million files in our regression corpus. The output is available here: metadata-keys-1m-20221006.tgz.
In Tika 3.x, we'll try to require that every metadata key has a namespace. We have moved in that direction slowly, but we have not yet achieved that goal.
If you would like Tika to modify metadata key names or metadata values before returning output see the section on MetadataFilters: ModifyingContentWithHandlersAndMetadataFilters. If you have a limited set of metadata keys that you need, you can add a MetadataWriteFilter that will effectively prevent Tika from even writing metadata that you do not want. See also ModifyingContentWithHandlersAndMetadataFilters for MetadataWriteFilters.
To get the fullest amount of metadata, we recommend using the RecursiveMetadataParser, the /rmeta endpoint or the -J option on tika-app. For tika-server specifically, see: TikaServerEndpointsCompared.
Tika Process
These capture behavior of parsers or other components during the parse.
Key | Notes |
---|---|
X-TIKA:Parsed-By | Which parser parsed a given file |
X-TIKA:Parsed-By-Full-Set | All the parsers that touched a given file and its embedded files. This key is reported in the metadata object of the primary file |
X-TIKA:parse_time_millis | Milliseconds it took to parse a given file and its embedded files. |
X-TIKA:EXCEPTION:container_exception | |
X-TIKA:EXCEPTION:embedded_exception | If there's parse exception while parsing an embedded file, the stack trace is stored with this key. |
Tika General
Key | Notes |
---|---|
Content-Type | This is the file's mime type as identified by Tika. Example: application/pdf |
X-TIKA:digest:MD5 | If you've configured digests, they are returned with a key of the form X-TIKA:digest:ALGORITHM. |
resourceName | File name |
Content-Length | When available, the number of bytes in a stream |
X-TIKA:content | This is the text that is extracted from the files |
X-TIKA:content_handler | This is the content handler that was used for handling the text (e.g. Text, XHTML, etc.) |
X-TIKA:embedded_resource_path | |
X-TIKA:embedded_depth | |
X-TIKA:encrypted | If a parser throws an EncryptedDocumentException, the parser also sets this value to true in the metadata. |
tika:file_ext | File extension |
Dublin Core
Key | Notes |
---|---|
dc:creator | |
dcterms:created | |
dcterms:modified | |
dc:rights | |
dc:contributor | |
dc:title | |
dc:relation | |
dc:type | |
dc:identifier | |
dc:publisher | |
dc:description | |
dc:subject | |
dc:language | |
dc:format |
XMP (eXtensible Metadata Platform)
Key | Notes |
---|---|
xmp:About | |
xmp:CreateDate | |
xmp:CreatorTool | |
xmp:Identifier | |
xmp:Label | |
xmp:MetadataDate | |
xmp:ModifyDate | |
xmp:Rating | |
xmpDM:album | |
xmpDM:albumArtist | |
xmpDM:artist | |
xmpDM:audioChannelType | |
xmpDM:audioCompressor | |
xmpDM:audioSampleRate | |
xmpDM:audioSampleType | |
xmpDM:compilation | |
xmpDM:composer | |
xmpDM:copyright | |
xmpDM:discNumber | |
xmpDM:duration | |
xmpDM:genre | |
xmpDM:logComment | |
xmpDM:releaseDate | |
xmpDM:trackNumber | |
xmpDM:videoCompressor | |
xmpMM:DerivedFrom:DocumentID | |
xmpMM:DerivedFrom:InstanceID | |
xmpMM:DocumentID | |
xmpMM:History:Action | |
xmpMM:History:InstanceID | |
xmpMM:History:SoftwareAgent | |
xmpMM:History:When | |
xmpTPg:NPages |
Format Specific Metadata
PDF Metadata
PDF metadata is typically stored via two mechanisms, one is the "native" PDF docinfo
metadata object and the other is via XMP. For cases where there may be the same key, e.g. "created," in both the docinfo and the XMP, Tika reports the information in the XMP. In this case, the created date in the XMP would be reported as dcterms:created
.
Some users want to extract the literal docinfo
information (irrespective of the XMP), and for that Tika prefixes keys with pdf:docinfo
.
Note that XMP metadata may have custom keys, and some PDFs store custom metadata in the docinfo.
PDF is a "page-based" file format, and the number of pages is stored in xmpTPg:NPages
.
Key | Notes |
---|---|
access_permission:assemble_document | |
access_permission:can_modify | |
access_permission:can_print | |
access_permission:can_print_degraded | |
access_permission:extract_content | |
access_permission:extract_for_accessibility | |
access_permission:fill_in_form | |
access_permission:modify_annotations | |
pdf:actionTrigger | |
pdf:annotationSubtypes | |
pdf:annotationTypes | |
pdf:charsPerPage | |
pdf:docinfo:custom:* | Custom metadata stored in the docinfo dictionary, e.g. pdf:docinfo:custom:_dlc_policyId |
pdf:docinfo:created | |
pdf:docinfo:creator | |
pdf:docinfo:creator_tool | |
pdf:docinfo:keywords | |
pdf:docinfo:modified | |
pdf:docinfo:producer | |
pdf:docinfo:title | |
pdf:docinfo:trapped | |
pdf:has3D | |
pdf:hasAcroFormFields | |
pdf:hasCollection | |
pdf:hasMarkedContent | |
pdf:hasXFA | |
pdf:hasXMP | |
pdf:PDFExtensionVersion | |
pdf:PDFVersion | |
pdf:producer | |
pdf:unmappedUnicodeCharsPerPage | |
pdfa:PDFVersion | |
pdfaid:conformance | |
pdfaid:part | |
pdfuaid:part | |
pdfvt:modified | |
pdfvt:version | |
pdfx:conformance | |
pdfx:version | |
pdfxid:version | |
Microsoft Office Files
Key | Notes |
---|---|
embeddedRelationshipId | |
RTF Files
Key | Notes |
---|---|
rtf_meta:emb_app_version | |
rtf_meta:emb_class | |
rtf_meta:thumbnail | |
rtf_pict:* | metadata around embedded images in RTF. A few examples include: rtf_pict:borderLeftColor, rtf_pict:borderRightColor, rtf_pict:borderTopColor, rtf_pict:dhgt, rtf_pict:dxHeightHR, rtf_pict:dxTextLeft, rtf_pict:dxTextRight, rtf_pict:dxWidthHR |
Tiff Files
Key | Notes |
---|---|
tiff:ImageWidth | |
tiff:ImageLength | |
tiff:BitsPerSample |
Exif Keys
Key | Notes |
---|---|
Exif SubIFD:Metering Mode | |
Exif SubIFD:White Balance Mode | |
Exif SubIFD:Scene Capture Type | |
Exif SubIFD:Exposure Mode | |
Text/Html-based Files
Key | Notes |
---|---|
Content-Encoding | |
Tool Specific Metadata
Tika Eval
To get this metadata, you need to have the tika-eval-core jar on your class path.
Key | Notes |
---|---|
tika-eval:numTokens | Number of tokens (words) in the extracted text. |
tika-eval:numUniqueTokens | Number of unique tokens (words), when used with the numTokens, useful for measuring vocabulary richness/repetition |
tika-eval:numAlphaTokens | Number of alphabetic tokens |
tika-eval:numUniqueAlphaTokens | Number of unique alphabetic tokens |
tika-eval:lang | Language automatically detected by Tika's modified OpenNLP language detector |
tika-eval:langConfidence | Confidence of that language |
tika-eval:oov | Out of vocabulary statistic. The tika-eval module has lists of the top 20k most common words for each of 120+ languages. Based on the detected languages, the number of "common tokens" is divided by the number of alphabetic tokens, we then subtract this value from 1 to calculate the percentage of words that are not in the top 20k "common words" for the identified language. This is very helpful for junk detection (identifying when text extraction failed) and for comparing the output of two parsers. See Popat's paper. |
Siegfried Detector
To extract Siegfried detection information, you have to have Siegfried commandline application installed (and callable as "sf" on the commandline) and you need to add the tika-detector-siegfried jar to your class path.
Key | Notes |
---|---|
sf:pronom:mime | |
sf:pronom:format | |
sf:pronom:version | |
sf:pronom:id | |
sf:pronom:basis | |
sf:errors |