You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Current »

****Page Under Construction****

This was written for version 2.6.0.  There will be additions over time.  For those migrating from Tika 1.x to 2.x, there are important changes, see: Migrating to Tika 2.0.0.

When possible, the Tika project tries to rely on standards such as Dublin Core, and we try to map file-format specific key names to those standards when possible.  However, that is not possible in all cases, and in fact, many file formats allow for custom metadata, which means that the metadata keys that one might encounter in the wild is an open set.

In October 2022, the Tika team counted Metadata keys in Tika extracts from 1 million files in our regression corpus. The output is available here: metadata-keys-1m-20221006.tgz.

In Tika 3.x, we'll try to require that every metadata key has a namespace.  We have moved in that direction slowly, but we have not yet achieved that goal.

If you would like Tika to modify metadata key names or metadata values before returning output see the section on MetadataFilters: ModifyingContentWithHandlersAndMetadataFilters.  If you have a limited set of metadata keys that you need, you can add a MetadataWriteFilter that will effectively prevent Tika from even writing metadata that you do not want.  See also ModifyingContentWithHandlersAndMetadataFilters for MetadataWriteFilters.

To get the fullest amount of metadata, we recommend using the RecursiveMetadataParser, the /rmeta endpoint or the -J option on tika-app.  For tika-server specifically, see: TikaServerEndpointsCompared.

Tika Process

These capture behavior of parsers or other components during the parse.

KeyNotes
X-TIKA:Parsed-ByWhich parser parsed a given file
X-TIKA:Parsed-By-Full-SetAll the parsers that touched a given file and its embedded files.  This key is reported in the metadata object of the primary file
X-TIKA:parse_time_millisMilliseconds it took to parse a given file and its embedded files.
X-TIKA:EXCEPTION:container_exception
X-TIKA:EXCEPTION:embedded_exceptionIf there's parse exception while parsing an embedded file, the stack trace is stored with this key.

Tika General

KeyNotes
Content-TypeThis is the file's mime type as identified by Tika. Example: application/pdf


X-TIKA:digest:MD5If you've configured digests, they are returned with a key of the form X-TIKA:digest:ALGORITHM.
resourceNameFile name
Content-LengthWhen available, the number of bytes in a stream
X-TIKA:contentThis is the text that is extracted from the files
X-TIKA:content_handlerThis is the content handler that was used for handling the text (e.g. Text, XHTML, etc.)
X-TIKA:embedded_resource_path
X-TIKA:embedded_depth


tika:file_extFile extension

Dublin Core

KeyNotes
dc:creator
dcterms:created
dcterms:modified
dc:rights
dc:contributor
dc:title
dc:relation
dc:type
dc:identifier
dc:publisher
dc:description
dc:subject
dc:language
dc:format


XMP (eXtensible Metadata Platform)

KeyNotes
xmp:About
xmp:CreateDate
xmp:CreatorTool
xmp:Identifier
xmp:Label
xmp:MetadataDate
xmp:ModifyDate
xmp:Rating
xmpDM:album
xmpDM:albumArtist
xmpDM:artist
xmpDM:audioChannelType
xmpDM:audioCompressor
xmpDM:audioSampleRate
xmpDM:audioSampleType
xmpDM:compilation
xmpDM:composer
xmpDM:copyright
xmpDM:discNumber
xmpDM:duration
xmpDM:genre
xmpDM:logComment
xmpDM:releaseDate
xmpDM:trackNumber
xmpDM:videoCompressor
xmpMM:DerivedFrom:DocumentID
xmpMM:DerivedFrom:InstanceID
xmpMM:DocumentID
xmpMM:History:Action
xmpMM:History:InstanceID
xmpMM:History:SoftwareAgent
xmpMM:History:When
xmpTPg:NPages

Format Specific Metadata

PDF Metadata

KeyNotes














pdf:docinfo:created
pdf:docinfo:custom:Company
pdf:docinfo:custom:SourceModified

Microsoft Office Files

KeyNotes
embeddedRelationshipId




Tiff Files

KeyNotes
tiff:ImageWidth
tiff:ImageLength
tiff:BitsPerSample

Exif Keys

KeyNotes
Exif SubIFD:Metering Mode
Exif SubIFD:White Balance Mode
Exif SubIFD:Scene Capture Type
Exif SubIFD:Exposure Mode










Text/Html-based Files

KeyNotes
Content-Encoding




Tool Specific Metadata

Tika Eval

To get this metadata, you need to have the tika-eval-core jar on your class path.

KeyNotes
tika-eval:numTokensNumber of tokens (words) in the extracted text.
tika-eval:numUniqueTokensNumber of unique tokens (words), when used with the numTokens, useful for measuring vocabulary richness/repetition
tika-eval:numAlphaTokensNumber of alphabetic tokens
tika-eval:numUniqueAlphaTokensNumber of unique alphabetic tokens
tika-eval:langLanguage automatically detected by Tika's modified OpenNLP language detector
tika-eval:langConfidenceConfidence of that language
tika-eval:oovOut of vocabulary statistic.  The tika-eval module has lists of the top 20k most common words for each of 120+ languages.  Based on the detected languages, the number of "common tokens" is divided by the number of alphabetic tokens, we then subtract this value from 1 to calculate the percentage of words that are not in the top 20k "common words" for the identified language.  This is very helpful for junk detection (identifying when text extraction failed) and for comparing the output of two parsers.  See Popat's paper.

Siegfried Detector

To extract Siegfried detection information, you have to have Siegfried commandline application installed (and callable as "sf" on the commandline) and you need to add the tika-detector-siegfried jar to your class path.

KeyNotes
sf:pronom:mime
sf:pronom:format
sf:pronom:version
sf:pronom:id
sf:pronom:basis
sf:errors
  • No labels