You are viewing an old version of this page. View the current version.
Compare with Current
View Page History
« Previous
Version 2
Next »
****Page Under Construction****
This was written for version 2.6.0. There will be additions over time. For those migrating from Tika 1.x to 2.x, there are important changes, see: Migrating to Tika 2.0.0.
When possible, the Tika project tries to rely on standards such as Dublin Core, and we try to map file-format specific key names to those standards when possible. However, that is not possible in all cases, and in fact, many file formats allow for custom metadata, which means that the metadata keys that one might encounter in the wild is an open set.
In October 2022, the Tika team counted Metadata keys in Tika extracts from 1 million files in our regression corpus. The output is available here: metadata-keys-1m-20221006.tgz.
In Tika 3.x, we'll try to require that every metadata key has a namespace. We have moved in that direction slowly, but we have not yet achieved that goal.
If you would like Tika to modify metadata key names or metadata values before returning output see the section on MetadataFilters: ModifyingContentWithHandlersAndMetadataFilters. If you have a limited set of metadata keys that you need, you can add a MetadataWriteFilter that will effectively prevent Tika from even writing metadata that you do not want. See also ModifyingContentWithHandlersAndMetadataFilters for MetadataWriteFilters.
To get the fullest amount of metadata, we recommend using the RecursiveMetadataParser, the /rmeta endpoint or the -J option on tika-app. For tika-server specifically, see: TikaServerEndpointsCompared.
These capture behavior of parsers or other components during the parse.
Key | Notes |
---|
X-TIKA:Parsed-By | Which parser parsed a given file |
X-TIKA:Parsed-By-Full-Set | All the parsers that touched a given file and its embedded files. This key is reported in the metadata object of the primary file |
X-TIKA:parse_time_millis | Milliseconds it took to parse a given file and its embedded files. |
X-TIKA:EXCEPTION:container_exception |
|
X-TIKA:EXCEPTION:embedded_exception | If there's parse exception while parsing an embedded file, the stack trace is stored with this key. |
Key | Notes |
---|
Content-Type | This is the file's mime type as identified by Tika. Example: application/pdf |
|
|
X-TIKA:digest:MD5 | If you've configured digests, they are returned with a key of the form X-TIKA:digest:ALGORITHM. |
resourceName | File name |
Content-Length | When available, the number of bytes in a stream |
X-TIKA:content | This is the text that is extracted from the files |
X-TIKA:content_handler | This is the content handler that was used for handling the text (e.g. Text, XHTML, etc.) |
X-TIKA:embedded_resource_path |
|
X-TIKA:embedded_depth |
|
|
|
tika:file_ext | File extension |
Key | Notes |
---|
dc:creator |
|
dcterms:created |
|
dcterms:modified |
|
dc:rights |
|
dc:contributor |
|
dc:title |
|
dc:relation |
|
dc:type |
|
dc:identifier |
|
dc:publisher |
|
dc:description |
|
dc:subject |
|
dc:language |
|
dc:format |
|
Key | Notes |
---|
xmp:About |
|
xmp:CreateDate |
|
xmp:CreatorTool |
|
xmp:Identifier |
|
xmp:Label |
|
xmp:MetadataDate |
|
xmp:ModifyDate |
|
xmp:Rating |
|
xmpDM:album |
|
xmpDM:albumArtist |
|
xmpDM:artist |
|
xmpDM:audioChannelType |
|
xmpDM:audioCompressor |
|
xmpDM:audioSampleRate |
|
xmpDM:audioSampleType |
|
xmpDM:compilation |
|
xmpDM:composer |
|
xmpDM:copyright |
|
xmpDM:discNumber |
|
xmpDM:duration |
|
xmpDM:genre |
|
xmpDM:logComment |
|
xmpDM:releaseDate |
|
xmpDM:trackNumber |
|
xmpDM:videoCompressor |
|
xmpMM:DerivedFrom:DocumentID |
|
xmpMM:DerivedFrom:InstanceID |
|
xmpMM:DocumentID |
|
xmpMM:History:Action |
|
xmpMM:History:InstanceID |
|
xmpMM:History:SoftwareAgent |
|
xmpMM:History:When |
|
xmpTPg:NPages |
|
Key | Notes |
---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
pdf:docinfo:created |
|
pdf:docinfo:custom:Company |
|
pdf:docinfo:custom:SourceModified |
|
Key | Notes |
---|
embeddedRelationshipId |
|
|
|
|
|
Key | Notes |
---|
tiff:ImageWidth |
|
tiff:ImageLength |
|
tiff:BitsPerSample |
|
Key | Notes |
---|
Exif SubIFD:Metering Mode |
|
Exif SubIFD:White Balance Mode |
|
Exif SubIFD:Scene Capture Type |
|
Exif SubIFD:Exposure Mode |
|
|
|
|
|
|
|
|
|
|
|
Text/Html-based Files