Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Running untrusted parsers on untrusted data rarely leads to good outcomes. We do what we can...is inherently risky (see, for example: Kathleen Fisher's LangSec2021 talk). The Apache Tika team does what it can.

In rare cases, Tika can go into infinite loops or allocate surprising amounts of memory (OutOfMemoryExceptions (OOMs)).  If you are processing enough documents in the wild, you will run into these challenges and you must defend against them. 

Again, if you're processing untrusted files at scale, we strongly encourage not running Tika in the same jvm as, say, an indexer or search system or any other critical code.

The Tika project offers some defenses against these denial of service (DoS) vulnerabilities:.  All of these options spawn a forked process to do the actual parsing.

  1. The ForkParser – this forks a child process and will protect against OOM and infinite loops.
  2. tika-batch – if you are processing files at desktop/vm scale (not cloud scale), you can run tika-batch via tika-app:
    1.  java -jar tika-app.jar -i <input_dir> -o <output_dir>
  3. tika-server – if you are using tika-server, start the server with `–spawnChild` mode, and it will fork a child processserver – In Tika >= 2.x, the parsing is done in a forked process by default. Clients need to be able to handle tika-server going offline when the forked parsing process has to restart.

  4. Use tika-pipes in Tika 2.x, programmatically, in tika-app with the -a option or in tika-server with the /async or /pipes endpoints.

The Tika project has taken the following steps to identify and fix catastrophic problems:

...