Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Introduction

The page provides details on how to translate documents (via the Tika.translate API) using

...

Reader Translator Generator

...

, a neural machine

...

translation toolkit.

The benefits of using this approach for machine translation through Tika are as follows;

  • It's free! As opposed to several other translation services currently available via Tika, NMT via RTG is free.
    • You are not restricted under usage ceiling, and you don't have to allocate monthly payments. There is no paid service behind the scene, you can use this method completely unrestricted.
  • You will have full control over the whole pipeline.You may either build NMT models or download pretrained models, set up server and manage backend.
    • Your data and documents are not sent to any services outside of your pipeline. So you can guarantee privacy of your data.

Though, you have to keep these in mind:

  • Though you may run the model on CPU for testing, the translation will be very slow on CPUs. GPUs are highly recommended.
  • NMT models are not interpretable and explainable. We cannot explain or guarantee that the translations are 100% correct. This is not specific to RTG/NMT; it is generally true for all neural machine translation services.


This is relatively a new addition; the following translation models are currently available:


To train models for your desired translation direction, please refer to the documentation at https://isi-nlp.github.io/rtg/#_usage

Integration: Overview

  1. The class org.apache.tika.language.translate.RTGTranslator glues Tika system with RTG REST API. By default, it integracts with http://localhost:6060.
    That URL can be costomized using rtg.base.url property from translator.rtg.properties file can be used to customize base URL if needed.
  2. Make sure http://localhost:6060 or  whatever rtg.base.url you have set is valid.
  3. To activate   o.a.t.l.t.RTGTranslator,

500 Languages to English Translation

Step 1: Start RTG Translator Service

500-English model can be obtained from a docker image as follows

Docker image can be run on CPU (i.e. without GPU, for testing):
   docker run --rm -i -p 6060:6060 tgowda/rtg-model:500toEng-v1

Using GPU (e.g. Device 0) is recommended for translating a lot of documents:
   docker run --rm -i -p 6060:6060 --gpus '"device=0"' tgowda/rtg-model:500toEng-v1

Verify that the translator serive is actually running by accessing http://localhost:6060/

Step 2: Start Tika Server Jar

Option 1: Obtain prebuilt jar
Note: this is a relatively new feature and currently under development; the prebuilt jars do not have this feature integrated.  This option is for the future versions.

wget https://www.apache.org/dyn/closer.cgi/tika/tika-server-2.0.0.jar 


Option 2: Build Tika Server from source

$ git clone https://github.com/apache/tika.git
$ cd tika
# if the pull request is not merged yet; please pull from this repo
$ git checkout -b TIKA-3329
$
git pull https://github.com/thammegowda/tika.git TIKA-3329

$ mvn clean install -DskipTests 
$ java -jar tika-server/target/tika-server-2.0.0-SNAPSHOT.jar

Step 3:Translate Documents via Tika + RTG


printf "Hola señor\nನಮಸ್ಕಾರ\nBonjour monsieur\nПривет\n" > tmp.txt
$ curl http://localhost:9998/translate/all/org.apache.tika.language.translate.RTGTranslator/x/eng -X PUT -T tmp.txt

Hi, sir.
Namaskar
Good morning, sir.
Hi.


Optional: Change the base URL of RTG translator service

You may deploy RTG service elsewhere (on a machine with GPU) and point its URL to tika.


Step 1: Create a file named translator.rtg.properties with rtg.base.url property

    echo "rtg.base.url=http://<myhost>:<port>/rtg/v1" > translator.rtg.properties 

Step 2: Add the directory having translator.rtg.properties to classpath; In this case . i.e, $PWD

       java -cp '.:tika-server/target/tika-server-2.0.0-SNAPSHOT.jar' org.apache.tika.server.TikaServerCli

Step 3: Interact with Tika Server as usual

 $ curl http://localhost:9998/translate/all/org.apache.tika.language.translate.RTGTranslator/x/eng -X PUT -T tmp.txt




Acknowledgements

If you wish to acknowledge or reference, RTG or its 500-eng model in