Introduction
The page provides details on how to translate documents (via the Tika.translate API) using Reader Translator Generator, a neural machine translation toolkit.
The benefits of using this approach for machine translation through Tika are as follows;
- It's free! As opposed to several other translation services currently available via Tika, NMT via RTG is free.
- You are not restricted under usage ceiling, and you don't have to allocate monthly payments. There is no paid service behind the scene, you can use this method completely unrestricted.
- You will have full control over the whole pipeline.You may either build NMT models or download pretrained models, set up server and manage backend.
- Your data and documents are not sent to any services outside of your pipeline. So you can guarantee privacy of your data.
Though, you have to keep these in mind:
- Though you may run the model on CPU for testing, the translation will be very slow on CPUs. GPUs are highly recommended.
- NMT models are not interpretable and explainable. We cannot explain or guarantee that the translations are 100% correct. This is not specific to RTG/NMT; it is generally true for all neural machine translation services.
This is relatively a new addition; the following translation models are currently available:
To train models for your desired translation direction, please refer to the documentation at https://isi-nlp.github.io/rtg/#_usage
Integration: Overview
The class
org.apache.tika.language.translate.RTGTranslator
glues Tika system with RTG REST API.By default, it interacts with http://localhost:6060.
This URL can be customized using
rtg.base.url
property from by adding translator.rtg.properties
file can be used to customize base URL if neededto classpath with rtg.base.url
property.500 Languages to English Translation
Step 1: Start RTG Translator Service
500-English model can be obtained from a docker image as follows
Docker image can be run on CPU (i.e. without GPU, for testing): docker run --rm -i -p 6060:6060 tgowda/rtg-model:500toEng-v1
Using GPU (e.g. Device 0) is recommended for translating a lot of documents:
docker run --rm -i -p 6060:6060 --gpus '"device=0"' tgowda/rtg-model:500toEng-v1
Verify that the translator serive is actually running by accessing http://localhost:6060/
Step 2: Start Tika Server Jar
Option 1: Obtain prebuilt jar
Note: This option is for the future versions. The current prebuilt jars do not have this feature integrated. Go to Option 2.
wget https://www.apache.org/dyn/closer.cgi/tika/tika-server-2.0.0.jar
Option 2: Build Tika Server from source
$ git clone https://github.com/apache/tika.git
$ cd tika
# if the pull request is not merged yet; please pull from this repo
$ git checkout -b TIKA-3329
$ git pull https://github.com/thammegowda/tika.git TIKA-3329
# Compile and package Tika
$ mvn clean package -DskipTests
# Start Tika server
$ java -jar tika-server/target/tika-server-2.0.0-SNAPSHOT.jar
Step 3:Translate Documents via Tika + RTG
printf "Hola señor\nನಮಸ್ಕಾರ\nBonjour monsieur\nПривет\n" > tmp.txt
$ curl http://localhost:9998/translate/all/org.apache.tika.language.translate.RTGTranslator/x/eng -X PUT -T tmp.txt
Hi, sir.
Namaskar
Good morning, sir.
Hi.
Optional: Change the base URL of RTG translator service
You may deploy RTG service elsewhere (on a machine with GPU) and point its URL to tika.
Step 1: Create a file named translator.rtg.properties
with rtg.base.url
property
echo "rtg.base.url=http://<myhost>:<port>/rtg/v1" > translator.rtg.properties
Step 2: Add the directory having translator.rtg.properties
to classpath; In this case . i.e, $PWD
java -cp '.:tika-server/target/tika-server-2.0.0-SNAPSHOT.jar' org.apache.tika.server.TikaServerCli
Step 3: Interact with Tika Server as usual
$ curl http://localhost:9998/translate/all/org.apache.tika.language.translate.RTGTranslator/x/eng -X PUT -T tmp.txt
Acknowledgements
If you wish to acknowledge or reference either RTG toolkit or the 500-English model, please reference this article: https://arxiv.org/abs/2104.00290
@misc{gowda2021manytoenglish,
title={Many-to-English Machine Translation Tools, Data, and Pretrained Models},
author={Thamme Gowda and Zhao Zhang and Chris A Mattmann and Jonathan May},
year={2021},
eprint={2104.00290},
archivePrefix={arXiv},
primaryClass={cs.CL}
}