Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
  1. Introduction

    The page provides details on how to translate documents (via the Tika.translate API) using Reader Translator Generator, a neural machine translation toolkit.

    The benefits of using this approach for machine translation through Tika are as follows;

    • It's free! As opposed to several other translation services currently available via Tika, NMT via RTG is free.
      • You are not restricted under usage ceiling, and you don't have to allocate monthly payments. There is no paid service behind the scene, you can use this method completely unrestricted.
    • You will have full control over the whole pipeline.You may either build NMT models or download pretrained models, set up server and manage backend.
      • Your data and documents are not sent to any services outside of your pipeline. So you can guarantee privacy of your data.

    Though, you have to keep these in mind:

    • Though you may run the model on CPU for testing, the translation will be very slow on CPUs. GPUs are highly recommended.
    • NMT models are not interpretable and explainable. We cannot explain or guarantee that the translations are 100% correct. This is not specific to RTG/NMT; it is generally true for all neural machine translation services.


    This is relatively a new addition; the following translation models are currently available:


    To train models for your desired translation direction, please refer to the documentation at https://isi-nlp.github.io/rtg/#_usage

    Integration: Overview

    1. The class org.apache.tika.language.translate.RTGTranslator glues Tika system with RTG REST API. By default, it
  2. integracts
    1. interacts with http://localhost:6060.
      That URL can be
  3. costomized
    1. customized using rtg.base.url property from translator.rtg.properties file can be used to customize base URL if needed.
    2. Make sure http://localhost:6060 or  whatever rtg.base.url you have set is valid.
  4. To activate   o.a.t.l.t.RTGTranslator,

    500 Languages to English Translation

    Step 1: Start RTG Translator Service

    500-English model can be obtained from a docker image as follows

    Docker image can be run on CPU (i.e. without GPU, for testing):
       docker run --rm -i -p 6060:6060 tgowda/rtg-model:500toEng-v1

    Using GPU (e.g. Device 0) is recommended for translating a lot of documents:
       docker run --rm -i -p 6060:6060 --gpus '"device=0"' tgowda/rtg-model:500toEng-v1

    Verify that the translator serive is actually running by accessing http://localhost:6060/

    Step 2: Start Tika Server Jar

    Option 1: Obtain prebuilt jar
    Note:

this is a relatively new feature and currently under development; the
  1. This option is for the future versions. The current prebuilt jars do not have this feature integrated.

  This option is for the future versions
  1. Go to Option 2.

    wget https://www.apache.org/dyn/closer.cgi/tika/tika-server-2.0.0.jar 


    Option 2: Build Tika Server from source

    $ git clone https://github.com/apache/tika.git
    $ cd tika
    # if the pull request is not merged yet; please pull from this repo
    $ git checkout -b TIKA-3329
    $
    git pull https://github.com/thammegowda/tika.git TIKA-3329

    # Compile and package Tika
    $ mvn clean 
install
  1. package -DskipTests 

    # Start Tika server
    $ java -jar tika-server/target/tika-server-2.0.0-SNAPSHOT.jar

    Step 3:Translate Documents via Tika + RTG


    printf "Hola señor\nನಮಸ್ಕಾರ\nBonjour monsieur\nПривет\n" > tmp.txt
    $ curl http://localhost:9998/translate/all/org.apache.tika.language.translate.RTGTranslator/x/eng -X PUT -T tmp.txt

    Hi, sir.
    Namaskar
    Good morning, sir.
    Hi.


    Optional: Change the base URL of RTG translator service

    You may deploy RTG service elsewhere (on a machine with GPU) and point its URL to tika.


    Step 1: Create a file named translator.rtg.properties with rtg.base.url property

        echo "rtg.base.url=http://<myhost>:<port>/rtg/v1" > translator.rtg.properties 

    Step 2: Add the directory having translator.rtg.properties to classpath; In this case . i.e, $PWD

           java -cp '.:tika-server/target/tika-server-2.0.0-SNAPSHOT.jar' org.apache.tika.server.TikaServerCli

    Step 3: Interact with Tika Server as usual

     $ curl http://localhost:9998/translate/all/org.apache.tika.language.translate.RTGTranslator/x/eng -X PUT -T tmp.txt




    Acknowledgements

    If you wish to acknowledge or reference, RTG or its 500-eng model in