You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

  1. Introduction

    The page provides details on how to translate documents (via the Tika.translate API) using Reader Translator Generator, a neural machine translation toolkit.

    The benefits of using this approach for machine translation through Tika are as follows;

    • It's free! As opposed to several other translation services currently available via Tika, NMT via RTG is free.
      • You are not restricted under usage ceiling, and you don't have to allocate monthly payments. There is no paid service behind the scene, you can use this method completely unrestricted.
    • You will have full control over the whole pipeline.You may either build NMT models or download pretrained models, set up server and manage backend.
      • Your data and documents are not sent to any services outside of your pipeline. So you can guarantee privacy of your data.

    Though, you have to keep these in mind:

    • Though you may run the model on CPU for testing, the translation will be very slow on CPUs. GPUs are highly recommended.
    • NMT models are not interpretable and explainable. We cannot explain or guarantee that the translations are 100% correct. This is not specific to RTG/NMT; it is generally true for all neural machine translation services.


    This is relatively a new addition; the following translation models are currently available:


    To train models for your desired translation direction, please refer to the documentation at https://isi-nlp.github.io/rtg/#_usage

    Integration: Overview

    1. The class org.apache.tika.language.translate.RTGTranslator glues Tika system with RTG REST API. By default, it interacts with http://localhost:6060.
      That URL can be customized using rtg.base.url property from translator.rtg.properties file can be used to customize base URL if needed.
    2. Make sure http://localhost:6060 or  whatever rtg.base.url you have set is valid.

    500 Languages to English Translation

    Step 1: Start RTG Translator Service

    500-English model can be obtained from a docker image as follows

    Docker image can be run on CPU (i.e. without GPU, for testing):
       docker run --rm -i -p 6060:6060 tgowda/rtg-model:500toEng-v1

    Using GPU (e.g. Device 0) is recommended for translating a lot of documents:
       docker run --rm -i -p 6060:6060 --gpus '"device=0"' tgowda/rtg-model:500toEng-v1

    Verify that the translator serive is actually running by accessing http://localhost:6060/

    Step 2: Start Tika Server Jar

    Option 1: Obtain prebuilt jar
    Note: This option is for the future versions. The current prebuilt jars do not have this feature integrated. Go to Option 2.

    wget https://www.apache.org/dyn/closer.cgi/tika/tika-server-2.0.0.jar 


    Option 2: Build Tika Server from source

    $ git clone https://github.com/apache/tika.git
    $ cd tika
    # if the pull request is not merged yet; please pull from this repo
    $ git checkout -b TIKA-3329
    $
    git pull https://github.com/thammegowda/tika.git TIKA-3329

    # Compile and package Tika
    $ mvn clean package -DskipTests 

    # Start Tika server
    $ java -jar tika-server/target/tika-server-2.0.0-SNAPSHOT.jar

    Step 3:Translate Documents via Tika + RTG


    printf "Hola señor\nನಮಸ್ಕಾರ\nBonjour monsieur\nПривет\n" > tmp.txt
    $ curl http://localhost:9998/translate/all/org.apache.tika.language.translate.RTGTranslator/x/eng -X PUT -T tmp.txt

    Hi, sir.
    Namaskar
    Good morning, sir.
    Hi.


    Optional: Change the base URL of RTG translator service

    You may deploy RTG service elsewhere (on a machine with GPU) and point its URL to tika.


    Step 1: Create a file named translator.rtg.properties with rtg.base.url property

        echo "rtg.base.url=http://<myhost>:<port>/rtg/v1" > translator.rtg.properties 

    Step 2: Add the directory having translator.rtg.properties to classpath; In this case . i.e, $PWD

           java -cp '.:tika-server/target/tika-server-2.0.0-SNAPSHOT.jar' org.apache.tika.server.TikaServerCli

    Step 3: Interact with Tika Server as usual

     $ curl http://localhost:9998/translate/all/org.apache.tika.language.translate.RTGTranslator/x/eng -X PUT -T tmp.txt




    Acknowledgements

    If you wish to acknowledge or reference, RTG or its 500-eng model in





  • No labels