TransLaTeX#

pipeline status coverage report latest release pre-commit

Description#

TransLaTeX is a program which aims to translate LaTeX source files (.tex) from one human language to another using automatic translators.

The purpose is to extract the text from LaTeX source files using a tokenization technique without including any commands or tags so that the result can be passed to an automatic translator like DeepL or Google Translate without messing up the LaTeX file. The objective is to have the same LaTeX file with only the text in another specified language which still compiles perfectly and visually intact. A single, simple tool to translate LaTeX.

Apart from the main use case, the marking, tokenization and text extraction can be useful in other contexts such as manual translation by a professional human translator or parsing.

The following is an overview on the inner processing stages of TransLaTeX.

How TransLaTeX processes a given file

TransLaTeX data flow#

Features#

  • Automatic translation for LaTeX

  • Use a popular translation API of your choice

  • Translate to and from a number of languages according to the translator you picked

  • Replace all LaTeX constructs with tokens to be able to restore them later and still have a correct LaTeX file after

  • Math environments and equations are kept intact while translating any text inside

  • User customizable marker and token formats

  • Provided syntax in your source LaTeX file to manually handle certain parts at your will

  • Straight forward design, easy to tinker with the program’s source code to customize its behavior

  • Easy to install and run with simple instructions and CLI

  • Debug option to have a peek at inner workings after the operation

  • The core program apart from the API call is idempotent and preserves the original text from a LaTeX standpoint after any manipulation

  • Correct resulting LaTeX, which compiles and has working references

Unsupported and unmanaged so far#

  • Supports only a subset of LaTeX, doesn’t recognize all constructs.

  • LaTeX escape inside literal blocks isn’t handled and doesn’t get recursed into. They get tokenized as a whole. For example: \begin{lstlisting}[escapeinside={\%*}{*)}] which lets you escape actual LaTeX tags and commands inside a literal lstlisting block.

  • Document header is kept untouched. Nothing in the header (before \begin{document}) gets tokenized or translated. Which means document \date, \newcommand and most importantly babel language specifier doesn’t get translated.

  • Any commands and metadata that can be put before \begin{document} should be put before and thus in the header. Otherwise, they will get tokenized and translated. More concisely, given file must be correct and conform LaTeX.

  • Text inside math environments are kept intact for translation only if they are located one level deep. For example (albeit not the best example), the \textsf command’s contents won’t be translated since it is more than one level deep in a math environment in the following code block.

\begin{document}
    Soit la matrice
    \begin{equation}
        A=\left[\begin{array}{ccc}
                    3  & 3  & 4   \\
                    6  & -2 & -12 \\
                    -2 & 3  & 9
                    \textsf{hello world}
        \end{array}\right].\label{eq:A}
    \end{equation}
\end{document}
  • Text inside math environments are recognized thanks to a list of valid text commands inside the data module named TEXT_COMMANDS. If there are others that we’ve missed, add them to this list so that they also get recognized.

  • This program is built for LaTeX only. Use LaTeX commands almost exclusively and not pure TeX.

  • Here is a list of known unsupported commands and environments:

    • picture

    • tikzpicture

  • For now, only single file documents are supported. No \input.

  • Macros are unsupported.

  • The final full stop at the end of block math environments gets put into tokens instead of being left out. These dots should be kept alongside the text to be translated for it to form a logical sentence giving more consistent results with an automatic translator.

Python versions#

CPython is the Python implementation used and below are the most used versions during development.

Python 3.10.11

Installation#

Install python package#

Using pip#

For use as an end user, the final product

pip install git+https://gitlab.math.unistra.fr/cassandre/translatex.git

Using pip in a virtual environment#

For use as a developer, a dev environment

From project root directory:

git clone git@gitlab.math.unistra.fr:cassandre/translatex.git
cd translatex
python3 -m virtualenv .venv  # create a virtual environment
source .venv/bin/activate  # activate the virtual environment
pip install -e ".[test,doc,dev]"  # install the package in editable mode with optional dependnecies
pre-commit install -t pre-commit -t pre-push  # install git hooks
pre-commit run -a   # run git hooks once for the first time

Note: in editable mode (-e option), the package is installed in a way that it is still possible to edit the source code and have the changes take effect immediately.

Also see the below sections on the unit tests and doc generation to get your dev enviornment fully ready.

Run the unit tests#

Install the development dependencies#

pip install -e ".[test]"

Run the tests#

Run the tests from the projet root directory using the -s:

pytest -sv

Also add the --runapi option to run all the tests including the optional ones that require a Google Translate API key.

See .gitlab-ci.yml for more details.

Build the documentation#

Install the documentation dependencies#

pip install -e ".[doc]"

You also need graphviz for diagram generation. Install it according to its official instructions and your system.

Build and serve the documentation locally#

make -C docs livehtml

Go to http://localhost:8000 and see the changes in docs/source/ and src/ directories take effect immediately.

Usage#

Basic use#

Note

See the CLI Synopsis for an overview of the possible options for TransLaTeX on the command line.

TransLaTeX reads from stdin and writes to stdout by default, but you can also pass in positional arguments specifying the paths to the input and output files. It also writes warnings about missing or altered indicators (generally due to the automatic translation) and extra information when verbose or debug options are specified as logs to stderr. Don’t forget to redirect these via 2> /dev/null or equivalent if you only want the LaTeX output. This behaviour is useful if you want to integrate TransLaTeX into scripts, batch execute it, automate its execution or simply use a pipe (|) syntax.

Note

You can find various small LaTeX files under the translatex/examples/ directory in the source code that serve as examples that you can use to experiment with the program.

For the most basic invocation of TransLaTeX, you need the source and destination language short names and an internet connection. An example is as follows:

translatex -sl en -dl fr

This reads from stdin and outputs the resulting LaTeX file in French to stdout as there were no file names specified for the output. The input LaTeX file is expected to be in English as per the passed option.

  • -sl or --src-lang specifies the source language which is the language of your input file

  • -dl or --dest-lang specifies the destination language which is the language you want for your translated result.

Note

The available languages depend on the chosen translation service and its supported languages.

To learn more about which translation service integrations are available and how you can add your own, see translator.

If you’re not scripting but instead using TransLaTeX yourself, you most likely have already existing LaTeX files to translate.

For a more common use, you need a correct LaTeX file. An example is as follows:

translatex -sl en -dl fr input.tex output.tex
  • input.tex is an existing LaTeX file, written in your previously specified source language, to be processed by TransLaTeX. This can be omitted by passing - instead, if you want to still read from stdin but output to a file.

  • output.tex is the file to write the translated output to. If it doesn’t exist, it is created. If it already exists, it is overwritten.

Lastly for this example, since we didn’t explicitly specify a translation service to use, the default service of Google Translate (no key) is used. This is a free, unlimited use service that produces lesser quality results. It is intended for testing and educational purposes, and it violates Google’s TOS. This is equivalent to explicitly specifying --service "Google Translate (no key)" as an option (read on for more info).

Common options#

Let’s see another example where we use Google’s official translation API and output to a file instead.

translatex -sl en -dl fr --service "Google Translate" input.tex output.tex
  • --service option lets you choose one of the available translation services to use.

According to the translation service that you choose to use, you may want to change the token format generated by TransLaTeX to one that works better. The default is [{}-{}] since this seems to work best with Google’s translation.

For example, you can do the following to get better results if the default format tokens get corrupted after the translation. Suppose ={}.{}= format works better with IRMA - M2M100:

translatex -sl en -dl fr --service "IRMA - M2M100" -tf "={}.{}=" input.tex output.tex

This makes it so that TransLaTeX generates tokens in your custom specified format before sending out the tokenized text for translation. The curly braces ({}) indicate where numbers will be put during the numbering of the tokens. For example a token with the previous format could be like =12.6= or [12-6] with the default format. The tokens use two numbers so their format string has to contain at least two distinct pairs of curly braces.

Note

See the Extra options section to find out details on how to visualize the tokens and the generation of intermediary files to have a peek at the inner workings of TransLaTeX.

If you want to show logs and have more information on the execution of TransLaTeX, you can use the -v or -vv flags for verbose output. These output logs of INFO level or higher to stderr. The former shows info on only TransLaTeX and the latter on all, including the imported modules, making for a more detailed output.

Note

For even more details on the logging behaviour of TransLaTeX, see main().

Manual substitution syntax#

It is possible to exclude certain parts of your LaTeX file from the automatic translation with some special TransLaTeX preprocessor syntax (see process()). This is useful if TransLaTeX doesn’t recognize certain structures in your LaTeX file or the automatic translation produces a poor quality translation that you want to provide by hand. The manual substitution block’s syntax is as follows inside a LaTeX file:

%@{            -> Beginning indicator
\textbf{Welcome to France!}
%@--           -> Seperator indicator
\textit{Bienvenue en France !}
% $x < 3$
%@}            -> Ending indicator

The top section (before %@--) of the block is considered the original text and the bottom part is the replacement text. Default behaviour is to remove these blocks before any further processing thus preventing them from being sent to parsing and translation. After the translation, the bottom part of the block is put where the block once was thus producing:

\textit{Bienvenue en France !}
$x < 3$

The commented out lines in the replacement section are uncommented before any replacement. Notice the second line is no longer a comment.

In short,

  • Use %@{ and %@} to begin and end a block.

  • Use %@-- to separate the original lines and the lines that will replace those.

  • Any commented out lines in the bottom part are uncommented before replacing.

You can write anything on the same lines as these indicators without causing issues just like in the given example. This allows you to annotate your manual replacement blocks. One use could be to continue with the number of dashes to improve visibility:

%@{ This is my very special block
\textbf{Welcome to France!}
%@-----------------------------------
\textit{Bienvenue en France !}
% $x < 3$
%@} For anyone wondering, this block is for TransLaTeX

This syntax is compatible with LaTeX as it uses the line comment character (%) and is invisible to it. Files that contain this syntax still compile without issues. In case, you don’t want the manual substitution to be included during compilation, resulting in a pdf file with two versions of some lines, you can comment out all replacement lines in the bottom part:

%@{
\textbf{Welcome to France!}
%@--
% \textit{Bienvenue en France !}
% $x < 3$
%@}

This still produces the aforementioned result but if you were to compile this LaTeX file you would only get your original text in its result.

Lastly, you can write the same things twice in both parts of the block causing TransLaTeX to not translate, and replace with the same thing, basically leaving parts of your file untouched by any operation, same as they were in the beginning while other parts still being parsed and translated:

%@{
\textbf{Here is some gibberish sşaıdfajş}
%@--
% \textbf{Here is some gibberish sşaıdfajş}
%@}

After preprocessing, this results in:

\textbf{Here is some gibberish sşaıdfajş}

thus allowing you to keep certain parts untouched.

An interesting CLI option is --no-pre which disables the manual replacement in the preprocessor stage.

This still removes the blocks from translation but at the end, instead of replacing the whole block with the bottom part, it just recreates the block where it was, as it was, untouched, untranslated.

This allows you to test your file with or without the manual replacement that you have in your source file.

Extra options#

Here you will find details on some options mostly used for debugging during development, but that can still come in handy for the end user on understanding what’s going wrong while your file is being processed and how you can potentially make it succeed.

  • -n or --dry-run: This makes it so that no API call is made, no internet is used and TransLaTeX runs offline. This means that no translation is made but the pipeline is run up until the translation stage including parsing and tokenization. This helps to test the inner workings of TransLaTeX and especially the idempotency. Since the same contents of the given file is returned, one can check for differences. If you are using a paid API with limits, you can use this option to detect undetected errors in the processing of your file before calling for a real translation.

  • -d or --debug: This is the debug option, and it does multiple things. Firstly, it enables the output of logs of level DEBUG or higher to stderr. This is the ultimate option as far as logs compared to -v and -vv since this also enables logs for TransLaTeX and for the imported modules while lowering the log level resulting in even more information. Secondly, it generates intermediary files produced while the input is being processed. These include the preprocessed, marked and tokenized versions of your given file, each relative to their respective stages and also the dictionaries that hold the markers and tokens and their associated LaTeX structures that they replace. This can give huge insight on what went on when TransLaTeX ran on your given file and what maybe went wrong.

  • -s or --stop: This option takes an argument and enables you to stop TransLaTeX’s execution at a given stage. Once stopped, that stage completes its operation and writes its result to the output file if specified or else to stdout. This can help you hunt down where a problem appears during the process.

Note

Take a look at the Modules section for even more details on the inner workings and the architecture of TransLaTeX.

Customizing behaviour#

You can customize and improve TransLaTeX’s behaviour by modifying or more importantly adding to the list of known LaTeX structures in the data module.

The default behaviour of TransLaTeX is dependent on the type of LaTeX structure encountered.

  • For LaTeX commands:

\LaTeX
\TeX{}
\textbf{Bold text}
\href{https://www.overleaf.com/}{Link to Overleaf}
\url{https://www.overleaf.com/}
\madeupcommand[with][options]{and}{many}{arguments}[last option]

it is to tokenize everything except the very last occurring argument (last set of curly braces {}) since this seems to be a common place to find the text that needs translation inside LaTeX commands even though not standard for all commands.

Some commands that are known to never contain text to translate are tokenized as a whole including all their arguments and options, like in the case for the \url command. This would for example produce the following, before sending out for translation:

[0-2]
[0-3]{}
[0-4]{Bold text}
[0-5]{Link to Overleaf}
[0-1]
[0-6]{arguments}
  • For LaTeX environments:

\begin{document}
    \section{New section}
    This is some text.
    \(x < 0 \textnormal{some text inside an inline math environment}\)
    \begin{lstlisting}[language=Python]
    import numpy as np
    print("Hello, world!")
    \end{lstlisting}
    \begin{enumerate}
        \item Foo bar
        \item $ x < 0 $
        \item Spam eggs
    \end{enumerate}
    \[
        D=P^{-1}AP.
        \textsf{some text inside a block math environment}
    \]
    \begin{equation}
        A=\left[
            \begin{array}{ccc}
                3  & 3  & 4   \\
                6  & -2 & -12 \\
                -2 & 3  & 9
            \end{array}
            \right].\label{eq:A}
    \end{equation}
\end{document}

it is to tokenize the \begin/\end statements and recurse into the environment to find and process the nested commands and environments if it is a regular environment. If it is a math environment, it is recursed into only if it contains text to translate (any known commands that can contain text inside a math environment), otherwise, it is tokenized as a whole. This is also true for environments that are known to never contain any text to translate, they are tokenized as a whole, like in the case for \begin{lstlisting}. This produces the following result after tokenization:

[0-10]
  [0-7]{New section}
  This is some text.
  [0-2][0-8]{some text inside an inline math environment}[0-5]
  [0-11]
  [0-12]
    [0-1] Foo bar
    [0-1] [0-3]
    [0-1] Spam eggs
  [0-14]
  [0-4][0-9]{some text inside a block math environment}
  [0-6]
  [0-13]
[0-15]

So here is a relatively user-friendly way of customizing and altering these default behaviours of TransLaTeX to make your file process correctly:

  • Add the names of the LaTeX commands or environments in COMPLETELY_REMOVED_COMMANDS or COMPLETELY_REMOVED_ENVS respectively that are to be completely removed, and that never contain any text to translate.

  • Add the names of the LaTeX math environments in MATH_ENVS for them to be processed as such. Add the names of any LaTeX commands used to insert text inside math environments in TEXT_COMMANDS for them to be detected and prepared for translation.

  • Add the names of the LaTeX commands in SPECIAL_COMMANDS that are to be handled during the tokenization stage via regular expressions instead of during the marking stage via a parser. This is handy for complex and variable structures that need finer handling to be able to extract their text every time. Since entering in additional regex by the end user hasn’t been implemented yet, anything you add here will only be skipped on all stages leaving these commands and all their arguments and options intact before sending out for translation (they could thus get altered by the translator resulting in a broken LaTeX file).

Custom translation service#

You can provide your own translation service by creating a python file containing one or several custom translation service classes that derive from TranslationService and implement the translate() method.

Suppose that a custom.py file contains the following code:

from translatex.translator import TranslationService


class DoNoTTranslate(TranslationService):
    """A Mockup translation service that does not translate anything."""

    name = "Do not translate"

    def translate(self, text: str, source_lang: str, dest_lang: str) -> str:
        return text

Then, you can use it with the --custom_api option together with the --service option:

translatex --custom_api custom.py --service "Do not translate" input.tex output.tex

Here is another example with a custom translation service that uses the TextSynth API:

import logging
import os

import requests

from translatex.translator import ApiKeyError, TranslationService

log = logging.getLogger("translatex.custom_api")


class TextSynth(TranslationService):
    """Translate using TextSynth API."""

    name = "TextSynth"
    char_limit = 1000
    url = "https://api.textsynth.com/v1/engines/m2m100_1_2B/translate"
    doc_url = "https://textsynth.com/documentation.html#translations"
    short_description = "TextSynth translation service based on M2M100 model."

    def __init__(self):
        try:
            self.textsynth_api_key = os.environ["TEXTSYNTH_API_KEY"]
        except KeyError:
            raise ApiKeyError(self.name, "TEXTSYNTH_API_KEY")

    def translate(self, text: str, source_lang: str, dest_lang: str) -> str:
        headers = {"Authorization": f"Bearer {self.textsynth_api_key}"}
        payload = {
            "text": [text],
            "source_lang": source_lang,
            "target_lang": dest_lang,
        }
        r = requests.post(self.url, headers=headers, json=payload, timeout=10)
        try:
            return r.json()["translations"][0]["text"]
        except Exception as e:
            log.error(e)
            log.error(str(r))
            log.error(r.json())
            return text

Indices and tables#