Localization files: Medium to exchange languages
In a project that involves multiple languages, or implies that you might at some point go global, it is highly likely that you will be dealing with the problem of exchanging multi-language content within your team, or cross-teams, or with external vendors. It is important to understand how it has been done in a modern working environment, what mediums and instruments are used, and why it has overall importance.
In this article, I will explain to you the basic requirements for the multi-language medium in modern production, show you some examples of such files, as well as some most basic instruments in this area.
Why it matters
Extensive use of computers in the translation industry caused a big shift, and localization, as a branch of the translation industry, was also affected by this. In the international business, documents are often translated in not one but multiple languages at the same time. An increased amount of texts, shorten turn-around time shows that optimization is required to make the process more efficient.
A couple of decades ago, a simple text document, such as TXT or RTF, was sufficient enough to exchange multiple languages between vendors and users (developers, designers) of such content: all the requirements would be stated in the file, and a vendor would fill in the blanks for the customer.
However, such practice might a lot of inconvenience for both sides, here are but a few of the most obvious ones:
- Time-consuming to extract info for translation
- Room for error when copying translations into the program
- Incompatible encoding within a document
In addition to that, other practices suggest to just send the source document and expect vendors to return it “translated” for them, which worked well for a single translation request, but was hardly scalable. Even a minor change in the source document would require to perform the review or re-translation of the whole document.
Even today, companies that just start working in the global or localization direction, might be unaware of best localization practices out there, using a good old Word to store content.
In the past 20 years, several file formats were introduced to face the challenge of this issue, and to standardize the way different people, companies, and teams are exchanging and storing multi-language content.
Today I will introduce you to three most popular file formats – XLIFF, TMX, and PO, which are widely used for localization purpose.
XLIFF (XML Localization Interchange File Format) was introduced in 2002 by OASIS organization to standardize the way localizable data are passed between tools during a localization process and a common format for CAT tool exchange. It was used since by numbers of localization products, including memoQ, Memsource, SDL Trados, Wordfast, OmegaT, etc.
It is an XML-based format, meaning that it is human-readable to a certain extent, and can be parsed by numbers of tools. Specifications are open and available to use by any third-party company.
File format itself specifies elements and attributes to store content from original file format, and it’s corresponding translation.
The current version of the specification is XLIFF 2.0, and a document might look like this:
Basically, an XLIFF document is a list of numbered segments of the source file, with the specified source and target languages. It can store source only to hand over to localization agency, and both source and target content to return to the engineer who will use it in his software.
Additionally, there is “skeleton” part of the document, that preserves the part of the original document that is not related to localization. You can read the detailed specification of the XLIFF file format at OASIS organization official website.
XLIFF is one of the most commonly used medium to separate localization texts from the software.
TMX (Translation Memory eXchange) is also an XML-based file format that was firstly introduced back in 1997. It was created to face similar problems as XLIFF will be later on, though it’s primary area of applications was Computer assisted translation (CAT) systems.
TMX specifications is also based on XML. It’s main difference, however, is that TMX can hold multiple target languages in the same file, compare to bi-lingual XLIFF. Also it doesn’t preserve any parts of the original document except for the source content itself, render it incapable of producing the result document on it’s own.
This is the basic structure of TMX 1.4b, the most commonly used and the latest standardized format of the file.
<seg>Bonjour tout le monde!</seg>
You can find the most recent specifications, which is distributed under Creative Commons Attribution 3.0, hosted at Gala-Global.
To the day, TMX is widely supported by professional software such as memoQ, SDL Trados, Wordbee, SmartCAT, OmegaT, etc., used to import translation memories in some and as native translation memory storage in others.
TMX is very capable of long-term storage of multi-language content. It’s high popularity guarantees that you will be able to re-use the translation memory in another environment. This format can also be used as a medium between different vendors, or if you want to switch from one vendor to another.
Ability to store multiple languages at the same time is one of the biggest advantages of the format. It is possible to combine multiple translation memories from different vendors and translation teams to form a unified, project-based storage which can serve as the basis of term base, or “content library”, within a project in the long run.
PO files are part of the gettext internationalization and localization system for Unix-like operating systems. It is widely used in number of open source software. You can read it in more details here.
PO files represents a human readable file that allows to link the text in the software with it’s localized counterpart. By default, software uses English text as an id, and each locale of the software is stored in a separate *.po file with the corresponding name.
Usually the translation vendor or translator will receive a clean copy of the *.po file with only source strings, and add translations for each string using some CAT tool or dedicated *.po compatible software such as POedit. Then this file will be returned to the developers and compiled into machine-readable format for further usage.
The human readable part of gettext localization is a bi-lingual file of the following format, where msgid is the source language, and msgstr is the target:
#: src/PackageCommands.cs:57 src/PackageCommands.cs:3181
msgid "Search for a match to any of the search strings"
msgstr "Vyhledat výsledek odpovídající alespoň některému z řetězců"
The big advantage of this approach is, unlike aforementioned XLIFF format, with gettext the programmer can see the actual text instead of a placeholder during the development process. Users, on the other hand, will see the localized string only if his system/software is set up to the required language, if the corresponding *.po file exists, and if the string is translated in the file.
There are more file formats there that are being used for localization purpose – CSV, Java JSON, Android XML, etc. (you can check out more at this portal, for example). The main idea of this article, however, wasn’t to cover them all. Instead, this is the material for the project manager, owner, or engineer, to see how people were dealt with similar challenges, and to help the reader find a tool that will suit his needs.
It is also worth to mention that most of the described formats and specifications are open and does not require licensing or fees, meaning that you can use those solutions for your projects in almost any way possible. Above that, there are numbers of software exists already that supports working with those files, and lots of them are open sourced and free to use.
In other words, do not hesitate or stick to good ol’ Google Docs. Spent some time to explore what has been done before, and take the best out of it. Every hour you spend preparing the localization ground for your product before you started saves 100 hours of frustration, refactoring and workaround findings in the future.
- XLIFF specifications
- GNU gettext
- TMX specifications
- XML in localization: Reuse translations with TM and TMX