Localization files: Medium to exchange languages

2018/09/03 English

In projects involving multiple languages, or when planning to go global, you will likely face the challenge of exchanging multilingual content within your team, across teams, or with external vendors. It is important to understand how this is handled in a modern working environment, what mediums and tools are used, and why it matters. In this article, I will explain the basic requirements for multilingual content management in modern production, show examples of such files, and introduce some essential tools in this area.

Why It Matters

The extensive use of computers in the translation industry has caused a significant shift, affecting localization as a branch of translation. In international business, documents are often translated into multiple languages simultaneously. The increased volume of text and shortened turnaround times show that optimization is required to make the process more efficient.

A couple of decades ago, simple text documents such as TXT or RTF were sufficient for exchanging multilingual content between vendors and users (developers, designers). All requirements would be stated in the file, and a vendor would fill in the blanks for the customer. However, this practice caused considerable inconvenience for both sides. Here are just a few of the most obvious issues:

Time-consuming extraction of content for translation
Room for error when copying translations back into the program
Incompatible encoding within documents

In addition, other practices involved simply sending the source document and expecting vendors to return it “translated.” This worked well for single translation requests but was hardly scalable. Even a minor change in the source document would require reviewing or re-translating the entire document. Even today, companies just starting to work in globalization or localization may be unaware of best practices, using good old Word to store content.

Over the past 20 years, several file formats have been introduced to address these challenges and standardize how different people, companies, and teams exchange and store multilingual content.

File Formats

Today I will introduce you to three of the most popular file formats—XLIFF, TMX, and PO—which are widely used for localization purposes.

XLIFF

XLIFF (XML Localization Interchange File Format) was introduced in 2002 by OASIS to standardize how localizable data is passed between tools during the localization process and to provide a common format for CAT tool exchange. It has since been used by numerous localization products, including memoQ, Memsource, SDL Trados, Wordfast, OmegaT, and others. It is an XML-based format, meaning it is human-readable to a certain extent and can be parsed by many tools. The specifications are open and available for use by any third-party company. The file format itself specifies elements and attributes to store content from the original file format and its corresponding translation.

Structure

The current version of the specification is XLIFF 2.0. A document might look like this:

<xliff srcLang="en-US" trgLang="zh-CN">
  <unit id="1">
    <segment>
      <source>Hello, world!</source>
      <target>你好世界</target>
    </segment>
  </unit>
</xliff>

Basically, an XLIFF document is a list of numbered segments from the source file, with specified source and target languages. It can store source-only content to hand over to a localization agency, or both source and target content to return to the engineer who will use it in their software. Additionally, there is a “skeleton” part of the document that preserves portions of the original document not related to localization. You can read the detailed specification of the XLIFF file format on the OASIS official website.

Best For

XLIFF is one of the most commonly used mediums for separating localization texts from software.

TMX

TMX (Translation Memory eXchange) is also an XML-based file format, first introduced in 1997. It was created to address similar problems that XLIFF would later tackle, though its primary area of application was Computer-Assisted Translation (CAT) systems. TMX specifications are also based on XML. Its main difference, however, is that TMX can hold multiple target languages in the same file, compared to the bilingual XLIFF. Also, it doesn’t preserve any parts of the original document except for the source content itself, rendering it incapable of producing the final document on its own.

Structure

This is the basic structure of TMX 1.4b, the most commonly used and latest standardized format of the file:

<header>
<body>
  <tu>
    <tuv xml:lang="en">
      <seg>Hello world!</seg>
    </tuv>
    <tuv xml:lang="fr">
      <seg>Bonjour tout le monde!</seg>
    </tuv>
    <tuv xml:lang="zh-CN">
      <seg>世界你好！</seg>
    </tuv>
  </tu>
</body>

You can find the most recent specifications, distributed under Creative Commons Attribution 3.0, hosted at GALA Global. To this day, TMX is widely supported by professional software such as memoQ, SDL Trados, Wordbee, SmartCAT, OmegaT, and others. It is used to import translation memories in some systems and serves as native translation memory storage in others.

Best For

TMX is very capable of long-term storage of multilingual content. Its high popularity guarantees that you will be able to reuse the translation memory in another environment. This format can also be used as a medium between different vendors, or if you want to switch from one vendor to another. The ability to store multiple languages at the same time is one of the biggest advantages of the format. It is possible to combine multiple translation memories from different vendors and translation teams to form a unified, project-based storage that can serve as the basis of a term base or “content library” within a project in the long run.

PO

PO files are part of the gettext internationalization and localization system for Unix-like operating systems. It is widely used in numerous open-source software projects. You can read more details here. PO files represent a human-readable format that allows linking the text in the software with its localized counterpart. By default, software uses English text as an ID, and each locale of the software is stored in a separate *.po file with the corresponding name. Usually, the translation vendor or translator will receive a clean copy of the *.po file with only source strings and add translations for each string using some CAT tool or dedicated PO-compatible software such as Poedit. Then this file will be returned to the developers and compiled into a machine-readable format for further usage.

Structure

The human-readable part of gettext localization is a bilingual file in the following format, where msgid is the source language and msgstr is the target:

#: src/PackageCommands.cs:57 src/PackageCommands.cs:3181
msgid "Search for a match to any of the search strings"
msgstr "Vyhledat výsledek odpovídající alespoň některému z řetězců"

Best For

The big advantage of this approach, unlike the aforementioned XLIFF format, is that with gettext the programmer can see the actual text instead of a placeholder during the development process. Users, on the other hand, will see the localized string only if their system/software is set up to the required language, if the corresponding *.po file exists, and if the string is translated in the file.

Conclusion

There are more file formats used for localization purposes—CSV, Java JSON, Android XML, etc. (you can check out more at this portal, for example). The main idea of this article, however, wasn’t to cover them all. Instead, this material is for project managers, owners, or engineers to see how people have dealt with similar challenges and to help readers find tools that suit their needs.

It is also worth mentioning that most of the described formats and specifications are open and do not require licensing or fees, meaning that you can use those solutions for your projects in almost any way possible. Furthermore, numerous software tools already exist that support working with those files, and many of them are open-source and free to use. In other words, do not hesitate or stick to good ol’ Google Docs. Spend some time exploring what has been done before, and take the best out of it. Every hour you spend preparing the localization groundwork for your product before you start saves 100 hours of frustration, refactoring, and workaround findings in the future.

References

Why It Matters

File Formats

XLIFF

Structure

Best For

TMX

Structure

Best For

PO

Structure

Best For

Conclusion

Featured