Working with transliterations on the web

Whilst developing a new site for a client with a Yiddish collection, I learnt some interesting things about the limits of modern fonts when dealing with transliterations and rarely-used characters.

The collection featured both Yiddish and transliterated strings of text in their metadata, and our task was to bring all these into the website and display them correctly and uniformly.

What is a transliteration

Before I dive in, I should explain what transliteration is, and why you’d use it. Consider this Yiddish string;

ל.נ טאלסטאי

If you don’t read Yiddish it’s hard to figure out how you’d begin to pronounce or search for this. Most users can read Roman characters (a-z), so converting it from Yiddish glyphs to Roman characters makes it much more legible. With a transliteration we would get this;

L.N. Ṭolsṭoy

Now most users could at least approximate how this would be pronounced, and likely recognise the name as that of Leo Tolstoy. Having this transliterated version in our metadata also makes it more searchable—it’s much simpler to match a search for ‘Tolstoy’ to ‘Ṭolsṭoy’ than to ‘טאלסטאי’.

Transliterations are created using transliteration tables, which have mappings for various languages. These tables are published in various standard forms—one of the more common is the ALA-LC standard from the Library of Congress.

Back to the problem

We knew that there would be a large number of transliterated strings, so the first task was to import the metadata and see what it looked like with the chosen font. Immediately problems started cropping up—some letters looked odd, others were obviously rendered in a different font. After digging into the data, it became apparent we were dealing with two separate problems.

Character composition

The first problem—the odd-looking letters—was caused by the presence of both pre- and de-composed characters in the metadata. Character composition is a feature of Unicode which can be used when rendering characters with diacritics such as ‘ṭ’. There are two ways to represent this character;

LATIN SMALL LETTER T WITH DOT BELOW (U+1E6D)
LATIN SMALL LETTER T (U+0074) + COMBINING DOT BELOW (U+0323)

The first representation is known as a pre-composed character (all features are included in a single glyph), and the second as a de-composed character (the letter and diacritic are two separate glyphs). For de-composed characters, the way the character is rendered varies wildly between fonts, and is very dependent on the font design.

Testing

To identify what the best solution would be, I put together a test-case page containing all of our known diacritic variants, and tested it with a selection of fonts on both Windows and OSX. It was immediately apparent that the pre-composed variant displayed most consistently, so we made the decision to normalise all the strings. As it turned out this was a straightforward transformation—we were already processing the data with Python, and the unicodedata library lets you do this:

unicodedata.normalize('NFC', 'decomposed_string')

Apart from some edge cases like combining ligatures (a topic for another time) this worked very well.

Finding a font

The second issue, of characters not rendering in our desired font face, was due to missing characters in the font. This is something to be aware of when venturing into the domain of rarely-used characters—font designers will rarely include every possible character in their font, but will instead design glyphs for only the most commonly used characters.

Our client got in touch with the font designer and arranged to have the missing characters added. However, we still had an issue whilst this took place. As an interim solution we looked around for a compatible font—Google Noto was our final choice, as it supports a huge range of characters, and this worked very well.

Lessons learned

This was a very interesting process, and I learnt a lot from it: don’t assume that your font will support all of the glyphs that you require if you’re working outside of common ASCII characters; test a known set of non-standard characters on all supported platforms; and if you don’t require de-composed characters for other purposes, pre-composition will help your data display more consistently.

This is by no means an exhaustive guide to potential issues with fonts and characters—if this talk of Unicode and ASCII sounds esoteric I’d recommend Joel Spolsky’s guide to character encoding and Unicode (a must-read for all programmers). If you’ve had similar issues, I’d love to know how you’ve solved them—let me know on Twitter via @benkyriakou.