Two More Cents

#A Translation Map of Indian Languages

I recently finished working on my latest project: Indian Translate, a translation map of 22 Indian languages - enter text in English, and view the translations (and their English transliterations) on the map in the region in which the language is spoken. This post is a deep dive into the project and its implementation.

#Inspiration

The project was inspired by this post on Hacker News - it’s a similar map for European languages. I realized that India had similar linguistic diversity, making it a great candidate for such a project. Unlike most European languages which use the Latin script, India’s 22 official languages use 13 different scripts, so I realized that an English transliteration would also be immensely helpful.

#Creating a Map

My first goal for the project was to create an interactive map, ideally rendered as an SVG. D3.js seemed like the best tool for data visualization on the web, but it needed a source of data (ideally in GeoJSON) to plot. (Sidenote: GeoJSON is a very cool format, because it’s a human readable representation of cartographic data.) In my search for the right data source, I came across Natural Earth, which offers free downloads of maps in a wide variety of formats - although GeoJSON wasn’t on that list, it was easy enough to convert their mapfiles into it. The only problem was that their data only defined state boundaries, which wasn’t granular enough for my use case - some Indian languages (like Tulu) are only spoken in parts of states. India’s 28 states are divided into 780 districts, which I figured would be the ideal unit of division for such a project. I found this GeoJSON map on Github which, in addition to representing each district, also contained information on the state containing that district. Perfect!

#Gathering Data

Now that I had a GeoJSON map of India, I had to collect the requisite data. I wanted to color-code different regions of the map based on language, similar to the European translation map. However, this was easier with the European map because it assumed a 1:1 mapping between language and country (or at least a 1:many). As I mentioned earlier, this isn’t always true with Indian languages, so I couldn’t just assign a language to each state. Instead, I went through census records (which collect linguistic data), finding districts where a sizable portion of residents spoke a certain language; in some cases, the Wikipedia entry on the district had done the work for me.

Rather than going through census records / wikipedia entries for each district, so I used the following method.

Using this method, I was able to create a list of each state’s primary language and secondary languages (if any), along with the districts in which those secondary languages were spoken.

#Plotting the Map

D3.js, as I mentioned earlier, is an excellent tool for geographic data visualization. Since India lies close to the equator, it typically isn’t distorted by map projections, so I went with a simple Mercator projection to plot the map. I assigned each language a color (which the corresponding region was filled with) and a language code, which would be used to communicate with the translation backend. I also added a hover-action using CSS to display the language name when the region was hovered on.

At this point, the frontend was mostly finished. The only work left here was to display data received, which couldn’t be done until I had a fully-functional backend.

#The Backend

I wanted my map to display two things - the translation (in the language’s native script) and a transliteration into English. Each of those is covered below.

#Fetching translations

The Indian Institute of Technology (IIT) in Madras (Chennai) has a project called AI4Bharat, whose aim is to build LLMs and other AI models tailored to Indian languages. One of their models - IndicTrans2 seemed like the perfect fit for my website. It supports exactly the 22 languages that I was looking for, and I would have the added benefit of running everything locally.

Unfortunately, setting it up proved to be a hassle. After wrestling with the dependencies for a few days, I had to give up (to be clear, this is a Python ecosystem problem, not the project’s fault). I also realized I would need a pretty beefy server to run the translation model fast enough to serve requests in real-time. This meant that, much to my annoyance, I had to resort to Google Translate’s API with its inane bureaucratic management and token setup. I wrote a quick-and-dirty server in Go to fetch translations using the API, and ended up tweaking that into the final version.

#Fetching transliterations

The AI4Bharat project also develops a Transliteration model called IndicXLit. For some reason, this was much easier to set up and run on my VM (if I had to guess, it’s because transliterations don’t care about semantics, so their processing is much simpler). After writing a simple Python wrapper to fetch the transliterations, I had both pieces of the back-end ready to go.

#Putting it together

Now that the hard parts were done, I just had to glue the pieces together. The result was this:

Funny enough, the vast majority of this project’s code is contained in the front-end. Rendering the map SVG, defining language boundaries and displaying results from the back-end turned out to be a lot more work than I anticipated. While I did spend a lot of time on the back-end, it was mostly limited to setting up the API and transliteration server.

Obviously, I’m very proud of the result. I hope the project is a good showcase of India’s diversity, and I hope you enjoy using it.