The Status of Research in Localisation

student-reading

In this 10th and concluding article in the series on localisation, the author take a closer look at the challenges for Indian language localisation and the current status of research in the field.

Ten years back, there were very few people who knew how to use Indian languages on the desktop. Now, we not only have millions of people who read content on the Internet in their native languages on desktops and smartphones, but also thousands of people who edit and contribute content. This has been possible due to the availability of computing devices with Indian language support, their falling prices, as well as the ability to access the Internet through various communication mediums.
We have seen a few early signs of the outcome of the research of the last decade, in the form of machine translation support for the Web. While five Indian languages are supported by Google’s machine translation tools, the quality is still not up to the mark due to the complex nature of languages. Speech and touch interfaces have made their appearance, particularly on smartphones. The speech interface is now supported in limited domains, such as searching through the contacts list, or searching the Internet. However, Indian-accented English support needs to be improved.

As per the framework of the Centre for Next Generation Localisation, a specialised centre of excellence in Ireland, the challenges for localisation are volume, access (interaction mode) and personalisation. These three dimensions represent three axes, with most of the localisation work focused on high-volume content in corporate environments, with support for the traditional keyboard-and-screen mode of access, and limited support for personalisation in terms of language variations. The research challenge is to leverage various core technologies and frameworks to be able to instantly translate content and localise applications, duly considering the profile of the user. The next few paragraphs explore the status of localisation and the challenges faced in dealing with Indian languages.

a) Volume: The quantity of information on the Web is exploding due to its popularity as a medium of communication and interaction, and also the popularity of Web 2.0 platforms such as Twitter, Facebook, Google+, etc. The industry has tried to address this by defining and improving the process for localisation in corporate environments, as well as leveraging the crowd sourcing opportunity in social media environments.

The core component of localisation is the translation technology. For a long time, the route explored was rule-based translation research consisting of parsing of the source text, and using dictionaries and grammar rules to produce the translation. Subsequently, Statistical Machine Translation(SMT), based on training of the algorithms, with paired human-translated texts of source language text and destination language texts has become popular. Websites, translated at the click of a button for the dominant languages, have become feasible—though the quality of the translation could be inadequate for professional requirements.
Localisers can use the automated translation suggestions from SMT, when there is no proper match in translation memory to improve the translation. The resulting improved translation can be used to train the statistical machine translation system.

b) Access (interaction mode): The traditional access (interaction mode) method while working with computers is through a screen and a physical keyboard. We have seen the emergence of the touchscreen, which allows for virtual keyboards and alternate methods of input like writing on the screen or composing the input by rapid selection of letters from the virtual keyboard by tracing a finger from letter to letter. In addition, with the popularity of smartphones, voice input and output is becoming another key interaction mode.

Due to the small screen size of phones, there is potential for errors in inputting text. Dictionary-based approaches that prompt the user to pick a word from a limited choice have been helpful. Other technologies that have reached a level of maturity in English, but need further development for Indian languages, are spell-checkers and grammar checkers.

Speech technologies for text-to-speech and speech-to-text are critical for the voice mode of interaction. This works fine in a limited context like search or interactive customer support in English. The support for Indian languages is limited in text-to-speech, and barely exists for speech-to-text. And speaker independence and operating in noisy environments are global challenges.
Character recognition technology, which was developed to rapidly process huge volumes of data from physical books, supported by image processing and pattern recognition techniques, has matured for English, whereas current offerings for Indian languages are not adequate.

Handwriting recognition is another area of active research, as it allows for more natural user interfaces. Here again, the complex nature of most Indian scripts makes this a challenging research area.
c) Personalisation: Traditionally, localisation is coarse-grained in the sense of its focus on language and not much on its variation across countries and regions within a country.

Personalisation refers to making information available as per the personal and information requirements of the user in a given context. This makes such information more valuable. If the user interface and other content can be made specific to a language as spoken in a particular region, the quality of localisation will become much better. This requires several resources, such as dictionaries at the dialect level and also a way to transform sentences from a standard language into its dialect forms.

Localisation tools
We have looked at the advances in tools from the basic text-editor kind of models to Web-based platforms in the previous articles. The tools have live interfaces to Translation Memory repositories, and support various project management tasks such as planning, tracking and work flow, as well as reporting mechanisms. Support for XML interoperability standards like XLIFF, TMX and TBX is also available. Several commercial business models based on the purchase and exchange of language resources have become common. Tools that allow Web localisation to be done directly on the displayed web page have appeared (e.g., Mozilla Pontoon). Further improvements to tools to manage the complexities of localisation as per user constraints, while leveraging Web services  and crowd sourcing  efficiently, is an active research area.

Future of Indian Language Technology Research
The Indian government’s Department of Electronics and Information Technology (DeiTY) has an initiative called ‘Technology Development in Indian languages’ (TDIL). The objective is to popularise the support for Indian languages on computing platforms. It has been promoting work on machine translation systems—from English to Indian languages and from one Indian language to another, cross-lingual information access, and Optical Character Recognition and handwriting —through a consortium of academic institutions and research organisations for than a decade. Demo versions of products, along with relevant fonts and software for each language, have been developed and were made available through free physical CDs seven years back. The same are now available for download from its data centre website. However, all the offerings are only meant for non-commercial use. Redhat, Google, Microsoft and various small and medium enterprises have been pioneering their own initiatives to popularise  Indic computing. Free and Open source groups have also worked tirelessly  to improve support for Indian languages.
The involvement of all language computing stakeholders  on a common platform in defining the strategic goals and assessing the outcomes, as well as releasing the results of basic research, tools and language related databases under unrestricted licenses will be a great step for rapid progress.

End note
It has been a great opportunity for me to introduce localisation and explore its various aspects, over the past year, through this magazine. I express my thanks to the OSFY editors and management for their support. I acknowledge and thank all the people and organisations who persevere passionately to make Indian languages on par with English in computing arena.

References
[1]     Next Generation Localisation, Josef van Genabith, Localisation Focus, Vol. 8, Issue 1, http://www.localisation.ie/resources/locfocus/vol8issue1.htm
[2]     Pontoon Introduction-Zbigniew Braniecki  
http://diary.braniecki.net/2010/04/19/pontoon-introduction/
[3]     TDIL website http://tdil.mit.gov.in/

All published articles are released under Creative Commons Attribution-NonCommercial 3.0 Unported License, unless otherwise noted.
Open Source For You is powered by WordPress, which gladly sits on top of a CentOS-based LEMP stack.

Creative Commons License.