Exploring Software: Video Subtitle Translation

This column looks at how to make MOOC videos accessible to non-English speakers.

The news that California State Universities were tying up with Udacity for inexpensive MOOC (massive open online courses) for credit was not surprising. The only surprise is the speed with which the changes are taking place. I am inclined to agree with the analysis on TechCrunch (http://goo.gl/fRppX) that this online project is going to end college education, as we know it. The advantage is that a wealth of options will become available for all to learn. The disadvantage is that most of the content will be in English.
Can the videos in English be accessible to students not very comfortable with the language? They would benefit a lot if subtitles (http://en.wikipedia.org/wiki/Subtitling) are provided in their language.

So how does one go about getting content with language subtitles in it? The time and effort required to translate the content into the vast number of languages would be huge. Crowd-sourcing can be an answer, for example, by using http://www.amara.org/.

Subtitling/captioning and the Web
Video players can merge video frames with subtitles while playing. There are numerous formats available for subtitling; the basic content, though, will be similar. Each subtitle is a text line to be displayed, along with information about when to start the display and when to stop. The best way to provide this information is by specifying the starting time and the end time, or the duration for each subtitle. This makes the subtitle file independent of the frame rate at which a video file may be created. One common format is the SubRip (.srt extension), which was the basis of another useful format, WebVTT, which may become widespread, as it is now a W3C standard. W3C has a competing timed text (TTML) standard, which is an XML document, intended to ensure interoperability of streaming video and captions on the Web.
However, the HTML5 video element supports a track element, which can be used to specify the subtitle file, perhaps in WebVTT format, and meet the needs of streaming video with captions in a user-defined language.
It is common these days to have same language subtitles (http://en.wikipedia.org/wiki/Same_language_subtitling) for television and video. The obvious advantage is that it makes content accessible to the hearing-impaired. Another advantage is its educational value. It helps practice reading as an incidental and sub-conscious part of entertainment. However, on the Web, it has an even greater significance, which is probably why Google has been in the forefront of the WebVTT format it allows video content to be searched easily!
Machine translation of captions
Manual translation is time-consuming and expensive, even with crowd-sourcing. The quantum of content is too large to be translated within a useful timeframe for all languages of interest. Furthermore, the content of technical courses is likely to be unambiguous, and not open to subtle differences in the interpretation of words and phrases. Machine translation may provide the answer.

If you search the Web for open source machine translation engines, you will find Moses (http://www.statmt.org/moses/), a statistical translator, and Apertium (http://www.apertium.org/), a rule-based translator. Moses’ capabilities are, in principle, similar to the software used by Google and Microsoft. However, it does not come with language models and datasets for carrying out translation—so, for it to be useful, you need to provide language models and training datasets. Apertium, however, comes with translation capabilities for a number of language pairs. The current list and status can be seen at http://wiki.apertium.org/wiki/List_of_language_pairs.
Unfortunately, the progress in pure open source tools is likely to be slow. The reason is fairly obvious; Web-based translators from Google, Microsoft and others provide excellent functional alternatives. These sites have a wealth of data, e.g., pages from multi-lingual sites, which may be used for training and fine-tuning translations.

If same-language subtitles are available, you may rely on machine translation for generating subtitles in a language for which a machine translator is available. YouTube provides this feature for translated captions on its site by using Google Translate, e.g., http://www.youtube.com/watch?v=1St0tJVGCW8. So, the easiest option is to use Google or Bing translators on the Web. Several open source tools had been created to translate subtitles using the Google Translate API. However, these tools no longer work after changes in the usage policy of Google’s Translate API, but they may be modified to use Microsoft’s translation API instead.
We can hope that the MOOC course videos will make same-language captions available, so that machine translation can spread this knowledge to an even wider group of learners.

A side lesson: The sudden changes in the usage policy for the Google Translate API reinforce the need for pure open source solutions for translation applications as well as for language models and translation datasets. The generosity of commercial sites will be aligned with their commercial interests and cannot be taken for granted.

All published articles are released under Creative Commons Attribution-NonCommercial 3.0 Unported License, unless otherwise noted.
Open Source For You is powered by WordPress, which gladly sits on top of a CentOS-based LEMP stack.

Creative Commons License.