Anuvaad: Domain-Specific Translation Engine for the Supreme Court of India
About The Customer
Anuvaad is a project that has been conceptualized as a general purpose, open-domain translation module for Indic languages. The Indian Constitution lists 22 official languages with 6,000-plus dialects and 55-plus languages with 1 million-plus speakers. Built to bridge the gap of translation and policy action, it acts as an accelerator for knowledge translation.
The knowledge translation in turn not just aids dissemination of the existing evidence base, but also drives outcomes scalable and sustainable in the local socio-cultural context. The project aspires to have high quality (Neural Machine Translation) NMT models for all major Indian languages. It is open sourced under the MIT license and is funded by EkStep foundation.
Overview
Project Anuvaad has assisted the Honorable Supreme Court of India to take a step forward in providing translated copies of judgments. Reducing the time and effort to obtain high quality translations to and from Indian languages can significantly catalyze the decision making/delivery process.
Tarento as the partner of choice
Tarento’s capabilities with OCR technology and expertise in AI and NMT models played a significant hand in positioning it as a reliable solution provider.
Challenges
The Indian Constitution lists 22 official languages with 6,000-plus dialects and 55-plus languages with 1 million-plus speakers. More than 20000 domain-specific documents were to be digitized and then accurately translated. This required a solution that was uncompromising on both accuracy and scale.
Solution
Tarento provided a tool which enables high quality and accurate translations for Indic languages using Scalable ML based Language Solution. This is a fully automated system designed for quick model testing.
We created an end-to-end translation pipeline and toolchains to achieve state-of-the-art translation quality for the selected domain.
- Scanned Judgement can be digitized through OCR.
- Completely open-sourced available on GitHub and based on NMT(Neural Machine Translation).
- Tools to create parallel corpus, to improve translation, benchmarking tools evaluate translation accuracy, etc. were also created as part of the project.
- The system currently supports multiple vernacular Indian languages with high quality digitization and translation capabilities.
Technologies Involved
- Spark, Apache Airflow, Apache Hbase, Mongo DB
- ULCA (Universal Language Contribution API) is an open-sourced scalable data platform, supporting various types of dataset for Indic languages, along with a user interface for interacting with the datasets.
Outcomes and Impact
Our model provides highly accurate Domain specific digitization ,translations, and shows qualitative and quantitative edge over Google’s Translate for judicial domain data due to the additional parallel corpus. It shows comparable performance to Google Translate for general sentences.
- We have digitised over 22million documents and then converted them into multiple languages.
- Anuvaad was deployed by the Supreme Court of India as SUVAS i.e. 'Supreme Court Vidhik Anuvaad Software' from November 26, 2019.
- The Bangladesh Supreme Court also launched this Artificial Intelligence (AI) based translation software in February 2021.
We can help you transform your business.