publications | Ashwin Sankar

2024

Rasa: Building Expressive Speech Synthesis Systems for Indian Languages in Low-resource Settings

Praveen Srinivasa Varadhan , Ashwin Sankar , Giri Raju , and 1 more author

In Proc. INTERSPEECH 2024 , 2024

Abs

We release Rasa, the first multilingual expressive TTS dataset for any Indian language, which contains 10 hours of neutral speech and 1-3 hours of expressive speech for each of the 6 Ekman emotions covering 3 languages: Assamese, Bengali, & Tamil. Our ablation studies reveal that just 1 hour of neutral and 30 minutes of expressive data can yield a Fair system as indicated by MUSHRA scores. Increasing neutral data to 10 hours, with minimal expressive data, significantly enhances expressiveness. This offers a practical recipe for resource-constrained languages, prioritizing easily obtainable neutral data alongside smaller amounts of expressive data. We show the importance of syllabically balanced data and pooling emotions to enhance expressiveness. We also highlight challenges in generating specific emotions, e.g., fear and surprise.
Enhancing Out-of-Vocabulary Performance of Indian TTS Systems for Practical Applications through Low-Effort Data Strategies

Srija Anand , Praveen Srinivasa Varadhan , Ashwin Sankar , and 2 more authors

In Proc. INTERSPEECH 2024 , 2024

Abs

Publicly available TTS datasets for low-resource languages like Hindi and Tamil typically contain 10-20 hours of data, leading to poor vocabulary coverage. This limitation becomes evident in downstream applications where domain-specific vocabulary coupled with frequent code-mixing with English, results in many OOV words. To highlight this problem, we create a benchmark containing OOV words from several real-world applications. Indeed, state-of-the-art Hindi and Tamil TTS systems perform poorly on this OOV benchmark, as indicated by intelligibility tests. To improve the model’s OOV performance, we propose a low-effort and economically viable strategy to obtain more training data. Specifically, we propose using volunteers as opposed to high quality voice artists to record words containing character bigrams unseen in the training data. We show that using such inexpensive data, the model’s performance improves on OOV words, while not affecting voice quality and in-domain performance.

2022

Comparative Study of Transformer Models

Ashwin Sankar , and R. Dhanalakshmi

In Databases Theory and Applications , 2022

Abs

Machine Reading Comprehension (MRC) is the process where computers or, machines are taught to understand a paragraph or more technically called a context. Like humans, machines also need to be evaluated for their understanding on question answering. MRC is one of the formidable sub-domains in the Natural Language Processing (NLP) domain, which has seen considerable progress over the years. In recent years, many novel datasets have tried to challenge the Machine Reading Comprehension (MRC) models with inference based question answering. With the advancement in NLP, many models have surpassed human-level performance on these datasets, albeit ignoring the obvious disparity between genuine human-level performance and state-of-the-art performance. This highlights the need for attention on the collective improvement of existing datasets, metrics, and models towards “real” prehension. Addressing the lack of sanity in the domain, this paper performs a comparative study on various transformer based models and tries to highlight the success factors of each model. Subsequently, we discuss an MRC model that performs comparatively better, if not the best, on question answering and give directions for future research.