Real and synthetic Punjabi speech datasets for automatic speech recognition

Loading...
Thumbnail Image

Date

2024-02

DOI

Open Access Location

Journal Title

Journal ISSN

Volume Title

Publisher

Elsevier Inc

Rights

(c) 2023 The Author/s
CC BY 4.0

Abstract

Automatic speech recognition (ASR) has been an active area of research. Training with large annotated datasets is the key to the development of robust ASR systems. However, most available datasets are focused on high-resource languages like English, leaving a significant gap for low-resource languages. Among these languages is Punjabi, despite its large number of speakers, Punjabi lacks high-quality annotated datasets for accurate speech recognition. To address this gap, we introduce three labeled Punjabi speech datasets: Punjabi Speech (real speech dataset) and Google-synth/CMU-synth (synthesized speech datasets). The Punjabi Speech dataset consists of read speech recordings captured in various environments, including both studio and open settings. In addition, the Google-synth dataset is synthesized using Google's Punjabi text-to-speech cloud services. Furthermore, the CMU-synth dataset is created using the Clustergen model available in the Festival speech synthesis system developed by CMU. These datasets aim to facilitate the development of accurate Punjabi speech recognition systems, bridging the resource gap for this important language.

Description

Keywords

Automatic speech recognition, Punjabi language, Speech dataset, low-resource languages

Citation

Singh S, Hou F, Wang R. (2024). Real and synthetic Punjabi speech datasets for automatic speech recognition.. Data Brief. 52. February 2024. (pp. 109865-).

Collections

Endorsement

Review

Supplemented By

Referenced By

Creative Commons license

Except where otherwised noted, this item's license is described as (c) 2023 The Author/s