Vai al contenuto

Audio Processing for Minority Language Heritage

RESEARCH DESCRIPTION

The Audio Processing for Minority Language Heritage project aims at the preservation, protection, and enhancement of minority languages, considered intangible heritage at risk of extinction. Particular attention is given to the Germanic linguistic islands of Friuli Venezia Giulia, with the creation of two digital archives: ArDLiS for the Saurano dialect spoken in Sauris, and ArDLiT for the Timavese dialect spoken in Timau, developed by the Communication and Linguistics Laboratory of DIUM.
The project integrates humanities methodologies and computational approaches, developing semi-automatic and automatic transcription and annotation methods based on small corpora of manually transcribed data. Using Transformer neural architectures, such as pre-trained Wav2Vec2 and XLS-R models, it is possible to build general multilingual speech representations and subsequently fine-tune them for specific tasks, including automatic transcription, translation, or speech synthesis. In the case of Saurano, the corpus comprises approximately two hours of controlled speech from six speakers.
The results contribute to the preservation of minority languages, the development of digital linguistic tools, and advances in automatic speech recognition (ASR) research, including the evaluation of small-sized corpora and the optimization of model hyperparameters.

Technological Tools and Methods

  • Transformer neural architectures for multilingual text and speech models (Wav2Vec2, XLS-R)
  • HuggingFace APIs for model integration and experimentation.
  • Semi-automatic and automatic methods for speech transcription and annotation
  • Controlled speech corpus for training and validation
  • Fine-tuning of general models for specific tasks (transcription, translation, speech synthesis)

OPPORTUNITIES FOR PARTICIPATION IN THE PROJECT

  • Analysis and transcription of voice recordings to enrich the corpus
  • Experimentation and optimization of ASR models
  • Creation of tools and methods for the digital enhancement of linguistic heritage
  • Interdisciplinary collaborations between linguistics, computer science, and Digital Humanities

PROJECT TEAM