[Thesis defence] 12/12/2025 – Nathan Griot: «Robust text-dependent speaker verification through temporal alignment, multitask learning, adversarial and self-supervised learning» (UPR LIA)
Mr Nathan GRIOT will publicly defend his thesis entitled: «Robust text-dependent speaker verification through temporal alignment, multitask learning, adversarial and self-supervised learning», supervised by Mr Driss MATROUF and Jean-François BONASTRE, on Friday 12 December 2025.
Date and place
Defence scheduled for Friday, 12 December 2025 at 3:00 p.m.
Location: 339 Chem. des Meinajaries, 84000 Avignon, CERI, 84000, Avignon
Room: Ada Lecture Theatre
Discipline
Computer Science
Laboratory
UPR 4128 LIA - Avignon Computing Laboratory
Composition of the jury
| Mr Driss MATROUF | Avignon University | Thesis supervisor |
| Ms Irina ILLINA | Lorraine University | Rapporteur |
| Mr Massimiliano TODISCO | EURECOM | Rapporteur |
| Mr Jean-François BONASTRE | University of Avignon | Thesis co-director |
| Mr Raphael BLOUET | Ardelan | Thesis supervisor |
| Mr Anthony LARCHER | Le Mans University | Examiner |
| Mr Dehak REDA | LRE – EPITA | Examiner |
| Ms Adda-Decker MARTINE | CNRS | Examiner |
Summary
Speaker verification is a natural and secure form of biometric authentication. Among its variants, text-dependent speaker verification (TDSV) offers enhanced protection by validating both the speaker's identity and the spoken content, thus combining the advantages of a biometric characteristic and a knowledge factor. Despite these advantages, TD-SV has attracted less interest than its text-independent counterpart. This thesis addresses several key challenges: the lack of suitable data, the entanglement between voice and text information, and the need for better generalisation across different languages and acoustic conditions. These issues are addressed through three main contributions. First, we explore the use of deep neural networks, notably ResNet34 combined with Attentive Statistical Pooling, for textual validation. Analysis of intermediate activations shows that they retain relevant linguistic information. On this basis, we propose the Comparative Attentive System (CAS), a temporal alignment-based architecture that combines frame-level activations with Dynamic Time Warping (DTW) to capture subtle differences in content and temporality between two utterances. Second, we introduce a unified multitask architecture leveraging self-supervised learning (SSL) via WavLM and a teacher-pupil scheme. The teacher model, trained on sentence-level textual validation, guides the pupil to jointly learn voice and text representations from data annotated only by speaker. This approach enables the student model to acquire lexical discrimination capabilities without explicit supervision and to generalise across multiple languages. We also propose an adversarial method, based on gradient inversion, to remove unwanted information that could hinder performance. Finally, we investigate synthetic data augmentation using modern voice cloning techniques to overcome the limitations of existing corpora. Experiments show that models trained on these artificial data achieve performance close to that obtained with real data, demonstrating the potential of this approach for large-scale resource creation. Comprehensive evaluations on the Common Voice, VoxCeleb and DeepMine databases show significant reductions in equal error rate (EER) and tandem-equal error rate (T-EER), outperforming robust reference systems in a multilingual context. These results confirm that alignment-based attention, self-supervised multitask learning, adversarial training, and the use of synthetic data are effective strategies for building robust, generalisable, and data-efficient TD-SV systems.
Keywords : speaker, verification, text dependent, neural networks
Updated on 1 December 2025