Sec. 7 — Basics of Speech Recognition and Customization of Riva ASR

Abstract

This section establishes the foundational principles of Automatic Speech Recognition (ASR), detailing the transition from raw acoustic signals to linguistic text through decoding algorithms like Connectionist Temporal Classification (CTC). It further examines the architectural requirements for building ASR systems, including acoustic and language model integration, and evaluates performance via metrics such as Word Error Rate (WER) and Character Error Rate (CER). The core contribution focuses on the NVIDIA Riva ASR framework, analyzing methods for customizing models at inference time through word boosting and offline through vocabulary extension, inverse text normalization, and acoustic model fine-tuning. These techniques are critical for adapting generic speech recognition capabilities to domain-specific applications where out-of-vocabulary terms and specialized syntax demand robust handling beyond standard deployment.

Key Concepts

Automatic Speech Recognition Architecture An ASR system functions by converting speech signals into sequences of words using three primary components: a lexicon mapping words to phonemes, a language model estimating word sequence probability, and an acoustic model identifying likely phoneme sequences from signal features. The pipeline involves a speech preprocessor for segmentation and enhancement, a feature extractor generating spectral features, and a decoder that utilizes pretrained models to find the optimal word sequence. This structure relies on the assumption observed in the source data that the speech signal has a correlational, but not one-to-one, relationship with the underlying phonetic units.
Human vs. Machine Recognition Paradigms Human recognition involves identifying a sequence of words $W = w 1, w 2 \dots w m$ from a sequence of phones $P = p 1, p 2 \dots p n$ , heavily dependent on language context and domain expertise. Machine recognition adapts this probabilistic framework by identifying $W$ from a sequence of observed features $O = o 1, o 2 \dots o t$ . While humans can leverage semantic knowledge to disambiguate sounds lacking context, machines must rely on extracted features and statistical models derived from training data to perform the mapping without inherent conceptual understanding of the language.
Connectionist Temporal Classification (CTC) Riva ASR employs the CTC algorithm to address the lack of one-to-one mapping between the speech signal and text labels. CTC computes an output distribution over all possible label alignments with the input sequence, utilizing a blank token ( $ϵ$ ) to handle repeating characters or silences, such as mapping “harry” to “h,a,a,r,Ɛ,r,y”. This allows the model to output a distribution over labels for each time step, enabling alignment without explicit forced alignment during training.
Decoding Strategies: Greedy vs. Beam Search Decoders interpret CTC output distributions to generate text sequences. The greedy decoder is computationally efficient, assuming the concatenation of the highest logit character at each time step yields the most probable sequence. Conversely, the Beam Search decoder is more computationally intensive but more accurate, iteratively computing prefix paths and pruning those with probabilities below a specific threshold to account for dependencies that CTC logits at single time steps might miss.
Inference-Time Adaptation (Word Boosting) Riva supports temporary model adaptation via word boosting, which extends vocabulary at inference time to recognize out-of-vocabulary (OOV) terms like proper names or domain-specific terminology. This process requires explicitly passing a list of boosted words with associated weights to the API request, biasing the acoustic model output score for those words. Unlike offline customization, this adaptation is ephemeral and applies only for the duration of the specific request.
Offline Model Customization and Lexicon Management Permanent vocabulary expansion requires modifying the lexicon file, which maps vocabulary words to tokenized forms such as sentence piece tokens. This can be executed at build time for custom models or post-deployment by modifying the lexicon file and restarting the server. The Flashlight decoder, a lexicon-based default in Riva, strictly restricts generation to the provided vocabulary, contrasting with greedy decoders that can produce any character sequence but lack lexical constraints.
Language Model Interpolation and Interpolation Custom language models provide permanent improvements for domain-specific phrase recognition through n-gram estimation, where $P (w or d_{1}, \dots, w or d_{n})$ defines sequence probability. Riva supports n-gram models trained via KenLM, which can be mixed with general domain language models via interpolation. This allows the system to balance general language fluency with specific domain knowledge without losing broad capabilities.
Acoustic Model Fine-Tuning and Data Requirements Adaptation of audio capabilities involves either training from scratch, requiring thousands of hours of data, or fine-tuning existing models. Fine-tuning is recommended with data on the order of 100 hours for sufficient performance, whereas NeMo specifically suggests several hundred hours for lossless fine-tuning to avoid catastrophic forgetting. Low-resource adaptation (e.g., 10 hours) necessitates mixing domain data with a larger base dataset to prevent the model from sacrificing general domain accuracy.

Key Equations and Algorithms

Posterior Probability Formulation The speech recognition problem is formulated as choosing the word sequence $W$ for the phone sequence $P$ that maximizes $P (W ∣ P)$ . This relies on the decomposition into $P (P ∣ W)$ , a Markov model derived from a lexicon, and $P (W)$ , the language model representing word sequence probability. This equation defines the probabilistic search space that the decoder navigates to identify the most likely transcription.
Word Error Rate Calculation The standard evaluation metric for transcription accuracy is defined as $W ER = (S + D + I) / N$ . Here, $N$ represents the total number of words in the reference transcription, $S$ is the count of substituted words, $D$ is the count of deleted words, and $I$ is the count of wrongly inserted words. Alignment of predicted and reference transcripts is required before computation due to potential length differences.
CTC Alignment with Blank Tokens CTC alignments handle non-1:1 mapping by inserting blank tokens $ϵ$ between repeating characters or silences. For example, the sequence “harry” might align as “h,a,a,r,Ɛ,r,y”, ensuring that repeated characters are not collapsed incorrectly. The algorithm modifies the standard path computation to account for these blank tokens, effectively expanding the output space to match acoustic durations.
WFST Composition Formula The final decoding graph is constructed via the composition of Transducer ( $T$ ), Lexicon ( $L$ ), and Grammar ( $G$ ) transducers. The composition is represented as $T L G = T \circ min (det (L \circ G))$ . This mathematical formulation defines how the token-level mapping, vocabulary mapping, and language model are composed into a single weighted finite state transducer for efficient decoding.
Beam Search Pruning Procedure The beam search algorithm iteratively computes the sum of paths corresponding to the most probable text sequence from prefix paths. It manages computational complexity by pruning all paths with probabilities less than a particular value. This optimization ensures that the decoder maintains a manageable set of hypotheses while avoiding the sub-optimality of the greedy approach which ignores context from previous time steps.

Key Claims and Findings

Human-level transcription accuracy is typically estimated at greater than 95%, with specialized domains often requiring accuracies closer to 99%.
Word Error Rate is often too simplistic for specific use cases like slot-filling, where Intent Classification Error Rate (ICER) or Entity Error Rate (EER) are more appropriate metrics.
Inference-time word boosting allows the ASR engine to recognize OOV words by giving them a higher score during decoding, but requires explicit specification at every API request.
Fine-tuning acoustic models with insufficient data, such as approx 10 hours, carries a high risk of catastrophic forgetting unless mixed with larger base datasets.
Inverse Text Normalization (ITN) is a critical post-processing step that converts raw spoken output into readable written text using weighted finite state transducers.
The Flashlight decoder is lexicon-bound and cannot generate words absent from the vocabulary file, whereas greedy decoders can produce arbitrary character sequences.

Terminology

Phone: The perceived or mental form of speech sounds, which constitute the raw acoustic units uttered by a person before language context is applied.
Phoneme: The written representation of a phone, used within the lexicon to map words to sequences of sound units.
Lexicon: A pronunciation dictionary that maps words to sequences of phonemes, letters, or other units, defining the vocabulary of a specific domain.
Word Error Rate (WER): A metric calculated by aligning predicted and reference transcriptions to count substitutions, deletions, and insertions relative to the total word count.
Connectionist Temporal Classification (CTC): An algorithm used for sequence discriminative tasks where labels are assigned to portions of a signal without a strict one-to-one mapping to input frames.
Out Of Vocabulary (OOV): Words that are not present in the decoder’s provided vocabulary file, often requiring word boosting or vocabulary expansion to be recognized.
Inverse Text Normalization (ITN): A process using finite state transducers to convert the raw spoken output of an ASR model into its formal written form for improved readability.
Weighted Finite State Transducer (WFST): A graph structure used to represent grammars, lexicons, and token mappings, allowing for composition of language and acoustic models.
Language Model (LM): A component that estimates the probability distribution over groups of words, guiding the decoder toward more likely word sequences based on context.
Catastrophic Forgetting: A phenomenon in fine-tuning where adapting a model to a small new dataset causes significant loss of accuracy in the general domains it was originally trained on.

Personal Wiki

Explorer

Sec. 7 — Basics of Speech Recognition and Customization of Riva ASR

Abstract

Key Concepts

Key Equations and Algorithms

Key Claims and Findings

Terminology

Graph View

Table of Contents

Backlinks