Performance Optimization

Abstract

This presentation outlines core performance optimization strategies for deep learning systems, targeting ML engineers, researchers, and infrastructure architects. It systematically addresses mixed-precision training, model quantization, neural network pruning, and data center energy efficiency, with contextual examples in automatic speech recognition. The central takeaway is that sustainable model performance at scale requires a coordinated approach: leveraging lower-precision arithmetic for compute/memory savings, applying structured compression techniques to preserve accuracy, and aligning software optimization with power-aware hardware and cooling infrastructure.

Key Concepts

Mixed-precision training using FP16 forward/backward passes with FP32 master weight accumulation and loss scaling
Post-training quantization (PTQ) and range mapping techniques for converting FP32 models to INT8 or lower
Combinatorial optimization-based neural network pruning (CHITA) for accuracy-preserving weight removal
Distinction between energy efficiency (tasks/kWh) and power efficiency (tasks/Watt) in compute infrastructure
Sequence alignment in ASR using the Connectionist Temporal Classification (CTC) blank token mechanism

Key Points by Section

Mixed Precision Training: Replacing uniform FP32 with FP16 activations, weights, and gradients nearly halves memory bandwidth and accelerates arithmetic on modern GPUs, provided that gradient underflow is prevented via loss scaling and FP32 master weights are maintained for optimizer updates.
CTC & ASR Alignment: Models like Wav2Vec2 and HuBERT map overlapping audio windows to character sequences without requiring sample-level timing labels, relying on CTC’s blank token to resolve monotonic alignment and eliminate duplicate predictions during inference.
Quantization Fundamentals & Calibration: Converting models to lower precision (primarily INT8) requires careful range mapping, scale/zero-point calibration, and configurable granularity; PTQ offers a practical deployment path but often demands high-percentile activation calibration or quantization-aware fine-tuning to maintain <1% accuracy loss.
Energy & Power Efficiency: Maximizing computational throughput per watt requires infrastructure-level interventions including hardware acceleration (GPUs/DPUs), purpose-built interconnects, virtualization, liquid cooling, and improved Power Usage Effectiveness (PUE) rather than relying solely on software-level tuning.
Neural Network Pruning: Standard magnitude-based pruning ignores parameter dependencies, whereas CHITA formulates pruning as a best-subset selection problem using low-rank Fisher information to evaluate weight removal impact combinatorially, delivering superior utility retention with scalable second-order information.

Key Claims and Findings

Mixed-precision training achieves comparable accuracy to FP32 while reducing memory footprint by ~50% and accelerating training/inference on GPU hardware.
INT8 quantization provides the optimal accuracy-compression-speed tradeoff, typically delivering ≥2x–4x throughput gains with negligible accuracy degradation when calibrated correctly.
Post-training quantization is deployment-ready but requires model-specific calibration strategies; accuracy recovery often necessitates quantization-aware training or selective layer quantization for sensitive architectures.
Data center power efficiency is fundamentally constrained by hardware architecture; offloading networking/management to DPUs and leveraging AI accelerators yields greater energy savings than CPU-only scaling.
Optimization-based pruning that accounts for weight interdependencies substantially outperforms magnitude-based pruning in preserving downstream task performance.

Connections to Existing Wiki Pages

[[ai-ml/nvidia-certs/Generative AI LLM Exam Study Guide/sec-06-mastering-llm-techniques-inference-optimization]]
[[ai-ml/nvidia-certs/Generative AI LLM Exam Study Guide/sec-04-llms-training-customizing-and-inferencing]]
[[ai-ml/nvidia-certs/NCA-GENM Experimentation/sec-07-basics-of-speech-recognition-and-customization-of-riva-asr]]
[[ai-ml/nvidia-certs/NCP-AAI_Part_1_Exam_Prep_FULL]]
[[ai-ml/nvidia-certs/NCA-GENM Multimodal Data]]

Personal Wiki

Explorer

Performance Optimization

Abstract

Key Concepts

Key Points by Section

Key Claims and Findings

Connections to Existing Wiki Pages

Graph View

Table of Contents

Backlinks