Deep Learning for Audio
About
The course is devoted to modern deep learning approaches for audio processing and analysis. Special attention is paid to digital signal processing fundamentals, automatic speech recognition systems, text-to-speech synthesis, and neural audio generation methods. The aim of the course is to introduce students to advanced techniques in audio machine learning and their practical applications.
Syllabus
- Digital Signal Processing: Audio fundamentals, spectrograms, STFT, and classical audio preprocessing techniques.
- Automatic Speech Recognition I: Word Error Rate (WER), Connectionist Temporal Classification (CTC), Listen-Attend-Spell (LAS), and beam search algorithms.
- Automatic Speech Recognition II: RNN-Transducer (RNN-T), Conformer architecture, Whisper model, language models in ASR, and Byte-Pair Encoding (BPE).
- Key-word Spotting (KWS): Wake word detection, small-footprint models, and streaming KWS systems.
- Text-to-Speech I: Tacotron architecture, FastSpeech models, and guided attention mechanisms.
- Text-to-Speech II: Neural vocoders including WaveNet, Parallel WaveGAN, and DiffWave for high-quality audio synthesis.
- Voice Conversion: Voice transformation techniques and neural approaches to voice cloning.
- Self-supervised Learning in Audio: Wav2Vec, HuBERT, and other self-supervised approaches for audio representation learning.
- Unsupervised Learning in Audio: Clustering, representation learning, and unsupervised audio analysis methods.
- Music Generation with Neural Networks: AI-powered music composition and generation techniques.
Lectures and Seminars
The course includes both theoretical lectures and practical seminars with hands-on coding exercises. Topics covered:
- Lectures: Fundamental concepts, model architectures, and theoretical foundations
- Seminars: Practical implementation using PyTorch, audio preprocessing, model training, and evaluation
Labworks
4 homeworks covering practical aspects of audio deep learning:
- Audio Classification & Preprocessing: Fundamental audio processing and classification tasks
- ASR with CTC: Implementing connectionist temporal classification for speech recognition
- ASR with RNN-T: Advanced speech recognition using RNN-Transducer architecture
- Text-to-Speech: Building FastPitch-based text-to-speech systems
Grading
Each homework gives 2 points + final test for 2 points. Maximum score: 4×2 + 2 = 10 points.
Prerequisites
Digital signal processing basics, machine learning fundamentals, deep learning with PyTorch, and basic understanding of sequence modeling.