Deep Learning for Audio

About

The course is devoted to modern deep learning approaches for audio processing and analysis. Special attention is paid to digital signal processing fundamentals, automatic speech recognition systems, text-to-speech synthesis, and neural audio generation methods. The aim of the course is to introduce students to advanced techniques in audio machine learning and their practical applications.

Syllabus

Digital Signal Processing: Audio fundamentals, spectrograms, STFT, and classical audio preprocessing techniques.
Automatic Speech Recognition I: Word Error Rate (WER), Connectionist Temporal Classification (CTC), Listen-Attend-Spell (LAS), and beam search algorithms.
Automatic Speech Recognition II: RNN-Transducer (RNN-T), Conformer architecture, Whisper model, language models in ASR, and Byte-Pair Encoding (BPE).
Key-word Spotting (KWS): Wake word detection, small-footprint models, and streaming KWS systems.
Text-to-Speech I: Tacotron architecture, FastSpeech models, and guided attention mechanisms.
Text-to-Speech II: Neural vocoders including WaveNet, Parallel WaveGAN, and DiffWave for high-quality audio synthesis.
Voice Conversion: Voice transformation techniques and neural approaches to voice cloning.
Self-supervised Learning in Audio: Wav2Vec, HuBERT, and other self-supervised approaches for audio representation learning.
Unsupervised Learning in Audio: Clustering, representation learning, and unsupervised audio analysis methods.
Music Generation with Neural Networks: AI-powered music composition and generation techniques.

Lectures and Seminars

The course includes both theoretical lectures and practical seminars with hands-on coding exercises. Topics covered:

Lectures: Fundamental concepts, model architectures, and theoretical foundations
Seminars: Practical implementation using PyTorch, audio preprocessing, model training, and evaluation

Labworks

4 homeworks covering practical aspects of audio deep learning:

Audio Classification & Preprocessing: Fundamental audio processing and classification tasks
ASR with CTC: Implementing connectionist temporal classification for speech recognition
ASR with RNN-T: Advanced speech recognition using RNN-Transducer architecture
Text-to-Speech: Building FastPitch-based text-to-speech systems

Grading

Each homework gives 2 points + final test for 2 points. Maximum score: 4×2 + 2 = 10 points.

Prerequisites

Digital signal processing basics, machine learning fundamentals, deep learning with PyTorch, and basic understanding of sequence modeling.