Stanford Online
5/11/2026

Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 5 - Architectures
Short summary
Stanford graduate lecture on diffusion models and large vision architectures, covering U-Net evolution, Diffusion Transformers (DiT), multimodal variants, and advanced positional encoding techniques. Requires deep neural network background; designed for AI product builders and researchers.
- •Covers U-Net and Diffusion Transformer architectures with timeline evolution
- •Explores multimodal DiT models and state-of-the-art implementations (FLUX.1, Qwen-Image)
- •Technical deep-dive into position embeddings (RoPE) and adaptive layer normalization
Generated with AI, which can make mistakes.
Is this a good recommendation for you?



