Dev.to
6/15/2026

aMuseMe: When Small Models Compose a Visual Symphony
Short summary
aMuseMe generates lyric videos from audio using only 3.5B total parameters—Whisper for timestamps, MiniCPM5-1B for line breaks (with schema-enforced Outlines), SD-Turbo for backgrounds, Pillow for final render. The project demonstrates efficient multi-stage ML composition without cloud APIs. Each stage was aggressively tuned for the hackathon's parameter budget.
- •3.5B parameters total: Whisper (word timestamps) + MiniCPM5-1B (line breaks) + SD-Turbo (backgrounds) + Pillow (render)
- •Schema-enforced generation with Outlines prevents malformed LLM outputs without retries or parsing logic
- •Multi-stage pipeline demonstrates efficient model composition and parameter budget optimization
Generated with AI, which can make mistakes.
Is this a good recommendation for you?



