Back to feed
Dev.to
Dev.to
6/15/2026
aMuseMe: When Small Models Compose a Visual Symphony

aMuseMe: When Small Models Compose a Visual Symphony

Short summary

aMuseMe generates lyric videos from audio using only 3.5B total parameters—Whisper for timestamps, MiniCPM5-1B for line breaks (with schema-enforced Outlines), SD-Turbo for backgrounds, Pillow for final render. The project demonstrates efficient multi-stage ML composition without cloud APIs. Each stage was aggressively tuned for the hackathon's parameter budget.

  • 3.5B parameters total: Whisper (word timestamps) + MiniCPM5-1B (line breaks) + SD-Turbo (backgrounds) + Pillow (render)
  • Schema-enforced generation with Outlines prevents malformed LLM outputs without retries or parsing logic
  • Multi-stage pipeline demonstrates efficient model composition and parameter budget optimization

Generated with AI, which can make mistakes.

Is this a good recommendation for you?

Explore more