Multi-Modal AI Video: Combining Images, Video & Audio with Seedance
True multi-modal AI video means feeding the model more than just a prompt. Here's how to combine images, video, and audio references in Seedance 2.0 Reference.

"Multi-modal" gets thrown around loosely in AI marketing. Most tools that claim it actually mean "we accept text and one image." Seedance 2.0 Reference is one of the few models that takes the term seriously: up to 9 images, 3 video clips, and 3 audio clips, all fused into a single generation.
Here's how to actually use all three input types together — and why the combination unlocks things no single input can.
TL;DR
- Seedance 2.0 Reference accepts images + video + audio as reference in one call
- Images drive style, video drives motion, audio drives mood
- Combining all three produces tighter output than any single input type
- Pricing: $0.3024/sec, regardless of how many references you use
- Generation: 60-180 seconds for the fused multi-modal pipeline
- Try the full multi-modal stack free
What Each Input Type Actually Controls
These three reference types don't overlap — they handle different dimensions of the output.
Images control style. Color palette, lighting, composition, texture, framing. Everything visual that doesn't involve motion.
Video controls motion. Camera moves, pacing, action vectors, temporal rhythm. Not color or lighting — the model extracts motion independently of the video's visual style.
Audio controls mood. Emotional bias that tilts the visual output subtly. Not the sound you hear in the output (that's generated separately), but the mood cue that influences what's on screen.
When you use all three together, each one fills a gap the others can't.
The Stack in Action
Here's a concrete multi-modal generation I ran.
Goal: A dreamy, slow-motion shot of a woman running through a field at dawn, in the style of a Terrence Malick film.
Image references (6):
- 2 Malick film stills showing warm dawn light
- 2 images of fields at golden hour
- 1 shot of the exact lens character I wanted (soft, dreamy bokeh)
- 1 hero frame showing the precise mood
Video references (1):
- A 3-second clip of a slow-motion running figure shot with a long lens
Audio references (1):
- 4 seconds of gentle, sparse piano
Prompt:
A woman in a white dress runs through tall grass at dawn, 8 seconds
Notice how minimal the prompt is. The references handle everything else.
The output was startlingly close to intent — style from the images, the specific slow-motion running feel from the video, and a subtle emotional weight from the audio that I couldn't have gotten with just images.

Try the full multi-modal stack. Upload images, a motion clip, and an audio reference in one shot. Start free with 50 credits.
Image References: The Foundation
Images are the heaviest-weighted input in the fusion. They should always be your primary style channel. Use 4-6 at minimum, up to 9 for complex styles.
See the dedicated multi-image tutorial for the full methodology on image curation. Summary: agreement matters more than count, cover different facets of the style, include one hero frame.
Video References: The Motion Layer
Video references are where multi-modal really separates from pure image-based workflows. Describing camera motion in words is genuinely hard — "a slow handheld drift left while subtly craning up" loses fidelity every time you translate it to prose.
A 2-5 second video reference skips the translation. The model samples motion vectors directly.
Best uses:
- Specific handheld feel you want to match
- Tricky camera moves (craning, dolly + pan combos, whip transitions)
- Pacing cues (fast cuts feel vs slow contemplative feel)
- Action rhythm for sports or dance content
What video references don't do:
- They don't carry their own style. Color grade from the reference won't transfer — only motion.
- They don't dictate content. The subject in the reference doesn't appear in your output.
You can use up to 3 video references per generation. Stacking multiple motion references blends them — useful for getting "between" two reference moves.
Audio References: The Mood Layer
Audio references are the strangest of the three. They don't appear in your output's audio track (that's generated fresh to match the scene). Instead, they nudge the visuals emotionally.
A stormy audio reference biases toward dramatic lighting and darker palette. A cheerful uptempo reference biases toward brighter framing and more movement. A sparse melancholy reference biases toward longer, slower shots with cooler tones.
When to use them:
- The style references and prompt are solid but the output feels emotionally flat
- You want a mood that's hard to describe visually
- You're matching output to an existing score or song
When to skip them:
- Your first couple of generations. Start with images only. Add audio references once you're comfortable with the baseline.
Up to 3 audio references per generation, blended together.
Try Seedance 2.0 Reference — multi-modal video generation
Combine images, video, and audio references in one generation. 50 free credits, no card required.
Try Seedance 2.0 Reference FreeThe Multi-Modal Decision Tree
Not every project needs all three inputs. Here's when to add each:
Always: Image references (3+ minimum) Add video reference if: You need specific motion that's hard to describe Add audio reference if: Your output feels emotionally off after image-only attempts
Don't force multi-modal when a simpler stack works. Image-only generations are faster to set up and often sufficient.
Common Mistakes with Multi-Modal
Contradicting the images with the video. If your images are moody and slow but your video reference is fast and bright, the fusion gets confused. Pick inputs that agree.
Using video references for style. They're motion-only. Style still comes from images.
Overusing audio references. A single tight audio cue is more effective than 3 conflicting ones.
Skipping the prompt. References define how, the prompt still defines what. Don't leave the prompt blank.
Cost and Generation Time
Multi-modal pricing is identical to any other Reference mode call: $0.3024 per second of output. Adding video and audio references doesn't cost extra.
| Duration | Credits | Cost | |---|---|---| | 4 sec | 243 | $2.42 | | 8 sec | 484 | $4.84 | | 15 sec | 907 | $9.07 |
Generation time is slightly longer than image-only because the fusion is processing more data. Expect 90-180 seconds for multi-modal calls versus 60-120 for image-only.
A Full Multi-Modal Production Workflow
For a 10-clip project where every clip needs to match:
1. Build your reference bundle (once, re-use for all 10 clips):
- 6 images defining the style
- 2 video references for your most-used camera moves
- 1 audio reference for the emotional tone
2. Write 10 shot-specific prompts that describe subject and action only.
3. Run all 10 generations using the same reference bundle, varying only the prompt.
4. Review and iterate on any clips that drifted. Usually 1-2 out of 10.
Total cost for 10 clips at 8 seconds: ~4,840 credits = ~$48. That's inside the $50 Pro tier with change left.
Multi-Modal vs Standard Comparison
| Capability | Standard Seedance 2.0 | Reference (Multi-Modal) | |---|---|---| | Style control | Prompt only | Up to 9 images | | Motion control | Prompt only | Up to 3 video refs | | Mood control | Prompt only | Up to 3 audio refs | | Setup complexity | Low | Moderate | | Output precision | Moderate | High | | Best for | One-offs | Precision work |
Standard is faster for simple ideas. Reference multi-modal is the precision option when you need the output to match a specific intent. See the full Reference vs Standard breakdown.
Advanced: Layering References Across a Sequence
For multi-shot sequences with distinct moments, you can change your multi-modal stack between shots while keeping one constant.
Constant across sequence: 5-6 style images (for visual consistency) Variable per shot: 1-2 shot-specific images + 1 motion reference
The constant images lock the sequence together. The variable references let you capture shot-specific nuance. This is the workflow most pro users land on.
Where To Go From Here
If this is your first time with multi-modal, start small. Try image-only generations first (multi-image tutorial). Once comfortable, add a motion reference. Once you're getting reliable results, experiment with audio.
For the complete feature picture, read the Seedance 2.0 Reference guide. For specific use cases, check brand videos or music videos.
Multi-modal isn't a gimmick — it's the feature that separates Seedance 2.0 Reference from every other AI video tool on the market. Use it when precision matters.
Stack images, video, and audio in one call
See how multi-modal input locks down your AI video output. 50 free credits, no card required.
Start Creating Free