ProductApril 10, 2026Seedance Team14 min read

OmniHuman v1.5: Create Lifelike AI Avatars from a Single Photo

The complete guide to OmniHuman v1.5 on Seedance. Learn how to create realistic AI avatar videos from a single photo and audio file, including features, input requirements, pricing, output specs, and real-world use cases.

One photo. One audio file. One realistic talking video for $9.60 — no subscription, no credit card lock-in, no monthly minimum. That is the promise of OmniHuman v1.5, and this guide shows you exactly how it delivers.

TL;DR

TL;DR

Turn any portrait photo plus an audio file into a lifelike talking video
960 credits ($9.60) per generation — pay only when you create
720p up to 60 seconds, or 1080p up to 30 seconds
Phoneme-accurate lip sync, natural gestures, audio-driven expressions
No subscription required, unlike HeyGen ($24-$48/mo) or Synthesia ($30-$90/mo)

What OmniHuman v1.5 Actually Does

OmniHuman v1.5 is ByteDance's flagship avatar model, available exclusively on Seedance. You upload a single portrait photo and a speech audio file, and the model generates a complete talking-head video — lip sync, eye blinks, head tilts, shoulder shifts, and micro-expressions all synthesized from scratch.

This is not the old "paste an animated mouth on a static face" trick. Every frame is freshly generated by a diffusion transformer that understands both identity and audio. The person in your photo moves the way a real human would while delivering that specific audio.

If you have used Seedance 2.0 for cinematic video or Seedream for image generation, OmniHuman v1.5 completes the set: text to image, image to video, and now photo plus audio to avatar.

👤

Create your AI presenter now

Turn one photo + audio into a lifelike talking video. $9.60 per video, no subscription.

Try OmniHuman Free

The Five Features That Matter

1. Phoneme-Level Lip Sync

The model extracts phonemes from your audio and maps them to exact mouth shapes frame by frame. Plosives like "p" and "b" show lip closure. Fricatives like "f" and "v" show teeth on lip. Vowels produce proper jaw opening. Viewers watching with sound on will not catch the model faking it.

2. Audio-Driven Facial Expressions

When the audio sounds enthusiastic, the eyebrows lift. When there is a pause for emphasis, the eyes narrow slightly. These cues are inferred from vocal prosody, not baked into rigid templates.

3. Natural Gesture Generation

Shoulder shifts, head turns, and subtle postural changes emerge from the speech signal itself. A speaker making a point leans forward. One listing items shifts weight. The motion correlates with meaning, not with a timer.

4. Turbo Mode

Need speed for iteration? Turbo mode trims generation time without changing the 960-credit cost. Use it to preview prompts and inputs, then switch to standard mode for the hero render.

5. Dual Resolution, Dual Duration

| Resolution | Max Audio Duration | Best For | |---|---|---| | 720p (1280x720) | 60 seconds | Social clips, training, drafts | | 1080p (1920x1080) | 30 seconds | Client work, product demos, marketing |

How the Pipeline Works

Understanding the stages helps you feed the model better inputs.

Stage 1: Identity extraction. A facial encoder turns your photo into a high-dimensional embedding capturing bone structure, skin tone, eye shape, and hundreds of other markers. This embedding keeps the face consistent across every frame.

Stage 2: Audio analysis. The speech track is analyzed for phonemes, prosody, energy contour, and emotional tone.

Stage 3: Motion synthesis. A motion network turns audio features into per-frame parameters for face, head, and shoulders.

Stage 4: Diffusion rendering. A transformer generates final pixels, combining identity, motion, and your prompt description of the scene.

Stage 5: Temporal coherence. A refinement pass kills flicker, judder, and identity drift between frames.

What You Need to Supply

Reference Photo

Resolution: minimum 512x512, ideally 1024x1024 or higher
Composition: head and shoulders, face at least 30% of frame
Lighting: even and diffused beats harsh shadows every time
Expression: neutral or mildly pleasant gives the model the most room to work
Format: JPEG or PNG

Audio File

Format: MP3, WAV, or M4A
Duration: up to 60s at 720p, up to 30s at 1080p
Quality: clean speech with no music bed, no heavy reverb
Language: any spoken language works

Text Prompt

Describe the scene: background ("modern office with large windows"), lighting ("soft key light from the left"), framing ("medium shot, head and shoulders"), and any style notes.

Pricing: The Pay-Per-Use Math

OmniHuman v1.5 costs 960 credits per video, equal to roughly $9.60 at the base rate. There is no subscription. You buy credits, you spend them when you generate, and unused credits stay in your account.

| Credit Package | Price | Credits | Effective Cost per Video | |---|---|---|---| | Starter | $10 | 1,050 | ~$9.14 | | Popular | $25 | 2,750 | ~$8.73 | | Pro | $50 | 5,750 | ~$8.35 | | Max | $100 | 12,000 | ~$8.00 |

Why This Beats Subscriptions

HeyGen starts at $24/month with a monthly minute cap. Synthesia starts at $30/month. D-ID starts at $5/month for a tiny allowance and scales to $300/month.

With OmniHuman v1.5, you pay nothing in months you do not create. Make one video for $9.60. Make ten for $96. Make zero and pay zero. New accounts also get 50 free credits on signup for exploring Seedream and Seedance 1.0 Lite before buying.

See the full breakdown in our OmniHuman v1.5 pricing guide.

Ready to try OmniHuman v1.5? Start creating free →

An AI-generated talking head from OmniHuman v1.5

Want a presenter like this? Try OmniHuman free →

Step-by-Step: Your First Video in Minutes

Prep the photo. Pick a well-lit portrait, or generate one with Seedream.
Prep the audio. Record yourself, or use any quality TTS. Trim leading and trailing silence.
Open the model. Go to seedance.it.com and pick OmniHuman v1.5, or jump straight to the model page.
Upload inputs. Photo, audio, and a short scene prompt.
Configure. Choose 720p or 1080p, toggle turbo if you want speed.
Generate. A few minutes later, your video is ready.
Review and download. MP4 out, ready for any editor or platform.

For a deeper walkthrough, see the talking head tutorial.

Real Use Cases Worth Your Attention

Corporate training — Consistent presenter videos without scheduling shoots. Update content by swapping audio. Full guide.

E-learning — Course instructors can ship lessons at scale. Avatars hold attention better than static slides with voiceover. Full guide.

Sales outreach — Personalized video messages that address prospects by name. Full guide.

Customer support — Video answers to FAQs resolve issues faster than help articles. Full guide.

YouTube and podcasting — AI co-hosts, explainer segments, and video versions of audio shows. See YouTube and podcast guides.

Healthcare education — Patient explainers in any language from a single trusted face. Full guide.

Multilingual content — Same photo, swap audio, ship ten language versions with consistent identity. Full guide.

Tips That Improve Output Quality

Photo Selection

Even, soft lighting with no harsh shadows
Front-facing or slight three-quarter angle
Hands away from face
Clean background that contrasts with the subject

Audio Preparation

Record in a quiet room with minimal reverb
Speak at a natural pace, include real pauses
Normalize levels to prevent clipping
Strip music or ambient noise before uploading

Prompt Writing

Be specific: "modern office with large windows" beats "nice background"
Name the lighting: "soft studio key light" or "warm afternoon sun"
State the framing: "medium close-up, head and shoulders"
Do not contradict the photo (no describing a different hair color)

Iteration

Turbo mode for test runs
Change one variable at a time
Keep a file of prompts that worked

OmniHuman v1.5 vs the Alternatives

| Feature | OmniHuman v1.5 | HeyGen | Synthesia | D-ID | |---|---|---|---|---| | Custom photo input | Yes | Limited | No (preset avatars) | Yes | | Lip sync quality | Excellent | Good | Good | Moderate | | Gesture generation | Audio-driven | Template-based | Template-based | Limited | | Max resolution | 1080p | 1080p | 1080p | 1024x1024 | | Pricing model | Pay-per-use | Subscription | Subscription | Mixed | | Cost per video | ~$8-10 | $24-48/mo | $30-90/mo | $5-300/mo | | Free credits on signup | 50 credits | Limited trial | Free trial | Limited trial |

Head-to-head breakdowns:

⚡

Skip the monthly subscription trap

HeyGen, Synthesia, and D-ID charge you every month — even when you make zero videos. OmniHuman v1.5 charges $9.60 only when you create.

Generate Your First Video

Honest Limitations

Duration caps. 60s at 720p, 30s at 1080p. Longer content needs stitching.
One person per video. Multi-person scenes require compositing.
Upper body only. No full-body animation.
Audio quality matters. Bad input, bad lip sync.
Not real-time. Generations take minutes, not milliseconds.
No live streaming. Output is a pre-rendered MP4.

Frequently Asked Questions

Can I use any photo?

You can use any photo meeting the input requirements, but you are responsible for rights and permissions. Seedance's terms of service cover the legal details.

Does it work with non-English audio?

Yes. Any spoken language works, with highest accuracy on languages that have strong training representation. See the multilingual guide.

Can I edit the output?

Yes. The output is a standard MP4. Trim it, crop it, composite it — whatever your editor supports.

Is there an API?

Yes. See the API guide.

Start Creating

Sign up for Seedance and grab your 50 free credits
Buy credits — the $25 Popular tier gives you roughly 2-3 OmniHuman videos
Prep a photo and audio file
Open OmniHuman v1.5
Upload, configure, generate

No subscription. No minimum. Just $9.60 per lifelike talking video, whenever you need one.

Ready to try OmniHuman v1.5? Start creating free →

Start Creating with OmniHuman v1.5

Turn one photo + audio into a lifelike talking video. Pay-per-use, no subscription.

50 free credits on signup. No credit card. No subscription.