OmniHuman v1.5: Create Lifelike AI Avatars from a Single Photo
The complete guide to OmniHuman v1.5 on Seedance. Learn how to create realistic AI avatar videos from a single photo and audio file, including features, input requirements, pricing, output specs, and real-world use cases.

One photo. One audio file. One realistic talking video for $9.60 — no subscription, no credit card lock-in, no monthly minimum. That is the promise of OmniHuman v1.5, and this guide shows you exactly how it delivers.
TL;DR
- Turn any portrait photo plus an audio file into a lifelike talking video
- 960 credits ($9.60) per generation — pay only when you create
- 720p up to 60 seconds, or 1080p up to 30 seconds
- Phoneme-accurate lip sync, natural gestures, audio-driven expressions
- No subscription required, unlike HeyGen ($24-$48/mo) or Synthesia ($30-$90/mo)
What OmniHuman v1.5 Actually Does
OmniHuman v1.5 is ByteDance's flagship avatar model, available exclusively on Seedance. You upload a single portrait photo and a speech audio file, and the model generates a complete talking-head video — lip sync, eye blinks, head tilts, shoulder shifts, and micro-expressions all synthesized from scratch.
This is not the old "paste an animated mouth on a static face" trick. Every frame is freshly generated by a diffusion transformer that understands both identity and audio. The person in your photo moves the way a real human would while delivering that specific audio.
If you have used Seedance 2.0 for cinematic video or Seedream for image generation, OmniHuman v1.5 completes the set: text to image, image to video, and now photo plus audio to avatar.
Create your AI presenter now
Turn one photo + audio into a lifelike talking video. $9.60 per video, no subscription.
Try OmniHuman FreeThe Five Features That Matter
1. Phoneme-Level Lip Sync
The model extracts phonemes from your audio and maps them to exact mouth shapes frame by frame. Plosives like "p" and "b" show lip closure. Fricatives like "f" and "v" show teeth on lip. Vowels produce proper jaw opening. Viewers watching with sound on will not catch the model faking it.
2. Audio-Driven Facial Expressions
When the audio sounds enthusiastic, the eyebrows lift. When there is a pause for emphasis, the eyes narrow slightly. These cues are inferred from vocal prosody, not baked into rigid templates.
3. Natural Gesture Generation
Shoulder shifts, head turns, and subtle postural changes emerge from the speech signal itself. A speaker making a point leans forward. One listing items shifts weight. The motion correlates with meaning, not with a timer.
4. Turbo Mode
Need speed for iteration? Turbo mode trims generation time without changing the 960-credit cost. Use it to preview prompts and inputs, then switch to standard mode for the hero render.
5. Dual Resolution, Dual Duration
| Resolution | Max Audio Duration | Best For | |---|---|---| | 720p (1280x720) | 60 seconds | Social clips, training, drafts | | 1080p (1920x1080) | 30 seconds | Client work, product demos, marketing |
How the Pipeline Works
Understanding the stages helps you feed the model better inputs.
Stage 1: Identity extraction. A facial encoder turns your photo into a high-dimensional embedding capturing bone structure, skin tone, eye shape, and hundreds of other markers. This embedding keeps the face consistent across every frame.
Stage 2: Audio analysis. The speech track is analyzed for phonemes, prosody, energy contour, and emotional tone.
Stage 3: Motion synthesis. A motion network turns audio features into per-frame parameters for face, head, and shoulders.
Stage 4: Diffusion rendering. A transformer generates final pixels, combining identity, motion, and your prompt description of the scene.
Stage 5: Temporal coherence. A refinement pass kills flicker, judder, and identity drift between frames.
What You Need to Supply
Reference Photo
- Resolution: minimum 512x512, ideally 1024x1024 or higher
- Composition: head and shoulders, face at least 30% of frame
- Lighting: even and diffused beats harsh shadows every time
- Expression: neutral or mildly pleasant gives the model the most room to work
- Format: JPEG or PNG
Audio File
- Format: MP3, WAV, or M4A
- Duration: up to 60s at 720p, up to 30s at 1080p
- Quality: clean speech with no music bed, no heavy reverb
- Language: any spoken language works
Text Prompt
Describe the scene: background ("modern office with large windows"), lighting ("soft key light from the left"), framing ("medium shot, head and shoulders"), and any style notes.
Pricing: The Pay-Per-Use Math
OmniHuman v1.5 costs 960 credits per video, equal to roughly $9.60 at the base rate. There is no subscription. You buy credits, you spend them when you generate, and unused credits stay in your account.
| Credit Package | Price | Credits | Effective Cost per Video | |---|---|---|---| | Starter | $10 | 1,050 | ~$9.14 | | Popular | $25 | 2,750 | ~$8.73 | | Pro | $50 | 5,750 | ~$8.35 | | Max | $100 | 12,000 | ~$8.00 |
Why This Beats Subscriptions
HeyGen starts at $24/month with a monthly minute cap. Synthesia starts at $30/month. D-ID starts at $5/month for a tiny allowance and scales to $300/month.
With OmniHuman v1.5, you pay nothing in months you do not create. Make one video for $9.60. Make ten for $96. Make zero and pay zero. New accounts also get 50 free credits on signup for exploring Seedream and Seedance 1.0 Lite before buying.
See the full breakdown in our OmniHuman v1.5 pricing guide.
Ready to try OmniHuman v1.5? Start creating free →

Want a presenter like this? Try OmniHuman free →
Step-by-Step: Your First Video in Minutes
- Prep the photo. Pick a well-lit portrait, or generate one with Seedream.
- Prep the audio. Record yourself, or use any quality TTS. Trim leading and trailing silence.
- Open the model. Go to seedance.it.com and pick OmniHuman v1.5, or jump straight to the model page.
- Upload inputs. Photo, audio, and a short scene prompt.
- Configure. Choose 720p or 1080p, toggle turbo if you want speed.
- Generate. A few minutes later, your video is ready.
- Review and download. MP4 out, ready for any editor or platform.
For a deeper walkthrough, see the talking head tutorial.
Real Use Cases Worth Your Attention
Corporate training — Consistent presenter videos without scheduling shoots. Update content by swapping audio. Full guide.
E-learning — Course instructors can ship lessons at scale. Avatars hold attention better than static slides with voiceover. Full guide.
Sales outreach — Personalized video messages that address prospects by name. Full guide.
Customer support — Video answers to FAQs resolve issues faster than help articles. Full guide.
YouTube and podcasting — AI co-hosts, explainer segments, and video versions of audio shows. See YouTube and podcast guides.
Healthcare education — Patient explainers in any language from a single trusted face. Full guide.
Multilingual content — Same photo, swap audio, ship ten language versions with consistent identity. Full guide.
Tips That Improve Output Quality
Photo Selection
- Even, soft lighting with no harsh shadows
- Front-facing or slight three-quarter angle
- Hands away from face
- Clean background that contrasts with the subject
Audio Preparation
- Record in a quiet room with minimal reverb
- Speak at a natural pace, include real pauses
- Normalize levels to prevent clipping
- Strip music or ambient noise before uploading
Prompt Writing
- Be specific: "modern office with large windows" beats "nice background"
- Name the lighting: "soft studio key light" or "warm afternoon sun"
- State the framing: "medium close-up, head and shoulders"
- Do not contradict the photo (no describing a different hair color)
Iteration
- Turbo mode for test runs
- Change one variable at a time
- Keep a file of prompts that worked
OmniHuman v1.5 vs the Alternatives
| Feature | OmniHuman v1.5 | HeyGen | Synthesia | D-ID | |---|---|---|---|---| | Custom photo input | Yes | Limited | No (preset avatars) | Yes | | Lip sync quality | Excellent | Good | Good | Moderate | | Gesture generation | Audio-driven | Template-based | Template-based | Limited | | Max resolution | 1080p | 1080p | 1080p | 1024x1024 | | Pricing model | Pay-per-use | Subscription | Subscription | Mixed | | Cost per video | ~$8-10 | $24-48/mo | $30-90/mo | $5-300/mo | | Free credits on signup | 50 credits | Limited trial | Free trial | Limited trial |
Head-to-head breakdowns:
Skip the monthly subscription trap
HeyGen, Synthesia, and D-ID charge you every month — even when you make zero videos. OmniHuman v1.5 charges $9.60 only when you create.
Generate Your First VideoHonest Limitations
- Duration caps. 60s at 720p, 30s at 1080p. Longer content needs stitching.
- One person per video. Multi-person scenes require compositing.
- Upper body only. No full-body animation.
- Audio quality matters. Bad input, bad lip sync.
- Not real-time. Generations take minutes, not milliseconds.
- No live streaming. Output is a pre-rendered MP4.
Frequently Asked Questions
Can I use any photo?
You can use any photo meeting the input requirements, but you are responsible for rights and permissions. Seedance's terms of service cover the legal details.
Does it work with non-English audio?
Yes. Any spoken language works, with highest accuracy on languages that have strong training representation. See the multilingual guide.
Can I edit the output?
Yes. The output is a standard MP4. Trim it, crop it, composite it — whatever your editor supports.
Is there an API?
Yes. See the API guide.
Start Creating
- Sign up for Seedance and grab your 50 free credits
- Buy credits — the $25 Popular tier gives you roughly 2-3 OmniHuman videos
- Prep a photo and audio file
- Open OmniHuman v1.5
- Upload, configure, generate
No subscription. No minimum. Just $9.60 per lifelike talking video, whenever you need one.
Ready to try OmniHuman v1.5? Start creating free →