Alibaba HappyHorse 1.1 · joint audio-video, multilingual lip-sync

HappyHorse 1.1 AI Video Generator

Alibaba's audio-video model — characters that speak, lip-synced across seven languages, consistent across every scene

HappyHorse 1.1 is Alibaba's audio-video model, and SupaImagine runs it in the browser. It generates picture and sound together, so when a character speaks, their mouth tracks the dialogue — across English, Mandarin, Cantonese, Japanese, Korean, German, and French. Feed it text, a first frame, or up to nine reference images to hold a character, product, or palette steady from one shot to the next. It renders at 720p or 1080p in clips three to fifteen seconds long, and every render lands in your library.

  • Lip-synced dialogue in seven languages
  • Audio synthesized in the same pass as the picture
  • Up to 9 reference images for a consistent cast
  • Text, a frame, or references — three ways in
  • Every clip saved to your private library
HappyHorse 1.1 — cinematic AI video generator

Generate With HappyHorse 1.1

Write a line of dialogue or a scene, add a first frame or up to nine reference images, pick 720p or 1080p, then run. Your clip lands straight in the library.

The Video Model Built to Talk

Joint audio, lip-sync across seven languages, and a cast that stays consistent — Alibaba's HappyHorse 1.1, in your browser.

HappyHorse 1.1 is the model to reach for when the video has to talk. Most generators hand you motion and leave the sound for later; HappyHorse synthesizes audio and picture together, then syncs a character's lips to the dialogue — across seven languages, with no separate scoring or dubbing step. Hand it up to nine reference images and a recurring character, product, or look stays consistent scene to scene without fine-tuning. It tops out at 1080p rather than 4K — the honest trade: this is the model for spokespeople, dubbed explainers, and character-driven sequences, not a 4K hero-shot finisher. For that, Seedance 2 sits one switch away in the same generator.

| happyhorse 1.1 | happyhorse ai video generator | happyhorse video | happyhorse lip sync | multilingual lip sync video | text to video | image to video |

How it works

From Prompt to Talking Clip in Four Steps

1

Write the scene and the line

Type what happens and what's said. Add a first frame to animate, or up to nine reference images to lock a character, product, or palette. Spell out the dialogue and the language — HappyHorse syncs the mouth to it.

2

Set resolution, ratio, and length

Choose 720p or 1080p, one of nine aspect ratios from 21:9 to 9:16, and a length from three to fifteen seconds. The generator shows the exact credit cost before you run.

3

Render the take

Send it off. Picture and audio are synthesized together in one pass, so a clip can come back already scored and lip-synced — and if a run fails, its credits come straight back to your balance.

4

Carry it into the next scene

Every clip saves to your private library. Re-run with a tweaked line, swap the reference set, or carry the same character forward so the next shot still matches.

Why HappyHorse 1.1 Is Built for Dialogue

Lip-sync that speaks seven languages

When a character has a line, HappyHorse 1.1 matches their mouth to the audio instead of letting the speech float free — across English, Mandarin, Cantonese, Japanese, Korean, German, and French, because the model was trained on dialogue in all seven. That's the difference between a clip you can ship as a spokesperson piece or a dubbed explainer and one where the lips give it away.

Illustration: four sequential frames of one presenter speaking, suggesting lip-synced dialogue

Sound generated, not added later

HappyHorse synthesizes audio in the same pass as the motion — dialogue, ambience, music, and Foley generated alongside the picture rather than dropped in during a separate scoring session. The intent is a take that arrives ready to watch, not a silent render still waiting on its soundtrack.

Illustration: a singer mid-performance, suggesting video that generates its own sound

A cast that stays consistent

Hand it up to nine reference images and point to them in the prompt — "the woman in [Image 1]", "the bottle in [Image 2]" — and a character, a product, or a colour palette holds its look from one shot to the next. It's how a multi-scene sequence reads as one piece without fine-tuning a model first.

Illustration: the same character kept consistent across reference frames and a new scene

Any shape, up to fifteen seconds

Render three to fifteen seconds in nine aspect ratios, from a 21:9 cinematic crop to a 9:16 vertical for social — so a clip drops into the cut or the feed without reframing. Pick the canvas to match where the video lands, not the other way around.

Illustration: one scene reframed as wide, square, and vertical aspect ratios
Three ways in

Start From Words, a Still, or a Set of References

Pick the input that matches what you're starting from — the mode switches inside the same generator.

Text-to-video — write the scene and the script

Describe the shot and the dialogue and HappyHorse builds the whole take, sound included, with a character's mouth synced to the words. No footage to start from.

Image-to-video — animate a first frame

Drop in a still — a portrait, a product shot, a piece of key art — and HappyHorse moves it into a clip, deriving the aspect ratio from the frame you give it.

Reference-to-video — hold one cast across shots

Attach up to nine reference images so a recurring character, object, or palette stays consistent across a sequence; name them in the prompt to place each one.

Where it fits

Where a Talking Model Earns Its Keep

The work that needs a voice, a face, and a consistent cast — not a silent 4K hero shot.

Spokespeople & talking avatars

A presenter who delivers a line straight to camera, mouth synced to the audio — for product intros, announcements, and talking-head clips without a shoot.

Dubbed & multilingual explainers

Walk through a feature once, then ship it lip-synced in English, Mandarin, Cantonese, Japanese, Korean, German, or French — the same explainer, localized to the viewer.

Localized social ads

Run one ad concept across markets with the dialogue swapped per language and the lips matching each cut, so a campaign doesn't read as a bad dub.

Character-driven sequences

Keep a recurring character consistent across shots with a fixed reference set, so a short story or episodic clip holds together scene to scene.

Product demos with a narrator

Animate a product still and pair it with a synced voiceover, so the walkthrough explains itself instead of needing captions bolted on later.

Course & tutorial clips

Turn a script into a narrated lesson with an on-screen presenter, lip-synced and saved to your library to update as the material changes.

Lip-Sync, Audio, and the Rest — Answered

What is HappyHorse 1.1?

HappyHorse 1.1 is Alibaba's audio-video generation model, and SupaImagine runs it in the browser. It synthesizes picture and sound together, syncs a character's lips to dialogue across seven languages, and holds a cast consistent with up to nine reference images — from text, a first frame, or a set of references, at 720p or 1080p. You run it next to other top models like Seedance 2 and Veo 3 in one workspace.

What languages can HappyHorse 1.1 lip-sync?

Seven: English, Mandarin, Cantonese, Japanese, Korean, German, and French. HappyHorse 1.1 was trained on dialogue in each, so a character's mouth tracks the spoken audio in that language rather than drifting out of sync — which is what makes it usable for spokespeople, dubbed explainers, and localized ads where the same scene ships in more than one language. You write the line in the prompt; the audio and the lip movement are generated with the clip.

What can I feed HappyHorse 1.1 — text, an image, or references?

All three, in separate modes. Text-to-video builds a scene and its dialogue from a written prompt; image-to-video animates a single first frame you upload; reference-to-video takes up to nine images to keep a character, product, or palette consistent across a sequence. You switch modes inside the same generator, and your prompt carries across.

What resolution and clip length does HappyHorse 1.1 support?

It renders at 720p or 1080p — it doesn't go to 4K, so for a 4K master reach for Seedance 2 in the same generator instead. Clips run from three to fifteen seconds, in nine aspect ratios from 21:9 down to 9:16. The generator shows the exact credit cost for each combination before you run.

Do HappyHorse 1.1 clips come with sound?

HappyHorse 1.1 synthesizes audio jointly with the picture — dialogue, ambience, music, and Foley generated in the same pass — so a clip can come back already scored and, when a character speaks, lip-synced. It's part of how the model works rather than a separate step you trigger afterward.

How much does a HappyHorse 1.1 clip cost on SupaImagine?

Video is billed by the second and scales with resolution, so a longer or 1080p clip costs more than a short 720p one — and the generator shows the exact credit cost before you run. A new account starts with a small credit grant: enough to explore the workspace, not to render a full clip, so you'll pick up a plan or a credit pack first. The pricing page lists the current packages.

Can I use HappyHorse 1.1 clips commercially?

On a paid plan, yes. Clips you generate on a paid plan are cleared for commercial use — ads, client spots, localized campaigns. The free starter credits are for trying the workspace and don't carry those rights; the legal page spells out the exact terms.

Give your characters a voice — start with HappyHorse 1.1

Joint audio, lip-sync across seven languages, and a consistent cast — with every clip saved to your SupaImagine library.