Wan 2.2, Performance Improvements and Crafting High-Impact Prompts

Wan 2.2 marks a significant upgrade from version 2.1, featuring a Mixture-of-Experts (MoE) architecture, a substantially expanded training dataset, and a new 5B hybrid model that can handle text to video and image to video.
In practice this translates in noticeable performance improvements when compared to Wan 2.1:
- Cleaner, sharper visuals – The MoE architecture balances detail enhancement with overall image consistency.
- More natural motion – Fast camera movements and layered scenes render more reliably.
- Motion fidelity – Large-scale complex motion; smoother, more controllable
- Better control – Cinematic-level labels (lighting, composition, colour) and handling of large-scale complex motions
- Budget-friendly experimentation – The 5B hybrid model runs on just 8 GB with offloading, making local testing more accessible.
Prompt Engineering Guide for Wan 2.2
Just like with Wan 2.1, prompts work best at 80–120 words. Leave out too much, and Wan 2.2 fills the gaps with its own “cinematic” defaults—which can be hit or miss.
1. Shot Structure
Begin with what the camera first captures, then outline how the shot develops:
- Opening scene → Camera motion → Reveal or payoff
2. Camera Movements
Unlike its predecessor, Wan 2.2 does a much better job following detailed camera directions:
- Pan left/right
- Tilt up/down
- Dolly in/out
- Orbital arc
- Crane up
3. Describing Motion
Use descriptive motion cues to direct flow and depth:
- Speed terms: slow-motion, whip-pan, time-lapse
- Depth cues: “foreground grass sways while mountains remain still”
4. Visual Style Tags
Help define the look and feel with precise aesthetic tags:
- Lighting: volumetric dusk, harsh noon sun, neon rim light, etc.
- Colour-grade: "teal-and-orange", "bleach-bypass", "kodak portra"
- Lens/style: anamorphic bokeh, 16mm grain, CGI stylized
5. Timing & Resolution
Wan 2.2 performs best on clips under 5 seconds. You can fine-tune this with:
- Frame count: Don't go higher than 120
- Resolution: Use 960×540 for drafts, 1280×720 for production
- FPS: Default is 24; use 16 fps for faster prototyping
6. Negative Prompting
Negative prompts are more reliably enforced in this version. We typically use the default chinese one.
Examples
tracking shot
Prompt: "Cinematic NYC alley chase: The camera starts shoulder-height behind a hooded man steadily tracking forward as he weaves through crowds. Cold tones, high contrast, neon lights. Smooth glide with intense shake for immersive pursuer tension. Blurred steam and wet pavement. Lens flare, shallow depth of field."
Pan Left/Right
Prompt: "A low angle shot of a of young man in dappled sunlight. Backlighting, warm low-saturation tones. Slow-motion glide with handheld tremor for dreamy nostalgia. Blurred foliage for emotional focus. Camera pans left to low angle shot of a cute girl."
Dolly in/out
Prompt: "In the style of an American drama promotional poster, Iron Man sits in a sleek, futuristic metal chair inside a dimly lit industrial setting. He is fully suited in his iconic red and gold armor, the arc reactor glowing in his chest. Around him are scattered high-tech gadgets, and stacks of prototype schematics. He sits still, helmet off, revealing Tony Stark’s face—confident, composed, with a subtle smirk. Camera dollies out. The background shows an abandoned, dim factory with light filtering through the windows. There's a noticeable grainy texture. A medium shot with a straight-on close-up of the character."
Getting Started with Wan2.2
Wan2.2 comes in multiple variants tailored for different use cases. Here's a quick breakdown:
Model Type | Model Name | Number of Parameters | Use case |
---|---|---|---|
Hybrid Model | Wan2.2-TI2V-5B | 5B | A dual-purpose model that supports both text-to-video and image-to-video tasks |
Image-to-Video | Wan2.2-I2V-A14B | 14B | Transforms still images into videos with consistent visual fidelity |
Text-to-Video | Wan2.2-T2V-A14B | 14B | Generates videos directly from text, with strong aesthetic and semantic control |
The 5B hybrid model is ideal for local development, especially if you're working with limited GPU resources. It handles both major generation modes (text-to-video and image-to-video) within a single, lightweight package.
For those looking to get started immediately on high-performance GPUs, both 14B versions are ready to use on ViewComfy templates. These larger models offer enhanced fidelity and smoother motion—perfect for production-level content generation.