| name | ltx-video |
| description | Set up, configure, and run LTX-2/LTX-2.3 (Lightricks) for AI video and audio generation. Use when: (1) Installing LTX-2 locally or via ComfyUI, (2) Generating video from text, image, audio, or keyframes, (3) Running inference with the 22B dev/distilled models, (4) Fine-tuning with LoRA/IC-LoRA, (5) Configuring upscalers (spatial/temporal), (6) Optimizing VRAM usage (FP8, quantization, low-VRAM modes), (7) Building content-generation pipelines with LTX-2, (8) Writing prompts for video generation, (9) Integrating LTX-2 into ComfyUI workflows. Triggers on: ltx, ltx-video, ltx-2, lightricks video, text-to-video generation, video diffusion model.
|
LTX-2 Video Generation
LTX-2.3 is the first DiT-based audio-video foundation model — 22B parameters, synchronized
audio+video, real-time capable. Developed by Lightricks.
Quick Setup
git clone https://github.com/Lightricks/LTX-2.git
cd LTX-2
uv sync --frozen
source .venv/bin/activate
Or run scripts/setup-ltx.sh for automated setup with model downloads.
Model Selection
| Model | Steps | Speed | Quality | Min VRAM |
|---|
ltx-2.3-22b-dev | 40 | Baseline | Highest | 24 GB |
ltx-2.3-22b-distilled | 8 | ~3x faster | Very good | 16 GB |
ltx-2.3-22b-distilled + FP8 | 8 | ~3x faster | Very good | 12 GB |
Supporting models (download alongside):
ltx-2.3-spatial-upscaler-x2-1.1 — 2x resolution enhancement
ltx-2.3-spatial-upscaler-x1.5-1.0 — 1.5x resolution (less VRAM)
ltx-2.3-temporal-upscaler-x2-1.0 — doubles FPS (smoother motion)
ltx-2.3-22b-distilled-lora-384 — distilled LoRA for dev model
- Gemma 3 text encoder (12B) — required for all pipelines
Pipelines
Choose based on use case:
| Pipeline | Use Case | Notes |
|---|
TI2VidTwoStagesPipeline | Default production — text/image to video | 2x spatial upsampling, best quality |
TI2VidTwoStagesHQPipeline | Higher quality, fewer steps needed | Second-order sampling |
TI2VidOneStagePipeline | Quick prototyping | Lower res, faster |
DistilledPipeline | Fastest inference | 8 fixed steps, CFG=1 |
ICLoraPipeline | Video/image transformations | Style transfer, control |
KeyframeInterpolationPipeline | Animate between keyframes | Start/end frame interpolation |
A2VidPipelineTwoStage | Audio-conditioned video | Synced audio+video |
RetakePipeline | Regenerate video regions | Selective re-generation |
Running Inference
CLI
python -m ltx_pipelines.run \
--config configs/ltx-2.3-22b-dev-2stage.yaml \
--prompt "A golden retriever running through autumn leaves in a park" \
--height 704 --width 1216 \
--num_frames 97 \
--output output.mp4
python -m ltx_pipelines.run \
--config configs/ltx-2.3-22b-dev-2stage.yaml \
--prompt "The scene comes alive with motion..." \
--conditioning_image path/to/image.png \
--height 704 --width 1216 \
--num_frames 97 \
--output output.mp4
python -m ltx_pipelines.run \
--config configs/ltx-2.3-22b-distilled-2stage.yaml \
--prompt "..." \
--output output.mp4
python -m ltx_pipelines.run \
--config configs/ltx-2.3-22b-dev-2stage.yaml \
--quantization fp8-cast \
--prompt "..." \
--output output.mp4
Python API
from ltx_video.inference import infer, InferenceConfig
infer(InferenceConfig(
pipeline_config="configs/ltx-2.3-22b-dev-2stage.yaml",
prompt="A cinematic shot of ocean waves crashing on volcanic rocks at sunset",
height=704,
width=1216,
num_frames=97,
output_path="output.mp4",
enhance_prompt=True,
))
Resolution & Frame Rules
- Width/height: must be divisible by 32 (e.g., 480x832, 704x1216, 768x512)
- Frame count: must be
8n + 1 (e.g., 33, 65, 97, 129, 257)
- Default: 1216x704 at 30 FPS
- Max frames: 1,441 (long-form with distilled models)
- FPS: 1-60, default 24-30
Prompting
Write detailed, chronological scene descriptions in ~200 words. Include:
- Main action and specific movements
- Character/object appearance details
- Environment and setting
- Camera angle and movement
- Lighting and atmosphere
- Notable transitions or changes
Use enhance_prompt=True for automatic prompt elaboration.
See references/prompting-guide.md for advanced techniques and examples.
VRAM Optimization
| Technique | Savings | Flag/Config |
|---|
| FP8 quantization | ~40% | --quantization fp8-cast |
| FP8 scaled (Hopper GPUs) | ~45% | --quantization fp8-scaled-mm |
| Single-stage pipeline | ~30% | Use OneStagePipeline config |
| Distilled model | Fewer steps = less peak VRAM | Use distilled config |
| xFormers attention | ~15% | Install xformers, auto-detected |
| Flash Attention 3 | ~20% | Hopper GPUs, auto-detected |
| Gradient estimation | Reduce steps 40 to 20 | --gradient_estimation |
Apple Silicon (MLX)
LTX-2 runs on Apple Silicon via MLX-native ports or the MPS PyTorch backend. Q4 quantized
models fit on 24 GB machines; full precision needs 64 GB+. Pin PyTorch to 2.4.1 for MPS
(2.5+ has Conv3d regression). See references/apple-silicon.md for setup, patches, and memory tables.
Cloud API
Use hosted APIs when local hardware is insufficient: LTX API ($0.04-0.08/sec, official),
fal.ai ($0.20/video), or Replicate ($0.10/run). For self-hosted cloud, RunPod H100 runs
at ~$2.49/hr and Vast.ai from ~$0.50/hr. See references/cloud-api.md for Python examples and cost comparison.
Low-VRAM CUDA (12 GB)
Run LTX-2 on 12 GB GPUs (RTX 4070, 3060 12GB) by combining three techniques:
- FP8 quantization (
--quantization fp8-cast) -- reduces model memory ~40%
- Distilled model (
ltx-2.3-22b-distilled) -- 8 fixed steps instead of 40, lower peak VRAM
- Single-stage pipeline (
TI2VidOneStagePipeline or DistilledPipeline) -- avoids loading the spatial upscaler
python -m ltx_pipelines.run \
--config configs/ltx-2.3-22b-distilled-1stage.yaml \
--quantization fp8-cast \
--prompt "A cat sitting on a windowsill watching rain" \
--height 480 --width 832 \
--num_frames 33 \
--output output.mp4
At 480x832 with 33 frames, this fits comfortably in 12 GB. Increase resolution or frame
count only if monitoring shows headroom. Adding xFormers (pip install xformers) saves
another ~15% VRAM when available.
LoRA & Fine-Tuning
Train custom LoRAs in <1 hour for motion, style, or likeness:
cd packages/ltx-trainer
python train.py --config configs/lora_training.yaml \
--data_dir /path/to/videos \
--output_dir /path/to/output
Specialized control LoRAs available:
- Union IC-LoRA (depth + Canny edge)
- Motion tracking, Detailer, Pose control, Camera movement LoRAs
ComfyUI Integration
See references/comfyui-setup.md for full ComfyUI workflow setup.
Quick start: ComfyUI Manager → Install Custom Nodes → search "LTXVideo" → Install.
Requires: CUDA GPU with 32GB+ VRAM, 100GB+ disk space.
Troubleshooting
- CUDA OOM: Add
--quantization fp8-cast, use distilled model, or reduce resolution
- Slow generation: Use distilled pipeline (8 steps vs 40), enable xFormers
- Poor quality: Write longer prompts (~200 words), enable
enhance_prompt, use two-stage pipeline
- Audio issues: Audio quality lower without speech; provide audio conditioning when possible
- Frame count error: Ensure frames = 8n+1 (33, 65, 97, 129...)
- Resolution error: Ensure width/height divisible by 32