| name | stable-diffusion-image-generation |
| description | State-of-the-art text-to-image generation with Stable Diffusion models via HuggingFace Diffusers. Use when generating images from text prompts, performing image-to-image translation, inpainting, or building custom diffusion pipelines. |
| version | 1.0.0 |
| author | Orchestra Research |
| license | MIT |
| dependencies | ["diffusers>=0.30.0","transformers>=4.41.0","accelerate>=0.31.0","torch>=2.0.0"] |
| platforms | ["linux","macos","windows"] |
| metadata | {"hermes":{"tags":["Image Generation","Stable Diffusion","Diffusers","Text-to-Image","Multimodal","Computer Vision"]}} |
Stable Diffusion Image Generation
Comprehensive guide to generating images with Stable Diffusion using the HuggingFace Diffusers library.
When to use Stable Diffusion
Use Stable Diffusion when:
- Generating images from text descriptions
- Performing image-to-image translation (style transfer, enhancement)
- Inpainting (filling in masked regions)
- Outpainting (extending images beyond boundaries)
- Creating variations of existing images
- Building custom image generation workflows
Key features:
- Text-to-Image: Generate images from natural language prompts
- Image-to-Image: Transform existing images with text guidance
- Inpainting: Fill masked regions with context-aware content
- ControlNet: Add spatial conditioning (edges, poses, depth)
- LoRA Support: Efficient fine-tuning and style adaptation
- Multiple Models: SD 1.5, SDXL, SD 3.0, Flux support
Use alternatives instead:
- DALL-E 3: For API-based generation without GPU
- Midjourney: For artistic, stylized outputs
- Imagen: For Google Cloud integration
- Leonardo.ai: For web-based creative workflows
Quick start
Installation
pip install diffusers transformers accelerate torch
pip install xformers
Basic text-to-image
from diffusers import DiffusionPipeline
import torch
pipe = DiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
torch_dtype=torch.float16
)
pipe.to("cuda")
image = pipe(
"A serene mountain landscape at sunset, highly detailed",
num_inference_steps=50,
guidance_scale=7.5
).images[0]
image.save("output.png")
Using SDXL (higher quality)
from diffusers import AutoPipelineForText2Image
import torch
pipe = AutoPipelineForText2Image.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16"
)
pipe.to("cuda")
pipe.enable_model_cpu_offload()
image = pipe(
prompt="A futuristic city with flying cars, cinematic lighting",
height=1024,
width=1024,
num_inference_steps=30
).images[0]
Architecture overview
Three-pillar design
Diffusers is built around three core components:
Pipeline (orchestration)
├── Model (neural networks)
│ ├── UNet / Transformer (noise prediction)
│ ├── VAE (latent encoding/decoding)
│ └── Text Encoder (CLIP/T5)
└── Scheduler (denoising algorithm)
Pipeline inference flow
Text Prompt → Text Encoder → Text Embeddings
↓
Random Noise → [Denoising Loop] ← Scheduler
↓
Predicted Noise
↓
VAE Decoder → Final Image
Core concepts
Pipelines
Pipelines orchestrate complete workflows:
| Pipeline | Purpose |
|---|
StableDiffusionPipeline | Text-to-image (SD 1.x/2.x) |
StableDiffusionXLPipeline | Text-to-image (SDXL) |
StableDiffusion3Pipeline | Text-to-image (SD 3.0) |
FluxPipeline | Text-to-image (Flux models) |
StableDiffusionImg2ImgPipeline | Image-to-image |
StableDiffusionInpaintPipeline | Inpainting |
Schedulers
Schedulers control the denoising process:
| Scheduler | Steps | Quality | Use Case |
|---|
EulerDiscreteScheduler | 20-50 | Good | Default choice |
EulerAncestralDiscreteScheduler | 20-50 | Good | More variation |
DPMSolverMultistepScheduler | 15-25 | Excellent | Fast, high quality |
DDIMScheduler | 50-100 | Good | Deterministic |
LCMScheduler | 4-8 | Good | Very fast |
UniPCMultistepScheduler | 15-25 | Excellent | Fast convergence |
Swapping schedulers
from diffusers import DPMSolverMultistepScheduler
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
pipe.scheduler.config
)
image = pipe(prompt, num_inference_steps=20).images[0]
Generation parameters
Key parameters
| Parameter | Default | Description |
|---|
prompt | Required | Text description of desired image |
negative_prompt | None | What to avoid in the image |
num_inference_steps | 50 | Denoising steps (more = better quality) |
guidance_scale | 7.5 | Prompt adherence (7-12 typical) |
height, width | 512/1024 | Output dimensions (multiples of 8) |
generator | None | Torch generator for reproducibility |
num_images_per_prompt | 1 | Batch size |
Reproducible generation
import torch
generator = torch.Generator(device="cuda").manual_seed(42)
image = pipe(
prompt="A cat wearing a top hat",
generator=generator,
num_inference_steps=50
).images[0]
Negative prompts
image = pipe(
prompt="Professional photo of a dog in a garden",
negative_prompt="blurry, low quality, distorted, ugly, bad anatomy",
guidance_scale=7.5
).images[0]
Image-to-image
Transform existing images with text guidance:
from diffusers import AutoPipelineForImage2Image
from PIL import Image
pipe = AutoPipelineForImage2Image.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
torch_dtype=torch.float16
).to("cuda")
init_image = Image.open("input.jpg").resize((512, 512))
image = pipe(
prompt="A watercolor painting of the scene",
image=init_image,
strength=0.75,
num_inference_steps=50
).images[0]
Inpainting
Fill masked regions:
from diffusers import AutoPipelineForInpainting
from PIL import Image
pipe = AutoPipelineForInpainting.from_pretrained(
"runwayml/stable-diffusion-inpainting",
torch_dtype=torch.float16
).to("cuda")
image = Image.open("photo.jpg")
mask = Image.open("mask.png")
result = pipe(
prompt="A red car parked on the street",
image=image,
mask_image=mask,
num_inference_steps=50
).images[0]
ControlNet
Add spatial conditioning for precise control:
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch
controlnet = ControlNetModel.from_pretrained(
"lllyasviel/control_v11p_sd15_canny",
torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
controlnet=controlnet,
torch_dtype=torch.float16
).to("cuda")
control_image = get_canny_image(input_image)
image = pipe(
prompt="A beautiful house in the style of Van Gogh",
image=control_image,
num_inference_steps=30
).images[0]
Available ControlNets
| ControlNet | Input Type | Use Case |
|---|
canny | Edge maps | Preserve structure |
openpose | Pose skeletons | Human poses |
depth | Depth maps | 3D-aware generation |
normal | Normal maps | Surface details |
mlsd | Line segments | Architectural lines |
scribble | Rough sketches | Sketch-to-image |
LoRA adapters
Load fine-tuned style adapters:
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
torch_dtype=torch.float16
).to("cuda")
pipe.load_lora_weights("path/to/lora", weight_name="style.safetensors")
image = pipe("A portrait in the trained style").images[0]
pipe.fuse_lora(lora_scale=0.8)
pipe.unload_lora_weights()
Multiple LoRAs
pipe.load_lora_weights("lora1", adapter_name="style")
pipe.load_lora_weights("lora2", adapter_name="character")
pipe.set_adapters(["style", "character"], adapter_weights=[0.7, 0.5])
image = pipe("A portrait").images[0]
Memory optimization
Enable CPU offloading
pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
Attention slicing
pipe.enable_attention_slicing()
pipe.enable_attention_slicing("max")
xFormers memory-efficient attention
pipe.enable_xformers_memory_efficient_attention()
VAE slicing for large images
pipe.enable_vae_slicing()
pipe.enable_vae_tiling()
Model variants
Loading different precisions
pipe = DiffusionPipeline.from_pretrained(
"model-id",
torch_dtype=torch.float16,
variant="fp16"
)
pipe = DiffusionPipeline.from_pretrained(
"model-id",
torch_dtype=torch.bfloat16
)
Loading specific components
from diffusers import UNet2DConditionModel, AutoencoderKL
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
pipe = DiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
vae=vae,
torch_dtype=torch.float16
)
Batch generation
Generate multiple images efficiently:
prompts = [
"A cat playing piano",
"A dog reading a book",
"A bird painting a picture"
]
images = pipe(prompts, num_inference_steps=30).images
images = pipe(
"A beautiful sunset",
num_images_per_prompt=4,
num_inference_steps=30
).images
Common workflows
Workflow 1: High-quality generation
from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16"
)
pipe.to("cuda")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
image = pipe(
prompt="A majestic lion in the savanna, golden hour lighting, 8k, detailed fur",
negative_prompt="blurry, low quality, cartoon, anime, sketch",
num_inference_steps=30,
guidance_scale=7.5,
height=1024,
width=1024
).images[0]
Workflow 2: Fast prototyping
from diffusers import AutoPipelineForText2Image, LCMScheduler
import torch
pipe = AutoPipelineForText2Image.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16
).to("cuda")
pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.fuse_lora()
image = pipe(
"A beautiful landscape",
num_inference_steps=4,
guidance_scale=1.0
).images[0]
Common issues
CUDA out of memory:
pipe.enable_model_cpu_offload()
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
Black/noise images:
pipe.safety_checker = None
pipe = pipe.to(dtype=torch.float16)
Slow generation:
from diffusers import DPMSolverMultistepScheduler
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
image = pipe(prompt, num_inference_steps=20).images[0]
References
Resources