| name | gem |
| description | Multimodal AI processing and image generation using Google Gemini. Use for analyzing PDFs, images, videos, YouTube links, and other large documents. Also generates images with Nano Banana Pro. Ideal when you need to extract information from files that require vision or multimodal understanding, or generate images from text prompts. |
Gemini Multimodal Tool
Use the ai-gem CLI tool for multimodal AI processing and image generation via Google's Gemini API.
Usage
ai-gem "Write a haiku about Python programming"
ai-gem "Summarize this document" document.pdf
ai-gem "What's in this image?" photo.jpg
ai-gem "Create a 5-point summary" "https://youtu.be/VIDEO_ID"
ai-gem "Compare these files" file1.pdf file2.png
ai-gem "Current AI news" --search
ai-gem --image "A cute robot reading a book in a cozy library"
ai-gem --image "A landscape at sunset" --aspect-ratio 16:9
ai-gem --image "A cat wearing a hat" -o cat.png
ai-gem --image "Edit this to add sunglasses" reference.jpg
ai-gem --image "A blue triangle" -m gemini-2.5-flash-image
Image Generation Options
--image / -i: Generate an image instead of text
--output / -o: Output file path (auto-generated if omitted)
--aspect-ratio / -a: Aspect ratio (1:1, 9:16, 16:9, etc.)
--model / -m: Override model (default: nano-banana-pro-preview)
- Attachments serve as reference images for editing
Requirements
GEMINI_API_KEY environment variable must be set
- The
hamel package must be installed: pip install hamel
Supported Input Types
- PDFs
- Images (PNG, JPEG, GIF, WebP)
- Videos (MP4, etc.)
- YouTube URLs
- Plain text files
- Multiple files for comparison