gemini-vision Skill
Google Gemini API for image understanding - captioning, classification, visual QA, object detection, segmentation.
When to Use
Use gemini-vision when you need:
- Image captioning
- Object detection
- Visual question answering
- Document OCR
- Multi-image comparison
- Segmentation masks
Quick Start
Invoke the Skill
"Use gemini-vision to analyze product images and extract:
- Product name
- Color
- Condition
- Defects if any"
What You Get
The skill will help you:
- Set up Gemini API
- Process images
- Extract information
- Handle responses
- Manage API costs
Common Use Cases
Product Analysis
"Use gemini-vision to analyze product photos:
- Identify product type
- Extract visible text
- Assess quality
- Detect damage"
Document OCR
"Use gemini-vision to extract text from invoice.jpg and structure as JSON"
Multi-Image Comparison
"Use gemini-vision to compare before/after photos and list differences"
Object Detection
"Use gemini-vision with gemini-2.0-flash to detect all objects in image with bounding boxes"
Supported Formats
- Images: PNG, JPEG, WEBP, HEIC, HEIF
- Documents: PDF (up to 1,000 pages)
- Size: 20MB max inline, File API for larger
Available Models
- gemini-2.5-pro: Most capable, segmentation + detection
- gemini-2.5-flash: Fast, efficient
- gemini-2.0-flash: Object detection
- gemini-1.5-pro/flash: Previous generation
API Setup
Get API Key
- Visit Google AI Studio
- Create API key
- Set environment variable:
export GEMINI_API_KEY="your-key-here"
Or in .env file:
GEMINI_API_KEY=your-key-here
Install SDK
pip install google-genai
Usage Examples
Basic Analysis
from google import genai
client = genai.Client(api_key="YOUR_API_KEY")
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=['What objects are in this image?', 'image.jpg']
)
print(response.text)
Structured Output
"Use gemini-vision to analyze receipt.jpg and return JSON with:
{
'store': 'store name',
'date': 'purchase date',
'items': ['item list'],
'total': 'total amount'
}"
Batch Processing
"Use gemini-vision to process folder of product images and create CSV with product details"
Token Costs
Images consume tokens based on size:
- Small (≤384px): 258 tokens
- Large: Tiled into 768x768 chunks, 258 tokens each
Example: 960x540 image = ~1,548 tokens
Best Practices
Image Quality
- Use clear, well-lit images
- Ensure correct rotation
- Higher resolution for text extraction
- Compress large files
Prompting
"Use gemini-vision to analyze image:
- Be specific in questions
- Request structured output (JSON/CSV)
- Provide examples for accuracy
- Specify required fields"
Cost Optimization
- Resize images before upload
- Use File API for repeated analysis
- Choose appropriate model (Flash for speed)
- Batch related images
Advanced Features
Object Detection (2.0+)
"Use gemini-vision with gemini-2.0-flash to:
- Detect all objects
- Return bounding boxes
- Label each object
- Calculate confidence scores"
Segmentation (2.5+)
"Use gemini-vision with gemini-2.5-pro to:
- Create segmentation masks
- Identify distinct objects
- Separate foreground/background"
Multi-Image Analysis
"Use gemini-vision to compare these 5 product photos and identify which one shows damage"
Integration Examples
E-commerce Product Listing
"Use gemini-vision to:
1. Analyze product photo
2. Extract product attributes
3. Generate description
4. Categorize product
5. Output as JSON for database"
Quality Control
"Use gemini-vision for manufacturing QC:
- Detect defects
- Compare to reference image
- Classify defect types
- Generate inspection report"
Document Processing
"Use gemini-vision to:
1. Extract text from scanned invoice
2. Parse line items
3. Calculate totals
4. Validate against expected format"
Error Handling
Common errors:
- 401: Invalid API key
- 429: Rate limit exceeded
- 400: Invalid image format/size
- 403: Restricted content
Quick Examples
Simple Caption:
"Use gemini-vision to caption this image"
Product Catalog:
"Use gemini-vision to analyze product images and create catalog with:
- Product name
- Description
- Key features
- Suggested price range"
Document Extraction:
"Use gemini-vision to extract all text and tables from multi-page PDF invoice"
Rate Limits
Free tier:
- 15 RPM (requests per minute)
- 1 million TPM (tokens per minute)
- 1,500 RPD (requests per day)
Paid tiers scale up significantly.
Next Steps
Bottom Line: gemini-vision analyzes images with AI. Extract text, detect objects, answer visual questions - all with simple prompts.