gemini-vision Skill

Google Gemini API for image understanding - captioning, classification, visual QA, object detection, segmentation.

When to Use

Use gemini-vision when you need:

Image captioning
Object detection
Visual question answering
Document OCR
Multi-image comparison
Segmentation masks

Quick Start

Invoke the Skill

"Use gemini-vision to analyze product images and extract:
- Product name
- Color
- Condition
- Defects if any"

What You Get

The skill will help you:

Set up Gemini API
Process images
Extract information
Handle responses
Manage API costs

Common Use Cases

Product Analysis

"Use gemini-vision to analyze product photos:
- Identify product type
- Extract visible text
- Assess quality
- Detect damage"

Document OCR

"Use gemini-vision to extract text from invoice.jpg and structure as JSON"

Multi-Image Comparison

"Use gemini-vision to compare before/after photos and list differences"

Object Detection

"Use gemini-vision with gemini-2.0-flash to detect all objects in image with bounding boxes"

Supported Formats

Images: PNG, JPEG, WEBP, HEIC, HEIF
Documents: PDF (up to 1,000 pages)
Size: 20MB max inline, File API for larger

Available Models

gemini-2.5-pro: Most capable, segmentation + detection
gemini-2.5-flash: Fast, efficient
gemini-2.0-flash: Object detection
gemini-1.5-pro/flash: Previous generation

API Setup

Get API Key

Visit Google AI Studio
Create API key
Set environment variable:

export GEMINI_API_KEY="your-key-here"

Or in .env file:

GEMINI_API_KEY=your-key-here

Install SDK

pip install google-genai

Usage Examples

Basic Analysis

from google import genai

client = genai.Client(api_key="YOUR_API_KEY")

response = client.models.generate_content(
    model='gemini-2.5-flash',
    contents=['What objects are in this image?', 'image.jpg']
)

print(response.text)

Structured Output

"Use gemini-vision to analyze receipt.jpg and return JSON with:
{
  'store': 'store name',
  'date': 'purchase date',
  'items': ['item list'],
  'total': 'total amount'
}"

Batch Processing

"Use gemini-vision to process folder of product images and create CSV with product details"

Token Costs

Images consume tokens based on size:

Small (≤384px): 258 tokens
Large: Tiled into 768x768 chunks, 258 tokens each

Example: 960x540 image = ~1,548 tokens

Best Practices

Image Quality

Use clear, well-lit images
Ensure correct rotation
Higher resolution for text extraction
Compress large files

Prompting

"Use gemini-vision to analyze image:
- Be specific in questions
- Request structured output (JSON/CSV)
- Provide examples for accuracy
- Specify required fields"

Cost Optimization

Resize images before upload
Use File API for repeated analysis
Choose appropriate model (Flash for speed)
Batch related images

Advanced Features

Object Detection (2.0+)

"Use gemini-vision with gemini-2.0-flash to:
- Detect all objects
- Return bounding boxes
- Label each object
- Calculate confidence scores"

Segmentation (2.5+)

"Use gemini-vision with gemini-2.5-pro to:
- Create segmentation masks
- Identify distinct objects
- Separate foreground/background"

Multi-Image Analysis

"Use gemini-vision to compare these 5 product photos and identify which one shows damage"

Integration Examples

E-commerce Product Listing

"Use gemini-vision to:
1. Analyze product photo
2. Extract product attributes
3. Generate description
4. Categorize product
5. Output as JSON for database"

Quality Control

"Use gemini-vision for manufacturing QC:
- Detect defects
- Compare to reference image
- Classify defect types
- Generate inspection report"

Document Processing

"Use gemini-vision to:
1. Extract text from scanned invoice
2. Parse line items
3. Calculate totals
4. Validate against expected format"

Error Handling

Common errors:

401: Invalid API key
429: Rate limit exceeded
400: Invalid image format/size
403: Restricted content

Quick Examples

Simple Caption:

"Use gemini-vision to caption this image"

Product Catalog:

"Use gemini-vision to analyze product images and create catalog with:
- Product name
- Description
- Key features
- Suggested price range"

Document Extraction:

"Use gemini-vision to extract all text and tables from multi-page PDF invoice"

Rate Limits

Free tier:

15 RPM (requests per minute)
1 million TPM (tokens per minute)
1,500 RPD (requests per day)

Paid tiers scale up significantly.

Next Steps

Bottom Line: gemini-vision analyzes images with AI. Extract text, detect objects, answer visual questions - all with simple prompts.