Debugger Agent

The debugger agent systematically investigates complex issues, analyzes system behavior, collects diagnostic data, and provides comprehensive root cause analysis with actionable solutions.

Purpose

Investigate and diagnose technical issues including API errors, CI/CD failures, performance degradation, database problems, and system anomalies.

When Activated

The debugger agent activates when:

Using /debug [description] command
API endpoints returning errors
CI/CD pipelines failing
Performance issues detected
Database queries slow or failing
System behavior unexpected
Production incidents
Integration failures

Capabilities

Issue Investigation

Systematic Analysis: Structured problem-solving approach
Data Collection: Logs, metrics, traces, database state
Pattern Recognition: Identify common error patterns
Impact Assessment: Scope and severity evaluation
Timeline Analysis: When issues started, frequency

Database Analysis

Schema Inspection: Using psql \d commands
Query Analysis: EXPLAIN plans, slow query logs
Connection Monitoring: Active connections, locks
Data Validation: Integrity checks, constraint violations
Performance Metrics: Query timing, index usage

Log Analysis

Server Logs: Application error logs, access logs
CI/CD Logs: Build output, test failures, deployment errors
GitHub Actions: Workflow logs, job failures, step errors
Container Logs: Docker/Kubernetes pod logs
System Logs: OS-level errors, resource constraints

Performance Analysis

Response Times: API endpoint latency
Resource Usage: CPU, memory, disk, network
Database Performance: Query execution time, connection pool
Cache Efficiency: Hit rates, eviction patterns
Bottleneck Identification: Slowest components

Root Cause Identification

Error Tracing: Follow error stack traces
Dependency Analysis: External service failures
Configuration Issues: Environment variables, settings
Code Analysis: Bug identification, logic errors
Infrastructure Problems: Network, storage, compute

Methodology

1. Initial Assessment (5-10 min)

## Issue Reported
- **Type**: API Error / CI Failure / Performance / System
- **Severity**: Critical / High / Medium / Low
- **Environment**: Production / Staging / Development
- **First Occurred**: Timestamp
- **Frequency**: Constant / Intermittent / Once

## Initial Questions
- What changed recently?
- Can it be reproduced?
- Are there error messages?
- Is it environment-specific?
- Are other systems affected?

2. Data Collection (10-20 min)

# Server logs
kubectl logs -l app=api --tail=1000 --since=1h

# Database state
psql -c "SELECT * FROM pg_stat_activity WHERE state != 'idle';"

# GitHub Actions logs
gh run view 1234567890 --log-failed

# System metrics
docker stats --no-stream

3. Analysis (15-30 min)

Review collected data
Identify error patterns
Check related systems
Test hypotheses
Narrow down root cause

4. Solution Development (10-20 min)

Propose fixes
Document workarounds
Create action plan
Estimate impact
Define validation steps

Example Usage

API Error Investigation

Input:

/debug [API endpoint /api/orders returning 500 errors]

Process:

1. Initial Assessment (2 min)
   Issue: POST /api/orders returns 500 Internal Server Error
   Severity: Critical (blocks order placement)
   Environment: Production
   Started: 2024-10-20 14:23:00 UTC
   Frequency: 100% of requests

2. Data Collection (8 min)

   Server Logs:
   $ kubectl logs -l app=api --tail=500 | grep "POST /api/orders"

   Found:
   [ERROR] 14:23:01 - POST /api/orders
   Error: Cannot read property 'userId' of undefined
     at OrderController.create (src/controllers/order.controller.ts:45)
     at Layer.handle (express/lib/router/layer.js:95)

   [ERROR] 14:23:15 - Database connection pool exhausted
   Error: Knex: Timeout acquiring a connection. Pool is exhausted

   Database State:
   $ psql -c "SELECT COUNT(*) FROM pg_stat_activity WHERE state = 'active';"
   count: 47 (max: 20)

   $ psql -c "SELECT query, state, wait_event FROM pg_stat_activity WHERE state != 'idle';"

   Found: 35 queries waiting on lock_tuple

3. Analysis (12 min)

   Root Causes Identified:

   A. Missing Null Check
      - File: src/controllers/order.controller.ts:45
      - Issue: req.user is undefined
      - Cause: Auth middleware not setting req.user
      - Impact: All POST /api/orders requests fail

   B. Database Connection Leak
      - Connections not released after queries
      - Pool exhausted (47 active, max 20)
      - Transactions not committed/rolled back
      - Impact: Cascading failures across all DB operations

   C. Table Lock Contention
      - 35 queries waiting on orders table lock
      - Long-running transaction blocking others
      - No query timeout configured
      - Impact: Slow queries, timeouts

   Timeline:
   - 14:15:00 - Deploy of v2.3.4
   - 14:20:00 - Connection pool warnings
   - 14:23:00 - First 500 errors
   - 14:25:00 - Complete service degradation

   Changed in v2.3.4:
   - Auth middleware refactor
   - Order service database queries
   - Transaction handling changes

4. Solution Development (8 min)
   Creating fix plan...

Generated Report:

# Debugging Report: API Orders Endpoint Failure

## Executive Summary

POST /api/orders endpoint failing with 500 errors due to three interconnected issues:
1. Auth middleware not setting req.user (deployment bug)
2. Database connection pool exhaustion (resource leak)
3. Table lock contention (transaction management)

**Status**: Critical - Blocking all order placements
**Root Cause**: Code bug in v2.3.4 deployment
**Estimated Fix Time**: 30 minutes
**Recommended Action**: Rollback to v2.3.3, then fix and redeploy

---

## Root Cause Analysis

### Issue 1: Missing req.user (PRIMARY)

**Location:** `src/controllers/order.controller.ts:45`

**Error:**
```typescript
Cannot read property 'userId' of undefined

Code:

// src/controllers/order.controller.ts
export const createOrder = async (req: Request, res: Response) => {
  const userId = req.user.userId; // ← req.user is undefined
  // ...
}

Root Cause: Auth middleware refactored in v2.3.4 but TypeScript interface not updated.

Before (v2.3.3):

// src/auth/middleware.ts
export const authenticate = async (req, res, next) => {
  const token = extractToken(req);
  const user = await verifyToken(token);
  req.user = user; // ✓ Sets req.user
  next();
};

After (v2.3.4 - BROKEN):

// src/auth/middleware.ts
export const authenticate = async (req, res, next) => {
  const token = extractToken(req);
  const payload = await verifyToken(token); // ← Renamed to payload
  req.payload = payload; // ✗ Sets req.payload instead
  next();
};

Fix:

// Option A: Restore original behavior
export const authenticate = async (req, res, next) => {
  const token = extractToken(req);
  const user = await verifyToken(token);
  req.user = user; // ✓ Restore req.user
  next();
};

// Option B: Update all controllers
// src/controllers/order.controller.ts
const userId = req.payload.userId; // Use req.payload

Issue 2: Connection Pool Exhaustion (SECONDARY)

Symptom:

Knex: Timeout acquiring a connection. Pool is exhausted
Active connections: 47 (max: 20)

Root Cause: Database transactions not properly closed in order service.

Problematic Code:

// src/services/order.service.ts
export const createOrder = async (orderData) => {
  const trx = await db.transaction(); // ← Transaction started

  try {
    const order = await trx('orders').insert(orderData);
    const items = await trx('order_items').insert(order.items);

    // ✗ Missing: await trx.commit();
    return order;
  } catch (error) {
    // ✗ Missing: await trx.rollback();
    throw error;
  }
  // ✗ Transaction never closed - connection leaked
};

Fix:

export const createOrder = async (orderData) => {
  const trx = await db.transaction();

  try {
    const order = await trx('orders').insert(orderData);
    const items = await trx('order_items').insert(order.items);

    await trx.commit(); // ✓ Commit transaction
    return order;
  } catch (error) {
    await trx.rollback(); // ✓ Rollback on error
    throw error;
  }
};

Immediate Mitigation:

# Restart API pods to reset connection pool
kubectl rollout restart deployment/api

# Increase pool size temporarily (not a fix)
export DB_POOL_MAX=50

Issue 3: Table Lock Contention (TERTIARY)

Symptom:

-- 35 queries waiting on lock_tuple
SELECT query, state, wait_event
FROM pg_stat_activity
WHERE wait_event = 'lock_tuple';

Root Cause: Long-running transaction holding locks on orders table.

Analysis:

-- Find blocking queries
SELECT
  blocked.pid AS blocked_pid,
  blocked.query AS blocked_query,
  blocking.pid AS blocking_pid,
  blocking.query AS blocking_query,
  blocking.query_start AS blocking_since
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
  ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE NOT blocked.waiting;

-- Result:
blocked_pid: 12345
blocked_query: INSERT INTO orders...
blocking_pid: 11111
blocking_query: UPDATE orders SET status = 'pending' WHERE...
blocking_since: 2024-10-20 14:20:00 (5 minutes ago!)

Root Cause: Transaction started but never committed due to Issue #2.

Fix:

-- Immediate: Kill blocking query
SELECT pg_terminate_backend(11111);

-- Permanent: Add query timeout
ALTER DATABASE production SET statement_timeout = '30s';

-- Code: Ensure transactions are short
export const updateOrderStatus = async (orderId, status) => {
  // Use separate connection, not transaction for simple updates
  return db('orders')
    .where({ id: orderId })
    .update({ status })
    .timeout(5000); // 5 second timeout
};

Impact Assessment

Affected Systems

✗ Order Placement API (100% failure)
✗ Order Status Updates (95% failure)
⚠ User Authentication (degraded, slow)
⚠ Inventory Management (slow queries)
✓ Product Catalog (unaffected)
✓ User Profile (unaffected)

Business Impact

Revenue Loss: $0 orders processed (normally$ 2,000/min)
Customer Impact: Unable to place orders
Duration: 30 minutes so far
Estimated Loss: $60,000 (if 1hr total)

Technical Debt

No integration tests for auth middleware changes
No database connection monitoring/alerting
No query timeout enforcement
No canary deployment strategy

Recommended Actions

Immediate (0-5 min)

# 1. Rollback to v2.3.3
kubectl rollout undo deployment/api

# 2. Kill blocking database connections
psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND query_start < NOW() - INTERVAL '1 minute';"

# 3. Verify recovery
curl https://api.example.com/health
curl -X POST https://api.example.com/api/orders -d '{...}'

# Expected: Orders processing successfully

Short-term (30-60 min)

# 1. Fix auth middleware
# Restore req.user assignment
git checkout v2.3.3 -- src/auth/middleware.ts

# 2. Fix transaction handling
# Add commit/rollback to order service
# See Fix #2 above

# 3. Add query timeout
psql -c "ALTER DATABASE production SET statement_timeout = '30s';"

# 4. Add integration tests
npm run test:integration -- tests/auth

# 5. Redeploy v2.3.5 with fixes
npm run build
docker build -t api:v2.3.5 .
kubectl set image deployment/api api=api:v2.3.5

# 6. Monitor deployment
kubectl rollout status deployment/api
kubectl logs -f deployment/api

Long-term (Next Sprint)

Improve Testing
- Add integration tests for auth flows
- Add database transaction tests
- Add load tests for connection pool
Improve Monitoring
- Alert on connection pool usage >80%
- Alert on query duration >5s
- Alert on error rate >1%
- Dashboard for database metrics
Improve Deployment
- Implement canary deployments
- Add automatic rollback on errors
- Require integration tests to pass
Improve Code Quality
- Add lint rule: no unused function params
- Add lint rule: require explicit transaction handling
- Add pre-commit hooks for tests

Validation Steps

After applying fixes:

# 1. Health check
curl https://api.example.com/health
# Expected: { "status": "healthy", "database": "connected" }

# 2. Create order
curl -X POST https://api.example.com/api/orders \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "items": [{ "productId": "123", "quantity": 1 }],
    "shippingAddress": {...}
  }'
# Expected: 201 Created with order details

# 3. Check database connections
psql -c "SELECT COUNT(*) FROM pg_stat_activity WHERE state = 'active';"
# Expected: <20 (within pool limit)

# 4. Check for errors
kubectl logs -l app=api --tail=100 | grep ERROR
# Expected: No errors

# 5. Monitor for 10 minutes
# Expected: No errors, normal response times

Prevention Measures

Code Review Checklist

Auth middleware changes include tests
Database transactions have commit/rollback
Integration tests pass
No breaking API changes without version bump
Backward compatibility maintained

Deployment Checklist

Integration tests pass in CI
Canary deployment successful (10% traffic)
Error rates normal for 5 minutes
Gradually increase to 100%
Monitor for 15 minutes post-deployment

Monitoring Alerts

# alerts.yml
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
  annotations:
    summary: "High error rate detected"

- alert: DatabaseConnectionPoolHigh
  expr: db_connections_active / db_connections_max > 0.8
  annotations:
    summary: "Database connection pool usage high"

- alert: SlowDatabaseQueries
  expr: histogram_quantile(0.95, db_query_duration_seconds) > 5
  annotations:
    summary: "Slow database queries detected"

Lessons Learned

What Went Wrong

Auth middleware refactored without updating consumers
Transaction handling code merged without review
No integration tests for auth changes
No database monitoring/alerting
No canary deployment process

What Went Right

Quick identification of root causes
Clear error messages in logs
Ability to rollback quickly
Database tools available for diagnosis

Action Items

Add integration tests for auth middleware
Add database connection monitoring
Implement canary deployments
Add query timeout enforcement
Create runbook for similar issues
Post-mortem meeting scheduled

Similar: #1234 - Connection pool exhaustion (2024-09-15)
Similar: #5678 - Auth middleware breaking change (2024-08-01)
Related: #9012 - Add integration tests (ongoing)

Appendix: Useful Commands

Database Diagnostics

-- Active connections
SELECT COUNT(*) FROM pg_stat_activity WHERE state != 'idle';

-- Long-running queries
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle' AND query_start < NOW() - INTERVAL '1 minute'
ORDER BY duration DESC;

-- Table locks
SELECT t.relname, l.locktype, l.mode, l.granted
FROM pg_locks l
JOIN pg_stat_all_tables t ON l.relation = t.relid
WHERE t.relname = 'orders';

-- Kill connection
SELECT pg_terminate_backend(12345);

Kubernetes Diagnostics

# Logs from all API pods
kubectl logs -l app=api --all-containers --tail=1000

# Logs from failed pods
kubectl logs -l app=api --previous

# Describe pod for events
kubectl describe pod api-7d8c9f4b6-x9z8k

# Pod resource usage
kubectl top pod -l app=api

# Restart deployment
kubectl rollout restart deployment/api

# Rollback deployment
kubectl rollout undo deployment/api

# Deployment history
kubectl rollout history deployment/api

GitHub Actions Diagnostics

# View recent runs
gh run list --limit 20

# View specific run
gh run view 1234567890

# View failed logs
gh run view 1234567890 --log-failed

# Re-run failed jobs
gh run rerun 1234567890 --failed

Report Generated: 2024-10-20 14:51:00 UTC Debugger: ClaudeKit Debugger Agent v1.0 Total Investigation Time: 30 minutes


## Tools Integration

### psql Commands

```bash
# Connection info
psql -c "\conninfo"

# List tables
psql -c "\dt"

# Describe table
psql -c "\d orders"

# Table indexes
psql -c "\di orders"

# Table size
psql -c "SELECT pg_size_pretty(pg_total_relation_size('orders'));"

# Slow queries
psql -c "SELECT query, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"

gh Command

# View workflow runs
gh run list --workflow=test.yml --limit=10

# View run details
gh run view 1234567890

# Download logs
gh run download 1234567890

# View specific job
gh run view 1234567890 --job=12345

# Re-run workflow
gh run rerun 1234567890

docs-seeker

# Search for error patterns
docs-seeker "PostgreSQL connection pool exhausted"
docs-seeker "Knex transaction best practices"
docs-seeker "Express middleware testing"

repomix

# Generate codebase summary for analysis
repomix --output analysis.xml

# Include specific patterns
repomix --include "src/**/*.ts" --output src-only.xml

scout agents

// Parallel investigation
await Promise.all([
  scout('Find all database transaction usage', 3),
  scout('Find all auth middleware implementations', 2),
  scout('Find all error handling patterns', 2)
]);

Diagnostic Patterns

Memory Leak Detection

# Monitor memory over time
watch -n 5 'kubectl top pod -l app=api'

# Heap snapshot analysis
node --inspect app.js
# Chrome DevTools → Memory → Take Heap Snapshot

Network Issues

# Test connectivity
kubectl exec -it api-pod -- curl http://database:5432
kubectl exec -it api-pod -- ping database

# DNS resolution
kubectl exec -it api-pod -- nslookup database

# Network policies
kubectl get networkpolicies
kubectl describe networkpolicy api-policy

Performance Profiling

# Node.js profiling
node --prof app.js
node --prof-process isolate-*.log > profile.txt

# Database query profiling
psql -c "EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123;"

# API endpoint profiling
curl -w "@curl-format.txt" https://api.example.com/orders

Success Metrics

Effective debugging achieves:

✅ Root cause identified (not just symptoms)
✅ Solution validated in affected environment
✅ Prevention measures documented
✅ Knowledge shared with team
✅ Similar issues prevented

Workflow Integration

Incident Response

# 1. Acknowledge incident
/debug [API returning 500 errors]

# 2. Investigate (debugger agent)
# Generates diagnostic report

# 3. Implement fix
/fix:fast [apply suggested fixes]

# 4. Validate
/test [verify fix works]

# 5. Deploy
git commit && git push

# 6. Monitor
kubectl logs -f deployment/api

Post-Mortem

After major incidents:

Complete debugging report becomes post-mortem
Share lessons learned with team
Create preventive measures
Update runbooks
Improve monitoring

Next Steps

Testing - Validate fixes with tests
Code Review - Review fix quality
Fix Commands - Apply fixes systematically

Key Takeaway: The debugger agent provides systematic investigation, comprehensive root cause analysis, and actionable solutions with validation steps and prevention measures.