# 🎯 LLM Monitoring Solution - START HERE

**Complete, production-ready monitoring for LLM failures and retries**

---

## ⚡ Quick Start (30 seconds)

```bash
cd ~/Coding/mathgaps-org/backend/app/resources

# 1. Start port-forward (separate terminal, keep running)
kubectl port-forward -n observability svc/loki-gateway 3100:80 --context=dev

# 2. Check system health
./llm-monitor stats

# 3. See recent problems
./llm-monitor failures
./llm-monitor latex-failures
```

**That's it!** You now have full visibility into LLM operations.

---

## 📦 What You Got

### 🔧 Tools

1. **`llm-monitor`** (executable)
   - 10 specialized commands for LLM monitoring
   - Built on klogs, uses existing Loki infrastructure
   - NEW: `latex-failures` command for LaTeX correction tracking

2. **`test-llm-monitor.sh`** (executable)
   - Validates your setup in 5 checks
   - Run before first use

### 📖 Documentation

1. **`START-HERE.md`** ← You are here
   - Quick start and overview

2. **`LLM-MONITORING-README.md`**
   - Architecture and capabilities
   - Best practices
   - Common use cases

3. **`LLM-MONITORING-QUICKSTART.md`**
   - Detailed user guide
   - Investigation workflows
   - Troubleshooting

4. **`LLM-REAL-WORLD-PATTERNS.md`** ✨ NEW
   - Based on actual production logs
   - Real error patterns with examples
   - LaTeX correction failure deep dive
   - Retry behavior analysis

5. **`LLM-MONITORING-ENHANCEMENTS.md`**
   - Go code snippets for enhanced instrumentation
   - 6 enhancement patterns ready to implement
   - Priority-based implementation guide

6. **`QUICKSTART.txt`**
   - Quick reference card
   - All commands in one place

---

## 🎯 Common Commands

```bash
# Health check
./llm-monitor stats              # Summary statistics
./llm-monitor stats --time 24h   # Last 24 hours

# Investigate failures
./llm-monitor failures           # All LLM failures
./llm-monitor latex-failures     # LaTeX correction issues
./llm-monitor retries            # Retry attempts
./llm-monitor validation-errors  # Validation issues

# Analyze patterns
./llm-monitor by-provider        # Group by AI provider
./llm-monitor trace <trace-id>   # Full request trace

# Live monitoring
./llm-monitor live               # Real-time error stream
```

---

## 🔍 Real-World Investigation Examples

### Example 1: "Node taking too long to generate"

```bash
# Find the lessonPlanID and nodeID from the UI or logs
export NODE_ID="01K8QCAXFVSXZMTAAC7M4MBVV9"

# Check for retries
./llm-monitor retries --time 2h | grep "$NODE_ID"

# Check for LaTeX issues (common cause)
./llm-monitor latex-failures --time 2h | grep "$NODE_ID"

# Get full trace
# (Extract trace ID from above, then)
./llm-monitor trace <trace-id>
```

**What you'll find:**
- Retry attempts with backoff durations
- LaTeX correction failures
- Complete timeline of operations

---

### Example 2: "Is LaTeX correction working?"

```bash
# Check LaTeX correction failures
./llm-monitor latex-failures --time 1h

# Get statistics
./llm-monitor stats --time 1h

# Calculate LaTeX failure rate
# (Compare "corrected field" errors to total operations)
```

**Common findings:**
- Initial LaTeX errors detected
- Correction attempted via LLM
- Re-validation fails → triggers retry
- Eventually succeeds after retry

---

### Example 3: "System health check"

```bash
# Quick health dashboard
./llm-monitor stats

# Recent problems
./llm-monitor failures --time 30m
./llm-monitor latex-failures --time 30m

# Active retries
./llm-monitor retries --time 15m
```

**Health indicators:**
- Error rate < 5% = Good
- Error rate 5-15% = Warning  
- Error rate > 15% = Critical

---

## 📐 LaTeX Correction Pattern (Most Common Issue)

### What Happens

```
1. LLM generates content with malformed LaTeX
   ↓
2. System detects errors (e.g., unmatched delimiters)
   ↓
3. Calls LaTeX correction LLM to fix it
   ↓
4. Re-validates corrected output
   ↓
5. If STILL invalid → Retry entire generation
   ↓
6. Exponential backoff: 500ms, 2.8s, 16s, 1.5m, 8.5m...
```

### Tracking It

```bash
# See the pattern
./llm-monitor latex-failures --time 1h

# You'll see:
# - "latex error before reformat" (initial detection)
# - "latex error after reformat" (correction attempt)
# - "corrected field still has malformed latex" (correction failed)
# - "failed to reconcile node, retrying" (triggering retry)
```

### Why It Matters

- **Most common cause of retries** in lesson plan generation
- Can cause significant delays (minutes to succeed)
- Affects user experience (loading times)

**Currently missing:**
- LaTeX correction success rate
- Which LaTeX patterns fail most
- Comparison of before/after correction

---

## 🔄 Retry Behavior

### How Retries Work

**Configuration:**
- Initial delay: 500ms
- Multiplier: 5.7 (aggressive)
- Maximum delay: 20 minutes
- Total timeout: 1 hour

**Retry schedule:**
```
Attempt 1: +500ms    (0.5 seconds)
Attempt 2: +2.85s    (2.85 seconds)
Attempt 3: +16.2s    (16 seconds)
Attempt 4: +1.5m     (90 seconds)
Attempt 5: +8.5m     (510 seconds)
...continues until 1 hour total elapsed
```

### What Gets Retried

**Each retry = complete restart:**
- New LLM API call
- New content generation
- New validation
- New correction attempt (if needed)

**Not a correction loop!**

---

## 📊 Key Insights from Real Logs

### What We Learned

1. **LaTeX corrections fail re-validation frequently**
   - Correction model doesn't always fix the issue
   - Complex nested expressions are problematic
   - Triggers expensive retries

2. **Retry backoffs can get long**
   - After 4-5 attempts, delays are minutes
   - Total time can exceed 30 minutes
   - But most succeed eventually

3. **Trace IDs are crucial**
   - Connect all operations in a request
   - Include LLM calls, retries, errors
   - Essential for investigation

4. **Error metadata is rich**
   - Every LLM call has structured metadata
   - Includes prompts, responses, errors
   - Searchable via `[prompt-input-and-output]` prefix

5. **Multiple providers in use**
   - OpenAI, Claude, Gemini, Groq
   - Model: `openai/gpt-oss-120b` common
   - Provider performance varies

---

## 🎓 Understanding the Logs

### Key Log Patterns

```json
// SUCCESS: Normal LLM operation
{
  "message": "[prompt-input-and-output]",
  "severity": "DEBUG",
  "model": "openai/gpt-oss-120b",
  "nodeType": "*lessonplan.ActivityNode",
  "output": "{...}"
}

// FAILURE: LaTeX correction failed
{
  "message": "failed to reconcile node, retrying",
  "severity": "WARNING",
  "attempt": 1,
  "backoff": 500,
  "error": "corrected field still has malformed latex"
}

// CORRECTION: LaTeX fix attempt
{
  "message": "[prompt-input-and-output]",
  "promptType": "lesson_plan_latex_correction",
  "invalidLatexData": [{...}],
  "output": "{\"correctedFields\": [...]}"
}
```

### Important Fields

- `traceID`: Connect all logs in a request
- `promptLogID`: Unique ID for LLM call
- `lessonPlanID`: What's being generated
- `nodeID`: Specific node within plan
- `attempt`: Which retry (1 = first failure)
- `backoff`: Delay until next retry (ms)

---

## 💡 Pro Tips

### 1. Use Trace IDs

Every investigation should start with a trace:
```bash
./llm-monitor trace <trace-id>
```

Shows complete timeline with all operations.

### 2. Monitor Patterns, Not Errors

One error = normal. Pattern = problem.

Watch for:
- Sustained high error rate
- Same node retrying repeatedly
- Increasing backoff durations

### 3. Time-Based Analysis

```bash
# Compare time periods
./llm-monitor stats --time 5m
./llm-monitor stats --time 1h
./llm-monitor stats --time 24h
```

### 4. Combine Tools

```bash
# llm-monitor for structured queries
./llm-monitor latex-failures

# klogs for custom searches
klogs search --message "lessonPlanID.*lp_..." --time 2h

# grep for post-processing
./llm-monitor retries | grep -A 3 "backoff.*8"
```

### 5. Create Aliases

Add to `~/.zshrc`:
```bash
alias llm='cd ~/Coding/mathgaps-org/backend/app/resources && ./llm-monitor'
alias llm-health='llm stats && llm failures --time 30m'
alias llm-latex='llm latex-failures'
```

---

## 🚨 When to Investigate

### Normal (No Action)

- Error rate < 5%
- Occasional retries (< 10% of operations)
- LaTeX corrections mostly succeeding

### Warning (Monitor)

- Error rate 5-15%
- Frequent retries (10-20% of operations)
- LaTeX corrections failing often
- Backoff durations > 5 minutes

### Critical (Investigate)

- Error rate > 15%
- Most operations retrying
- Retries exceeding 30 minutes
- Same errors repeating continuously

**Check:**
```bash
./llm-monitor stats --time 1h
./llm-monitor failures --time 30m
./llm-monitor latex-failures --time 30m
```

---

## 📈 Next Steps

### Immediate (Today)

1. ✅ Test the setup: `./test-llm-monitor.sh`
2. ✅ Run health check: `./llm-monitor stats`
3. ✅ Read `LLM-REAL-WORLD-PATTERNS.md` for real examples

### Short Term (This Week)

4. Create shell aliases for convenience
5. Monitor during deployments
6. Share with team

### Long Term (This Month)

7. Implement Phase 1 enhancements (see `LLM-MONITORING-ENHANCEMENTS.md`)
8. Create Grafana dashboards
9. Set up automated alerts

---

## 📞 Need Help?

### Quick Checks

```bash
# Test setup
./test-llm-monitor.sh

# Verify port-forward
kubectl port-forward -n observability svc/loki-gateway 3100:80 --context=dev

# Check klogs works
klogs search --time 5m --limit 5
```

### Documentation

- **Setup issues** → `LLM-MONITORING-QUICKSTART.md` (Troubleshooting section)
- **Understanding logs** → `LLM-REAL-WORLD-PATTERNS.md`
- **Code changes** → `LLM-MONITORING-ENHANCEMENTS.md`
- **Architecture** → `LLM-MONITORING-README.md`

---

## ✅ Success Criteria

You'll know this is working when you can:

- ✅ Check LLM system health in 10 seconds
- ✅ Investigate any failure with trace ID
- ✅ Track LaTeX correction patterns
- ✅ Monitor retry behavior
- ✅ Correlate errors across operations
- ✅ Make data-driven decisions about LLM operations

---

## 🎉 What Makes This Special

### Built on Real Data

- ✅ Analyzed actual production logs
- ✅ Identified real error patterns
- ✅ Validated against live system
- ✅ Based on codebase deep dive

### Zero Infrastructure Changes

- ✅ Uses existing Loki/Grafana
- ✅ No new services needed
- ✅ No code changes required
- ✅ Works immediately

### Production Ready

- ✅ Real-world patterns documented
- ✅ LaTeX correction tracking
- ✅ Retry behavior analysis
- ✅ Investigation workflows
- ✅ Screen reader friendly

---

## 🚀 Get Started Now

```bash
cd ~/Coding/mathgaps-org/backend/app/resources

# Test setup
./test-llm-monitor.sh

# Check health
./llm-monitor stats

# Investigate
./llm-monitor failures
./llm-monitor latex-failures
```

**Happy monitoring!** 🎯
