# LLM Monitoring: Real-World Patterns & Investigation Guide

🎯 **Based on actual production logs and codebase analysis**

## 📋 Table of Contents

1. [Common Error Patterns](#common-error-patterns)
2. [LaTeX Correction Failures](#latex-correction-failures)
3. [Retry Behavior Deep Dive](#retry-behavior-deep-dive)
4. [Investigation Workflows](#investigation-workflows)
5. [Real Log Examples](#real-log-examples)

---

## 🔴 Common Error Patterns

### Pattern 1: LaTeX Correction Failure Loop

**What you'll see:**
```
WARNING: failed to reconcile node, retrying
  lessonPlanID: lp_01K8QCAV2FEZT5NBDB12HVNMF7
  nodeID: 01K8QCAXFVSXZMTAAC7M4MBVV9
  attempt: 1
  backoff: 500ms
  error: corrected field still has malformed latex
```

**What this means:**
1. LLM generated content with malformed LaTeX
2. System tried to auto-correct via LaTeX correction LLM
3. Corrected output STILL had malformed LaTeX
4. Node generation failed → will retry entire process

**Investigation steps:**
```bash
# 1. Find all LaTeX failures
./llm-monitor latex-failures --time 1h

# 2. Get the trace ID from the log
./llm-monitor trace <trace-id>

# 3. Check if it eventually succeeded
./llm-monitor search --message "lessonPlanID.*lp_01K8QCAV2FEZT5NBDB12HVNMF7" --time 2h
```

**Common causes:**
- Complex mathematical expressions with nested structures
- Mixed LaTeX/text content confusing the correction model
- Edge cases in KaTeX delimiter handling

---

### Pattern 2: Unmarshal Errors

**What you'll see:**
```
ERROR: failed to unmarshal LLM output
  nodeID: 01K8QCAXFVSXZMTAAC81EYWVEZ
  rawJSON: {"easy":{"picked_skill":"...
  error: invalid character '\n' in string literal
```

**What this means:**
- LLM returned JSON that Go's unmarshaler can't parse
- Often due to unescaped newlines, quotes, or invalid UTF-8

**Investigation:**
```bash
# Find unmarshal errors
./llm-monitor validation-errors --time 1h | grep -i unmarshal

# Get full context with trace
./llm-monitor trace <trace-id>
```

---

### Pattern 3: GraphQL/gRPC Errors

**What you'll see:**
```
ERROR: GraphQL response error
  error: failed to get template user for year from grpc
  rpc error: code = Unknown desc = cannot unmarshal records
```

**What this means:**
- Not an LLM error - database/service communication issue
- Often network, database query, or data format problems

**Not tracked by llm-monitor** - use regular error monitoring

---

## 📐 LaTeX Correction Failures (Deep Dive)

### How LaTeX Correction Works

**Flow:**
```
1. LLM generates content
   ↓
2. Validate all LaTeX fields (parallel, 5 workers)
   ↓
3. If errors found → Call LaTeX correction LLM
   ↓
4. Re-validate ALL corrected fields
   ↓
5. If ANY field still invalid → FAIL (trigger retry)
   ↓
6. Else → merge corrections back → SUCCESS
```

### Retry Behavior

**Exponential backoff:**
- Attempt 1: 500ms delay
- Attempt 2: ~2.85s delay (500ms × 5.7)
- Attempt 3: ~16.2s delay
- Attempt 4: ~1.5min delay
- Attempt 5: ~8.5min delay
- Maximum: 1 hour total elapsed time

**What gets retried:**
- **Entire node generation** - not just LaTeX correction
- New LLM call → new content → new validation
- Each retry is a fresh attempt from scratch

### Observability Gaps (Current State)

**What you CAN track:**
- ✅ That LaTeX error occurred (`latexError` in metadata)
- ✅ Retry attempts with backoff durations
- ✅ Final success or failure

**What you CAN'T easily track:**
- ❌ LaTeX correction success rate
- ❌ Comparison of errors before vs after correction
- ❌ Which specific LaTeX patterns cause failures
- ❌ Which models are better at LaTeX correction
- ❌ How many attempts until success

---

## 🔄 Retry Behavior Deep Dive

### Repository-Level Retries

**Location:** `repository_impl.go:RetryNodeReconciler`

**Configuration:**
```go
InitialInterval:  500ms
Multiplier:       5.7      // Aggressive growth
MaxInterval:      20min
MaxElapsedTime:   1 hour   // Total timeout
```

**Logging on each attempt:**
```go
log.Warn("failed to reconcile node, retrying",
    "lessonPlanID", lpID,
    "nodeID", nodeID,
    "attempt", attemptNumber,
    "backoff", duration,
    error)
```

### What Triggers Retries

**Any of these will trigger retry:**
1. LaTeX validation failure
2. JSON unmarshal error
3. Empty LLM response
4. Network/API errors
5. Validation logic errors

**NOT all retries are LLM-related!**

### Tracking Retry Patterns

```bash
# See all retry attempts
./llm-monitor retries --time 2h

# Correlation: Are retries related to LaTeX?
./llm-monitor latex-failures --time 2h
./llm-monitor retries --time 2h

# Check if node eventually succeeded
# (Look for same nodeID without "retrying" afterwards)
klogs search --message "nodeID.*<your-node-id>" --time 3h
```

---

## 🔍 Investigation Workflows

### Workflow 1: "Why is this lesson plan taking so long?"

**Steps:**
```bash
# 1. Find all operations for this lesson plan
export LP_ID="lp_01K8QCAV2FEZT5NBDB12HVNMF7"
klogs search --message "$LP_ID" --time 2h

# 2. Check for retry patterns
./llm-monitor retries --time 2h | grep "$LP_ID"

# 3. Check for LaTeX issues
./llm-monitor latex-failures --time 2h | grep "$LP_ID"

# 4. Get statistics
./llm-monitor stats --time 2h

# 5. Check specific trace
./llm-monitor trace <trace-id-from-logs>
```

**What to look for:**
- Multiple retry attempts (indicates errors)
- LaTeX correction failures (common cause)
- Long backoff durations (8+ minutes means many retries)

---

### Workflow 2: "Is the system healthy right now?"

**Quick health check:**
```bash
# Overall stats
./llm-monitor stats --time 1h

# Recent failures
./llm-monitor failures --time 30m

# Active retries
./llm-monitor retries --time 15m

# LaTeX issues
./llm-monitor latex-failures --time 30m
```

**Health indicators:**
- **Good**: < 5% error rate, few retries
- **Warning**: 5-15% error rate, moderate retries
- **Critical**: > 15% error rate, many retries, long backoffs

---

### Workflow 3: "What's failing right now?"

**Live monitoring:**
```bash
# Terminal 1: Watch for errors
./llm-monitor live

# Terminal 2: Watch for retries
klogs tail --severity WARN --message "retrying"

# Terminal 3: Health dashboard
watch -n 60 './llm-monitor stats --time 1h'
```

---

### Workflow 4: "Why did this specific node fail?"

**Given a nodeID:**
```bash
export NODE_ID="01K8QCAXFVSXZMTAAC7M4MBVV9"

# 1. Find all logs for this node
klogs search --message "$NODE_ID" --time 4h

# 2. Extract trace ID from output
export TRACE_ID="bc230ad0acae5a94d08d8513e16830b2"

# 3. View full trace
./llm-monitor trace $TRACE_ID

# 4. Check for specific error patterns
klogs search --message "$NODE_ID" --time 4h | grep -i "error\|fail\|latex"
```

---

## 📊 Real Log Examples

### Example 1: Successful LLM Operation

```json
{
  "severity": "DEBUG",
  "time": "2025-10-29T07:01:31Z",
  "message": "[prompt-input-and-output]",
  "traceID": "bc230ad0acae5a94d08d8513e16830b2",
  "userID": "edKE1BIYC7O1u8cHonCSw3jgMl23",
  "NodeID": "01K8QCAXFVSXZMTAAC7R790AVK",
  "promptLogID": "01K8QCB1FMNV9F7CPN8AX7KWYW",
  "model": "openai/gpt-oss-120b",
  "nodeType": "*lessonplan.ActivityNode",
  "provider": "groq",
  "output": "{...JSON output...}"
}
```

**What to extract:**
- `traceID`: For full trace correlation
- `promptLogID`: Unique ID for this LLM call
- `model`: Which model was used
- `nodeType`: What was being generated

---

### Example 2: LaTeX Correction Failure

```json
{
  "severity": "WARNING",
  "time": "2025-10-29T07:01:32Z",
  "message": "failed to reconcile node, retrying",
  "traceID": "bc230ad0acae5a94d08d8513e16830b2",
  "lessonPlanID": "lp_01K8QCAV2FEZT5NBDB12HVNMF7",
  "nodeID": "01K8QCAXFVSXZMTAAC7M4MBVV9",
  "attempt": 1,
  "error": {
    "domain": "lessonPlanNodeReconciler.generateFromLLM",
    "messages": ["corrected field still has malformed latex"],
    "context": {
      "correctedField": "Park rangers estimate the water..."
    }
  },
  "backoff": 500
}
```

**What to extract:**
- `attempt`: Which retry this is (1 = first failure)
- `backoff`: How long until next retry (milliseconds)
- `error.messages`: Root cause
- `error.context.correctedField`: The problematic content

---

### Example 3: LaTeX Correction Metadata

```json
{
  "severity": "DEBUG",
  "message": "[prompt-input-and-output]",
  "promptType": "lesson_plan_latex_correction",
  "invalidLatexData": [
    {
      "field": "$2\\times3\\times4$\\n$= 2\\times3$...",
      "errorDetail": [
        "Undelimited KaTeX found in word `\\n`..."
      ]
    }
  ],
  "promptLogID": "01K8QCB61YB11V9RR6Z934MRX8",
  "model": "openai/gpt-oss-120b",
  "rawOutput": "{\"correctedFields\": [...]}"
}
```

**What to extract:**
- `promptType`: "lesson_plan_latex_correction" = correction attempt
- `invalidLatexData`: What errors were found
- Link this to parent operation via lessonPlanMetadataID

---

## 🎯 Key Metrics to Track

### Error Rates

```bash
# Calculate error rate
./llm-monitor stats --time 1h

# Compare across time periods
./llm-monitor stats --time 1h > now.txt
./llm-monitor stats --time 24h > day.txt
diff now.txt day.txt
```

### Retry Patterns

```bash
# Count retry attempts
./llm-monitor retries --time 1h | grep -c "attempt"

# Average backoff duration
./llm-monitor retries --time 1h | grep "backoff" | \
  awk '{print $NF}' | \
  sed 's/ms//' | \
  awk '{sum+=$1; count++} END {print sum/count " ms"}'
```

### LaTeX Issues

```bash
# LaTeX failure rate
latex_failures=$(./llm-monitor latex-failures --time 1h | grep -c "corrected field")
total_ops=$(./llm-monitor stats --time 1h | grep "Total LLM Operations" | awk '{print $NF}')
echo "LaTeX failure rate: $(echo "scale=2; $latex_failures * 100 / $total_ops" | bc)%"
```

---

## 💡 Tips for Effective Monitoring

### 1. Use Trace IDs

Trace IDs connect everything:
```bash
# From any log, extract traceID, then:
./llm-monitor trace <trace-id>
```

This shows:
- All LLM operations in the request
- All retries
- All errors
- Full timeline

### 2. Monitor Patterns, Not Individual Errors

Don't alert on single errors. Watch for:
- Sustained high error rate (>10% for 10+ minutes)
- Increasing retry attempts
- Same error pattern repeating

### 3. Correlate Multiple Views

```bash
# Are retries due to LaTeX?
retries=$(./llm-monitor retries --time 1h | wc -l)
latex=$(./llm-monitor latex-failures --time 1h | wc -l)
echo "LaTeX accounts for $(echo "scale=0; $latex * 100 / $retries" | bc)% of retries"
```

### 4. Time-Based Analysis

```bash
# Compare different time windows
for period in 5m 30m 1h 4h; do
  echo "=== Last $period ==="
  ./llm-monitor stats --time $period | grep "Total\|Error"
done
```

### 5. Provider/Model Comparison

```bash
# Which provider is more reliable?
./llm-monitor by-provider --time 24h > providers.txt
grep -A 5 "openai\|claude\|groq" providers.txt
```

---

## 🚨 Alert Recommendations

### Critical Alerts

**Trigger when:**
- Error rate > 20% for 10+ minutes
- Any node retrying for > 30 minutes
- LaTeX correction failing > 50% of attempts

```bash
# Example check script
error_rate=$(./llm-monitor stats --time 10m | grep "Error Rate" | awk '{print $NF}' | sed 's/%//')
if [ "$error_rate" -gt 20 ]; then
  piper-say "Critical: LLM error rate above 20 percent"
fi
```

### Warning Alerts

**Trigger when:**
- Error rate > 10% for 30+ minutes
- Retry backoffs exceeding 5 minutes
- Sustained LaTeX correction issues

---

## 📚 Reference

### Log Prefixes

- `[prompt-input-and-output]` - LLM operation metadata
- `failed to reconcile node, retrying` - Retry attempt
- `latex error before reformat` - Pre-correction LaTeX errors
- `latex error after reformat` - Post-correction LaTeX errors
- `corrected field still has malformed latex` - Correction failed

### Important Fields

- `traceID` - Correlate all logs in a request
- `promptLogID` - Unique ID for an LLM call
- `lessonPlanID` - Entity being generated
- `nodeID` - Specific node within lesson plan
- `attempt` - Retry attempt number
- `backoff` - Delay until next retry (ms)
- `promptType` - Type of LLM operation

### Severity Levels

- `DEBUG` - LLM metadata, normal operations
- `INFO` - General information
- `WARN` - Retries, non-critical issues
- `ERROR` - Operation failures
- `DPANIC` - Critical failures (dev mode panic)

---

## 🎓 Understanding the System

### Key Insights from Codebase

1. **No fixed retry count** - Time-based (1 hour max)
2. **Exponential backoff** - Gets longer each retry (500ms → 20min)
3. **LaTeX correction** - One attempt per generation
4. **Fresh retries** - Each retry = completely new LLM call
5. **Parallel validation** - 5 workers validate fields simultaneously

### Why LaTeX Corrections Fail

Common reasons:
1. **Nested complexity** - Deep nesting of brackets/braces
2. **Mixed content** - LaTeX + plain text confusion
3. **Escape sequences** - Newlines, special chars in JSON
4. **Model limitations** - Correction model not sophisticated enough
5. **Edge cases** - Unusual KaTeX syntax

### System Design

- **Defensive** - Validate everything, retry automatically
- **User-friendly** - Auto-correction attempts before failing
- **Observable** - Rich logging at all stages
- **Resilient** - Can recover from transient issues

---

## ✅ Quick Reference Card

```bash
# Daily health check
./llm-monitor stats

# Investigate issues
./llm-monitor failures
./llm-monitor retries
./llm-monitor latex-failures

# Track specific problem
./llm-monitor trace <trace-id>

# Live monitoring
./llm-monitor live
```

**Remember:** Most retries eventually succeed. Focus on patterns, not single failures.
