# klogs - Log Monitoring Tools

Production-ready scripts for monitoring LLM operations and failures in the resources app.

## Quick Start

```bash
# 1. Start port-forward (keep running in separate terminal)
kubectl port-forward -n observability svc/loki-gateway 3100:80 --context=dev

# 2. Use the tools
./llm-show-failures prompts --time 24h    # Show actual LLM calls
./klogs search --message "latex" --time 1h # Search logs
```

## Tools

### 1. klogs (Main CLI)

Query Loki logs directly from the command line.

**Common commands:**
```bash
# Search logs
./klogs search --message "unmarshal" --time 24h
./klogs search --message "latex" --time 1h
./klogs search --severity ERROR --time 30m

# View trace
./klogs trace <trace-id>

# Show errors
./klogs errors --time 1h

# Live tail
./klogs tail --severity ERROR
```

**All options:**
```bash
./klogs search --help
```

### 2. llm-show-failures (Data Inspector)

Show the ACTUAL data that failed - raw JSON, bad LaTeX, full prompts.

**Commands:**
```bash
# Show actual LLM prompts and responses
./llm-show-failures prompts --time 24h

# Show actual bad LaTeX
./llm-show-failures latex --time 1h

# Show JSON that failed to unmarshal
./llm-show-failures unmarshal --time 30m

# Dump latest errors with full data
./llm-show-failures dump --time 1h

# View ALL data for a specific trace
./llm-show-failures trace <trace-id>
```

**What you get:**
- Full LLM system and user prompts
- Complete LLM responses (JSON output)
- Actual malformed LaTeX that failed validation
- Raw JSON that couldn't be unmarshaled
- Complete error context with trace IDs
- Stack traces and metadata

## Prerequisites

1. **Port-forward to Loki:**
   ```bash
   kubectl port-forward -n observability svc/loki-gateway 3100:80 --context=dev
   ```
   Keep this running in a separate terminal.

2. **Python 3.x** (for llm-show-failures)
   ```bash
   python3 --version  # Should be 3.7+
   ```

3. **kubectl** configured for dev context

## Common Workflows

### Investigation: "Why did this lesson plan fail?"

```bash
# 1. Find errors
./klogs search --message "lessonPlanID.*lp_..." --severity ERROR --time 2h

# 2. Get trace ID from output, then view full trace
./llm-show-failures trace <trace-id>

# 3. See actual LLM calls
./llm-show-failures prompts --time 2h

# 4. Check for LaTeX issues
./llm-show-failures latex --time 2h
```

### Monitoring: "What's failing right now?"

```bash
# Recent errors with full data
./llm-show-failures dump --time 30m

# Live stream
./klogs tail --severity ERROR
```

### Debugging: "What did the LLM actually generate?"

```bash
# See last 5 LLM calls with full prompts and responses
./llm-show-failures prompts --time 1h 5
```

### Analysis: "Show me LaTeX correction failures"

```bash
# Find LaTeX errors
./klogs search --message "latex" --severity DEBUG --time 4h

# See the actual bad LaTeX
./llm-show-failures latex --time 4h
```

## Understanding the Output

### klogs Output

Human-readable format:
```
DEBUG [08:31:50] latex corrected output
       ↪ lesson_plan/node_reconciler.go:939
       🔗 Trace: 4004a689b0954b4eb07dec2e78600073
       👤 User: 7jySsBW1Zda0WXOiMOvvfPFbLSg1
```

With `--json` flag: adds full JSON after human output.

### llm-show-failures Output

Shows structured data extraction:

**Prompts command:**
- Model and provider used
- Full system prompt (with instructions)
- User prompt (with input data)
- Complete LLM response (parsed JSON)
- Trace ID for correlation

**LaTeX command:**
- The actual malformed LaTeX text
- Error context (what field, what node)
- Trace ID for full investigation

**Unmarshal command:**
- The exact JSON that failed parsing
- Error message
- Node and lesson plan IDs
- Trace ID

## Troubleshooting

### "Cannot connect to Loki"

**Fix:** Start port-forward:
```bash
kubectl port-forward -n observability svc/loki-gateway 3100:80 --context=dev
```

### "No logs found"

**Causes:**
1. Time range too narrow - try `--time 24h`
2. Actually no logs in that time (success!)
3. Wrong search term - try broader search

**Debug:**
```bash
# Check if ANY logs exist
./klogs search --time 24h --limit 5

# Try different severities
./klogs search --severity DEBUG --time 24h
```

### "Error running klogs" in llm-show-failures

**Causes:**
1. Port-forward not running
2. Search pattern causing LogQL error (like using `[...]` in regex)

**Fix:**
- Restart port-forward
- Check if klogs works directly: `./klogs search --message "test" --time 1h`

## Tips

### Create Aliases

Add to `~/.zshrc`:
```bash
alias llm-prompts='cd /path/to/klogs && ./llm-show-failures prompts --time 24h'
alias llm-latex='cd /path/to/klogs && ./llm-show-failures latex --time 24h'
alias llm-dump='cd /path/to/klogs && ./llm-show-failures dump --time 1h'
```

### Pipe to Tools

```bash
# Count errors
./llm-show-failures dump --time 24h | grep -c "ERROR"

# Save to file
./llm-show-failures prompts --time 24h > llm-calls.txt

# Search within results
./llm-show-failures prompts --time 4h | grep -A 20 "gpt-4"
```

### Time Ranges

Supported formats:
- `5m` - 5 minutes
- `30m` - 30 minutes  
- `1h` - 1 hour
- `2h` - 2 hours
- `24h` - 24 hours

## Key Log Patterns

### LLM Operations

**Log prefix:** `[prompt-input-and-output]`

**Search:**
```bash
./klogs search --message "prompt-input-and-output" --time 24h
```

**Contains:**
- model, provider, promptType
- Full system and user prompts
- LLM response (rawOutput/output)
- Metadata: lessonPlanID, nodeID, etc.

### LaTeX Errors

**Messages:**
- `latex error before reformat` - Initial detection
- `latex corrected output` - After correction
- `corrected field still has malformed latex` - Correction failed

**Search:**
```bash
./klogs search --message "latex" --time 4h
```

### Retries

**Messages:**
- `failed to reconcile node, retrying` - Retry trigger

**Fields:**
- `attempt` - Which retry (1 = first)
- `backoff` - Delay until next retry (ms)

**Search:**
```bash
./klogs search --message "retrying" --severity WARN --time 2h
```

## Files

- `klogs` - Main log query CLI (Python)
- `llm-show-failures` - Data extraction tool (Python)
- `test-real-logs.sh` - Test script
- `docs/` - This documentation

## Architecture

```
llm-show-failures
      ↓
   klogs (with --json)
      ↓
   Loki API (via port-forward)
      ↓
   Loki (log storage)
      ←
   OpenTelemetry Collector
      ←
   resources-graphql pods
```

## Performance

**klogs:**
- Fast queries (<1 second for recent logs)
- Slow for large time ranges (>24h)
- Limit results with `--limit`

**llm-show-failures:**
- Adds parsing overhead
- Use specific time ranges
- Limit prompts display (e.g., `prompts --time 24h 5`)

## Getting Help

```bash
# klogs help
./klogs --help
./klogs search --help

# llm-show-failures help
./llm-show-failures --help
```

## Updates

**Last Updated:** October 29, 2024  
**Version:** 2.0 (Fixed ANSI color code parsing)

**Changes:**
- Fixed JSON parsing (strips ANSI escape codes)
- Updated search patterns (removed `[...]` regex issues)
- Improved error messages
- Added test script
