## Existing work that basically solves the interface problem

- **RL-style step interfaces**: `reset(seed) -> obs`, `step(action) -> obs, reward, done, info`. This makes “fast-paced” irrelevant because time becomes _discrete ticks you control_, not wall-clock. Gym/Gymnasium documents this exact contract. ([Gymnasium][1])
- **Expose internal state, not pixels**: NetHack Learning Environment kept the original game but added seeding + **internal state exposure** to the frontend. ([NeurIPS Proceedings][2])
- **LLM agents that avoid vision by using environment APIs**: Voyager in Minecraft leans on APIs to get precise state and execute higher-level actions (instead of trying to “see” everything). ([arXiv][3])
- **Forward-model / clone-and-rollout**: GVGAI planning track explicitly gives agents a **forward model** so they can simulate outcomes from a state. This is _huge_ for autonomous debugging and test generation. ([AAAI][4])
- **Headless automated testing is normal in engines**: Unity supports running tests from CLI in batchmode. Godot has built-in unit testing paths and common headless test setups via frameworks like GUT. ([Unity Documentation][5])
- **Deterministic replay**: standard game-dev wisdom: if your sim is deterministic, you can record inputs + seed and replay. ([GameDev][6])
- **Determinism is hard in practice** (floats, iteration order, physics, threading): there’s dedicated analysis on nondeterminism sources in game engines used for simulation. ([Bristol Research Information][7])

## The mental model I’d replace yours with

### 1) Make the “game” a **deterministic program** with a controllable clock

Instead of “run/jump continuously,” define:

- a fixed timestep (e.g. 1 tick = 16ms)
- `step(action, n_ticks=1)` where `action` is held constant for those ticks (or provide a per-tick action list)

Now “fast-paced platformer” becomes: “simulate 240 ticks with this input schedule.”

### 2) Observations should be **semantic**, not visual

Don’t dump the entire world every frame. Give:

- **stable entity IDs**
- component values needed for reasoning (pos/vel, grounded, animation state, controller state, collision contacts)
- **events since last step** (landed, took damage, entered trigger, collected item)
- optionally a **tile-window** around the player (cheap spatial grounding)

Crucially: support **querying** instead of flooding:

- `get_entity(id)`
- `get_entities_in_aabb(x1,y1,x2,y2, filters…)`
- `get_contacts(entity_id)`
- `get_tilemap_patch(cx,cy,radius)`
- `diff_since(last_tick)` returning only changed fields

### 3) Give the agent a forward model superpower

Add:

- `snapshot() -> handle`
- `restore(handle)`
- `clone_and_step(handle, action, ticks) -> (handle2, obs2)`
  This lets an LLM _branch_ and test hypotheses without re-running whole scenarios (same trick GVGAI leans on). ([AAAI][4])

### 4) Testing needs an oracle (otherwise you get “it ran”)

Provide first-class assertions/invariants the LLM can lean on:

- physics invariants (no tunneling through solid tiles; max penetration depth)
- controller invariants (can only jump if grounded/coyote-time; jump buffer behavior)
- gameplay invariants (HP never negative; collectibles monotonic; camera bounds)
  And add property-based fuzz hooks:
- random seeds + random input schedules
- shrink failing cases (store minimal failing input trace)

### 5) Keep rendering as a separate, _tested adapter_

Your “thin UI” instinct is solid, with one tweak: **treat rendering as a consumer of a render packet** (draw list / sprite batch commands). Then you can snapshot-test render packets without pixels, and only do pixel golden tests occasionally.

## A concrete interface that tends to work well

Think “Gym + replay + introspection”:

- `reset(seed, scenario_id)`
- `step(action | action_macro, ticks=1)`
- `observe(mode="minimal"|"debug"|"tile_patch"|"entity_dump")`
- `events()` (since last step)
- `snapshot()/restore()`
- `replay(record: {seed, actions[]}) -> final_state_hash`
- `hash_state()` (for deterministic regression checks)
- `assert(predicate_id, args)` (engine-owned assertions, not LLM-judged)

**State hashes + deterministic replay** give you a brutal, reliable pass/fail signal for regressions (recorded traces become golden tests). ([GameDev][6])

## Hard critique of your current approach

- **“LLM triggers running and jumping” is too low-level**: you want _macro-actions_ (“run right for 30 ticks, jump at tick 12”) and the ability to auto-search for a macro that satisfies a goal/invariant.
- **Determinism will make or break this**: if physics/ordering isn’t deterministic, you’ll chase ghosts forever. Bake determinism in early (fixed timestep, seeded RNG, stable iteration order, single-threaded sim in test mode). ([Bristol Research Information][7])

If you build only one thing first: build the **headless step runner + semantic observation + deterministic replay**. Once that exists, an LLM can autonomously add features because every change can be validated by (1) invariants, (2) replay traces, (3) state hashes—without ever “playing” visually.

[1]: https://gymnasium.farama.org/api/env/?utm_source=chatgpt.com "Env - Gymnasium Documentation"
[2]: https://proceedings.nips.cc/paper/2020/file/569ff987c643b4bedf504efda8f786c2-Paper.pdf?utm_source=chatgpt.com "The NetHack Learning Environment - NIPS papers"
[3]: https://arxiv.org/abs/2305.16291?utm_source=chatgpt.com "Voyager: An Open-Ended Embodied Agent with Large ..."
[4]: https://cdn.aaai.org/ojs/9869/9869-13-13397-1-2-20201228.pdf?utm_source=chatgpt.com "General Video Game AI: Competition, Challenges, and ..."
[5]: https://docs.unity3d.com/6000.4/Documentation/Manual/test-framework/run-tests-from-command-line.html?utm_source=chatgpt.com "Run tests from the command line"
[6]: https://www.gamedev.net/forums/topic/439336-replay-system/?utm_source=chatgpt.com "Replay System - General and Gameplay Programming"
[7]: https://research-information.bris.ac.uk/files/331516996/Full_text_PDF_final_published_version_.pdf?utm_source=chatgpt.com "On Determinism of Game Engines Used for Simulation- ..."