## Existing work that basically solves the interface problem - **RL-style step interfaces**: `reset(seed) -> obs`, `step(action) -> obs, reward, done, info`. This makes “fast-paced” irrelevant because time becomes _discrete ticks you control_, not wall-clock. Gym/Gymnasium documents this exact contract. ([Gymnasium][1]) - **Expose internal state, not pixels**: NetHack Learning Environment kept the original game but added seeding + **internal state exposure** to the frontend. ([NeurIPS Proceedings][2]) - **LLM agents that avoid vision by using environment APIs**: Voyager in Minecraft leans on APIs to get precise state and execute higher-level actions (instead of trying to “see” everything). ([arXiv][3]) - **Forward-model / clone-and-rollout**: GVGAI planning track explicitly gives agents a **forward model** so they can simulate outcomes from a state. This is _huge_ for autonomous debugging and test generation. ([AAAI][4]) - **Headless automated testing is normal in engines**: Unity supports running tests from CLI in batchmode. Godot has built-in unit testing paths and common headless test setups via frameworks like GUT. ([Unity Documentation][5]) - **Deterministic replay**: standard game-dev wisdom: if your sim is deterministic, you can record inputs + seed and replay. ([GameDev][6]) - **Determinism is hard in practice** (floats, iteration order, physics, threading): there’s dedicated analysis on nondeterminism sources in game engines used for simulation. ([Bristol Research Information][7]) ## The mental model I’d replace yours with ### 1) Make the “game” a **deterministic program** with a controllable clock Instead of “run/jump continuously,” define: - a fixed timestep (e.g. 1 tick = 16ms) - `step(action, n_ticks=1)` where `action` is held constant for those ticks (or provide a per-tick action list) Now “fast-paced platformer” becomes: “simulate 240 ticks with this input schedule.” ### 2) Observations should be **semantic**, not visual Don’t dump the entire world every frame. Give: - **stable entity IDs** - component values needed for reasoning (pos/vel, grounded, animation state, controller state, collision contacts) - **events since last step** (landed, took damage, entered trigger, collected item) - optionally a **tile-window** around the player (cheap spatial grounding) Crucially: support **querying** instead of flooding: - `get_entity(id)` - `get_entities_in_aabb(x1,y1,x2,y2, filters…)` - `get_contacts(entity_id)` - `get_tilemap_patch(cx,cy,radius)` - `diff_since(last_tick)` returning only changed fields ### 3) Give the agent a forward model superpower Add: - `snapshot() -> handle` - `restore(handle)` - `clone_and_step(handle, action, ticks) -> (handle2, obs2)` This lets an LLM _branch_ and test hypotheses without re-running whole scenarios (same trick GVGAI leans on). ([AAAI][4]) ### 4) Testing needs an oracle (otherwise you get “it ran”) Provide first-class assertions/invariants the LLM can lean on: - physics invariants (no tunneling through solid tiles; max penetration depth) - controller invariants (can only jump if grounded/coyote-time; jump buffer behavior) - gameplay invariants (HP never negative; collectibles monotonic; camera bounds) And add property-based fuzz hooks: - random seeds + random input schedules - shrink failing cases (store minimal failing input trace) ### 5) Keep rendering as a separate, _tested adapter_ Your “thin UI” instinct is solid, with one tweak: **treat rendering as a consumer of a render packet** (draw list / sprite batch commands). Then you can snapshot-test render packets without pixels, and only do pixel golden tests occasionally. ## A concrete interface that tends to work well Think “Gym + replay + introspection”: - `reset(seed, scenario_id)` - `step(action | action_macro, ticks=1)` - `observe(mode="minimal"|"debug"|"tile_patch"|"entity_dump")` - `events()` (since last step) - `snapshot()/restore()` - `replay(record: {seed, actions[]}) -> final_state_hash` - `hash_state()` (for deterministic regression checks) - `assert(predicate_id, args)` (engine-owned assertions, not LLM-judged) **State hashes + deterministic replay** give you a brutal, reliable pass/fail signal for regressions (recorded traces become golden tests). ([GameDev][6]) ## Hard critique of your current approach - **“LLM triggers running and jumping” is too low-level**: you want _macro-actions_ (“run right for 30 ticks, jump at tick 12”) and the ability to auto-search for a macro that satisfies a goal/invariant. - **Determinism will make or break this**: if physics/ordering isn’t deterministic, you’ll chase ghosts forever. Bake determinism in early (fixed timestep, seeded RNG, stable iteration order, single-threaded sim in test mode). ([Bristol Research Information][7]) If you build only one thing first: build the **headless step runner + semantic observation + deterministic replay**. Once that exists, an LLM can autonomously add features because every change can be validated by (1) invariants, (2) replay traces, (3) state hashes—without ever “playing” visually. [1]: https://gymnasium.farama.org/api/env/?utm_source=chatgpt.com "Env - Gymnasium Documentation" [2]: https://proceedings.nips.cc/paper/2020/file/569ff987c643b4bedf504efda8f786c2-Paper.pdf?utm_source=chatgpt.com "The NetHack Learning Environment - NIPS papers" [3]: https://arxiv.org/abs/2305.16291?utm_source=chatgpt.com "Voyager: An Open-Ended Embodied Agent with Large ..." [4]: https://cdn.aaai.org/ojs/9869/9869-13-13397-1-2-20201228.pdf?utm_source=chatgpt.com "General Video Game AI: Competition, Challenges, and ..." [5]: https://docs.unity3d.com/6000.4/Documentation/Manual/test-framework/run-tests-from-command-line.html?utm_source=chatgpt.com "Run tests from the command line" [6]: https://www.gamedev.net/forums/topic/439336-replay-system/?utm_source=chatgpt.com "Replay System - General and Gameplay Programming" [7]: https://research-information.bris.ac.uk/files/331516996/Full_text_PDF_final_published_version_.pdf?utm_source=chatgpt.com "On Determinism of Game Engines Used for Simulation- ..."