From one-shot generation to mechanic-aware iterative improvement.
What the system does
CreativeGame decomposes generation into planning, code generation, testing, evaluation, reflection, and memory writing. The system is designed to address runtime failure, weak cross-generation memory, saturated LLM creativity scores, and the tendency to treat mechanics only as post-hoc descriptions.
Abstract
The current pipeline couples mechanic retrieval, CurrentMechanicSet, runtime validation, lineage memory, and proxy-based reward in order to optimize structural game change rather than surface novelty alone.
Five stages from intent to evaluated game.
Planner
Retrieves mechanic-library context and emits CurrentMechanicSet declaring what to preserve, add, remove, or recombine.
Code Generation
Skeleton → Feature → Visual → Refinement. Each sub-stage receives the planned mechanic set.
Tester
Static and optional browser validation. Failed runtime checks block reward.
Evaluator
Scores creativity and functionality (1-10), extracts realized mechanics, and computes reward against CurrentMechanicSet.
Reflector + Memory
Writes mechanic-level deltas to lineage memory and optionally updates the global archive.
Before this version
- Mechanic library acted mainly as retrieval support and novelty baseline.
- Versions were generated primarily from prompt interpretation.
- Evaluation compared versions, but not plan versus implementation.
What CurrentMechanicSet adds
- Explicit separation between preserved and newly added mechanics.
- Reward can be interpreted as planned structural realization rather than only post-hoc comparison.
- Memory can store mechanic-level deltas instead of only version summaries.
Key numbers from the current codebase.
What is in code, in progress, and still open.
In the current code
- Planner reads mechanic-library context from a 774-entry archive.
CurrentMechanicSetis parsed and stored per iteration.- Planner output is forwarded to Evaluator and Reflector for plan-vs-realized comparison.
- API responses expose mechanic trace and reward breakdown.
- The web gallery visualizes mechanic planning and evaluation.
Being strengthened
- LLM extraction is being wired into the live pipeline to replace regex-based mechanic delta tracking.
- Planned-vs-realized comparison is being formalized through
realization_score. - Write-back is evolving into a two-path loop: lineage-local memory plus cross-lineage archive update.
Missing for research-grade claims
- Human pairwise evaluation of creativity improvement.
- MemRL and reward ablation studies.
- Controlled evidence that mechanic-guided planning outperforms prompt-only generation.
Four lineages across four mechanic anchors.
Each tab is one lineage: Base → Round 2 → Round 3 → Round 4. These source games act as representative mechanic anchors rather than direct cloning targets.
Current gaps and caveats.
No human evaluation yet
LLM self-evaluation still inflates scores, and no controlled pairwise study has yet been run to validate creative improvement from a human perspective.
No memory ablation yet
Memory is implemented but its contribution has not been isolated. It is still unclear how much improvement comes from memory rather than base model capability.
Mechanic extraction remains partial
The archive already uses LLM-based extraction, but live generation still relies partly on regex-based mechanic_delta. Full end-to-end integration remains the next concrete engineering task.
The key gap has shifted
The central question is no longer whether the system can generate games, but whether mechanic-aware planning can reliably produce more structurally creative outcomes than prompt-only generation.
Why reward decreases in later iterations.
Round 1 rewards are consistently 1.00. By rounds 3-4, reward drops to 0.55-0.80 in three of four lineages despite stable LLM scores of C=8/F=8. This reflects a structural tension in the reward formula rather than simple quality degradation.
structural_change decay
structural_change rewards added mechanics and explicit structural delta. In preserve-and-deepen rounds, both naturally approach zero, so the formula cannot distinguish deliberate integration from no change.
novelty decay
relative_mechanic_novelty is measured against the archive. As later versions are written back, the lineage begins to compete with its own prior states, driving novelty downward.
cosmetic penalty misfire
Visual-deepening rounds can trigger cosmetic-only penalties even when they were planned and mechanically meaningful, which misaligns the reward with the intended creative strategy.
baseline offset
The formula applies composite = raw * 2 - 0.3. A neutral iteration with no penalties still scores negatively, making later refinement rounds structurally harder to score well.
The current reward is better at measuring per-round novelty than lineage deepening. A later-round game that tightly integrates several prior mechanics can therefore score lower than an earlier-round game that introduces one shallow new mechanic. This suggests two next steps: a cumulative mechanic coupling score and a lineage-relative novelty baseline.
Next priorities.
Consolidate CurrentMechanicSet schema across generation paths, connect planned mechanics to evaluation via realization_score, integrate LLM-based mechanic extraction into the live pipeline, and run human pairwise evaluation to test whether mechanic-aware planning produces measurably more creative outcomes than prompt-only baselines.