CreativeGame - Technical Report

Abstract

From one-shot generation to mechanic-aware iterative improvement.

What the system does

CreativeGame decomposes generation into planning, code generation, testing, evaluation, reflection, and memory writing. The system is designed to address runtime failure, weak cross-generation memory, saturated LLM creativity scores, and the tendency to treat mechanics only as post-hoc descriptions.

Abstract

The current pipeline couples mechanic retrieval, CurrentMechanicSet, runtime validation, lineage memory, and proxy-based reward in order to optimize structural game change rather than surface novelty alone.

Pipeline

Five stages from intent to evaluated game.

01

Planner

Retrieves mechanic-library context and emits CurrentMechanicSet declaring what to preserve, add, remove, or recombine.

02

Code Generation

Skeleton → Feature → Visual → Refinement. Each sub-stage receives the planned mechanic set.

03

Tester

Static and optional browser validation. Failed runtime checks block reward.

04

Evaluator

Scores creativity and functionality (1-10), extracts realized mechanics, and computes reward against CurrentMechanicSet.

05

Reflector + Memory

Writes mechanic-level deltas to lineage memory and optionally updates the global archive.

Before this version

Mechanic library acted mainly as retrieval support and novelty baseline.
Versions were generated primarily from prompt interpretation.
Evaluation compared versions, but not plan versus implementation.

What `CurrentMechanicSet` adds

Explicit separation between preserved and newly added mechanics.
Reward can be interpreted as planned structural realization rather than only post-hoc comparison.
Memory can store mechanic-level deltas instead of only version summaries.

Results Snapshot

Key numbers from the current codebase.

71 stored lineages

9 multi-node lineages explored to depth 4

~4.6M total tokens across all saved nodes

91.7% Python code reduction vs ChatDev prototype

Implementation Status

What is in code, in progress, and still open.

Implemented

In the current code

Planner reads mechanic-library context from a 774-entry archive.
CurrentMechanicSet is parsed and stored per iteration.
Planner output is forwarded to Evaluator and Reflector for plan-vs-realized comparison.
API responses expose mechanic trace and reward breakdown.
The web gallery visualizes mechanic planning and evaluation.

In Progress

Being strengthened

LLM extraction is being wired into the live pipeline to replace regex-based mechanic delta tracking.
Planned-vs-realized comparison is being formalized through realization_score.
Write-back is evolving into a two-path loop: lineage-local memory plus cross-lineage archive update.

Open

Missing for research-grade claims

Human pairwise evaluation of creativity improvement.
MemRL and reward ablation studies.
Controlled evidence that mechanic-guided planning outperforms prompt-only generation.

Demo Cases

Four lineages across four mechanic anchors.

Each tab is one lineage: Base → Round 2 → Round 3 → Round 4. These source games act as representative mechanic anchors rather than direct cloning targets.

Limitations

Current gaps and caveats.

No human evaluation yet

LLM self-evaluation still inflates scores, and no controlled pairwise study has yet been run to validate creative improvement from a human perspective.

No memory ablation yet

Memory is implemented but its contribution has not been isolated. It is still unclear how much improvement comes from memory rather than base model capability.

Mechanic extraction remains partial

The archive already uses LLM-based extraction, but live generation still relies partly on regex-based mechanic_delta. Full end-to-end integration remains the next concrete engineering task.

The key gap has shifted

The central question is no longer whether the system can generate games, but whether mechanic-aware planning can reliably produce more structurally creative outcomes than prompt-only generation.

Discussion

Why reward decreases in later iterations.

Round 1 rewards are consistently 1.00. By rounds 3-4, reward drops to 0.55-0.80 in three of four lineages despite stable LLM scores of C=8/F=8. This reflects a structural tension in the reward formula rather than simple quality degradation.

structural_change decay

structural_change rewards added mechanics and explicit structural delta. In preserve-and-deepen rounds, both naturally approach zero, so the formula cannot distinguish deliberate integration from no change.

novelty decay

relative_mechanic_novelty is measured against the archive. As later versions are written back, the lineage begins to compete with its own prior states, driving novelty downward.

cosmetic penalty misfire

Visual-deepening rounds can trigger cosmetic-only penalties even when they were planned and mechanically meaningful, which misaligns the reward with the intended creative strategy.

baseline offset

The formula applies composite = raw * 2 - 0.3. A neutral iteration with no penalties still scores negatively, making later refinement rounds structurally harder to score well.

The current reward is better at measuring per-round novelty than lineage deepening. A later-round game that tightly integrates several prior mechanics can therefore score lower than an earlier-round game that introduces one shallow new mechanic. This suggests two next steps: a cumulative mechanic coupling score and a lineage-relative novelty baseline.

cumulative coupling score lineage-relative novelty cosmetic penalty calibration LLM evaluation inflation

Future Work

Next priorities.

Consolidate CurrentMechanicSet schema across generation paths, connect planned mechanics to evaluation via realization_score, integrate LLM-based mechanic extraction into the live pipeline, and run human pairwise evaluation to test whether mechanic-aware planning produces measurably more creative outcomes than prompt-only baselines.

human pairwise evaluation realization_score LLM mechanic extraction memory ablation reward redesign

CreativeGame: Toward Mechanic-Aware Iterative Creative Game Generation.

From one-shot generation to mechanic-aware iterative improvement.

What the system does

Abstract

Five stages from intent to evaluated game.

Planner

Code Generation

Tester

Evaluator

Reflector + Memory

Before this version

What CurrentMechanicSet adds

Key numbers from the current codebase.

What is in code, in progress, and still open.

In the current code

Being strengthened

Missing for research-grade claims

Four lineages across four mechanic anchors.

Current gaps and caveats.

No human evaluation yet

No memory ablation yet

Mechanic extraction remains partial

The key gap has shifted

Why reward decreases in later iterations.

structural_change decay

novelty decay

cosmetic penalty misfire

baseline offset

Next priorities.

What `CurrentMechanicSet` adds