Technical Report · April 2026

CreativeGame: Toward Mechanic-Aware Iterative Creative Game Generation.

CreativeGame Team

Hongnan Ma · Han Wang · Shenglin Wang · Tieyue Yin · Yiwei Shi · Yucong Huang

Yingtian Zou · Muning Wen · Mengyue Yang

University of Bristol · Shanghai Jiao Tong University · Shandong University · Nanjing University · Sreal AI

CreativeGame is a multi-agent pipeline for iterative HTML5 game generation. A Planner retrieves mechanic context, emits CurrentMechanicSet, and drives Skeleton → Feature → Visual → Refinement generation. Each version is evaluated against its plan, reflected upon, and written into lineage memory for subsequent iterations.

71 lineages in repository
774 mechanic archive entries
Plan → Trace mechanic planning visible across all stages
88 saved nodes across lineages

From one-shot generation to mechanic-aware iterative improvement.

What the system does

CreativeGame decomposes generation into planning, code generation, testing, evaluation, reflection, and memory writing. The system is designed to address runtime failure, weak cross-generation memory, saturated LLM creativity scores, and the tendency to treat mechanics only as post-hoc descriptions.

Abstract

The current pipeline couples mechanic retrieval, CurrentMechanicSet, runtime validation, lineage memory, and proxy-based reward in order to optimize structural game change rather than surface novelty alone.

Five stages from intent to evaluated game.

01

Planner

Retrieves mechanic-library context and emits CurrentMechanicSet declaring what to preserve, add, remove, or recombine.

02

Code Generation

Skeleton → Feature → Visual → Refinement. Each sub-stage receives the planned mechanic set.

03

Tester

Static and optional browser validation. Failed runtime checks block reward.

04

Evaluator

Scores creativity and functionality (1-10), extracts realized mechanics, and computes reward against CurrentMechanicSet.

05

Reflector + Memory

Writes mechanic-level deltas to lineage memory and optionally updates the global archive.

Before this version

  • Mechanic library acted mainly as retrieval support and novelty baseline.
  • Versions were generated primarily from prompt interpretation.
  • Evaluation compared versions, but not plan versus implementation.

What CurrentMechanicSet adds

  • Explicit separation between preserved and newly added mechanics.
  • Reward can be interpreted as planned structural realization rather than only post-hoc comparison.
  • Memory can store mechanic-level deltas instead of only version summaries.

Key numbers from the current codebase.

71 stored lineages
9 multi-node lineages explored to depth 4
~4.6M total tokens across all saved nodes
91.7% Python code reduction vs ChatDev prototype

What is in code, in progress, and still open.

Implemented

In the current code

  • Planner reads mechanic-library context from a 774-entry archive.
  • CurrentMechanicSet is parsed and stored per iteration.
  • Planner output is forwarded to Evaluator and Reflector for plan-vs-realized comparison.
  • API responses expose mechanic trace and reward breakdown.
  • The web gallery visualizes mechanic planning and evaluation.
In Progress

Being strengthened

  • LLM extraction is being wired into the live pipeline to replace regex-based mechanic delta tracking.
  • Planned-vs-realized comparison is being formalized through realization_score.
  • Write-back is evolving into a two-path loop: lineage-local memory plus cross-lineage archive update.
Open

Missing for research-grade claims

  • Human pairwise evaluation of creativity improvement.
  • MemRL and reward ablation studies.
  • Controlled evidence that mechanic-guided planning outperforms prompt-only generation.

Four lineages across four mechanic anchors.

Each tab is one lineage: Base → Round 2 → Round 3 → Round 4. These source games act as representative mechanic anchors rather than direct cloning targets.

Current gaps and caveats.

No human evaluation yet

LLM self-evaluation still inflates scores, and no controlled pairwise study has yet been run to validate creative improvement from a human perspective.

No memory ablation yet

Memory is implemented but its contribution has not been isolated. It is still unclear how much improvement comes from memory rather than base model capability.

Mechanic extraction remains partial

The archive already uses LLM-based extraction, but live generation still relies partly on regex-based mechanic_delta. Full end-to-end integration remains the next concrete engineering task.

The key gap has shifted

The central question is no longer whether the system can generate games, but whether mechanic-aware planning can reliably produce more structurally creative outcomes than prompt-only generation.

Why reward decreases in later iterations.

Round 1 rewards are consistently 1.00. By rounds 3-4, reward drops to 0.55-0.80 in three of four lineages despite stable LLM scores of C=8/F=8. This reflects a structural tension in the reward formula rather than simple quality degradation.

structural_change decay

structural_change rewards added mechanics and explicit structural delta. In preserve-and-deepen rounds, both naturally approach zero, so the formula cannot distinguish deliberate integration from no change.

novelty decay

relative_mechanic_novelty is measured against the archive. As later versions are written back, the lineage begins to compete with its own prior states, driving novelty downward.

cosmetic penalty misfire

Visual-deepening rounds can trigger cosmetic-only penalties even when they were planned and mechanically meaningful, which misaligns the reward with the intended creative strategy.

baseline offset

The formula applies composite = raw * 2 - 0.3. A neutral iteration with no penalties still scores negatively, making later refinement rounds structurally harder to score well.

The current reward is better at measuring per-round novelty than lineage deepening. A later-round game that tightly integrates several prior mechanics can therefore score lower than an earlier-round game that introduces one shallow new mechanic. This suggests two next steps: a cumulative mechanic coupling score and a lineage-relative novelty baseline.

cumulative coupling score lineage-relative novelty cosmetic penalty calibration LLM evaluation inflation

Next priorities.

Consolidate CurrentMechanicSet schema across generation paths, connect planned mechanics to evaluation via realization_score, integrate LLM-based mechanic extraction into the live pipeline, and run human pairwise evaluation to test whether mechanic-aware planning produces measurably more creative outcomes than prompt-only baselines.

human pairwise evaluation realization_score LLM mechanic extraction memory ablation reward redesign