OneLife synthesizes world laws from a single unguided (no environment-specific rewards / goals) episode in a hostile, stochastic environment
OneLife models the world as mixture of laws written in code with a precondition-effect structure, each governing an aspect of the world, and infers parameters for the mixture that best explain the observed dynamics of the world.
The resulting world model (WM) provides a probability distribution over attributes of an object-oriented world state, such as the position of a particular zombie.
OneLife outperforms a strong baseline in modeling 16/23 core game mechanics tested, measured by MRR (Mean Reciprocal Rank) of the true next state (see evaluation) under the WM's likelihood.
See zombie law for a synthesized zombie law.
Abstract
Symbolic world modeling is the task of inferring and representing the transitional
dynamics of an environment as an executable program. Previous research on
symbolic world modeling has focused on largely deterministic environments with
abundant interaction data, simple mechanics, and human-provided guidance. We
address the more realistic and challenging problem of learning a symbolic world
model in a complex, stochastic environment with severe constraints: a limited
interaction budget where the agent has only "one life" to explore a hostile environment and no external guidance in the form of human-provided, environment-specific rewards or goals. We introduce OneLife, a framework that models world
dynamics through conditionally activated programmatic laws within a probabilistic programming framework. Each law operates through a precondition-effect
structure, allowing it to remain silent on irrelevant aspects of the world state and
predict only the attributes it directly governs. This creates a dynamic computation graph that routes both inference and optimization only through relevant laws
for each transition, avoiding the scaling challenges that arise when all laws must
contribute to predictions about a complex, hierarchical state space, and enabling
accurate learning of stochastic dynamics even when most rules are inactive at any
given moment. To evaluate our approach under these demanding constraints, we
introduce a new evaluation protocol that measures (a) state ranking, the ability to
distinguish plausible future states from implausible ones, and (b) state fidelity,
the ability to generate future states that closely resemble reality. We develop
and evaluate our framework on Crafter-OO, our reimplementation of the popular Crafter environment that exposes a structured, object-oriented symbolic state
and a pure transition function that operates on that state alone. OneLife can successfully learn key environment dynamics from minimal, unguided interaction,
outperforming a strong baseline on 16 out of 23 scenarios tested. We also demonstrate the world model's utility for planning, where rollouts simulated within the
world model successfully identify superior strategies in goal-oriented tasks. Our
work establishes a foundation for autonomously constructing programmatic world
models of unknown, complex environments.
How OneLife Works
Illustration of the inference process. The active laws for each observable (defined by \(\mathcal{I}_k(s_t, a)\)) determine the structure of the computation graph, i.e., which laws and their corresponding parameters \(\theta_i\) are related to which observables. This structure in turn informs the parameter updates.
Shown here is a dataset with a single transition instance, in which the player (P) moves right; at the same time, a zombie (Z) independently moves left.
This implicates two laws, PlayerMovementLaw and ZombieMovementLaw, while not implicating the InventoryUpdateLaw.
As a result, the loss computation is only a function of \(\theta_1\) and \(\theta_2\). Note we use \(Z\) here to denote the normalizing factor.
Mixture of programmatic laws (precondition–effect) + observables: We represent dynamics as a mixture of modular laws written in code. Each law activates when its precondition holds and only predicts a subset of state observables (e.g., player.position), creating a sparse, modular interface that scales to complex, object-oriented states.
Unguided exploration with atomic law synthesis: A language-model–driven exploration policy collects a single episode without rewards or goals. A general synthesizer then proposes simple, atomic laws that explain observed transitions (decomposing complex events into minimal attribute changes) to form a broad hypothesis set.
Dynamic routing inference + forward simulation: For each transition, gradients and credit are dynamically routed only through active laws for the relevant observables; we fit the law weights via a weighted product-of-experts objective (optimized with L-BFGS). The learned model supports likelihood scoring and generative rollouts by sampling per-observable predictions and reconstructing the next symbolic state.
Crafter-OO: A Testbed for Symbolic World Modeling
A common design assumption in previous work on symbolic world modeling is that we have access to an object-oriented world state to use as input to the symbolic world model under construction. In practice, this state is only easily accessible for simple environments such as Minigrid or BabyAI. Programmatic access to the state of more complex environments such as Atari games is only possible due to standalone development efforts such as OCAtari which makes the internal object-oriented state of these environments accessible to researchers. The lack of an environment with an exposed, object-oriented state that is more complex than gridworlds or with mechanics more diverse than Atari games has thus far prevented evaluation and development of symbolic world modeling approaches for more complex environments.
To close this gap, we implement Crafter-OO, which emulates the Crafter environment by operating purely on an explicit, object-oriented game state. Additionally, we contribute utilities for programmatically modifying the game state to create evaluation scenarios.
Interactive State Transition Example
Below is a simple example showing how actions transform the object-oriented game state in Crafter-OO. Click the button to see how the state changes when the player takes an action.
Current State
Scroll to see more
Changes (Diff)
Scroll to see more
The state representation captures the complete game world including player attributes (position, health, inventory), entities (cows, zombies, trees), and environmental properties. When an action is taken, multiple aspects of the state can change simultaneously: the player's action and inventory update, entities may move (like the cow and zombie), and environmental objects (like trees) may be removed. The world model must learn to predict these cascading changes from observing transitions.
Evaluation Protocols and Metrics
Two evaluation metric categories. We assess world models through state ranking (ability to distinguish plausible future states from implausible ones using programmatic mutators) and state fidelity (ability to generate states that closely resemble reality using edit distance metrics).
Evaluating world models for stochastic environments requires measuring two key capabilities:
State Ranking
These metrics assess whether the model ranks the true next state higher than distractor states. We create distractors using mutators—programmatic functions that apply semantically meaningful, rule-breaking changes (e.g., allowing crafting without prerequisites).
Rank @ 1 (R@1): Binary metric measuring if the model assigns highest probability to the true state.
Mean Reciprocal Rank (MRR): Averages the reciprocal rank of the correct state: \(\text{MRR} = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{r_i}\)
State Fidelity
These metrics measure the error between predicted and ground truth states:
Raw Edit Distance: Number of atomic JSON Patch operations needed to transform predicted state into ground truth.
Normalized Edit Distance: Raw edit distance divided by total state elements.
Evaluation Framework Implementation on Crafter-OO
Evaluating a world model on random rollouts may not provide sufficient coverage of rare or important events in an environment. To ensure our evaluation is comprehensive, we create evaluation trajectories from a suite of scenarios. Each scenario runs a short, scripted policy from an initial state designed to reliably exercise a specific game mechanic or achieve a particular goal.
Our scenarios cover every achievement in the achievement tree of Crafter-OO/Crafter, ranging from basic actions like collecting wood to complex, multi-step tasks like crafting an iron sword. We generate distractors for each transition in the evaluation dataset using a bank of 8 mutators which each produce a subtle, but illegal transformation of the game state in response to an action—such as causing an incorrect item to be produced when taking a crafting action, or allowing an item to be produced without the correct requirements.
Per-scenario state ranking performance of OneLife (Ours) versus PoE-World, measured by Mean Reciprocal Rank (MRR ↑). Scenarios are grouped by the core game mechanic they test. Horizontal lines show the average MRR across all scenarios in a group for OneLife and PoE-World. OneLife demonstrates a more accurate understanding of the environment's laws, achieving a higher average MRR and outperforming the baseline on the majority of individual scenarios.
Experimental Setup and Results
We conduct a series of experiments to evaluate OneLife. First, we quantitatively assess the model's predictive accuracy using our state ranking and fidelity metrics across a comprehensive suite of scenarios. Second, we test the model's ability to support planning in imagination by performing simulated rollouts of different policies.
Baseline Models
Random World Model: A model that assigns uniform probability to all candidate states. Its performance is equivalent to random guessing and serves as a sanity check.
PoE-World: A state-of-the-art symbolic world model that scaled symbolic world modeling to domains like Atari. Both PoE-World and OneLife represent the transition function as a weighted product of programs. We reimplement this baseline with our exploration policy and law synthesizer.
Results
Performance comparison of world modeling methods on the Crafter-OO environment, averaged over ten trials. We evaluate models on two criteria: state fidelity and state ranking. All methods use the OneLife exploration policy and law synthesizer but differ in their parameter inference method. OneLife shows significant improvements over the PoE-World inference algorithm and OneLife variant without parameter inference. The random baseline is shaded in gray.
Planning with the Learned World Model
To assess the practical utility of the learned world model, we evaluate its effectiveness in a planning context. Our protocol tests the model's ability to distinguish between effective and ineffective plans through forward simulation. For a set of scenarios, we define a reward function and two distinct, programmatic policies (plans) to achieve a goal within the scenario. Each plan is represented as a hierarchical policy (in code) that composes subroutines for navigation, interaction, and crafting.
We execute rollouts of both plans within our learned world model and, separately, within the ground-truth environment. The measure of success is whether the world model's simulation yields the same preference ranking over the two plans as the true environment, based on the final reward. This assesses if the model has captured the causal dynamics necessary for goal-directed reasoning.
Setup
We design three scenarios that test distinct aspects of the environment's mechanics: combat, tool-use and resource consumption. In the Zombie Fighter scenario, an agent with low health must defeat two zombies. The superior plan involves a multi-step process: pathfinding to locate and harvest trees, crafting a table and then a sword, and only then engaging in combat. The alternative is to fight immediately.
The Stone Miner scenario tests the model's understanding of resource collection. The effective plan is to first harvest wood, craft a pickaxe, pathfind to a stone, and then mine. Attempting to mine stone directly is ineffective. Finally, the Sword Maker scenario evaluates knowledge of resource consumption. The goal is to craft multiple swords. The efficient plan places a single crafting table and reuses it, whereas the inefficient plan wastes wood by placing a new table for each sword.
On average, a plan requires approximately 18 steps to execute, with the longest plans taking over 30 steps. Thus, simulating the results of these plans tests the ability of the world model to accurately model the consequences of long sequences of actions upon the world.
Example of plan execution withinOneLife's world model for the "Stone Miner" scenario. The effective plan carries out a multi-step sequence of gathering wood, crafting a wooden pickaxe, and then attempting to mine. The ineffective plan attempts to mine the stone directly. The world learned by OneLife correctly simulates causal game mechanics that cause the effective plan to succeed and the ineffective plan to fail.
Results
Across all three scenarios, our learned world model correctly predicts the more effective plan. The ranking of plans generated by simulating rollouts in OneLife matches the ranking from the ground-truth environment. For instance, in the Zombie Fighter scenario, the model correctly simulates that the multi-step plan of crafting a sword leads to higher Damage Per Second, identifying it as the superior strategy. This demonstrates that OneLife captures a sufficiently accurate causal model of the world to support basic, goal-oriented planning.
Planning results across three scenarios. OneLife correctly identifies the superior plan in each scenario by simulating rollouts and comparing final rewards.
Citation
@inproceedings{khan2025onelife,
title={One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration},
author={Khan, Zaid and Prasad, Archiki and Stengel-Eskin, Elias and Cho, Jaemin and Bansal, Mohit},
journal={arXiv preprint arXiv:2510.12088},
year={2025}
}