WorldGym: World Model as An Environment for Policy Evaluation

Julian Quevedo,¹ Ansh Kumar Sharma,² Yixiang Sun,² Varad Suryavanshi,² Percy Liang,¹ Sherry Yang^1,2,3

¹Stanford University, ²NYU, ³Google DeepMind

🎮 Try Interactive Demo 📁 GitHub 📄 Arxiv

Gripper Control

Sweep X-Axis

Sweep Y-Axis

Sweep Z-Axis

Abstract

Evaluating robot control policies is difficult: real-world testing is costly, and handcrafted simulators require manual effort to improve in realism and generality. We propose a world-model-based policy evaluation environment (WorldGym), an autoregressive, action-conditioned video generation model which serves as a proxy to real world environments. Policies are evaluated via Monte Carlo rollouts in the world model, with a vision-language model providing rewards. We evaluate a set of VLA-based real-robot policies in the world model using only initial frames from real robots, and show that policy success rates within the world model highly correlate with real-world success rates. Moreoever, we show that WorldGym is able to preserve relative policy rankings across different policy versions, sizes, and training checkpoints. Due to requiring only a single start frame as input, the world model further enables efficient evaluation of robot policies' generalization ability on novel tasks and environments. We find that modern VLA-based robot policies still struggle to distinguish object shapes and can become distracted by adversarial facades of objects. While generating highly realistic object interaction remains challenging, WorldGym faithfully emulates robot motions and offers a practical starting point for safe and reproducible policy evaluation before deployment.

Overview of WorldGym

Overview of WorldGym. Given an initial frame and an action sequence predicted by a policy, WorldGym uses a world model to interactively predict future frames, serving as a generative simulator. WorldGym then passes the generated rollout to a VLM which provides rewards. WorldGym can easily be used to test policies on OOD tasks and environments by changing the input language instruction or directly modifying the initial image.

Correlation between Real-World and Simulated Policy Performance

Per-Task Success Rates. Each point represents a task, with different policies being represented by different shaped markers. There is a strong correlation (r=0.78) between policy performance in our world model (y-axis) and within the real world (x-axis).

Mean Success Rates. Robot policies' mean success rates in the world model differ by an average of only 3.3% from the real world, near the standard error range for each policy. Relative performance rankings between RT-1-X, Octo, and OpenVLA are also preserved.

Policy Rollouts on Bridge and Google Robot

Qualitative policy rollouts on Bridge and Google Robot for RT-1-X, Octo, and OpenVLA. OpenVLA rollouts often lead to more visual successes than the other two policies across environments.

Policy Ranking within a World Model

Success Rates of different model versions in WorldGym. We evaluate different generations of Octo and OpenVLA in the world model, showing that WorldGym assigns higher score to larger and more recent versions.

Success Rate within WorldGym throughout training. We train a video-based policy and a diffusion policy from scratch and evaluate it within our world model as it trains. We see that mean task success rate within the world model increases with additional training steps.

Out-of-Distribution Inputs

OOD: Color Classification. We add red and blue pieces of paper to a table, and ask the policies to "pick red" or "pick blue" (OOD image and language). OpenVLA excels, picking the correct colored paper in all trials, whereas all other policies score near chance.

OOD: Unseen object. We use Nano Banana to add an orange to the world model's initial frame. When both the orange and the carrot are present, (a-b) OpenVLA grabs whichever is closer. After (c) editing the carrot's color to red, however, the orange is correctly picked up.

OOD: Failure modes. Left: We add a laptop to the scene, which displays an image of a carrot. In 15% of trials, OpenVLA grabs the laptop instead of the real carrot. Right: We test the ability distinguish to between squares and circles, celebrity faces, and cats and dogs, with all policies scoring near-chance.

OOD Language Instructions

OOD Language Instructions. We pick a set of tasks from the OpenVLA Bridge evaluation suite and modify the language instruction, e.g. changing the the target object and/or its goal destination.

Effect of OOD Distractors

OOD Distraction Examples. We use Nano Banana to add distractions to every image of the OpenVLA Bridge task suite.

Effect of OOD Distractors. When distractor objects are added to the Bridge evaluation suite, RT-1-X drops in performance by 51%, Octo by 83%, and OpenVLA by 41.5%, making OpenVLA the most robust to distractors.

Rollout Examples

Comparing (ground truth, generated) rollouts for unseen action trajectories across different robots, using a single interactive diffusion model.

In-World-Model Rollouts on Bridge Tasks

Task ▼ / Policy ▶	OPENVLA-7B	OCTO BASE-1.5	RT-1-X
Put eggplant into Pot
Put corn on plate
Take grapes out of Pot

In-World-Model Rollouts on RT-1 Tasks

Task ▼ / Policy ▶	OPENVLA-7B	OCTO BASE-1.5	RT-1-X
Pick Blue Chip Bag
Close the drawer

Out of Distribution Image Rollouts

Task ▼ / Policy ▶	OPENVLA-7B	OCTO BASE-1.5	RT-1-X
Pick up Carrot (With Computer on side)
Pick up Orange (Replace Carrot with Radish)
Pick Circle

Out of Distribution Language Instruction Rollouts

Task ▼ / Policy ▶	OPENVLA-7B	OCTO BASE-1.5	RT-1-X
Put Plate On Drying Rack
Move Pot With Grapes Into Drying Rack

Citation

If you find this work useful, please cite:

@misc{quevedo2025worldgymworldmodelenvironment,
      title={WorldGym: World Model as An Environment for Policy Evaluation},
      author={Julian Quevedo and Ansh Kumar Sharma and Yixiang Sun and Varad Suryavanshi and Percy Liang and Sherry Yang},
      year={2025},
      eprint={2506.00613},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2506.00613},
}