Absolute Zero: Teaching AI to Reason Without Human Data

TL;DR

Researchers from Tsinghua University have developed “Absolute Zero Reasoner” (AZR), a groundbreaking system that enables large language models to improve their reasoning abilities without any human-curated data. The model achieves state-of-the-art performance by generating its own training tasks and learning from solving them, potentially transforming how we think about AI learning and scaling.

Introduction

How do we train AI systems to reason better? The conventional approach has been to collect high-quality human-curated examples, have humans label and evaluate reasoning processes, or at minimum, provide gold-standard question-answer pairs for training. But what if an AI could learn to reason entirely on its own, without any human examples?

This is the ambitious question addressed by researchers at Tsinghua University in their paper “Absolute Zero: Reinforced Self-play Reasoning with Zero Data”¹. The research team, led by Andrew Zhao along with Yiran Wu, Yang Yue, and others, introduces a novel paradigm that could transform how we approach AI training for complex reasoning tasks.

The Problem with Traditional Training Methods

Recent advancements in AI reasoning have leveraged Reinforcement Learning with Verifiable Rewards (RLVR), which enhances language models by using outcome-based rewards rather than relying on process-based supervision. While this has shown promising results, even these “zero-setting” approaches still depend on manually curated collections of questions and answers.

This dependence creates two significant challenges:

Scalability concerns: High-quality, human-produced examples are scarce and expensive to create, limiting long-term scaling potential.
Learning ceiling: In a hypothetical future where AI surpasses human intelligence, tasks provided by humans may not offer sufficient learning potential for more advanced systems.

The Absolute Zero Paradigm

The researchers propose a radically different approach called “Absolute Zero,” in which a single model learns to both propose tasks that maximize its own learning and improve by solving them—without using any external data whatsoever.

The Absolute Zero paradigm functions through a continuous cycle:

Task proposal: The model creates potential learning tasks
Task validation: A code executor environment verifies task integrity
Task solution: The model attempts to solve its self-proposed tasks
Solution verification: The code executor provides verifiable feedback
Reinforcement learning: The model updates based on performance

Similar to how AlphaZero learned chess through self-play, AZR learns reasoning entirely through self-interaction, using a code executor as an open-ended yet grounded environment.

How Absolute Zero Reasoner (AZR) Works

The Absolute Zero Reasoner implementation uses a unified language model that serves dual roles:

Task Proposer

Generates new coding challenges designed to maximize learning potential
Creates tasks that are neither too easy (offering minimal learning) nor impossible (providing no useful feedback)
Uses buffer memory of previous tasks to ensure diversity

Task Solver

Attempts to solve self-proposed tasks
Receives concrete feedback from the code executor environment
Uses this feedback to improve both task proposal and solution capabilities

AZR constructs three types of coding tasks corresponding to three complementary modes of reasoning:

Deduction: Given code and input, predict the output
Abduction: Given code and output, infer the input
Induction: Given input-output pairs, create the code that produces the transformation

This approach uses a code executor as an environment that serves two critical functions:

Validating the integrity and solvability of proposed tasks
Providing verifiable feedback for the model’s solutions

Technical Innovations

Several technical innovations make AZR effective:

Task-Relative REINFORCE++ (TRR++): A specialized advantage estimator tailored to the multitask nature of self-play reasoning. Unlike traditional reinforcement learning approaches, TRR++ computes separate baselines for each task-role configuration.
Dual-role architecture: By using the same model for both proposing and solving tasks, AZR creates a virtuous cycle of improvement.
Diversity and challenge balancing: The reward system is carefully designed to encourage task diversity and appropriate difficulty levels—tasks that are challenging but solvable.

Remarkable Results

Despite being trained entirely without external data, AZR achieves state-of-the-art performance on coding and mathematical reasoning tasks. The researchers report:

Superior performance: AZR outperforms existing zero-setting models that rely on tens of thousands of in-domain human-curated examples
Cross-domain transfer: AZR models trained only on coding tasks show substantial improvements in mathematical reasoning (10.9-15.2 percentage points)
Scaling benefits: Performance gains increase with model size, with 7B and 14B parameter models showing continued improvement beyond 200 training steps

Perhaps most remarkably, the Absolute Zero Reasoner-Coder-7B achieved state-of-the-art results in both overall average and coding average categories, surpassing previous best models by 1.8 absolute percentage points despite being entirely out-of-distribution for both math and code reasoning benchmarks.

Emergent Behaviors

One of the most fascinating aspects of AZR is the emergence of sophisticated reasoning patterns that weren’t explicitly programmed. For example:

Self-documentation: When solving code induction tasks, AZR naturally develops the habit of interleaving step-by-step plans as comments within its code
Task-specific reasoning: The model exhibits different reasoning patterns depending on the type of task it’s solving
Curriculum emergence: The system naturally progresses from simpler to more complex tasks as its capabilities improve

Implications and Future Directions

The Absolute Zero paradigm represents a significant step toward AI systems that can autonomously achieve advanced reasoning capabilities through self-improvement. Key implications include:

Reduced human dependence: By eliminating the need for human-curated examples, this approach could dramatically accelerate AI advancement
Beyond human limitations: Self-play learning allows models to potentially explore reasoning strategies that might not occur to human teachers
Knowledge boundaries: The current system is limited to tasks that can be verified by a code executor, but future work could expand this to other domains
Safety considerations: The researchers note some “uh-oh moments” where the system exhibited concerning reasoning patterns, highlighting the ongoing need for safety research in self-improving systems

Conclusion

The Absolute Zero paradigm introduces a compelling new direction for AI research that addresses fundamental limitations in how we currently train reasoning systems. By enabling models to create their own learning curriculum and improve through self-play, this approach could significantly accelerate the development of more capable, flexible, and independent AI systems.

As with any major advancement, this research raises as many questions as it answers. Could similar self-play approaches work in domains beyond coding and mathematics? How might we ensure safety as systems become more autonomous in their learning? And what does this tell us about the nature of knowledge acquisition and reasoning itself?

The full implications of this research are still emerging, but one thing is clear: by teaching AI to teach itself, we may be witnessing an important step toward more genuinely intelligent systems.

Zhao, A., Wu, Y., Yue, Y., Wu, T., Xu, Q., Yue, Y., Lin, M., Wang, S., Wu, Q., Zheng, Z., & Huang, G. (2025). Absolute Zero: Reinforced Self-play Reasoning with Zero Data. arXiv:2505.03335v2. https://arxiv.org/abs/2505.03335 ↩