Automated Scientific Discovery via Bayesian Surprise & Monte Carlo Tree Search — NLS Socioeconomic Status Dataset
Upload a CSV file to preview its structure, then run AutoDiscovery on it for free.
Briefly describe what each column represents. This helps the AI generate better hypotheses.
Aggregate statistics from the MCTS exploration run.
Hypotheses ranked by self-reward score. Each card shows the experimental finding, belief update, and information-theoretic metrics.
Prior vs. posterior mean belief (P(H|data)) for each tested hypothesis, illustrating how evidence shifted confidence.
KL Divergence — bits of information gained from each experiment, measuring how much the posterior diverged from the prior.
Detailed prior and posterior belief sample distributions across the five confidence categories.
AutoDiscovery is an AI-driven scientific discovery framework developed by the Allen Institute for AI (AI2). It uses a Monte Carlo Tree Search (MCTS) strategy to explore a combinatorial space of hypotheses and experiments, selecting the most informative path at each step via the UCB1 bandit algorithm.
At each node, a large language model proposes a hypothesis and designs an experiment (including executable code). After execution, a separate Bayesian Belief Update module evaluates the result: an LLM jury samples categorical beliefs ("definitely true" → "definitely false") both before (prior) and after (posterior) seeing the experimental evidence, using n = 30 independent samples per distribution.
The reward signal is derived from the KL Divergence between the prior and posterior belief distributions, scaled by a configurable factor (here κ = 5). Larger divergence means the experiment was more informative — it changed the model's beliefs substantially. A "surprise" flag (not triggered in this run) would indicate findings that contradicted the prior expectation.
The tree is grown iteratively: each round selects a parent node, generates candidate experiments, runs the most promising one, updates beliefs, and back-propagates the reward. This demo ran 4 MCTS iterations with 8 candidate experiments per round, using GPT-4o as both the hypothesis-generating and belief-evaluating model.