Vector Policy Optimization

Training for Diversity Improves Test-Time Search

Ryan Bahlous-Boldi^1,2, Isha Puri², Idan Shenfeld^1,2, Akarsh Kumar², Mehul Damani²,
Sebastian Risi⁴, Omar Khattab², Zhang-Wei Hong^1,2,3, Pulkit Agrawal^1,2

¹Improbable AI Lab ²MIT ³MIT-IBM Computing Research Lab ⁴Sakana AI

[ arXiv ] [ Code ] [ Colab ] [ BibTeX ]

TL;DR

Modern LLM pipelines rarely take a single answer. They sample many candidates and keep the best: best-of-n, verifier reranking, evolutionary search like AlphaEvolve. This is test-time search, and it only pays off if the candidates differ. GRPO works against that. It collapses the model onto one high-probability answer, so the extra samples are near-duplicates and search has nothing to choose between.

Vector Policy Optimization (VPO) keeps the reward a vector. Most rewards already are vectors: several objectives summed into one score. VPO trains the model to emit a set of solutions that spread across different trade-offs, so search gets distinct candidates. It is a drop-in replacement for the GRPO advantage estimator. It improves pass@k / best@k, and the gap grows as the search budget grows.

VPO vs GRPO: VPO spreads candidates across the Pareto frontier, improving LiveCodeBench pass@k and solving search problems GRPO cannot. — **Figure 1.** When maximizing a scalar, GRPO sends every rollout toward the same point. VPO simultaneously optimizes across different reward weightings, spreading candidates across the frontier of good solutions. On LiveCodeBench this yields higher pass@k, and inside an AlphaEvolve-style search loop VPO cracks hard problems GRPO never solves.

1. The problem: scalar rewards collapse diversity

Modern LLM pipelines pair a learned policy with explicit test-time searchAny procedure that draws several candidate solutions from the model and then selects among them: best-of-$n$, self-consistency, verifier reranking, or evolutionary loops like AlphaEvolve and FunSearch.: rejection sampling with a verifier, best-of-n, evolutionary methods like AlphaEvolve, or agents that plan and backtrack. In all of them, search picks from a pool of candidates the policy generates, so the diversity of that pool sets a ceiling on how well search can do.

Standard RL post-training works against that. Policy-gradient methods like GRPO drive the model toward a narrow set of high-probability responses, so after training the extra samples are near-duplicatesThis is the well-documented entropy / mode collapse of RL post-training: KL-regularised policy gradients sharpen the output distribution, so pass@$k$ at large $k$ often degrades relative to the base model even as pass@1 improves.. That collapse is exactly what makes downstream search ineffective. VPO reframes the job of post-training: the goal is not to converge on the single best response, but to maximize the diversity of a set of competent solutions, leaving exploitation entirely to search. A clean division of labor: training explores, search exploits.

2. The idea: keep the reward a vector

In most real tasks the reward is already a vector $r(x,y)=[r_1,\dots,r_d]$ of sub-scores that we choose to sum up: per-test-case correctness in code, per-criterion ratings in RLHF, per-sub-question success in multi-hop reasoning, or the format / tool-name / argument fields in tool use. Scalar RL throws this structure away by fixing a single weighting $w^\star$ and optimizing $w^{\star\top} r$. Every answer is judged on one axis, so the policy has no reason to keep an answer that is excellent on $r_2$ but mediocre on $r_1$, exactly the answers search needs.

VPO keeps the reward a vector. Any weighting $w\in\Delta^{d-1}$ on the simplexThe $(d{-}1)$-simplex $\Delta^{d-1}$ is the set of non-negative weight vectors that sum to 1, i.e. every convex combination of the objectives. Each point is one "user preference" over the reward components. defines a valid scalar goal $w^\top r$; VPO refuses to commit to one and instead asks the model for a set of $m$ answers, rewarding the set for doing well under many weightings at once.

3. The VPO objective

For a sampled weighting $w$, VPO scores not the average answer but the best one, $\max_{y\in S} w^\top r(x,y)$The $\max$ is what makes this a coverage objective rather than an averaging one. Averaging the set would just reward each answer to be individually good under the mean weighting, recovering scalar RL. Taking the max instead rewards the set for containing a specialist for each $w$: it is exactly the expected best-of-$m$ the downstream search will enjoy., and averages that best-of-set over a distribution of weightings:

$$R(S) \;=\; \mathbb{E}_{w\sim\mathrm{Dir}(\alpha)}\Big[\max_{y\in S} w^\top r(x,y)\Big].$$

The $\max$ inside the expectation is the whole trick. A set gets credit only for its best member under each weighting, so the way to score highly is to have a different member be best for different weightings. Near-duplicates cover a narrow slice of the simplex; a set that spreads across trade-offs covers all of it. The objective rewards coverage of the reward space, which is precisely the diversity downstream search exploits.

4. The algorithm

Drop that set-level reward into GRPO and you have VPO. In practice $R(S)$ is a Monte-Carlo estimate: draw $K$ weights once and share them across the GRPO group of $G$ rollouts, so every set is scored under the same weightings and the advantages stay comparable:

$$\hat{R}\big(S^{(g)}\big) \;=\; \frac{1}{K}\sum_{k=1}^{K}\,\max_{s\in S^{(g)}} w^{(k)\top} r(x,s), \qquad w^{(1)},\dots,w^{(K)}\stackrel{\text{iid}}{\sim}\mathrm{Dir}(\mathbf{1}).$$

The model emits all $m$ answers in one autoregressive rollout (a shared reasoning prefix, then the answers), so it can condition each answer on the previous ones and deliberately diversify. The scalar $\hat R$ replaces the usual per-rollout reward; the GRPO advantage is computed and applied to every token as normal. Two knobs: the weight sampler (i.i.d. Dirichlet, or a Sobol quasi-Monte-Carlo sampler with lower varianceSharing the $K$ weight draws across the whole group already removes the between-rollout noise. The Sobol (low-discrepancy) sampler additionally spreads the $K$ weights more evenly over the simplex than i.i.d. draws, cutting the variance of $\hat R$ for the same $K$.) and the concentration $\alpha$ ($\alpha=\mathbf{1}$ is uniform). In code it is a 3-file patch to veRL: algorithm.adv_estimator=grpo becomes vpo.

VPO method diagram: a shared reasoning prefix produces m answers, each scored as a reward vector; weight vectors are sampled from a Dirichlet, and the mean best-of-m under each weighting becomes the rollout reward. — **Figure 2.** How a VPO rollout is scored. From prompt $x$ the model writes $m$ answers in one chain; each gets a reward vector. We sample many weightings $w\sim\mathrm{Dir}(\mathbf 1)$, take the best-of-$m$ under each, and average into the set-level reward $\hat R$.

5. Results

Across four domains, VPO matches or beats the strongest scalar RL baselines on test-time search (best@kbest@$k$ = the highest eval-scalar score among $k$ sampled candidates; pass@$k$ = the fraction of problems where at least one of $k$ candidates is correct. Both reward a diverse pool: they only improve if the extra samples differ.), and the gap widens with the search budget. VPO can lose at k=1 but wins once you sample more than once, the regime where search sits downstream.

best@30 on the GRPO training scalar, and reward-space diversity (higher = more coverage). m=3 throughout; 4 eval seeds on EUREQA and ToolRL.
Domain (model)	GRPO	Max-at-K	Multi-RLVR	VPO	diversity (GRPO → VPO)
Maze (Qwen3-4B, k=4)	0.432	0.577	0.436	0.593	0.003 → 1.01
MuSiQue (Qwen3-1.7B, k=5)	0.728	0.802	0.633	0.832	0.054 → 0.59
EUREQA (Qwen3-8B, k=5)	0.236	0.237	0.267	0.279	0.119 → 0.51
ToolRL (Qwen3-1.7B, k=4)	0.925	0.954	0.905	0.952	0.044 → 1.30

On ToolRL everything saturates near the reward ceiling, so VPO ties Max-at-K (0.952 vs 0.954) while producing about 10× the candidate diversity. Where the Pareto front is genuinely wide (Maze, MuSiQue, EUREQA), VPO is the top performer.

LiveCodeBench: pass@k and best@k over k, and pass@k/best@k over OpenEvolve search iterations. — **Figure 3.** LiveCodeBench case study (Qwen2.5-Coder-7B, trained on DeepCoder). (A) Pass@k on the 279-problem held-out split: at k=1 GRPO is ahead, VPO overtakes as k grows. (B) Best@k: VPO above GRPO at every k, gap widening. (C, D) Inside the OpenEvolve search loop on the 32 hardest held-out problems (both methods score 0 at best@30 in the standard regime), VPO keeps finding solutions over 200 iterations and cracks problems neither could touch; GRPO plateaus early.

Pairwise L1 distance between per-rollout reward vectors over training; VPO sustains higher diversity than Multi-RLVR. — **Figure 4.** Reward-space diversity over training: pairwise $L_1$ distance between per-rollout reward vectors. VPO sustains much higher diversity than Multi-RLVR throughout. The one domain where Multi-RLVR keeps up (EUREQA) is also the only one where it comes close on best@k.

6. Relation to prior work

Temperature sampling adds undirected token-level noise and degrades quality; VPO produces directed diversity in reward space. Multi-objective RL conditions a policy on a weight vector to trace the Pareto front, one policy per preference; VPO never conditions on a weight, and its goal is performance under a fixed $w^\star$ using a single unconditional policy. GDPO is single-solution against a vector reward; VPO is multi-solution. Multi-RLVR samples multiple answers but collapses the vector to a scalar before scoring, so its pool still drifts toward duplicates. VPO is for the regime where the reward genuinely decomposes and search is downstream; when the reward components are near-collinear it reduces to ordinary scalar RLConcretely, on UltraFeedback scored by a near-collinear ArmoRM reward the simplex effectively collapses to a line, and VPO lands slightly below scalar baselines on absolute best@$k$, while still retaining the largest per-$k$ headroom of any method..

Getting started

git clone https://github.com/ryanboldi/vpo.git && cd vpo
uv venv .venv && source .venv/bin/activate
cd verl && pip install -e . && cd .. && pip install -e .

python data/preprocess_maze.py --local_save_dir $HOME/data/maze
bash train.sh METHOD=vpo TASK=maze MODEL=Qwen/Qwen3-4B

Code released under Apache 2.0. github.com/ryanboldi/vpo

Acknowledgments

The authors express gratitude to Jyo Pari, Jonas Hübotter, Nitish Dashora, Andre Ye, Itamar Pres, Navodita Sharma, John Marangola, Nolan Fey, Oliver Sieberling, Linlu Qiu, Luca Grillotti, Benjamin Eysenbach, Yoon Kim, Jacob Andreas, and members of the Improbable AI lab at MIT for discussions that helped shape this work. This work was supported by National Science Foundation graduate research fellowships to RB, IP, and AK, the MIT-IBM Computing Research Lab, the MIT-Google Program for Computing Innovation and Hyundai Motor Company. This research was sponsored by the Army Research Office and was accomplished under Grant Numbers W911NF2110328 and W911NF-23-1-0277. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

BibTeX

@misc{bahlousboldi2026vpo,
  title={Vector Policy Optimization: Training for Diversity Improves Test-Time Search},
  author={Ryan Bahlous-Boldi and Isha Puri and Idan Shenfeld and Akarsh Kumar and Mehul Damani and Sebastian Risi and Omar Khattab and Zhang-Wei Hong and Pulkit Agrawal},
  year={2026},
  eprint={2605.22817},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2605.22817},
}