A Multi-Fidelity Control Variate Approach for Policy Gradient Estimation

The University of Texas at Austin, The University of British Columbia, DEVCOM Army Research Laboratory
*Indicates Equal Contribution
Transactions on Machine Learning Research (TMLR), 2026
MFPG cover figure

Challenge: Real-world or high-fidelity simulation data are expensive and scarce for online reinforcement learning (RL), while abundant low-fidelity data are inexpensive but biased.

TL;DR: Multi-Fidelity Policy Gradient (MFPG) is a multi-fidelity, sample-efficient RL framework that boosts RL with cheap, imperfect data.

Key properties:

  • Unbiasedness: grounds learning on scarce, accurate, high-fidelity data
  • Reduced variance: uses large amounts of cheap, imperfect, low-fidelity data to form a control variate for variance reduction
  • Robustness: robust to low-fidelity data biases
  • Generality: handles dynamics gaps and reward misspecification
MFPG teaser image showing the proposed multi-fidelity policy gradient approach

Video Presentation

Slides (PDF)

Abstract

Many reinforcement learning (RL) algorithms are impractical for deployment in operational systems or for training with computationally expensive high-fidelity simulations, as they require large amounts of data. Meanwhile, low-fidelity simulators—such as reduced-order models, heuristic reward functions, or generative world models—can cheaply provide useful data for RL training, even if they are too coarse for direct sim-to-real transfer. We propose multi-fidelity policy gradients (MFPGs), an RL framework that mixes a small amount of data from the target environment with a control variate formed from a large volume of low-fidelity simulation data to construct an unbiased, variance-reduced estimator for on-policy policy gradients. We instantiate the framework by developing a practical, multi-fidelity variant of the classical REINFORCE algorithm. We show that under standard assumptions, the MFPG estimator guarantees asymptotic convergence of multi-fidelity REINFORCE to locally optimal policies in the target environment, and achieves faster finite-sample convergence rates compared to training with high-fidelity data alone. We evaluate the MFPG algorithm across a suite of simulated robotics benchmark tasks in scenarios with limited high-fidelity data but abundant off-dynamics, low-fidelity data. In our baseline comparisons, for scenarios where low-fidelity data are neutral or beneficial and dynamics gaps are mild to moderate, MFPG is, among the evaluated off-dynamics RL and low-fidelity-only approaches,the only method that consistently achieves statistically significant improvements in mean performance over a baseline trained solely on high-fidelity data. When low-fidelity data become harmful, MFPG exhibits the strongest robustness against performance degradation among the evaluated methods, whereas strong off-dynamics RL methods tend to exploit low-fidelity data aggressively and fail substantially more severely. An additional experiment in which the high- and low-fidelity environments are assigned anti-correlated rewards shows that MFPG can remain effective even when the low-fidelity environment exhibits reward misspecification. Thus, MFPG not only offers a reliable and robust paradigm for exploiting low-fidelity data, e.g., to enable efficient sim-to-real transfer, but also provides a principled approach to managing the trade-off between policy performance and data collection costs.