Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

1 The University of Texas at Austin  ·  2 KAIST  ·  3 Microsoft Superintelligence
Project Icon
Comparison of continual learning performance between a pretrained VLA (GR00T N1.5) and a non-pretrained small policy model (BC-Transformer). Each checkpoint corresponds to a model obtained by sequentially training over ten tasks under Experience Replay (ER). Each matrix entry (i, j) denotes the success rate on Task j after training on Task i. We compare a pretrained VLA (top) with a non-pretrained small BC policy (bottom) across multiple LIBERO benchmark suites.
Teaser: Comparison of continual learning performance between a pretrained VLA and a non-pretrained small policy model
Key Findings
Pretrained VLAs are remarkably resistant to forgetting compared with smaller policy models trained from scratch.
Simple Experience Replay (ER) works surprisingly well on VLAs, able to achieve zero forgetting even with 2% of training data or even positive backward transfer on previous tasks.
Pretraining plays an integral role in improving continual learning performance in both forward and backward transfer.
The pretrained knowledge in VLAs is especially effective for mitigating forgetting even at an extremely small replay dataset size. Furthermore, pretraining reduces forgetting while still maintaining a high success rate on new tasks, avoiding the trade-off between preserving knowledge and forward transfer ability.
VLAs preserve relevant knowledge of prior tasks despite seemingly performance degradation from learning new ones.
Even when performance on previous tasks seems to degrade from catastrophic forgetting, the underlying knowledge is still preserved in the VLA's internal representation. Evidence for this lies in the fact that a few finetuning steps can quickly restore performance on past tasks.
Abstract

Continual learning is a long-standing challenge in robot policy learning, where a policy must acquire new skills over time without catastrophically forgetting previously learned ones. While prior work has extensively studied continual learning in relatively small behavior cloning (BC) policy models trained from scratch, its behavior in modern large-scale pretrained Vision-Language-Action (VLA) models remains underexplored. In this work, we find that pretrained VLAs are remarkably resistant to forgetting compared with smaller policy models trained from scratch. Simple Experience Replay (ER) works surprisingly well on VLAs, sometimes achieving zero forgetting even with a small replay data size. Our analysis reveals that pretraining plays a critical role in downstream continual learning performance: large pretrained models mitigate forgetting with a small replay buffer size while maintaining strong forward learning capabilities. Furthermore, we find that VLAs can retain relevant knowledge from prior tasks despite performance degradation during learning new tasks. This knowledge retention enables rapid recovery of seemingly forgotten skills through finetuning. Together, these insights imply that large-scale pretraining fundamentally changes the dynamics of continual learning, enabling models to continually acquire new skills over time with simple replay.

VLAs are Surprisingly Resistant to Forgetting

Pretrained VLAs trained with experience replay (ER) surprisingly achieve near-zero, or even positive backward transfer, while maintaining a strong forward transfer.
Table: Continual learning performance on LIBERO benchmarks: Average Success Rate (SR ↑) and Negative Backward Transfer (NBT ↓). Results are with replay buffer size = 1000 (20% of dataset size per task).
Average Performance
Method SR ↑ NBT ↓
Pi0 0.768±.017 -0.016±.022
GR00T N1.5 0.919±.011 0.027±.021
BC-Diffusion 0.696±.068 0.127±.071
BC-Transformer 0.585±.066 0.245±.080
BC-ViT 0.508±.142 0.193±.082
Per-Benchmark Performance
Method LIBERO-Spatial LIBERO-Object LIBERO-Goal LIBERO-10
SR ↑ NBT ↓ SR ↑ NBT ↓ SR ↑ NBT ↓ SR ↑ NBT ↓
Pi0 0.879±.008 0.019±.019 0.897±.011 -0.011±.034 0.732±.015 -0.005±.010 0.563±.028 -0.068±.018
GR00T N1.5 0.940±.013 0.007±.009 0.975±.004 0.019±.013 0.943±.004 0.023±.017 0.820±.017 0.059±.035
BC-Diffusion 0.663±.074 0.050±.070 0.847±.069 0.025±.056 0.809±.089 0.230±.101 0.464±.024 0.201±.044
BC-Transformer 0.659±.058 0.299±.096 0.595±.112 0.132±.120 0.709±.022 0.356±.042 0.376±.034 0.192±.019
BC-ViT 0.513±.069 0.171±.065 0.543±.268 0.077±.134 0.661±.024 0.319±.022 0.316±.062 0.204±.063
Pretrained VLAs are resistant to forgetting especially under a low replay data regime.
Backward Transfer (BWT) across different replay buffer sizes. Each subplot shows BWT as a function of replay buffer size ({0.2%, 2%, 20%}) for all methods across benchmarks. Shaded regions indicate ±1 standard deviation. Higher BWT indicates more forgetting; values near zero indicate no forgetting. Pretrained VLAs (Pi0, GR00T) maintain near-zero BWT even with small replay buffers, while non-pretrained models (BC-Transformer, BC-ViT, BC-Diffusion Policy) show 2–4× more forgetting.
Backward Transfer across different replay buffer sizes

Pretraining Plays an Integral Role in Improving Continual Learning Performance

Pretraining enables models to mitigate forgetting while maintaining strong forward learning capabilities.
Pareto frontier of average NBT vs. replay buffer size. As the replay buffer size decreases, the gap between pretrained and non-pretrained models increases, indicating that at smaller buffer sizes, pretraining plays an increasingly important role in mitigating forgetting. When the curve reaches 0, it means zero forgetting; below 0 indicates positive backward transfer.
Pareto frontier of NBT vs replay buffer size
Comparison of forgetting performance across different buffer sizes (10, 100, 1000) for Pi0 pretrained, Pi0 initialized from PaliGemma, and Pi0 trained from scratch. Each panel shows per-task success rates as training progresses through the sequence of 10 tasks.
Comparison of forgetting across buffer sizes
Knowledge transfer curves across benchmarks. We compare Pi0 trained from scratch (orange), Pi0 from PaliGemma (green), and Pi0 pretrained (blue) under different replay buffer sizes (10, 100, 1000). Pretrained models exhibit steadily increasing knowledge transfer over time, whereas from-scratch models show slower growth—low forgetting in non-pretrained models can arise from inability to learn new tasks.
Knowledge transfer curves

VLAs Retain Knowledge that is Seemingly Forgotten

A drastic decrease in prior task performance does not necessarily indicate complete forgetting of task-relevant knowledge in pretrained VLAs.

Instead, this knowledge is still retained, as prior task performance can often be rapidly recovered with minimal finetuning.
Methodology for investigating VLA knowledge loss in continual learning. (a) Continual learning pipeline of learning Tasks k-1, k, k+1 sequentially. (b) Identifying knowledge loss in the vision-language (VL) backbone and action head by swapping components. (c) Recovering vision-language knowledge with finetuning.
Methodology for investigating VLA knowledge loss
Knowledge loss in VLAs is not monolithic.

We find that knowledge is compartmentalized across VLA components: the vision-language (VL) backbone is the dominant source of forgetting, while the action head retains most of its prior task knowledge:

Comparison of mean success rates retained under different vision-language and action component combinations. Knowledge loss in VLAs is not monolithic but affects different modules separately. The VL backbone is the dominant source of forgetting. Knowledge loss correlates with task diversity—LIBERO-10 (most diverse) exhibits the largest drop.
Component swapping analysis
VLAs preserve relevant knowledge of prior tasks despite performance degradation from learning new ones.

In the finetuning experiments, Pi0 reaches peak performance within a small fraction of training steps, significantly faster than its initial training. In contrast, BC-Transformer requires a similar number of steps as initial training to reach peak performance, suggesting that task knowledge has been largely erased:

Performance comparison averaged across tasks. Each column represents a different LIBERO benchmark, where the x-axis shows the percentage of training steps during finetuning. Pi0 reaches peak performance within a small fraction of training steps—significantly faster than initial training—indicating task knowledge is preserved and reused. BC-Transformer requires comparable or more steps, suggesting knowledge is largely erased.
Performance recovery comparison

In terms of recovery efficiency, Pi0 consistently recovers peak performance with fewer than 10% of the original training steps, whereas BC-Transformer often requires a comparable or greater number of steps:

Table: Recovery efficiency. Finetuning steps (as ratio of original training time) needed to regain the peak success rate. Pi0 consistently recovers with fewer than 10% of original steps.
Benchmark Pi0 BC-Transformer
LIBERO-Spatial 0.066 1.36
LIBERO-10 0.105 1.87
LIBERO-Object 0.067 1.80
LIBERO-Goal 0.062 0.33

BibTeX

@article{liu2026continualvla, title = {Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning}, author = {Liu, Huihan and Kim, Changyeon and Liu, Bo and Liu, Minghuan and Zhu, Yuke}, journal = {arXiv preprint}, year = {2026} }