Action-and-wrench references for online contact adaptation

TORL-VLA: Tactile Guided Online Reinforcement Learning for Contact-Rich Manipulation

Huaihang Zheng1, Yi Yang1,4, Kai Ma1, Shenglin Xu1,2, Tian Xie1, Guozheng Li1,5, Xiangyu Wang1,3, Yiren Ma1, Si Liu3, Yinian Mao1, Baoxu Liu1

1Meituan   2Beijing Institute of Technology   3Beihang University   4State Key Lab of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS   5China University of Mining and Technology (Beijing)

Project Video

0:00/0:00

Overview

Turning contact-aware references into contact-adaptive execution

Contact-rich manipulation can fail when visually similar states hide different physical conditions: partial insertion, incomplete latch engagement, or excessive grasp force. TORL-VLA uses tactile-derived wrench feedback as physical cues for these contact bottlenecks during execution.

A wrench-aware VLA predicts both action references and future wrench sequences. Lightweight stage-specific actor-critic refiners adapt those references online, while an intervention-censored critic prevents post-intervention success from being over-credited to preceding policy actions.

Method

Method Overview

TORL-VLA uses a frozen wrench-aware VLA as an action-and-wrench reference model, then updates only lightweight stage-specific actor-critic refiners during online adaptation.

Reference

Wrench-aware VLA reference model

Three camera views, language, and robot state are encoded by the VLA, while a recent wrench-history token is fused after visual-language encoding to predict reference actions and future wrench sequences.

Refinement

Stage online reference adaptation

The online refiner receives compact VLA tokens, Aref, Wref, current measured wrench, and recent wrench history for stage-specific actor-critic refinement.

Evaluation

Real-robot deployment and evaluation

Evaluates the refined policy on long-horizon latch-box manipulation with subtask success, full-task success, and 60-minute throughput.

Overview of the TORL-VLA framework.
TORL-VLA pipeline: a wrench-aware VLA predicts action-and-wrench references, stage-specific online actor-critic heads refine contact-window actions, and the refined policy is evaluated on long-horizon latch-box manipulation.

Tactile-Aware VLA as an Action-and-Wrench Reference Model

The reference model encodes recent tactile-derived wrench history as a compact token, fuses it after visual-language encoding, and jointly predicts action chunks and future wrench sequences.

Tactile-to-wrench mapping.
Each raw fingertip tactile array is mapped to a calibrated 6D wrench; left and right fingertips are concatenated into the 12D wrench observation.
Implementation details of the wrench-aware VLA reference model.
Wrench-aware reference model: the wrench-history token is fused after visual-language encoding through attention and MoE routing, with a zero-initialized bypass and joint action-wrench prediction head.

Stage-Specific Online Reference Adaptation

The frozen wrench-aware VLA remains the reference policy. Stage II refines only routed contact-critical windows, using measured wrench feedback and predicted future-wrench cues while keeping the reference model fixed.

Step 1 · Reference interface

Action-and-wrench reference

The frozen wrench-aware VLA predicts a 50-step action sequence and future-wrench sequence; Stage II uses the first 10 steps as Aref and Wref before replanning.

Aref executable action reference
Wref aligned future-wrench cue
VLA token zt proprioception qt current wrench wt recent measured wrench history
Step 2 · Stage routing

Route each chunk

The stage estimator maps each chunk to base execution or contact-window refinement. Contact bottlenecks activate the matching actor head.

Base route non-contact approach and transport follow the frozen VLA reference
Stage actor route the selected actor head directly outputs Aφ
zt + qt Aref Wref wt wrench history
Step 3 · Intervention-censored actor-critic learning

Learn from corrections without over-crediting

Qtask estimates task return, while Qic prevents success after human correction from being assigned to the preceding policy chunk.

Task critic task return
IC critic intervention-censored value
Behavior regularization anchors to Aref or human corrections
Latch-stage online reference adaptation during RL training.

Experiments

Latch-Box Benchmark

The experiments evaluate contact-centric subtasks, full-task execution, 60-minute throughput, and component ablations under the same robot platform, action representation, demonstrations, and evaluation protocol.

Experimental Setup

The benchmark contains three contact-centric subtasks: Coffee Cup for tight insertion under partial occlusion, Latch for mechanical engagement, and Egg for fragile-object grasping and placement. Full-task evaluation completes all stages in one autonomous rollout.

Robot platform with tactile sensors and camera views.
Real-robot latch-box setup with global, fisheye, and wrist camera views plus dual-fingertip tactile sensing.
Coffee cup placement, latch locking, and egg placement task visualization.
Representative contact-rich subtasks: coffee-cup placement, latch locking, and egg placement.

Evaluation Protocol

Success and failure are scored at the task-outcome level, not from internal wrench traces. Each method is evaluated over 30 autonomous trials for subtask and full-task success; final autonomous evaluation uses no human intervention, reset, safety-stop recovery, or online parameter update.

TaskSuccess conditionFailure condition
Coffee CupPick cup; insert into holder; release; cup remains upright and stable.Holder miss; rim hang-up; partial insertion; large tilt; drop; unsafe contact.
LatchGrip latch; flip toward locking direction; press into locked state; latch remains secured after release.Missed latch; slip during flipping; missed locking edge; incomplete lock; rebound; unsafe force.
EggGrasp egg; transfer to holder; release; egg remains stable without visible damage.Slip; drop; holder collision; rolling out after release; excessive grasping force; visible damage.
Full TaskComplete Coffee Cup, Latch, and Egg sequentially in one autonomous rollout.Any subtask failure; human intervention; reset; safety stop; object drop; object damage.

Main Results on Contact-Rich Tasks

Each method is evaluated over 30 autonomous trials. TORL-VLA improves full-task completion from 12/30 with the base π0.5 policy to 28/30 and achieves the lowest average time over successful full-task trials.

28/30 Full-task success
165.45 s Full-task average time
30/30 Coffee Cup
29/30 Latch
30/30 Egg
Method Coffee Cup Latch Egg Full-task success Full-task average time
π0.518/3015/3020/3012/30199.65 s
TA-VLA19/3017/3020/3012/30204.45 s
ForceVLA21/3020/3022/3015/30195.34 s
TORL-VLA without RL25/3023/3025/3021/30191.91 s
RLT26/3025/3025/3023/30175.23 s
TORL-VLA30/3029/3030/3028/30165.45 s

RLT denotes the matched reference-guided online refinement baseline; unlike TORL-VLA, it does not use wrench input during online refinement and does not include the intervention-censored critic.

Full-task success

Successful autonomous rollouts out of 30 trials.

12π0.5
12TA-VLA
15ForceVLA
21Without RL
23RLT
28TORL-VLA

Full-task average time

Lower is better; computed over successful full-task trials.

π0.5199.65 s
TA-VLA204.45 s
ForceVLA195.34 s
Without RL191.91 s
RLT175.23 s
TORL-VLA165.45 s

Reliability and throughput

Full-task success rate against successful full-task completions within a fixed 60-minute evaluation window.

Full-task success rate and throughput TORL-VLA reaches a 0.933 full-task success rate and 21 successful completions per 60 minutes, outperforming all baselines. 0 2.5 5 7.5 10 12.5 15 17.5 20 22.5 0.0 0.2 0.4 0.6 0.8 1.0 Throughput (Successes / 60 min) Full-task Success Rate TA-VLA: 0.40 success rate, 5 successes per 60 minutes. pi0.5: 0.40 success rate, 7 successes per 60 minutes. ForceVLA: 0.50 success rate, 9 successes per 60 minutes. TORL-VLA without RL: 0.70 success rate, 12 successes per 60 minutes. RLT: 0.767 success rate, 16 successes per 60 minutes. TORL-VLA: 0.933 success rate, 21 successes per 60 minutes. TA-VLA π0.5 ForceVLA TORL-VLA without RL RLT TORL-VLA

Ablation Studies

Reference-model ablations isolate the wrench-history token, future-wrench prediction objective, MoE fusion, and physical bypass; online-adaptation ablations isolate wrench-context conditioning and the intervention-censored critic.

Reference ModelCupLatchEgg
Without wrench history token24/3022/3022/30
Without future wrench prediction25/3021/3024/30
Without MoE fusion18/3017/3019/30
Without physical bypass23/3020/3021/30
Full reference, without RL25/3023/3025/30
Online AdaptationCupLatchEgg
Without wrench context27/3027/3026/30
Without IC critic27/3026/3028/30
Full TORL-VLA30/3029/3030/30

Online Adaptation on Latch

All online-adaptation variants use the same frozen wrench-aware reference model and matched online interaction budget. Removing wrench context or the IC critic slows adaptation and reduces final performance on the Latch stage.

Throughput curve during online adaptation.
Latch online adaptation: throughput measured over successive 10-minute windows as online data increases.
Success-rate curve during online adaptation.
Latch online adaptation: checkpoint success rate as online interaction data accumulates.

Analysis & Details

Analysis and Implementation Details

These materials inspect measured wrench evolution, future-wrench prediction, intervention-censored critic behavior, stage routing, and deployment implementation. They are process diagnostics and setup details, not additional final autonomous evaluation metrics.

Measured contact evolution

The measured right-fingertip wrench changes continuously during latch and cup contact. In the cup task, the gripper maintains normal force fz to hold the cup while applying downward force fy during insertion.

Measured contact evolution in cup insertion.
Cup insertion: right-fingertip force traces show normal holding force and downward insertion force during placement into the holder.
Measured contact evolution during initial latch grasping.
Initial latch-grasping phase with synchronized camera views, right-fingertip wrench, and action traces.
Measured contact evolution during latch flipping.
Latch-flipping phase showing continuous contact variation while the latch is being manipulated.
Future-wrench prediction examples

The future-wrench head predicts contact trends over the next 50 action steps. The examples cover mechanism contact, environmental insertion contact, and delicate-object contact, pairing current and future observations with predicted and measured wrench norms.

Latch beginning-of-locking observations. Latch beginning-of-locking future wrench norms.
Latch task: beginning of locking.
Latch end-of-locking observations. Latch end-of-locking future wrench norms.
Latch task: end of locking.
Cup grasping observations. Cup grasping future wrench norms.
Cup task: cup grasping.
Cup insertion observations. Cup insertion future wrench norms.
Cup task: insertion into the holder.
Egg beginning-of-grasping observations. Egg beginning-of-grasping future wrench norms.
Egg task: beginning of grasping.
Egg end-of-grasping observations. Egg end-of-grasping future wrench norms.
Egg task: end of grasping.
Six-dimensional future-wrench prediction.
Detailed six-dimensional future-wrench prediction on a latch-locking segment for the left-fingertip tactile sensor, complementing the norm-based examples above.
Intervention-censored critic analysis

These process-level analyses inspect the Latch-Lock contact window. They test whether Qic captures intervention-risk signals and whether the learned actor moves away from low-IC-value actions on policy-to-human boundary contexts.

Boundary distributionValue
Episodes with replay162
Boundary contexts165
Episodes ≥ 1 boundary125 / 162 = 77.2%
Episodes ≥ 2 boundaries32 / 162 = 19.8%
Episodes ≥ 3 boundaries8 / 162 = 4.9%
Mean / median per episode1.02 / 1
90th percentile / maximum2 / 3
Actor behaviorValue
Logged Qic(c, Adata)-1.003
Reference Qic(c, Aref)-0.573
Final actor Qic(c, Aphi)-0.271
Delta Qic: actor vs. reference+0.301
Delta Qic: actor vs. logged data+0.731
IC-hinge-active boundary contexts21 / 165 = 12.73%
Critic / ActionClean MeanBoundary MeanClean-BoundaryAUC
Qic(c, Adata)0.073-1.0031.0760.999
Qic(c, Aref)-0.007-0.5730.5650.913
Qic(c, Aphi)0.091-0.2710.3630.806
Qtask(c, Adata)0.3330.2470.0860.621
Qtask(c, Aref)0.3160.2410.0740.626
Qtask(c, Aphi)0.3510.2720.0790.618
Actor training trendValue
IC-hinge-active contexts at 10k89 / 165 = 53.94%
IC-hinge-active contexts at 20k66 / 165 = 40.00%
IC-hinge-active contexts at final 33k21 / 165 = 12.73%
10k active contexts resolved by final actor71 / 89 = 79.8%
Mean margin improvement+0.421
Variant / WindowHuman-Controlled StepsAverage InterventionsIntervention-Free Episodes
Without IC critic, last 100 ep.27.78%13.4011.0%
Full model, last 100 ep.25.32%10.6829.0%
Without IC critic, last 50 ep.25.42%13.1012.0%
Full model, last 50 ep.18.12%7.4840.0%
Stage routing details
Stage estimator architecture.
Stage routing predicts a stage label and confidence from proprioception, visual-token, and wrench contexts. A route is accepted only after confidence and temporal-stability gating.
Full-task stage routing alignment between ground-truth annotations and estimated routes.
Full-task routing trace. Estimated routes align with ground-truth contact-window annotations; remaining mismatches concentrate near gradual phase boundaries rather than stable task stages.
Deployment and implementation details
Reference / executionValue
Reference horizon H50
Executed horizon K10
Action / wrench dimension7 / 12
Robot command representationdelta chunk
Control frequency20 Hz
Replanning interval0.5 s
Wrench history length / window10 / 2.0 s
Future wrench horizon / loss50 / 0.3
VLA token dimension2048
Online learning / ACValue
Discount factor gamma0.99
Update-to-data ratio5
Critic / actor update frequencyevery step / every 2 steps
Target-network EMA tau0.005
Actor snapshot / checkpoint interval100 / 1000 steps
Intervention cost / threshold1.0 / 0.5
Warmup replay / updates300 chunks / 5k
Hidden dimension256
AC layersCup 3/3, Egg grasp 3/3, Egg place 3/3, Latch 4/4
Module / QueryTimed CallsMean Latency
Base π0.5 policy query (comparison)3094.7 ms
TORL-VLA wrench-aware reference query30100.2 ms
Stage estimator routing301.7 ms
Stage-specific online actor301.1 ms
End-to-end TORL-VLA query with actor refinement30103.0 ms

Latency is measured over warm-call policy queries on a single RTX 4090. The end-to-end TORL-VLA row is measured directly on the actor-refinement route, not obtained by summing separately timed modules.

Citation

Resources and BibTeX

TORL-VLA code will be released. Our earlier open-source RLT reproduction, Yyshadow/openpi-RLT, is already available for readers interested in reference-guided online RL.

@article{zheng2026torlvla,
  title   = {TORL-VLA: Tactile Guided Online Reinforcement Learning for Contact-Rich Manipulation},
  author  = {Zheng, Huaihang and Yang, Yi and Ma, Kai and Xu, Shenglin and Xie, Tian and Li, Guozheng and Wang, Xiangyu and Ma, Yiren and Liu, Si and Mao, Yinian and Liu, Baoxu},
  year    = {2026},
  journal = {arXiv preprint arXiv:2606.09337}
}