Action-and-wrench references for online contact adaptation

TORL-VLA: Tactile Guided Online Reinforcement Learning for Contact-Rich Manipulation

Huaihang Zheng¹, Yi Yang^1,4, Kai Ma¹, Shenglin Xu^1,2, Tian Xie¹, Guozheng Li^1,5, Xiangyu Wang^1,3, Yiren Ma¹, Si Liu³, Yinian Mao¹, Baoxu Liu¹

¹Meituan ²Beijing Institute of Technology ³Beihang University ⁴State Key Lab of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS ⁵China University of Mining and Technology (Beijing)

PDF Paper arXiv Preprint Play Project Video Code Coming Soon Code Our RLT Repro Bib BibTeX

Project Video

0:00/0:00

Overview

Turning contact-aware references into contact-adaptive execution

Contact-rich manipulation can fail when visually similar states hide different physical conditions: partial insertion, incomplete latch engagement, or excessive grasp force. TORL-VLA uses tactile-derived wrench feedback as physical cues for these contact bottlenecks during execution.

A wrench-aware VLA predicts both action references and future wrench sequences. Lightweight stage-specific actor-critic refiners adapt those references online, while an intervention-censored critic prevents post-intervention success from being over-credited to preceding policy actions.

Method

Method Overview

TORL-VLA uses a frozen wrench-aware VLA as an action-and-wrench reference model, then updates only lightweight stage-specific actor-critic refiners during online adaptation.

Reference

Wrench-aware VLA reference model

Three camera views, language, and robot state are encoded by the VLA, while a recent wrench-history token is fused after visual-language encoding to predict reference actions and future wrench sequences.

Refinement

Stage online reference adaptation

The online refiner receives compact VLA tokens, A_ref, W_ref, current measured wrench, and recent wrench history for stage-specific actor-critic refinement.

Evaluation

Real-robot deployment and evaluation

Evaluates the refined policy on long-horizon latch-box manipulation with subtask success, full-task success, and 60-minute throughput.

Overview of the TORL-VLA framework. — TORL-VLA pipeline: a wrench-aware VLA predicts action-and-wrench references, stage-specific online actor-critic heads refine contact-window actions, and the refined policy is evaluated on long-horizon latch-box manipulation.

Tactile-Aware VLA as an Action-and-Wrench Reference Model

The reference model encodes recent tactile-derived wrench history as a compact token, fuses it after visual-language encoding, and jointly predicts action chunks and future wrench sequences.

Tactile-to-wrench mapping. — Each raw fingertip tactile array is mapped to a calibrated 6D wrench; left and right fingertips are concatenated into the 12D wrench observation.

Implementation details of the wrench-aware VLA reference model. — Wrench-aware reference model: the wrench-history token is fused after visual-language encoding through attention and MoE routing, with a zero-initialized bypass and joint action-wrench prediction head.

Experiments

Latch-Box Benchmark

The experiments evaluate contact-centric subtasks, full-task execution, 60-minute throughput, and component ablations under the same robot platform, action representation, demonstrations, and evaluation protocol.

Experimental Setup

The benchmark contains three contact-centric subtasks: Coffee Cup for tight insertion under partial occlusion, Latch for mechanical engagement, and Egg for fragile-object grasping and placement. Full-task evaluation completes all stages in one autonomous rollout.

Robot platform with tactile sensors and camera views. — Real-robot latch-box setup with global, fisheye, and wrist camera views plus dual-fingertip tactile sensing.

Coffee cup placement, latch locking, and egg placement task visualization. — Representative contact-rich subtasks: coffee-cup placement, latch locking, and egg placement.

Evaluation Protocol

Success and failure are scored at the task-outcome level, not from internal wrench traces. Each method is evaluated over 30 autonomous trials for subtask and full-task success; final autonomous evaluation uses no human intervention, reset, safety-stop recovery, or online parameter update.

Task	Success condition	Failure condition
Coffee Cup	Pick cup; insert into holder; release; cup remains upright and stable.	Holder miss; rim hang-up; partial insertion; large tilt; drop; unsafe contact.
Latch	Grip latch; flip toward locking direction; press into locked state; latch remains secured after release.	Missed latch; slip during flipping; missed locking edge; incomplete lock; rebound; unsafe force.
Egg	Grasp egg; transfer to holder; release; egg remains stable without visible damage.	Slip; drop; holder collision; rolling out after release; excessive grasping force; visible damage.
Full Task	Complete Coffee Cup, Latch, and Egg sequentially in one autonomous rollout.	Any subtask failure; human intervention; reset; safety stop; object drop; object damage.

Main Results on Contact-Rich Tasks

Each method is evaluated over 30 autonomous trials. TORL-VLA improves full-task completion from 12/30 with the base π_0.5 policy to 28/30 and achieves the lowest average time over successful full-task trials.

28/30 Full-task success

165.45 s Full-task average time

30/30 Coffee Cup

29/30 Latch

30/30 Egg

Method	Coffee Cup	Latch	Egg	Full-task success	Full-task average time
π_0.5	18/30	15/30	20/30	12/30	199.65 s
TA-VLA	19/30	17/30	20/30	12/30	204.45 s
ForceVLA	21/30	20/30	22/30	15/30	195.34 s
TORL-VLA without RL	25/30	23/30	25/30	21/30	191.91 s
RLT	26/30	25/30	25/30	23/30	175.23 s
TORL-VLA	30/30	29/30	30/30	28/30	165.45 s

RLT denotes the matched reference-guided online refinement baseline; unlike TORL-VLA, it does not use wrench input during online refinement and does not include the intervention-censored critic.

Full-task success

Successful autonomous rollouts out of 30 trials.

12π_0.5

12TA-VLA

15ForceVLA

21Without RL

23RLT

28TORL-VLA

Full-task average time

Lower is better; computed over successful full-task trials.

π_0.5199.65 s

TA-VLA204.45 s

ForceVLA195.34 s

Without RL191.91 s

RLT175.23 s

TORL-VLA165.45 s

Reliability and throughput

Full-task success rate against successful full-task completions within a fixed 60-minute evaluation window.

Ablation Studies

Reference-model ablations isolate the wrench-history token, future-wrench prediction objective, MoE fusion, and physical bypass; online-adaptation ablations isolate wrench-context conditioning and the intervention-censored critic.

Reference Model	Cup	Latch	Egg
Without wrench history token	24/30	22/30	22/30
Without future wrench prediction	25/30	21/30	24/30
Without MoE fusion	18/30	17/30	19/30
Without physical bypass	23/30	20/30	21/30
Full reference, without RL	25/30	23/30	25/30

Online Adaptation	Cup	Latch	Egg
Without wrench context	27/30	27/30	26/30
Without IC critic	27/30	26/30	28/30
Full TORL-VLA	30/30	29/30	30/30

Online Adaptation on Latch

All online-adaptation variants use the same frozen wrench-aware reference model and matched online interaction budget. Removing wrench context or the IC critic slows adaptation and reduces final performance on the Latch stage.

Throughput curve during online adaptation. — Latch online adaptation: throughput measured over successive 10-minute windows as online data increases.

Success-rate curve during online adaptation. — Latch online adaptation: checkpoint success rate as online interaction data accumulates.

Analysis & Details

Analysis and Implementation Details

These materials inspect measured wrench evolution, future-wrench prediction, intervention-censored critic behavior, stage routing, and deployment implementation. They are process diagnostics and setup details, not additional final autonomous evaluation metrics.

Measured contact evolution

The measured right-fingertip wrench changes continuously during latch and cup contact. In the cup task, the gripper maintains normal force f_z to hold the cup while applying downward force f_y during insertion.

Measured contact evolution during initial latch grasping. — Initial latch-grasping phase with synchronized camera views, right-fingertip wrench, and action traces.

Measured contact evolution during latch flipping. — Latch-flipping phase showing continuous contact variation while the latch is being manipulated.

Future-wrench prediction examples

The future-wrench head predicts contact trends over the next 50 action steps. The examples cover mechanism contact, environmental insertion contact, and delicate-object contact, pairing current and future observations with predicted and measured wrench norms.

Latch beginning-of-locking observations. — Latch task: beginning of locking.

Latch beginning-of-locking future wrench norms. — Latch task: beginning of locking.

Latch end-of-locking observations. — Latch task: end of locking.

Latch end-of-locking future wrench norms. — Latch task: end of locking.

Cup grasping observations. — Cup task: cup grasping.

Cup grasping future wrench norms. — Cup task: cup grasping.

Cup insertion observations. — Cup task: insertion into the holder.

Cup insertion future wrench norms. — Cup task: insertion into the holder.

Egg beginning-of-grasping observations. — Egg task: beginning of grasping.

Egg beginning-of-grasping future wrench norms. — Egg task: beginning of grasping.

Egg end-of-grasping observations. — Egg task: end of grasping.

Egg end-of-grasping future wrench norms. — Egg task: end of grasping.

Six-dimensional future-wrench prediction. — Detailed six-dimensional future-wrench prediction on a latch-locking segment for the left-fingertip tactile sensor, complementing the norm-based examples above.

Intervention-censored critic analysis

These process-level analyses inspect the Latch-Lock contact window. They test whether Q_ic captures intervention-risk signals and whether the learned actor moves away from low-IC-value actions on policy-to-human boundary contexts.

Boundary distribution	Value
Episodes with replay	162
Boundary contexts	165
Episodes ≥ 1 boundary	125 / 162 = 77.2%
Episodes ≥ 2 boundaries	32 / 162 = 19.8%
Episodes ≥ 3 boundaries	8 / 162 = 4.9%
Mean / median per episode	1.02 / 1
90th percentile / maximum	2 / 3

Actor behavior	Value
Logged Q_ic(c, A^data)	-1.003
Reference Q_ic(c, A^ref)	-0.573
Final actor Q_ic(c, A^phi)	-0.271
Delta Q_ic: actor vs. reference	+0.301
Delta Q_ic: actor vs. logged data	+0.731
IC-hinge-active boundary contexts	21 / 165 = 12.73%

Critic / Action	Clean Mean	Boundary Mean	Clean-Boundary	AUC
Q_ic(c, A^data)	0.073	-1.003	1.076	0.999
Q_ic(c, A^ref)	-0.007	-0.573	0.565	0.913
Q_ic(c, A^phi)	0.091	-0.271	0.363	0.806
Q_task(c, A^data)	0.333	0.247	0.086	0.621
Q_task(c, A^ref)	0.316	0.241	0.074	0.626
Q_task(c, A^phi)	0.351	0.272	0.079	0.618

Actor training trend	Value
IC-hinge-active contexts at 10k	89 / 165 = 53.94%
IC-hinge-active contexts at 20k	66 / 165 = 40.00%
IC-hinge-active contexts at final 33k	21 / 165 = 12.73%
10k active contexts resolved by final actor	71 / 89 = 79.8%
Mean margin improvement	+0.421

Variant / Window	Human-Controlled Steps	Average Interventions	Intervention-Free Episodes
Without IC critic, last 100 ep.	27.78%	13.40	11.0%
Full model, last 100 ep.	25.32%	10.68	29.0%
Without IC critic, last 50 ep.	25.42%	13.10	12.0%
Full model, last 50 ep.	18.12%	7.48	40.0%

Stage routing details

Stage estimator architecture. — Stage routing predicts a stage label and confidence from proprioception, visual-token, and wrench contexts. A route is accepted only after confidence and temporal-stability gating.

Full-task stage routing alignment between ground-truth annotations and estimated routes. — Full-task routing trace. Estimated routes align with ground-truth contact-window annotations; remaining mismatches concentrate near gradual phase boundaries rather than stable task stages.

Deployment and implementation details

Reference / execution	Value
Reference horizon H	50
Executed horizon K	10
Action / wrench dimension	7 / 12
Robot command representation	delta chunk
Control frequency	20 Hz
Replanning interval	0.5 s
Wrench history length / window	10 / 2.0 s
Future wrench horizon / loss	50 / 0.3
VLA token dimension	2048

Online learning / AC	Value
Discount factor gamma	0.99
Update-to-data ratio	5
Critic / actor update frequency	every step / every 2 steps
Target-network EMA tau	0.005
Actor snapshot / checkpoint interval	100 / 1000 steps
Intervention cost / threshold	1.0 / 0.5
Warmup replay / updates	300 chunks / 5k
Hidden dimension	256
AC layers	Cup 3/3, Egg grasp 3/3, Egg place 3/3, Latch 4/4

Module / Query	Timed Calls	Mean Latency
Base π_0.5 policy query (comparison)	30	94.7 ms
TORL-VLA wrench-aware reference query	30	100.2 ms
Stage estimator routing	30	1.7 ms
Stage-specific online actor	30	1.1 ms
End-to-end TORL-VLA query with actor refinement	30	103.0 ms

Latency is measured over warm-call policy queries on a single RTX 4090. The end-to-end TORL-VLA row is measured directly on the actor-refinement route, not obtained by summing separately timed modules.

Citation

Resources and BibTeX

TORL-VLA code will be released. Our earlier open-source RLT reproduction, Yyshadow/openpi-RLT, is already available for readers interested in reference-guided online RL.

PDF Paper PDF Supplementary arXiv Preprint Code Coming Soon Code Our RLT Reproduction

@article{zheng2026torlvla,
  title   = {TORL-VLA: Tactile Guided Online Reinforcement Learning for Contact-Rich Manipulation},
  author  = {Zheng, Huaihang and Yang, Yi and Ma, Kai and Xu, Shenglin and Xie, Tian and Li, Guozheng and Wang, Xiangyu and Ma, Yiren and Liu, Si and Mao, Yinian and Liu, Baoxu},
  year    = {2026},
  journal = {arXiv preprint arXiv:2606.09337}
}