TSD: A Physics-Inspired Trajectory Saliency Detector for Efficient Imitation Learning

Anonymous Authors

Video

Abstract

For imitation learning in robotic manipulation, high data collection costs result in the scarcity of high quality data. In this paper, we leverage the inherent heterogeneity of trajectories to address this challenge. Based on our observations of manipulation tasks, we categorize motions into transitional, precise, and agile types, defining the latter two as trajectory saliency due to their criticality to task success in contrast to the prevalent but less relevant transitional motions. Therefore, we propose the Trajectory Saliency Detector (TSD), a training-free and plug-and-play framework to identify trajectory saliency. TSD employs two physically-grounded metrics: spatial entropy to capture fine-grained manipulation and centripetal acceleration to detect agile maneuvering. We further leverage TSD to develop a dataset compression method that reduces training costs and a dataset expansion strategy that improves data collection efficiency. Extensive experiments in both simulation and real-world settings demonstrate that models trained on TSD-condensed datasets achieve comparable or even superior performance with 25% less data on average. These results validate the effectiveness of our dataset compression and expansion strategies, thereby confirming the utility of TSD. Consequently, TSD offers a scalable and cost-effective pathway to synthesize information-dense datasets for efficient robot learning.

Teaser image

Framework overview

Consistency of Saliency Detection

Through experimental analysis, we observe that the proposed TSD algorithm exhibits significant consistency in extracting critical task features, characterized by Numerical Consistency and Positional Consistency.

Consistency Convergence

Numerical Consistency signifies the stability of the detection results relative to the dataset size. Once the number of demonstrations reaches a fundamental threshold, the quantity and the spatiotemporal localization of identified precise and agile segments converge to a stable state.

Positional Consistency reflects the algorithm's robustness against spatial perturbations. Objects are often placed randomly within a specific workspace.

agility convergence

agility_convergence

precision convergence

precision_convergence

Consistency Visualization

The visualization shows the detected precise and agile segments and the corresponding visual frames, demonstrating TSD's accuracy in identifying key segments.

Experiments

Task settings

We validate the detection performance of TSD and corresponding model training effectiveness across five robomimic simulation tasks, and three real-world manipulation tasks, including single-arm and dual-arm scenarios.

  • Simulation: Reported aggregate success rates from 450 trials across the last three training checkpoints.
  • Real-World: Conducted 30 real robot trials with randomized location to test models' real-world adaptability.
Teaser image

Experiment Results

Model Base A1 A2(Ours) A3 B1 B2(Ours) B3
Sim Can Succ. 53.3% 72.8% 82.0% 83.1% 91.7% 92.0% 93.7%
Size 4617(19.9%) 7999(34.5%) 7884(33.9%) 11561(49.8%) 13863(59.7%) 13321(57.4%) 23207(100%)
Lift Succ. 96.6% 98.8% 99.3% 98.6% 98.6% 99.7% 99.3%
Size 1992(20.6%) 3465(35.8%) 3399(35.1%) 4886(50.5%) 5919(61.2%) 5839(60.4%) 9666(100%)
Tool Hang Succ. 5.1% 35.7% 35.7% 36.4% 46.6% 56.8% 58.6%
Size 18817(19.6%) 37077(38.6%) 34107(35.5%) 46740(48.7%) 65721(68.5%) 64237(66.9%) 95962(100%)
Transport Succ. 24.6% 49.5% 69.5% 73.3% 88.8% 95.3% 89.3%
Size 18837(20.1%) 38774(41.2%) 40188(42.8%) 46902(49.2%) 78032(83.2%) 79044(84.3%) 93752(100%)
Square Succ. 11.1% 66.2% 68.8% 68.6% 73.3% 76.8% 77.5%
Size 6060(20.1%) 11922(39.5%) 11526(38.2%) 15125(50.5%) 21021(69.7%) 21327(70.7%) 30154(100%)
Real Tray
Setting
Succ. 0.0% 46.7% 60.0% 53.3% 76.7% 86.6% 83.3%
Size 5335(21.0%) 11502(45.3%) 11418(45.0%) 12819(50.6%) 21516(84.9%) 21527(84.9%) 25350(100%)
Water
Stowing
Succ. 40.0% 53.3% 56.7% 63.3% 70.0% 70.0% 76.6%
Size 2108(20.5%) 4146(40.4%) 4194(40.8%) 5167(50.3%) 7767(75.6%) 7789(75.8%) 10269(100%)
Book
Fetching
Succ. 46.7% 53.3% 60.0% 66.7% 76.7% 93.3% 93.3%
Size 3661(19.7%) 7889(42.5%) 7854(42.3%) 9232(49.7%) 14886(80.1%) 14921(80.3%) 18578(100%)

[1] Size(Ratio): Size refer to the total frames in the dataset. Ratio refer to the proportion in the full dataset.
[2] Bold values indicate the highest data efficiency (success rate per unit of data) within each group.
[3] base: 20% of the full dataset; A1/B1: Randomly sampled trajectories (size-matched to A2/B2); A2/B2: TSD-expanded(real)/compressed(sim) datasets, integrating salient segments from a matching number of trajectories as A3 and B3, respectively; A3/B3: 50% (A) and 100% (B) of the full dataset.

can

can - Base

can - Model A1

can - Model A2 (Ours)

can - Model A3

can - Model B1

can - Model B2 (Ours)

can - Model B3

lift

lift - Base

lift - Model A1

lift - Model A2 (Ours)

lift - Model A3

lift - Model B1

lift - Model B2 (Ours)

lift - Model B3

transport

transport - Base

transport - Model A1

transport - Model A2 (Ours)

transport - Model A3

transport - Model B1

transport - Model B2 (Ours)

transport - Model B3

tool hang

tool hang - Base

tool hang - Model A1

tool hang - Model A2 (Ours)

tool hang - Model A3

tool hang - Model B1

tool hang - Model B2 (Ours)

tool hang - Model B3

square

Square - Base

square - Model A1

square - Model A2 (Ours)

square - Model A3

square - Model B1

square - Model B2 (Ours)

square - Model B3

book fetching

Book Fetching - Base

book fetching - Model A1

book fetching - Model A2 (Ours)

book fetching - Model A3

book fetching - Model B1

book fetching - Model B2 (Ours)

book fetching - Model B3

water stowing

Water Stowing - Base

water stowing - Model A1

water stowing - Model A2 (Ours)

water stowing - Model A3

water stowing - Model B1

water stowing - Model B2 (Ours)

water stowing - Model B3

tray setting

Tray Setting - Base

tray setting - Model A1

tray setting - Model A2 (Ours)

tray setting - Model A3

tray setting - Model B1

tray setting - Model B2 (Ours)

tray setting - Model B3

TSD Test Results Visualization

can

can

can

lift

lift

lift

tool hang

tool hang

tool hang

square

square

square

transport

transport (left arm)

transport (left arm)

transport (right arm)

transport (right arm)

book fetching

book fetching

book fetching

water stowing

water stowing

water stowing

tray setting

tray setting (left arm)

left arm

tray setting (right arm)

right arm

Ablation Study

Ablation Experiments Results

Model B2 (Ours) C1 C2 D1 D2
Book
Fetching
Succ. 93.3% 70.0% 63.3% 50.0% 40.0%
Size 14921 13015 12938 7110 7220
Tool
Hang
Succ. 56.8% 46.2% 52.9% 30% 16%
Size 64237 60930 62283 22870 22179

[1] Size: Size refer to the total frames in the dataset.
[2] Bold values indicate the highest data efficiency (success rate per unit of data) within each group.
[3] C1: a complete dataset of the same size as C2;
[4] C2: B2 without agile segments;
[5] D1: a complete dataset of the same size as D2;
[6] D2: B2 without precise segments.

Visualization

Typical failures of models trained on datasets lacking specific segments.

  • Dataset w/o. agile segments: The model struggles to navigate around obstacles, leading to collisions and task failures.
  • Dataset w/o. precise segments: The model fails to grasp objects accurately, resulting in dropped items and unsuccessful task completion.

Dataset w/o. agile segments

problem_only_precision

Dataset w/o. precise segments

problem_only_agility

Book Fetching

Book Fetching - Model B2 (Ours)

Book Fetching - Model C1

Book Fetching - Model C2 (w/o agile)

Book Fetching - Model D1

Book Fetching - Model D2 (w/o precise)

Tool Hang

Tool Hang - Model B2 (Ours)

Tool Hang - Model C1

Tool Hang - Model C2 (w/o agile)

Tool Hang - Model D1

Tool Hang - Model D2 (w/o precise)