Through experimental analysis, we observe that the proposed TSD algorithm exhibits significant consistency in extracting critical task features, characterized by Numerical Consistency and Positional Consistency.
For imitation learning in robotic manipulation, high data collection costs result in the scarcity of high quality data. In this paper, we leverage the inherent heterogeneity of trajectories to address this challenge. Based on our observations of manipulation tasks, we categorize motions into transitional, precise, and agile types, defining the latter two as trajectory saliency due to their criticality to task success in contrast to the prevalent but less relevant transitional motions. Therefore, we propose the Trajectory Saliency Detector (TSD), a training-free and plug-and-play framework to identify trajectory saliency. TSD employs two physically-grounded metrics: spatial entropy to capture fine-grained manipulation and centripetal acceleration to detect agile maneuvering. We further leverage TSD to develop a dataset compression method that reduces training costs and a dataset expansion strategy that improves data collection efficiency. Extensive experiments in both simulation and real-world settings demonstrate that models trained on TSD-condensed datasets achieve comparable or even superior performance with 25% less data on average. These results validate the effectiveness of our dataset compression and expansion strategies, thereby confirming the utility of TSD. Consequently, TSD offers a scalable and cost-effective pathway to synthesize information-dense datasets for efficient robot learning.
Through experimental analysis, we observe that the proposed TSD algorithm exhibits significant consistency in extracting critical task features, characterized by Numerical Consistency and Positional Consistency.
Numerical Consistency signifies the stability of the detection results relative to the dataset size. Once the number of demonstrations reaches a fundamental threshold, the quantity and the spatiotemporal localization of identified precise and agile segments converge to a stable state.
Positional Consistency reflects the algorithm's robustness against spatial perturbations. Objects are often placed randomly within a specific workspace.
The visualization shows the detected precise and agile segments and the corresponding visual frames, demonstrating TSD's accuracy in identifying key segments.
We validate the detection performance of TSD and corresponding model training effectiveness across five robomimic simulation tasks, and three real-world manipulation tasks, including single-arm and dual-arm scenarios.
| Model | Base | A1 | A2(Ours) | A3 | B1 | B2(Ours) | B3 | ||
| Sim | Can | Succ. | 53.3% | 72.8% | 82.0% | 83.1% | 91.7% | 92.0% | 93.7% |
| Size | 4617(19.9%) | 7999(34.5%) | 7884(33.9%) | 11561(49.8%) | 13863(59.7%) | 13321(57.4%) | 23207(100%) | ||
| Lift | Succ. | 96.6% | 98.8% | 99.3% | 98.6% | 98.6% | 99.7% | 99.3% | |
| Size | 1992(20.6%) | 3465(35.8%) | 3399(35.1%) | 4886(50.5%) | 5919(61.2%) | 5839(60.4%) | 9666(100%) | ||
| Tool Hang | Succ. | 5.1% | 35.7% | 35.7% | 36.4% | 46.6% | 56.8% | 58.6% | |
| Size | 18817(19.6%) | 37077(38.6%) | 34107(35.5%) | 46740(48.7%) | 65721(68.5%) | 64237(66.9%) | 95962(100%) | ||
| Transport | Succ. | 24.6% | 49.5% | 69.5% | 73.3% | 88.8% | 95.3% | 89.3% | |
| Size | 18837(20.1%) | 38774(41.2%) | 40188(42.8%) | 46902(49.2%) | 78032(83.2%) | 79044(84.3%) | 93752(100%) | ||
| Square | Succ. | 11.1% | 66.2% | 68.8% | 68.6% | 73.3% | 76.8% | 77.5% | |
| Size | 6060(20.1%) | 11922(39.5%) | 11526(38.2%) | 15125(50.5%) | 21021(69.7%) | 21327(70.7%) | 30154(100%) | ||
| Real | Tray Setting |
Succ. | 0.0% | 46.7% | 60.0% | 53.3% | 76.7% | 86.6% | 83.3% |
| Size | 5335(21.0%) | 11502(45.3%) | 11418(45.0%) | 12819(50.6%) | 21516(84.9%) | 21527(84.9%) | 25350(100%) | ||
| Water Stowing |
Succ. | 40.0% | 53.3% | 56.7% | 63.3% | 70.0% | 70.0% | 76.6% | |
| Size | 2108(20.5%) | 4146(40.4%) | 4194(40.8%) | 5167(50.3%) | 7767(75.6%) | 7789(75.8%) | 10269(100%) | ||
| Book Fetching |
Succ. | 46.7% | 53.3% | 60.0% | 66.7% | 76.7% | 93.3% | 93.3% | |
| Size | 3661(19.7%) | 7889(42.5%) | 7854(42.3%) | 9232(49.7%) | 14886(80.1%) | 14921(80.3%) | 18578(100%) | ||
[1] Size(Ratio): Size refer to the total frames in the dataset. Ratio refer to the
proportion in the full dataset.
[2] Bold values indicate the
highest data efficiency (success
rate per unit of data) within each group.
[3] base: 20% of the full dataset; A1/B1: Randomly
sampled trajectories (size-matched to A2/B2); A2/B2:
TSD-expanded(real)/compressed(sim) datasets, integrating salient segments from a matching number of
trajectories as A3 and B3, respectively; A3/B3: 50%
(A) and 100% (B) of the full dataset.
can - Base
can - Model A1
can - Model A2 (Ours)
can - Model A3
can - Model B1
can - Model B2 (Ours)
can - Model B3
lift - Base
lift - Model A1
lift - Model A2 (Ours)
lift - Model A3
lift - Model B1
lift - Model B2 (Ours)
lift - Model B3
transport - Base
transport - Model A1
transport - Model A2 (Ours)
transport - Model A3
transport - Model B1
transport - Model B2 (Ours)
transport - Model B3
tool hang - Base
tool hang - Model A1
tool hang - Model A2 (Ours)
tool hang - Model A3
tool hang - Model B1
tool hang - Model B2 (Ours)
tool hang - Model B3
Square - Base
square - Model A1
square - Model A2 (Ours)
square - Model A3
square - Model B1
square - Model B2 (Ours)
square - Model B3
Book Fetching - Base
book fetching - Model A1
book fetching - Model A2 (Ours)
book fetching - Model A3
book fetching - Model B1
book fetching - Model B2 (Ours)
book fetching - Model B3
Water Stowing - Base
water stowing - Model A1
water stowing - Model A2 (Ours)
water stowing - Model A3
water stowing - Model B1
water stowing - Model B2 (Ours)
water stowing - Model B3
can
lift
tool hang
square
transport (left arm)
transport (right arm)
book fetching
water stowing
left arm
right arm
| Model | B2 (Ours) | C1 | C2 | D1 | D2 | |
| Book Fetching |
Succ. | 93.3% | 70.0% | 63.3% | 50.0% | 40.0% |
| Size | 14921 | 13015 | 12938 | 7110 | 7220 | |
| Tool Hang |
Succ. | 56.8% | 46.2% | 52.9% | 30% | 16% |
| Size | 64237 | 60930 | 62283 | 22870 | 22179 | |
[1] Size: Size refer to the total frames in the dataset.
[2] Bold values indicate the
highest data efficiency (success
rate per unit of data) within each group.
[3] C1: a complete dataset of the same size as C2;
[4] C2: B2 without agile segments;
[5] D1: a complete dataset of the same size as D2;
[6] D2: B2 without precise segments.
Typical failures of models trained on datasets lacking specific segments.
Book Fetching - Model B2 (Ours)
Book Fetching - Model C1
Book Fetching - Model C2 (w/o agile)
Book Fetching - Model D1
Book Fetching - Model D2 (w/o precise)
Tool Hang - Model B2 (Ours)
Tool Hang - Model C1
Tool Hang - Model C2 (w/o agile)
Tool Hang - Model D1
Tool Hang - Model D2 (w/o precise)