TSD: A Physics-Inspired Trajectory Saliency Detector for Efficient Imitation Learning

Anonymous Authors

Abstract

For imitation learning in robotic manipulation, high data collection costs result in the scarcity of high quality data. In this paper, we leverage the inherent heterogeneity of trajectories to address this challenge. Based on our observations of manipulation tasks, we categorize motions into transitional, precise, and agile types, defining the latter two as trajectory saliency due to their criticality to task success in contrast to the prevalent but less relevant transitional motions. Therefore, we propose the Trajectory Saliency Detector (TSD), a training-free and plug-and-play framework to identify trajectory saliency. TSD employs two physically-grounded metrics: spatial entropy to capture fine-grained manipulation and centripetal acceleration to detect agile maneuvering. We further leverage TSD to develop a dataset compression method that reduces training costs and a dataset expansion strategy that improves data collection efficiency. Extensive experiments in both simulation and real-world settings demonstrate that models trained on TSD-condensed datasets achieve comparable or even superior performance with 25% less data on average. These results validate the effectiveness of our dataset compression and expansion strategies, thereby confirming the utility of TSD. Consequently, TSD offers a scalable and cost-effective pathway to synthesize information-dense datasets for efficient robot learning.

Through experimental analysis, we observe that the proposed TSD algorithm exhibits significant consistency in extracting critical task features, characterized by Numerical Consistency and Positional Consistency.

Numerical Consistency signifies the stability of the detection results relative to the dataset size. Once the number of demonstrations reaches a fundamental threshold, the quantity and the spatiotemporal localization of identified precise and agile segments converge to a stable state.

Positional Consistency reflects the algorithm's robustness against spatial perturbations. Objects are often placed randomly within a specific workspace.

The visualization shows the detected precise and agile segments and the corresponding visual frames, demonstrating TSD's accuracy in identifying key segments.

Simulation: Reported aggregate success rates from 450 trials across the last three training checkpoints.
Real-World: Conducted 30 real robot trials with randomized location to test models' real-world adaptability.

Experiment Results

Model			Base	A1	A2(Ours)	A3	B1	B2(Ours)	B3
Sim	Can	Succ.	53.3%	72.8%	82.0%	83.1%	91.7%	92.0%	93.7%
	Can	Size	4617(19.9%)	7999(34.5%)	7884(33.9%)	11561(49.8%)	13863(59.7%)	13321(57.4%)	23207(100%)
	Lift	Succ.	96.6%	98.8%	99.3%	98.6%	98.6%	99.7%	99.3%
	Lift	Size	1992(20.6%)	3465(35.8%)	3399(35.1%)	4886(50.5%)	5919(61.2%)	5839(60.4%)	9666(100%)
	Tool Hang	Succ.	5.1%	35.7%	35.7%	36.4%	46.6%	56.8%	58.6%
	Tool Hang	Size	18817(19.6%)	37077(38.6%)	34107(35.5%)	46740(48.7%)	65721(68.5%)	64237(66.9%)	95962(100%)
	Transport	Succ.	24.6%	49.5%	69.5%	73.3%	88.8%	95.3%	89.3%
	Transport	Size	18837(20.1%)	38774(41.2%)	40188(42.8%)	46902(49.2%)	78032(83.2%)	79044(84.3%)	93752(100%)
	Square	Succ.	11.1%	66.2%	68.8%	68.6%	73.3%	76.8%	77.5%
	Square	Size	6060(20.1%)	11922(39.5%)	11526(38.2%)	15125(50.5%)	21021(69.7%)	21327(70.7%)	30154(100%)
Real	Tray Setting	Succ.	0.0%	46.7%	60.0%	53.3%	76.7%	86.6%	83.3%
	Tray Setting	Size	5335(21.0%)	11502(45.3%)	11418(45.0%)	12819(50.6%)	21516(84.9%)	21527(84.9%)	25350(100%)
	Water Stowing	Succ.	40.0%	53.3%	56.7%	63.3%	70.0%	70.0%	76.6%
	Water Stowing	Size	2108(20.5%)	4146(40.4%)	4194(40.8%)	5167(50.3%)	7767(75.6%)	7789(75.8%)	10269(100%)
	Book Fetching	Succ.	46.7%	53.3%	60.0%	66.7%	76.7%	93.3%	93.3%
	Book Fetching	Size	3661(19.7%)	7889(42.5%)	7854(42.3%)	9232(49.7%)	14886(80.1%)	14921(80.3%)	18578(100%)

[1] Size(Ratio): Size refer to the total frames in the dataset. Ratio refer to the proportion in the full dataset.
[2] Bold values indicate the highest data efficiency (success rate per unit of data) within each group.
[3] base: 20% of the full dataset; A1/B1: Randomly sampled trajectories (size-matched to A2/B2); A2/B2: TSD-expanded(real)/compressed(sim) datasets, integrating salient segments from a matching number of trajectories as A3 and B3, respectively; A3/B3: 50% (A) and 100% (B) of the full dataset.

Model		B2 (Ours)	C1	C2	D1	D2
Book Fetching	Succ.	93.3%	70.0%	63.3%	50.0%	40.0%
Book Fetching	Size	14921	13015	12938	7110	7220
Tool Hang	Succ.	56.8%	46.2%	52.9%	30%	16%
Tool Hang	Size	64237	60930	62283	22870	22179

[1] Size: Size refer to the total frames in the dataset.
[2] Bold values indicate the highest data efficiency (success rate per unit of data) within each group.
[3] C1: a complete dataset of the same size as C2;
[4] C2: B2 without agile segments;
[5] D1: a complete dataset of the same size as D2;
[6] D2: B2 without precise segments.

Typical failures of models trained on datasets lacking specific segments.

Dataset w/o. agile segments: The model struggles to navigate around obstacles, leading to collisions and task failures.
Dataset w/o. precise segments: The model fails to grasp objects accurately, resulting in dropped items and unsuccessful task completion.