PartInstruct: Part-level Instruction Following for Fine-grained Robot Manipulation

PartInstruct teaser image

An example fine-grained robot manipulation task in PartInstruct

Abstract

Fine-grained robot manipulation, such as lifting and rotating a bottle to display the label on the cap, requires robust reasoning about object parts and their relationships with intended tasks. Despite recent advances in training general-purpose robot manipulation policies guided by language instructions, there is a notable lack of large-scale datasets for fine-grained manipulation tasks with part-level instructions and diverse 3D object instances annotated with part-level labels. We introduce PartInstruct, the first benchmark for training and evaluating such models. It features 513 object instances across 14 categories, 1302 manipulation tasks in 16 classes, and over 10,000 expert demonstrations synthesized in a 3D simulator. Each demonstration includes a high-level task instruction, a sequence of basic part-based skills, and ground-truth 3D object data. Additionally, we designed a comprehensive test suite to evaluate the generalizability of learned policies across new states, objects, and tasks. We evaluated several state-of-the-art robot manipulation approaches including end-to-end vision-language policy learning and bi-level planning models on our benchmark. The experimental results reveal that current models struggle to robustly ground part concepts and predict actions in 3D space, and face challenges when manipulating object parts in long-horizon tasks.

Problem Setup

PartGym: 3D Simulation for Part-level Manipulation

PartGym is a realistic robot simulator for fine-grained manipulation tasks requiring part-level understanding. Built on Pybullet, it features a 7-DoF Franka Emika Panda robot with a two-finger parallel gripper and simulates manipulation tasks for 14 types of everyday objects from the PartNet Mobility dataset. PartGym provides (1) rich 3D assets, (2) part-level annotations, and (3) a diverse task set with natural language instructions. It includes 513 object instances and 4,653 part labels for detailed manipulation.

Multimodal Observations in PartGym. PartGym supports multimodal observations, including RGB images, depth maps, scene point clouds (PCDs). It also provides object masks, 2D part masks, 3D object PCDs, and 3D part PCDs for each object.

Modality figure
Object Categories

The following star plot shows the relative distribution of the number of episodes in each object category in the dataset.

Distribution of object category episodes
Object Parts

The following graphs show annotated parts grouped by object categories. Spatial part names are highlighted in light gray.

Part counts among skills

PartInstruct: Benchmark for Part-level Instruction Following

Comparison of PartInstruct with Other Benchmarks. We compared PartInstruct with existing tabletop robot manipulation benchmarks based on: the number of distinctive part-level instructions, part labels, part-level tasks, availability of training demonstrations, and whether these demonstrations include part-level annotations such as 2D and 3D segmentation masks.

Part counts among skills
Interactive Demos

Here we provide interactive demos to animate part-level manipulation tasks. You can use the slider to see the observation of every timestep and the corresponding skill instruction.

Loading...
Evaluation Protocol

To systematically evaluate model performance, we designed a five-level evaluation protocol, each of which evaluates a policy in one type of generalization. Namely, generalizability over object initial states (OS), novel object instances (OI), novel part combinations in the same task type (TP), novel task categories (TC), and novel object categories (OC).

Evaluation Protocol Table
Evaluation Protocol Table
Left: Training set. Right: Test 1(OS).
Evaluation Protocol Table
Left: Training set. Right: Test 2(OI).
Part counts among skills
Above: Training set. Below: Test 3(TP).
Part counts among skills
Above: Training set. Below: Test 4(TC).
Part counts among skills
Left: Training set. Right: Test 5(OC).

Experiments

In our benchmark, we evaluate two types of approaches to achieve general-purpose robot manipulation: (1) end-to-end policy learning that directly maps observation and instruction to actions and (2) bi-level planning that first generates high-level plans (typically subgoals), then compute and execute the low-level action plans to achieve the subgoals.

Part counts among skills
Success Rates of all baselines. The left group represents end-to-end learning policies, while the right group corresponds to bi-level planning models.
End-to-End Policy Learning

We evaluate several state-of-the-art end-to-end robot manipulation policy learning methods, including DP, DP3, Act3D, RVT2 and Octo. Note that the original DP and DP3 models do not support language inputs. To fit the setup of PartInstruct, we use a pre-trained T5 language encoder to get the language embedding, then concatenated it with other features as the observation condition for the denoising diffusion process.

Bi-level Planning

We hypothesize that it would be easier to train policies with skill instructions compared to directly training a policy for the whole task. Such a low-level action policy can then be combined with a high-level task planner that generates skill instructions given a task instruction to solve the manipulation task. Below is a figure illustrating our bi-level planning framework.

Part counts among skills

High-level Task Planner. We leverage a Vision Language Model (VLM) for high-level task planning. At step \( t \), we prompt the VLM with the task instruction \( I_\text{task} \) to generate the skill instruction for the current step as the subgoal \( sg_t \), i.e., \( \pi_\text{VLM}(sg_t | o_t, I_\text{task}) \), where \( o_t \) is the observation at \( t \).

Low-level Action Policy. The low-level action policy is a vision-language policy that generates manipulation actions based on a subgoal and the current observation, i.e., \( \pi(a_t | o_t, sg_t) \), where \( a_t \) is the action at step \( t \). We can train such policies using the skill instructions annotated for training demonstrations in our dataset. Here, we select the best-performing end-to-end policy learning baseline, DP3, to train such policy with object part segmentation as part of the input, which we refer to as DP3-S.

Qualitative Results

Test 2: Place gripper tip on the right of the bottle.

Test 3: Place gripper tip on the right of the bottle.

Test 2: Grab the left of the scissors and move it to the left.

Test 3: Take hold of the left of the mug, slide it backwards, then let go of it.

Test 1: Push the right of the kitchenpot in to the left.

Test 2: Place gripper tip on the left of the mug.

Test 3: Take hold of the right of the mug, slide it to the left, then let go of it.

Test 1: Place gripper tip on the top of the stapler.

Comparison of End-to-End and Bi-level Episodes

Test 5: Lift the box by its rotation lid, move it to the right, turn the bottom to face front, then set it down.

DP3

Gemini-2.0 Flash+DP3-S

Conclusion

In this work, we introduced PartInstruct, a large-scale benchmark designed to advance fine-grained robot manipulation using part-level instructions. By curating a diverse set of objects, tasks, and expert demonstrations, PartInstruct provides a foundation for training and evaluating robot manipulation models that require reasoning about object parts and their relationships with tasks. Our evaluations of state-of-the-art models highlight critical challenges in grounding part concepts and executing long-horizon tasks. With comprehensive experiments and ablation studies, our work provides key insights for future research, highlighting the need for further innovation in perception, reasoning, and planning to enable robots to effectively perform fine-grained, part-aware manipulation.

BibTeX

Coming Soon