Publications

You can also find my articles on my Google Scholar profile.

Conference Papers


PartInstruct: Part-level Instruction Following for Fine-grained Robot Manipulation

Published in WRL@ICLR 2025, 2025

Fine-grained robot manipulation, such as lifting and rotating a bottle to display the label on the cap, requires robust reasoning about object parts and their relationships with intended tasks. Despite recent advances in training general-purpose robot manipulation policies guided by language instructions, there is a notable lack of large-scale datasets for fine-grained manipulation tasks with part-level instructions and diverse 3D object instances annotated with part-level labels. In this work, we introduce PartInstruct, the first large-scale benchmark for both training and evaluating fine-grained robot manipulation models using part-level instructions. PartInstruct comprises 513 object instances across 14 categories, each annotated with part-level information, and 1302 fine-grained manipulation tasks organized into 16 task classes. Our training set consists of over 10,000 expert demonstrations synthesized in a 3D simulator, where each demonstration is paired with a high-level task instruction, a chain of basic part-based skill instructions, and ground-truth 3D information about the object and its parts. Additionally, we designed a comprehensive test suite to evaluate the generalizability of learned policies across new states, objects, and tasks. We evaluated several state-of-the-art robot manipulation approaches including end-to-end vision-language policy learning and bi-level planning models for robot manipulation on our benchmark. The experimental results reveal that current models struggle to robustly ground part concepts and predict actions in 3D space, and face challenges when manipulating object parts in long-horizon tasks.

Recommended citation: @inproceedings{yinpartinstruct, title={PartInstruct: Part-level Instruction Following for Fine-grained Robot Manipulation}, author={Yin, Yifan and Han, Zhengtao and Aarya, Shivam and Xu, Shuhang and Wang, Jianxin and Peng, Jiawei and Wang, Angtian and Yuille, Alan and Shu, Tianmin}, booktitle={7th Robot Learning Workshop: Towards Robots with Human-Level Abilities}, year={2025} }
Download Paper

Goma: Proactive embodied cooperative communication via goal-oriented mental alignment

Published in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024

Verbal communication plays a crucial role in human cooperation, particularly when the partners only have incomplete information about the task, environment, and each other’s mental state. In this paper, we propose a novel cooperative communication framework, Goal-Oriented Mental Alignment (GOMA). GOMA formulates verbal communication as a planning problem that minimizes the misalignment between the parts of agents’ mental states that are relevant to the goals. This approach enables an embodied assistant to reason about when and how to proactively initialize communication with humans verbally using natural language to help achieve better cooperation. We evaluate our approach against strong baselines in two challenging environments, Overcooked (a multiplayer game) and VirtualHome (a household simulator). Our experimental results demonstrate that large language models struggle with generating meaningful communication that is grounded in the social and physical context. In contrast, our approach can successfully generate concise verbal communication for the embodied assistant to effectively boost the performance of the cooperation as well as human users’ perception of the assistant.

Recommended citation: @inproceedings{ying2024goma, title={Goma: Proactive embodied cooperative communication via goal-oriented mental alignment}, author={Ying, Lance and Jha, Kunal and Aarya, Shivam and Tenenbaum, Joshua B and Torralba, Antonio and Shu, Tianmin}, booktitle={2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, pages={7099--7106}, year={2024}, organization={IEEE} }
Download Paper

SIFToM: Robust Spoken Instruction Following through Theory of Mind

Published in Second AAAI Symposium on Unifying Representations for Robot Application Development (UR-RAD), 2024

Spoken language instructions are ubiquitous in agent collaboration. However, in human-robot collaboration, recognition accuracy for human speech is often influenced by various speech and environmental factors, such as background noise, the speaker’s accents, and mispronunciation. When faced with noisy or unfamiliar auditory inputs, humans use context and prior knowledge to disambiguate the stimulus and take pragmatic actions, a process referred to as top-down processing in cognitive science. We present a cognitively inspired model, Speech Instruction Following through Theory of Mind (SIFToM), to enable robots to pragmatically follow human instructions under diverse speech conditions by inferring the human’s goal and joint plan as prior for speech perception and understanding. We test SIFToM in simulated home experiments (VirtualHome 2). Results show that the SIFToM model outperforms state-of-the-art speech and language models, approaching human-level accuracy on challenging speech instruction following tasks. We then demonstrate its ability at the task planning level on a mobile manipulator for breakfast preparation tasks.

Recommended citation: @article{ying2024siftom, title={SIFToM: Robust Spoken Instruction Following through Theory of Mind}, author={Ying, Lance and Liu, Jason Xinyu and Aarya, Shivam and Fang, Yizirui and Tellex, Stefanie and Tenenbaum, Joshua B and Shu, Tianmin}, journal={arXiv preprint arXiv:2409.10849}, year={2024} }
Download Paper

Towards Increasing the Robustness of Predictive Steering-Control Autonomous Navigation Systems Against Dash Cam Image Angle Perturbations Due to Pothole Encounters

Published in arXiv preprint arXiv:2310.03959, 2023

Vehicle manufacturers are racing to create autonomous navigation and steering control algorithms for their vehicles. These software are made to handle various real-life scenarios such as obstacle avoidance and lane maneuvering. There is some ongoing research to incorporate pothole avoidance into these autonomous systems. However, there is very little research on the effect of hitting a pothole on the autonomous navigation software that uses cameras to make driving decisions. Perturbations in the camera angle when hitting a pothole can cause errors in the predicted steering angle. In this paper, we present a new model to compensate for such angle perturbations and reduce any errors in steering control prediction algorithms. We evaluate our model on perturbations of publicly available datasets and show our model can reduce the errors in the estimated steering angle from perturbed images to 2.3%, making autonomous steering control robust against the dash cam image angle perturbations induced when one wheel of a car goes over a pothole.

Recommended citation: @article{aarya2023towards, title={Towards Increasing the Robustness of Predictive Steering-Control Autonomous Navigation Systems Against Dash Cam Image Angle Perturbations Due to Pothole Encounters}, author={Aarya, Shivam}, journal={arXiv preprint arXiv:2310.03959}, year={2023} }
Download Paper