ObjectVLA: End-to-End Open-World Object
Manipulation Without Demonstration

1Midea Group, 2East China Normal University, 3Shanghai University
*Equal contribution Corresponding author

Abstract

Imitation learning has proven to be highly effective in teaching robots dexterous manipulation skills. However, it typically relies on large amounts of human demonstration data, which limits its scalability and applicability in dynamic, real-world environments. One key challenge in this context is object generalization— where a robot trained to perform a task with one object, such as ''hand over the apple,'' struggles to transfer its skills to a semantically similar but visually different object, such as ''hand over the peach.'' This gap in generalization to new objects beyond those in the same category has yet to be adequately addressed in previous work on end-to-end visuomotor policy learning. In this paper, we present a simple yet effective approach for achieving object generalization through Vision-Language-Action (VLA) models, referred to as ObjectVLA. Our model enables robots to generalize learned skills to novel objects without requiring explicit human demonstrations for each new target object. By leveraging vision-language pair data, our method provides a lightweight and scalable way to inject knowledge about the target object, establishing an implicit link between the object and the desired action. We evaluate ObjectVLA on a real robotic platform, demonstrating its ability to generalize across over 100 novel objects with a 64% success rate in selecting objects not seen during training. Furthermore, we propose a more accessible method for enhancing object generalization in VLA models—using a smartphone to capture a few images and fine-tune the pre-trained model. These results highlight the effectiveness of our approach in enabling object-level generalization and reducing the need for extensive human demonstrations, paving the way for more flexible and scalable robotic learning systems.

Experiments

We start by evaluating the generalization performance of our ObjectVLA in selecting object. The experimental result is shown in the following figure. Our method achieves a 100% success rate for in-distribution objects selecting and a 64% success rate for out-of-distribution objects selecting. Notably, ObjectVLA w/o bbox achieves only a 19% success rate in OOD evaluation, despite achieving a 100% success rate in the ID test. This illustrates that without explicit grounding and a structured reasoning process, the model struggles to differentiate objects in vision-language data.

Rollouts


Combining with More Skills.

We also expands the evaluation to encompass more complex skills, specifically "pick & place", "push" and "rotate". Our experimental results show that ObjectVLA can transfer skills to objects unseen in robot data but present in vision-language data. The following video is played at 2× speed for better visualization.

Bin Picking

Rotate

Push

Cheap Object Generalization via Smart-Phone Pictures

Additionally, we find that our ObjectVLA can generalize to novel objects with only a few images captured by a smartphone. The video is 2x speed up for better visualization.

Pikachu

Toy Cat

BibTeX

@misc{zhu2025objectvla,
      title={ObjectVLA: End-to-End Open-World Object Manipulation Without Demonstration}, 
      author={Minjie Zhu and Yichen Zhu and Jinming Li and Zhongyi Zhou and Junjie Wen and Xiaoyu Liu and Chaomin Shen and Yaxin Peng and Feifei Feng},
      year={2025},
      eprint={2502.19250},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2502.19250}, 
}