A Python simulation that combines PyBullet robotic arm control with zero-shot image-text matching capabilities using the CLIP vision-language model. This project was created as part of my exploration into applying vision-language models in robotics.
This project demonstrates:
- Physics-based robotic arm simulation using PyBullet
- Vision-guided object selection using CLIP vision-language model
- Intelligent pick and place operations based on text prompts
The robot captures images from its camera, analyzes them using CLIP, selects objects based on text descriptions, and performs pick-and-place operations, including color sorting and interactive placement.
- VLM interaction:
vlm.mp4
- Automatic Sorting:
Sorting.mp4
├── panda_vision_simulation.py # Vision-guided robot simulation Class
├── color_sorting_vlm.py # Color sorting and interactive demo with VLM
├── simple_pick_place_demo.py # Simple pick and place demo
├── requirements.txt # Python dependencies
└── README.md # This documentation
- Camera image capture from PyBullet simulation
- Object detection using 3D-to-2D projection and segmentation masks
- Vision-language matching with CLIP for object selection
- Text-prompted pick and place ("pick up the red cube")
- Multiple objects (red cube, red sphere, blue cube, green sphere, yellow cylinder)
- Color sorting zones with physical borders to prevent objects from falling
- Interactive zone selection and throw action for object placement
- Python 3.7+
- PyBullet
- NumPy
- All basic requirements plus:
- PyTorch
- Transformers (Hugging Face)
- Pillow (PIL)
- Matplotlib
- Requests
-
Clone or download this project:
git clone https://github.com/Nabil-Miri/vlm-robot-color-sorting.git
-
(Recommended) Create and activate a Python virtual environment:
python3 -m venv .venv source .venv/bin/activate -
Install the required dependencies:
pip install -r requirements.txt
To run the color sorting and interactive demo:
python3 color_sorting_vlm.pyWhen you start the demo, you will be prompted to choose a mode:
- Automatic Color Sorting: The robot will automatically sort all objects into their matching colored zones.
- Interactive Text Prompt: You can enter text prompts to select which object the robot should pick up, and choose where to place it (including a "throw away" option).
Follow the on-screen instructions to interact with the robot and sorting zones.
The robot uses CLIP to match images of objects to your text prompt:
# For each object crop, CLIP returns a similarity score with the text prompt
similarity_scores = model.compute_object_similarity(crops, text_prompt)
selected_object, best_score = model.select_best_object(similarity_scores)
# Robot picks and places the selected objectCLIP compares each cropped object image to your prompt (like "red cube") and returns a score for each. The robot picks the object with the highest score and moves it as you choose.
The robot uses PyBullet's built-in inverse kinematics solver to calculate joint angles needed to reach target positions:
joint_positions = p.calculateInverseKinematics(
self.robot_id,
endEffectorLinkIndex=11, # Panda end-effector link
targetPosition=target_position,
targetOrientation=target_orientation
)Joint control uses position control mode:
- Position Control: Joints move to target positions
- Gripper Control: Two-finger gripper with synchronized motion
Possible extensions to this project:
- Integrate advanced object segmentation for more accurate identification.
- Detect and localize area zones visually, not just by preset coordinates.
- Enable multi-object reasoning for commands involving relationships (e.g., “Stack all blue cubes, then put the red sphere on top”).
Feel free to fork, modify, and submit PRs! Suggestions and improvements are welcome.
