Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,10 @@ With the Agent SDK, you can:
|---|---|---|
| `anthropic/claude-sonnet-4-5-20250929` | `huggingface-local/xlangai/OpenCUA-{7B,32B}` | any all-in-one CUA |
| `openai/computer-use-preview` | `huggingface-local/HelloKKMe/GTA1-{7B,32B,72B}` | any VLM (using liteLLM, requires `tools` parameter) |
| `openrouter/z-ai/glm-4.5v` | `huggingface-local/Hcompany/Holo1.5-{3B,7B,72B}` | |
| `openrouter/z-ai/glm-4.5v` | `huggingface-local/Hcompany/Holo1.5-{3B,7B,72B}` | any LLM (using liteLLM, requires `moondream3+` prefix ) |
| `huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}` | any all-in-one CUA | |
| `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` | |
| `moondream3+{ui planning}` (supports text-only models) | |
| `omniparser+{ui planning}` | | |
| `{ui grounding}+{ui planning}` | | |

Expand Down
18 changes: 18 additions & 0 deletions docs/content/docs/agent-sdk/supported-agents/composed-agents.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ Any model that supports `predict_click()` can be used as the grounding component
- InternVL 3.5 family: `huggingface-local/OpenGVLab/InternVL3_5-{1B,2B,4B,8B,...}`
- UI‑TARS 1.5: `huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B` (also supports full CU)
- OmniParser (OCR): `omniparser` (requires combination with a LiteLLM vision model)
- Moondream3: `moondream3` (requires combination with a LiteLLM vision/text model)

## Supported Planning Models

Expand Down Expand Up @@ -83,6 +84,23 @@ async for _ in agent.run("Help me fill out this form with my personal informatio
pass
```

### Moondream3 + GPT-4o

Use the built-in Moondream3 grounding with any planning model. Moondream3 will detect UI elements on the latest screenshot, label them, and provide a user message listing detected element names.

```python
from agent import ComputerAgent
from computer import computer

agent = ComputerAgent(
"moondream3+openai/gpt-4o",
tools=[computer]
)

async for _ in agent.run("Close the settings window, then open the Downloads folder"):
pass
```

## Benefits of Composed Agents

- **Specialized Grounding**: Use models optimized for click prediction accuracy
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,12 @@ OCR-focused set-of-marks model that requires an LLM for click prediction:

- `omniparser` (requires combination with any LiteLLM vision model)

### Moondream3 (Local Grounding)

Moondream3 is a powerful small model that can perform UI grounding and click prediction.

- `moondream3`

## Usage Examples

```python
Expand Down
2 changes: 1 addition & 1 deletion libs/python/agent/agent/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -167,7 +167,7 @@ async def chat_loop(agent, model: str, container_name: str, initial_prompt: str

# Process and display the output
for item in result.get("output", []):
if item.get("type") == "message":
if item.get("type") == "message" and item.get("role") == "assistant":
# Display agent text response
content = item.get("content", [])
for content_part in content:
Expand Down
2 changes: 2 additions & 0 deletions libs/python/agent/agent/loops/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from . import opencua
from . import internvl
from . import holo
from . import moondream3

__all__ = [
"anthropic",
Expand All @@ -25,4 +26,5 @@
"opencua",
"internvl",
"holo",
"moondream3",
]
Loading