Add moondream3 model for CUA #437

ddupont808 · 2025-10-02T15:12:23Z

This PR adds support for the moondream3 model in the Agent SDK. The moondream3 model supports both UI screenshot parsing (element detection + captioning) and UI grounding (element localization). As a result, it also enables support for composed agents using text-only models.

The model can be used in a few different ways:

Grounding only

agent = ComputerAgent("moondream3")
agent.predict_click(screenshot, "start button") # ( x, y )

Composed Agent (Supports vision & text-only models)

agent = ComputerAgent("moondream3+ollama/gemma3")
agent.run("close this window")

This will caption each screenshot sent to the model and append a user message with a list of elements.

Example output:

Detected form UI elements on screen:

Full Name
Email Address
Priority Level
Subject
Detailed Description
Request Support

ddupont808 added 3 commits October 2, 2025 10:57

added moondream3 agent loop

0b3c677

added moondream3 to docs

b2ddfe2

Added working moondream3 agent

1e94b5d

ddupont808 marked this pull request as ready for review October 3, 2025 12:18

ddupont808 merged commit 9f18f9e into main Oct 6, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add moondream3 model for CUA #437

Add moondream3 model for CUA #437

ddupont808 commented Oct 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add moondream3 model for CUA #437

Add moondream3 model for CUA #437

Conversation

ddupont808 commented Oct 2, 2025

Grounding only

Composed Agent (Supports vision & text-only models)

Uh oh!

Uh oh!

Uh oh!