Skip to content

Conversation

ddupont808
Copy link
Contributor

This PR adds support for the moondream3 model in the Agent SDK. The moondream3 model supports both UI screenshot parsing (element detection + captioning) and UI grounding (element localization). As a result, it also enables support for composed agents using text-only models.

The model can be used in a few different ways:

Grounding only

agent = ComputerAgent("moondream3")
agent.predict_click(screenshot, "start button") # ( x, y )

Composed Agent (Supports vision & text-only models)

agent = ComputerAgent("moondream3+ollama/gemma3")
agent.run("close this window")

This will caption each screenshot sent to the model and append a user message with a list of elements.

Example output:
image
Detected form UI elements on screen:

  • Full Name
  • Email Address
  • Priority Level
  • Subject
  • Detailed Description
  • Request Support

@ddupont808 ddupont808 marked this pull request as ready for review October 3, 2025 12:18
@ddupont808 ddupont808 merged commit 9f18f9e into main Oct 6, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant