This project designs common use scenarios for web-based code, model, and dataset hosting platforms, and provides corresponding prompts and ground truth. These resources can be used to evaluate the localization performance of visual language models (VLMs) in specialized scenarios.
| Model | Platform | Accuracy (%) | Error (%) | Invalid (%) | Completion Rate (%) |
|---|---|---|---|---|---|
| AriaUI | Huggingface | 70.8 | 12.5 | 6.7 | 100.0 |
| ModelScope | 57.6 | 14.2 | 28.2 | 100.0 | |
| OpenCSG | 81.0 | 9.5 | 9.5 | 100.0 | |
| CogAgent | Huggingface | 73.3 | 26.7 | 0.0 | 100.0 |
| ModelScope | 57.9 | 29.1 | 13.0 | 96.3 | |
| OpenCSG | 57.1 | 19.0 | 23.8 | 100.0 | |
| Qwen3B | Huggingface | 8.3 | 15.8 | 19.2 | 41.7 |
| ModelScope | 0.0 | 28.6 | 20.6 | 49.2 | |
| OpenCSG | 4.8 | 4.8 | 9.5 | 19.0 | |
| Qwen7B | Huggingface | 73.3 | 11.7 | 10.8 | 95.8 |
| ModelScope | 55.5 | 30.2 | 8.5 | 95.2 | |
| OpenCSG | 71.4 | 14.3 | 14.3 | 100.0 | |
| SeeClick | Huggingface | 39.2 | 36.7 | 24.2 | 100.0 |
| ModelScope | 52.4 | 29.0 | 18.6 | 100.0 | |
| OpenCSG | 52.4 | 14.3 | 33.3 | 100.0 | |
| ShowUI | Huggingface | 30.0 | 45.0 | 11.7 | 86.7 |
| ModelScope | 43.3 | 26.7 | 14.3 | 88.9 | |
| OpenCSG | 23.8 | 52.4 | 9.5 | 85.7 |
Summery:
| Model | Accuracy (%) | Error (%) | Invalid (%) | Completion Rate (%) |
|---|---|---|---|---|
| AriaUI | 67.7 | 19.3 | 11.4 | 100.0 |
| CogAgent | 63.3 | 34.8 | 3.0 | 98.7 |
| Qwen3B | 4.5 | 10.8 | 12.9 | 62.6 |
| Qwen7B | 66.9 | 18.8 | 10.1 | 100.0 |
| SeeClick | 45.6 | 26.2 | 26.8 | 97.9 |
| ShowUI | 32.1 | 42.8 | 11.6 | 85.0 |
| Model | Platform | Accuracy (%) | Error (%) | Invalid (%) | Completion Rate (%) |
|---|---|---|---|---|---|
| AriaUI | GitCode | 57.1 | 28.5 | 14.3 | 100.0 |
| Gitea | 71.4 | 28.5 | 0.0 | 100.0 | |
| Gitee | 57.1 | 28.5 | 14.3 | 100.0 | |
| Github | 71.4 | 14.3 | 14.3 | 100.0 | |
| GitLab | 71.4 | 14.3 | 14.3 | 100.0 | |
| CogAgent | GitCode | 71.4 | 28.5 | 0.0 | 100.0 |
| Gitea | 71.4 | 28.5 | 0.0 | 100.0 | |
| Gitee | 100.0 | 0.0 | 0.0 | 100.0 | |
| Github | 57.1 | 42.8 | 0.0 | 100.0 | |
| GitLab | 85.7 | 14.3 | 0.0 | 100.0 | |
| Qwen3B | GitCode | 14.2 | 28.5 | 42.8 | 85.7 |
| Gitea | 14.2 | 57.1 | 14.2 | 85.7 | |
| Gitee | 14.2 | 42.8 | 28.5 | 100.0 | |
| Github | 0.0 | 28.5 | 57.1 | 85.7 | |
| GitLab | 14.2 | 28.5 | 28.5 | 71.4 | |
| Qwen7B | GitCode | 71.4 | 0.0 | 28.5 | 100.0 |
| Gitea | 57.1 | 28.5 | 14.2 | 100.0 | |
| Gitee | 28.5 | 57.1 | 14.2 | 100.0 | |
| Github | 0.0 | 14.2 | 85.7 | 100.0 | |
| GitLab | 85.7 | 14.2 | 0.0 | 100.0 | |
| SeeClick | GitCode | 28.5 | 48.5 | 28.5 | 100.0 |
| Gitea | 28.5 | 28.5 | 48.5 | 100.0 | |
| Gitee | 28.5 | 57.1 | 14.2 | 100.0 | |
| Github | 14.2 | 57.1 | 28.5 | 100.0 | |
| GitLab | 0.0 | 71.4 | 28.5 | 100.0 | |
| ShowUI | GitCode | 28.5 | 48.5 | 14.2 | 85.7 |
| Gitea | 57.1 | 48.5 | 0.0 | 100.0 | |
| Gitee | 57.1 | 28.5 | 0.0 | 85.7 | |
| Github | 48.5 | 14.2 | 28.5 | 85.7 | |
| GitLab | 48.5 | 14.2 | 14.2 | 71.4 |
Summery:
| Model | Platform | Accuracy (%) | Error (%) | Invalid (%) | Completion Rate (%) |
|---|---|---|---|---|---|
| AriaUI | 65.7 | 22.8 | 11.4 | 100.0 | |
| CogAgent | 62.9 | 22.8 | 0.0 | 100.0 | |
| Qwen3B | 11.4 | 37.1 | 37.1 | 85.7 | |
| Qwen7B | 48.5 | 22.9 | 28.6 | 100.0 | |
| SeeClick | 20.0 | 51.4 | 28.6 | 100.0 | |
| ShowUI | 45.7 | 28.6 | 11.4 | 85.7 |