Cổng tri thức PTIT

Bài báo quốc tế

Kho tri thức

/

/

OWLViz: An Open-World Benchmark for Visual Question Answering

OWLViz: An Open-World Benchmark for Visual Question Answering

Thuy Nguyen

We present OWLViz, a challenging benchmark for Open WorLd VISual question answering that evaluates multimodal AI systems on realistic, practical tasks. OWLViz features 248 carefully curated questions requiring the integration of multiple capabilities: common-sense knowledge, visual understanding, web exploration, and specialized tool usage. The benchmark specifically challenges models with visually degraded inputs, complex multi-step reasoning involving counting and measurement operations, and knowledge-intensive queries requiring external information retrieval from minimal visual cues. While humans achieve 69.2% accuracy on these intuitive tasks in under one minute, even state-of-the-art VLMs struggle dramatically, with the best model, Gemini 2.5 Pro, achieving only 27.09% accuracy. Current tool-calling agents and GUI agents, which rely on vision and vision-language models as tools, perform even worse, often failing to engage with available tools effectively. This substantial performance gap reveals critical limitations in multimodal systems' ability to select appropriate tools, coordinate heterogeneous resources, and execute complex reasoning sequences. OWLViz establishes new directions for advancing practical, open-world AI research and agent development.

Xuất bản trên:

OWLViz: An Open-World Benchmark for Visual Question Answering

Ngày đăng:

DOI:


Nhà xuất bản:

Địa điểm:


Từ khoá:

Open World, VLM, Tool-calling agents, GUI agents, VQA, RR