Cổng tri thức PTIT

Trang chủ

Giới thiệu

AI Cộng đồng

Kho tri thức

Tin tức

Liên hệ

Bài báo quốc tế

Kho tri thức

Bài báo quốc tế

OWLViz: An Open-World Benchmark for Visual Question Answering

Thuy Nguyen

We present OWLViz, a challenging benchmark for Open WorLd VISual question answering that evaluates multimodal AI systems on realistic, practical tasks. OWLViz features 248 carefully curated questions requiring the integration of multiple capabilities: common-sense knowledge, visual understanding, web exploration, and specialized tool usage. The benchmark specifically challenges models with visually degraded inputs, complex multi-step reasoning involving counting and measurement operations, and knowledge-intensive queries requiring external information retrieval from minimal visual cues. While humans achieve 69.2% accuracy on these intuitive tasks in under one minute, even state-of-the-art VLMs struggle dramatically, with the best model, Gemini 2.5 Pro, achieving only 27.09% accuracy. Current tool-calling agents and GUI agents, which rely on vision and vision-language models as tools, perform even worse, often failing to engage with available tools effectively. This substantial performance gap reveals critical limitations in multimodal systems' ability to select appropriate tools, coordinate heterogeneous resources, and execute complex reasoning sequences. OWLViz establishes new directions for advancing practical, open-world AI research and agent development.

Xuất bản trên:

OWLViz: An Open-World Benchmark for Visual Question Answering

Ngày đăng:

DOI:

Nhà xuất bản:

Địa điểm:

Từ khoá:

Open World, VLM, Tool-calling agents, GUI agents, VQA, RR

Bài báo liên quan

Adaptive federated learning with k-Means++ for rare-class IoT intrusion detection

Huỳnh Trọng Thưa

Hybrid Federated Learning with TabTransformer and FedMADE-GSA for IoT Intrusion Detection

Huỳnh Trọng Thưa

ANALYZING THE IMPACT OF IT CAPABILITY ON BUSINESS PERFORMANCE OF SMES IN THE CONTEXT OF DIGITAL TRANSFORMATION IN HCMC

Nguyễn Văn Sáu

R-IDF: Addressing the accuracy fallacy in evaluating LSTM-based intrusion detection

Phan Thanh Hy

A Smart Curriculum Vitae Analysis and Recommendation System for Job Application Support

Lai Quang Vinh

Detecting "Nine-Dash Line" Images in Digital Content via Faster R-CNN and DINOv2-Based Knowledge Distillation

Do Tran Tu

Deep Learning-Based Recognition and Classification of Technical Errors in Squat Movements.

Le Mau Hai Dang