Bài báo quốc tế
OWLViz: An Open-World Benchmark for Visual Question Answering
Thuy Nguyen
We present OWLViz, a challenging benchmark for Open WorLd VISual question answering that evaluates multimodal AI systems on realistic, practical tasks. OWLViz features 248 carefully curated questions requiring the integration of multiple capabilities: common-sense knowledge, visual understanding, web exploration, and specialized tool usage. The benchmark specifically challenges models with visually degraded inputs, complex multi-step reasoning involving counting and measurement operations, and knowledge-intensive queries requiring external information retrieval from minimal visual cues. While humans achieve 69.2% accuracy on these intuitive tasks in under one minute, even state-of-the-art VLMs struggle dramatically, with the best model, Gemini 2.5 Pro, achieving only 27.09% accuracy. Current tool-calling agents and GUI agents, which rely on vision and vision-language models as tools, perform even worse, often failing to engage with available tools effectively. This substantial performance gap reveals critical limitations in multimodal systems' ability to select appropriate tools, coordinate heterogeneous resources, and execute complex reasoning sequences. OWLViz establishes new directions for advancing practical, open-world AI research and agent development.
Bài báo liên quan
Hybrid Federated Learning with TabTransformer and FedMADE-GSA for IoT Intrusion Detection
Huỳnh Trọng ThưaANALYZING THE IMPACT OF IT CAPABILITY ON BUSINESS PERFORMANCE OF SMES IN THE CONTEXT OF DIGITAL TRANSFORMATION IN HCMC
Nguyễn Văn Sáu