Cổng tri thức PTIT

Trang chủ

Giới thiệu

AI Cộng đồng

Kho tri thức

Tin tức

Liên hệ

Bài báo quốc tế

Kho tri thức

Bài báo quốc tế

SADL: Sampling, Deliberation, and Pseudo-Labeling for In-Context Learning in Compositional Visual Question Answering

Đặng Hoàng Long

Large vision-language models (LVLMs) offer a novel capability for performing in-context learning (ICL) in Visual QA. When prompted with a few demonstra- tions of image-question-answer triplets, LVLMs have demonstrated the ability to discern underlying patterns and transfer this latent knowledge to answer new questions about unseen images without the need for expensive supervised fine-tuning. However, designing effective vision-language prompts, especially for compositional questions, remains poorly understood. Adapting language-only ICL techniques may not necessarily work because we need to bridge the visual- linguistic semantic gap: Symbolic concepts must be grounded in visual content, which does not share the syntactic linguistic structures. This paper introduces SADL, a new visual-linguistic prompting framework for the task. SADL revolves around three key components: SAmpling, Deliberation, and Pseudo-Labeling of image-question pairs. Given an image-question query, we sample image-question pairs from the training data that are in semantic proximity to the query. To address the compositional nature of questions, the deliberation step decom- poses complex questions into a sequence of subquestions. Finally, the sequence is progressively annotated one subquestion at a time to generate a sequence of pseudo-labels. We investigate the behaviors of SADL under OpenFlamingo on large-scale Visual QA datasets, namely GQA, GQA-OOD, CLEVR, and CRIC. The evaluation demonstrates the critical roles of sampling in the neighborhood of the image, the decomposition of complex questions, and the accurate pairing of the subquestions and labels. These findings do not always align with those found in language-only ICL, suggesting fresh insights in vision-language settings.

Xuất bản trên:

SADL: Sampling, Deliberation, and Pseudo-Labeling for In-Context Learning in Compositional Visual Question Answering

Ngày đăng:

DOI:

Nhà xuất bản:

Discover Artificial Intelligence

Địa điểm:

Từ khoá:

Visual-linguistic reasoning, Large vision-language models, In-context learning, Compositional visual question answering

Bài báo liên quan

RAC : Few-shot fruit recognition through CLIP-based ambiguity reduction

Trần Anh Đạt

Optimising source code vulnerability detection using deep learning and deep graph network

Đỗ Xuân Chợ

Real-time phishing uniform resource locator detection based on hybrid embedding transformer and retraining-free inferencing

Đàm Minh Lịnh

Diffusion Model-Enhanced Environment Reconstruction in ISAC

Nguyễn Đức Minh Quang

Multiple teacher-student model guided knowledge distillation for malpositioned catheters and lines detection on chest x-rays

Trần Anh Đạt

AgriDetectVL: Emphasizes the Agriculture-Focused Application Combined With Visual–Language Integration

Vũ Hoài Nam

Applying machine learning algorithms for PE-malware detection on the Windows operating system

Đinh Trường Duy