Vision and Language

Vision and Language

This page is about vision and language tasks that are distinct from visual question answering (which only deals with question answering) or grounded language learning (which includes a learning component to the task).

Overviews

Uppal et al 2020 - Multimodal Research in Vision and Language: A Review of Current and Emerging Trends
Multimodal Large Language Models (MLLMs)
- Yin et al 2023 - A Survey on Multimodal Large Language Models Comprehensive github (continuously updated)
- Li et al 2025 - A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges
- For Visual QA:
  - Kuang et al 2024 - Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey
- Evaluation of MLLMs:
  - Huang & Zhang 2024 - A Survey on Evaluation of Multimodal Large Language Models

Multimodal Foundation Models (Visual Language Models)

Prompting Methods

Yang et al 2023 - Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Multimodal Dialog Agents

Navigation Tasks

Multimodal Pretraining

See also Awesome Vision & Language Pretraining Papers.

Bugliarello et al 2021 - Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs

Bibliographies

Language and Behavior Learning
Vision-and-Language A curated list of vision and language resources.
Vision-Language Navigation
Embodied Vision

Table of Contents