====== Vision and Language ======
This page is about vision and language tasks that are distinct from [[visual question answering]] (which only deals with question answering) or [[grounded language learning]] (which includes a learning component to the task).

===== Overviews =====
  * [[https://arxiv.org/pdf/2010.09522.pdf|Uppal et al 2020 - Multimodal Research in Vision and Language: A Review of Current and Emerging Trends]]
  * **Multimodal Large Language Models (MLLMs)**
    * [[https://arxiv.org/pdf/2306.13549|Yin et al 2023 - A Survey on Multimodal Large Language Models]] Comprehensive [[https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models|github]] (continuously updated)
    * [[https://arxiv.org/pdf/2501.02189|Li et al 2025 - A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges]]
    * For Visual QA:
      * [[https://arxiv.org/pdf/2411.17558|Kuang et al 2024 - Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey]]
    * Evaluation of MLLMs:
      * [[https://arxiv.org/pdf/2408.15769|Huang & Zhang 2024 - A Survey on Evaluation of Multimodal Large Language Models]]

===== Multimodal Foundation Models (Visual Language Models) =====
  * CLIP: [[https://arxiv.org/pdf/2103.00020.pdf|Radford et al 2021 - Learning Transferable Visual Models From Natural Language Supervision]] [[https://github.com/openai/CLIP|github]]
  * [[https://arxiv.org/pdf/2201.12086.pdf|Li et al 2022 - BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation]] [[https://github.com/salesforce/BLIP|github]]
  * [[https://arxiv.org/pdf/2204.14198.pdf|Alayrac et al 2022 - Flamingo: a Visual Language Model for Few-Shot Learning]] (Open model)
  * [[https://arxiv.org/pdf/2301.12597.pdf|Li et al 2023 - BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models]] [[https://github.com/salesforce/LAVIS/tree/main/projects/blip2|github]]
  * LLaVA: **[[https://arxiv.org/pdf/2304.08485.pdf|Liu et al 2023 - Visual Instruction Tuning]]** Github: [[https://llava-vl.github.io/|LLaVA: Large Language and Vision Assistant]]
  * [[https://arxiv.org/pdf/2304.10592.pdf|Zhu et al 2023 - MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models]]
  * [[https://arxiv.org/pdf/2312.11805|Gemini-team 2023 - Gemini: A Family of Highly Capable Multimodal Models]]
  * [[https://arxiv.org/pdf/2403.13600|Qiao et al 2024 - VL-Mamba: Exploring State Space Models for Multimodal Learning]]
  * [[https://arxiv.org/pdf/2403.05530|Gemini-team 2024 - Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context]]
  * [[https://arxiv.org/pdf/2405.09818|Meta 2024 - Chameleon: Mixed-Modal Early-Fusion Foundation Models]]

==== Prompting Methods ====
  * [[https://arxiv.org/pdf/2310.11441|Yang et al 2023 - Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V]]

===== Multimodal Dialog Agents =====
  * Overviews
    * [[https://arxiv.org/pdf/2205.06907.pdf|Sundar & Heck 2022 - Multimodal Conversational AI: A Survey of Datasets and Approaches]]
  * Diana
    * [[https://www.cs.colostate.edu/~draper/papers/narayana_intellisys18.pdf|Narayana et al 2018 - Cooperating with Avatars Through Gesture, Language and Action]]
    * [[https://www.cs.colostate.edu/~dwhite54/pubs/mcneely_hcc2019.pdf|McNeely-White et al 2019 - User-Aware Shared Perception for Embodied Agents]]
    * {{papers:diana_s_world_a_situated_multimodal_interactive_agent.pdf|Krishnaswamy et al 2020 - Diana's World: A Situated Multimodal Interactive Agent}}
  * [[https://arxiv.org/pdf/2003.07385.pdf|Krishnaswamy & Pustejovsky 2020 - A Formal Analysis of Multimodal Referring Strategies Under Common Ground]]

===== Navigation Tasks =====
See also this [[https://github.com/daqingliu/awesome-vln|bibliography]].
  * [[https://arxiv.org/pdf/1806.02724.pdf|Fried et al 2018 - Speaker-Follower Models for Vision-and-Language Navigation]]

===== Multimodal Pretraining =====
See also [[https://github.com/yuewang-cuhk/awesome-vision-language-pretraining-papers|Awesome Vision & Language Pretraining Papers]].

  * [[https://aclanthology.org/2021.tacl-1.58.pdf|Bugliarello et al 2021 - Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs]]

===== Bibliographies =====
  * [[https://github.com/jacobandreas/bibs/blob/master/language_behavior.md|Language and Behavior Learning]]
  * [[https://github.com/sangminwoo/awesome-vision-and-language|Vision-and-Language]] A curated list of vision and language resources.
  * [[https://github.com/daqingliu/awesome-vln|Vision-Language Navigation]]
  * [[https://github.com/ChanganVR/awesome-embodied-vision|Embodied Vision]]

===== People =====
  * [[https://scholar.google.com/citations?user=_bs7PqgAAAAJ&hl=en|Dhruv Batra]]
  * [[https://scholar.google.com/citations?user=TP_JZm8AAAAJ&hl=en|Jason Baldridge]]
  * [[https://scholar.google.com/citations?user=4HrSWNcAAAAJ&hl=en|Tamara Berg]]
  * [[https://scholar.google.com/citations?user=vhP-tlcAAAAJ&hl=en|Yejin Choi]]
  * [[https://scholar.google.com/citations?user=mS5k4CYAAAAJ&hl=en|Justin Johnson]]
  * [[https://scholar.google.com/citations?user=p9RsPG4AAAAJ&hl=en|Ray Mooney]]
  * [[https://scholar.google.com/citations?user=56UT_6IAAAAJ&hl=en|James Pustejovsky]]
  * [[https://scholar.google.com/citations?user=8BeTDr0AAAAJ&hl=en|Jesse Thomason]]
  * [[https://scholar.google.com/citations?user=YjqluE0AAAAJ&hl=en|Xin Eric Wang]]
  * [[https://scholar.google.com/citations?user=6RxMYNEAAAAJ&hl=en|Mark Yatskar]]

===== Related Pages =====
  * [[robotics:Embodied AI]]
  * [[Grounding]]
  * [[Grounded Language Learning]]
  * [[Image Captioning]]
  * [[Image Captioning#Video Captioning]]
  * [[Visual Question Answering]]