====== Vision and Language ====== This page is about vision and language tasks that are distinct from [[visual question answering]] (which only deals with question answering) or [[grounded language learning]] (which includes a learning component to the task). ===== Overviews ===== * [[https://arxiv.org/pdf/2010.09522.pdf|Uppal et al 2020 - Multimodal Research in Vision and Language: A Review of Current and Emerging Trends]] * **Multimodal Large Language Models (MLLMs)** * [[https://arxiv.org/pdf/2306.13549|Yin et al 2023 - A Survey on Multimodal Large Language Models]] Comprehensive [[https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models|github]] (continuously updated) * [[https://arxiv.org/pdf/2501.02189|Li et al 2025 - A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges]] * For Visual QA: * [[https://arxiv.org/pdf/2411.17558|Kuang et al 2024 - Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey]] * Evaluation of MLLMs: * [[https://arxiv.org/pdf/2408.15769|Huang & Zhang 2024 - A Survey on Evaluation of Multimodal Large Language Models]] ===== Multimodal Foundation Models (Visual Language Models) ===== * CLIP: [[https://arxiv.org/pdf/2103.00020.pdf|Radford et al 2021 - Learning Transferable Visual Models From Natural Language Supervision]] [[https://github.com/openai/CLIP|github]] * [[https://arxiv.org/pdf/2201.12086.pdf|Li et al 2022 - BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation]] [[https://github.com/salesforce/BLIP|github]] * [[https://arxiv.org/pdf/2204.14198.pdf|Alayrac et al 2022 - Flamingo: a Visual Language Model for Few-Shot Learning]] (Open model) * [[https://arxiv.org/pdf/2301.12597.pdf|Li et al 2023 - BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models]] [[https://github.com/salesforce/LAVIS/tree/main/projects/blip2|github]] * LLaVA: **[[https://arxiv.org/pdf/2304.08485.pdf|Liu et al 2023 - Visual Instruction Tuning]]** Github: [[https://llava-vl.github.io/|LLaVA: Large Language and Vision Assistant]] * [[https://arxiv.org/pdf/2304.10592.pdf|Zhu et al 2023 - MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models]] * [[https://arxiv.org/pdf/2312.11805|Gemini-team 2023 - Gemini: A Family of Highly Capable Multimodal Models]] * [[https://arxiv.org/pdf/2403.13600|Qiao et al 2024 - VL-Mamba: Exploring State Space Models for Multimodal Learning]] * [[https://arxiv.org/pdf/2403.05530|Gemini-team 2024 - Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context]] * [[https://arxiv.org/pdf/2405.09818|Meta 2024 - Chameleon: Mixed-Modal Early-Fusion Foundation Models]] ==== Prompting Methods ==== * [[https://arxiv.org/pdf/2310.11441|Yang et al 2023 - Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V]] ===== Multimodal Dialog Agents ===== * Overviews * [[https://arxiv.org/pdf/2205.06907.pdf|Sundar & Heck 2022 - Multimodal Conversational AI: A Survey of Datasets and Approaches]] * Diana * [[https://www.cs.colostate.edu/~draper/papers/narayana_intellisys18.pdf|Narayana et al 2018 - Cooperating with Avatars Through Gesture, Language and Action]] * [[https://www.cs.colostate.edu/~dwhite54/pubs/mcneely_hcc2019.pdf|McNeely-White et al 2019 - User-Aware Shared Perception for Embodied Agents]] * {{papers:diana_s_world_a_situated_multimodal_interactive_agent.pdf|Krishnaswamy et al 2020 - Diana's World: A Situated Multimodal Interactive Agent}} * [[https://arxiv.org/pdf/2003.07385.pdf|Krishnaswamy & Pustejovsky 2020 - A Formal Analysis of Multimodal Referring Strategies Under Common Ground]] ===== Navigation Tasks ===== See also this [[https://github.com/daqingliu/awesome-vln|bibliography]]. * [[https://arxiv.org/pdf/1806.02724.pdf|Fried et al 2018 - Speaker-Follower Models for Vision-and-Language Navigation]] ===== Multimodal Pretraining ===== See also [[https://github.com/yuewang-cuhk/awesome-vision-language-pretraining-papers|Awesome Vision & Language Pretraining Papers]]. * [[https://aclanthology.org/2021.tacl-1.58.pdf|Bugliarello et al 2021 - Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs]] ===== Bibliographies ===== * [[https://github.com/jacobandreas/bibs/blob/master/language_behavior.md|Language and Behavior Learning]] * [[https://github.com/sangminwoo/awesome-vision-and-language|Vision-and-Language]] A curated list of vision and language resources. * [[https://github.com/daqingliu/awesome-vln|Vision-Language Navigation]] * [[https://github.com/ChanganVR/awesome-embodied-vision|Embodied Vision]] ===== People ===== * [[https://scholar.google.com/citations?user=_bs7PqgAAAAJ&hl=en|Dhruv Batra]] * [[https://scholar.google.com/citations?user=TP_JZm8AAAAJ&hl=en|Jason Baldridge]] * [[https://scholar.google.com/citations?user=4HrSWNcAAAAJ&hl=en|Tamara Berg]] * [[https://scholar.google.com/citations?user=vhP-tlcAAAAJ&hl=en|Yejin Choi]] * [[https://scholar.google.com/citations?user=mS5k4CYAAAAJ&hl=en|Justin Johnson]] * [[https://scholar.google.com/citations?user=p9RsPG4AAAAJ&hl=en|Ray Mooney]] * [[https://scholar.google.com/citations?user=56UT_6IAAAAJ&hl=en|James Pustejovsky]] * [[https://scholar.google.com/citations?user=8BeTDr0AAAAJ&hl=en|Jesse Thomason]] * [[https://scholar.google.com/citations?user=YjqluE0AAAAJ&hl=en|Xin Eric Wang]] * [[https://scholar.google.com/citations?user=6RxMYNEAAAAJ&hl=en|Mark Yatskar]] ===== Related Pages ===== * [[robotics:Embodied AI]] * [[Grounding]] * [[Grounded Language Learning]] * [[Image Captioning]] * [[Image Captioning#Video Captioning]] * [[Visual Question Answering]]