This is an old revision of the document!

Vision and Language

This page is about vision and language tasks that are distinct from visual question answering (which only deals with question answering) or grounded language learning (which includes a learning component to the task).

Overviews

Uppal et al 2020 - Multimodal Research in Vision and Language: A Review of Current and Emerging Trends

Multimodal Foundation Models (Visual Language Models)

CLIP: Radford et al 2021 - Learning Transferable Visual Models From Natural Language Supervision github
Li et al 2022 - BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation github
Alayrac et al 2022 - Flamingo: a Visual Language Model for Few-Shot Learning (Open model)
Li et al 2023 - BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models github
LLaVA: Liu et al 2023 - Visual Instruction Tuning Github: LLaVA: Large Language and Vision Assistant
Zhu et al 2023 - MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Qiao et al 2024 - VL-Mamba: Exploring State Space Models for Multimodal Learning
Gemini-team 2024 - Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context

Multimodal Dialog Agents

Navigation Tasks

Multimodal Pretraining

Bugliarello et al 2021 - Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs

Bibliographies

Language and Behavior Learning
Vision-and-Language A curated list of vision and language resources.
Vision-Language Navigation
Embodied Vision

NLP Wiki

Table of Contents