This is an old revision of the document!

Vision and Language

This page is about vision and language tasks that are distinct from visual question answering (which only deals with question answering) or grounded language learning (which includes a learning component to the task).

Overviews

Uppal et al 2020 - Multimodal Research in Vision and Language: A Review of Current and Emerging Trends

Multimodal Foundation Models (Visual Language Models)

CLIP: Radford et al 2021 - Learning Transferable Visual Models From Natural Language Supervision github
Li et al 2022 - BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Alayrac et al 2022 - Flamingo: a Visual Language Model for Few-Shot Learning (Open model)
Li et al 2023 - BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Multimodal Dialog Agents

Navigation Tasks

Multimodal Pretraining

Bugliarello et al 2021 - Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs

Bibliographies

Language and Behavior Learning
Vision-and-Language A curated list of vision and language resources.
Vision-Language Navigation
Embodied Vision

NLP Wiki

Table of Contents