User Tools

Site Tools


nlp:vision_and_language

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
nlp:vision_and_language [2023/10/17 16:58] – [Multimodal Foundation Models (Visual Language Models)] jmflanignlp:vision_and_language [2025/07/03 04:05] (current) – [Overviews] jmflanig
Line 4: Line 4:
 ===== Overviews ===== ===== Overviews =====
   * [[https://arxiv.org/pdf/2010.09522.pdf|Uppal et al 2020 - Multimodal Research in Vision and Language: A Review of Current and Emerging Trends]]   * [[https://arxiv.org/pdf/2010.09522.pdf|Uppal et al 2020 - Multimodal Research in Vision and Language: A Review of Current and Emerging Trends]]
 +  * **Multimodal Large Language Models (MLLMs)**
 +    * [[https://arxiv.org/pdf/2306.13549|Yin et al 2023 - A Survey on Multimodal Large Language Models]] Comprehensive [[https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models|github]] (continuously updated)
 +    * [[https://arxiv.org/pdf/2501.02189|Li et al 2025 - A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges]]
 +    * For Visual QA:
 +      * [[https://arxiv.org/pdf/2411.17558|Kuang et al 2024 - Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey]]
 +    * Evaluation of MLLMs:
 +      * [[https://arxiv.org/pdf/2408.15769|Huang & Zhang 2024 - A Survey on Evaluation of Multimodal Large Language Models]]
  
 ===== Multimodal Foundation Models (Visual Language Models) ===== ===== Multimodal Foundation Models (Visual Language Models) =====
Line 11: Line 18:
   * [[https://arxiv.org/pdf/2301.12597.pdf|Li et al 2023 - BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models]] [[https://github.com/salesforce/LAVIS/tree/main/projects/blip2|github]]   * [[https://arxiv.org/pdf/2301.12597.pdf|Li et al 2023 - BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models]] [[https://github.com/salesforce/LAVIS/tree/main/projects/blip2|github]]
   * LLaVA: **[[https://arxiv.org/pdf/2304.08485.pdf|Liu et al 2023 - Visual Instruction Tuning]]** Github: [[https://llava-vl.github.io/|LLaVA: Large Language and Vision Assistant]]   * LLaVA: **[[https://arxiv.org/pdf/2304.08485.pdf|Liu et al 2023 - Visual Instruction Tuning]]** Github: [[https://llava-vl.github.io/|LLaVA: Large Language and Vision Assistant]]
 +  * [[https://arxiv.org/pdf/2304.10592.pdf|Zhu et al 2023 - MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models]]
 +  * [[https://arxiv.org/pdf/2312.11805|Gemini-team 2023 - Gemini: A Family of Highly Capable Multimodal Models]]
 +  * [[https://arxiv.org/pdf/2403.13600|Qiao et al 2024 - VL-Mamba: Exploring State Space Models for Multimodal Learning]]
 +  * [[https://arxiv.org/pdf/2403.05530|Gemini-team 2024 - Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context]]
 +  * [[https://arxiv.org/pdf/2405.09818|Meta 2024 - Chameleon: Mixed-Modal Early-Fusion Foundation Models]]
 +
 +==== Prompting Methods ====
 +  * [[https://arxiv.org/pdf/2310.11441|Yang et al 2023 - Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V]]
  
 ===== Multimodal Dialog Agents ===== ===== Multimodal Dialog Agents =====
Line 50: Line 65:
 ===== Related Pages ===== ===== Related Pages =====
   * [[robotics:Embodied AI]]   * [[robotics:Embodied AI]]
 +  * [[Grounding]]
   * [[Grounded Language Learning]]   * [[Grounded Language Learning]]
   * [[Image Captioning]]   * [[Image Captioning]]
nlp/vision_and_language.1697561934.txt.gz · Last modified: 2023/10/17 16:58 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki