User Tools

Site Tools


nlp:vision_and_language

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
nlp:vision_and_language [2025/06/05 04:38] – [Overviews] jmflanignlp:vision_and_language [2025/07/03 04:05] (current) – [Overviews] jmflanig
Line 6: Line 6:
   * **Multimodal Large Language Models (MLLMs)**   * **Multimodal Large Language Models (MLLMs)**
     * [[https://arxiv.org/pdf/2306.13549|Yin et al 2023 - A Survey on Multimodal Large Language Models]] Comprehensive [[https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models|github]] (continuously updated)     * [[https://arxiv.org/pdf/2306.13549|Yin et al 2023 - A Survey on Multimodal Large Language Models]] Comprehensive [[https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models|github]] (continuously updated)
 +    * [[https://arxiv.org/pdf/2501.02189|Li et al 2025 - A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges]]
 +    * For Visual QA:
 +      * [[https://arxiv.org/pdf/2411.17558|Kuang et al 2024 - Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey]]
 +    * Evaluation of MLLMs:
 +      * [[https://arxiv.org/pdf/2408.15769|Huang & Zhang 2024 - A Survey on Evaluation of Multimodal Large Language Models]]
  
 ===== Multimodal Foundation Models (Visual Language Models) ===== ===== Multimodal Foundation Models (Visual Language Models) =====
Line 17: Line 22:
   * [[https://arxiv.org/pdf/2403.13600|Qiao et al 2024 - VL-Mamba: Exploring State Space Models for Multimodal Learning]]   * [[https://arxiv.org/pdf/2403.13600|Qiao et al 2024 - VL-Mamba: Exploring State Space Models for Multimodal Learning]]
   * [[https://arxiv.org/pdf/2403.05530|Gemini-team 2024 - Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context]]   * [[https://arxiv.org/pdf/2403.05530|Gemini-team 2024 - Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context]]
 +  * [[https://arxiv.org/pdf/2405.09818|Meta 2024 - Chameleon: Mixed-Modal Early-Fusion Foundation Models]]
 +
 +==== Prompting Methods ====
 +  * [[https://arxiv.org/pdf/2310.11441|Yang et al 2023 - Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V]]
  
 ===== Multimodal Dialog Agents ===== ===== Multimodal Dialog Agents =====
nlp/vision_and_language.1749098293.txt.gz · Last modified: 2025/06/05 04:38 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki