====== BERT ====== ===== Introductions to BERT ===== * Paper: [[https://arxiv.org/pdf/1810.04805.pdf|Devlin et al 2018 - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding]] * Blogs * [[https://www.analyticsvidhya.com/blog/2019/09/demystifying-bert-groundbreaking-nlp-framework/| Demystifying BERT: A Comprehensive Guide to the Groundbreaking NLP Framework]] * [[http://jalammar.github.io/illustrated-bert/|The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)]] * Textbooks * [[https://web.stanford.edu/~jurafsky/slp3/11.pdf|SLP Ch 11]] (especially [[https://web.stanford.edu/~jurafsky/slp3/11.pdf#page=6|11.2]]) * Training from scratch * [[https://aclanthology.org/2021.emnlp-main.831.pdf|Izsak et al 2021 - How to Train BERT with an Academic Budget]] * Retrospective Analyssis * [[https://arxiv.org/pdf/2306.02870.pdf|Nityasya et al 2023 - On “Scientific Debt” in NLP: A Case for More Rigour in Language Model Pre-Training Research]] ===== Extensions ===== * [[https://arxiv.org/pdf/1902.04094.pdf|Wang & Cho 2019 - BERT has a Mouth, and It Must Speak:BERT as a Markov Random Field Language Model]] WARNING: Mistake in this paper, [[https://sites.google.com/site/deepernn/home/blog/amistakeinwangchoberthasamouthanditmustspeakbertasamarkovrandomfieldlanguagemodel|it's not an MRF]]. * [[https://arxiv.org/pdf/2106.02736.pdf|Goyal et al 2021 - Exposing the Implicit Energy Networks behind Masked Language Models via Metropolis–Hastings]] ===== Interpretation and Properties (BERTology) ===== Summary: [[https://arxiv.org/pdf/2002.12327.pdf|Rogers et al 2020 - A Primer in BERTology: What we know about how BERT works]]. See also [[ml:Neural Network Psychology]]. * [[https://arxiv.org/pdf/1906.04341.pdf|2019 - What Does BERT Look At?An Analysis of BERT’s Attention]] Also points out that BERT looks at the SEP token as a no-op attention. * [[https://arxiv.org/pdf/1909.10430.pdf|Wiedemann et al 2019 - Does BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized Embeddings]] * [[https://arxiv.org/pdf/1905.06316.pdf|Tenney et al 2019 - What do you learn from context? Probing for sentence structure in contextualized word representations]] * [[https://arxiv.org/pdf/1905.05950.pdf|Tenney et al 2019 - BERT Rediscovers the Classical NLP Pipeline]] * [[https://arxiv.org/pdf/2002.12327.pdf|Rogers et al 2020 - A Primer in BERTology: What we know about how BERT works]] * [[https://twitter.com/lvwerra/status/1485301457813487619?s=21|2022 - Visualization of position embeddings in BERT and GPT-2 (Twitter)]] * [[https://arxiv.org/pdf/2203.06204.pdf|Papadimitriou et al 2022 - When classifying grammatical role, BERT doesn’t care about word order... except when it matters]] ===== Applications ===== * [[https://arxiv.org/pdf/1905.05583.pdf|Sun et al 2019 - How to Fine-Tune BERT for Text Classification?]] Exhaustive study investigating different fine-tuning methods of BERT on text classification tasks and provides a general strategy for BERT fine-tuning. * [[https://arxiv.org/pdf/1904.05255.pdf|Shi & Lin 2019 - Simple BERT Models for Relation Extraction and Semantic Role Labeling]] * [[https://arxiv.org/pdf/1903.10318.pdf|Liu 2019 - Fine-tune BERT for Extractive Summarization]] * [[https://arxiv.org/pdf/1904.09675.pdf|Zhang et al 2019 - BERTScore: Evaluating Text Generation with BERT]] ===== Domain & Language Variants ===== * [[https://arxiv.org/pdf/2004.10220.pdf|Mulyar et al 2020 - MT-Clinical BERT: Scaling Clinical Information Extraction with Multitask Learning]] ===== Other Variants ===== * [[https://arxiv.org/pdf/1909.05840.pdf|Shen et al 2019 - Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT]] * [[https://arxiv.org/pdf/1910.01108.pdf|Sanh et al 2019 - DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter]] ===== Related Pages ===== * [[Pretraining]] * [[Transformers]]