====== ETC: Encoding Long and Structured Inputs in Transformers ====== [[https://arxiv.org/abs/2004.08483v5|ETC: Encoding Long and Structured Inputs in Transformers]] **TLDR;** ETC encodes long inputs using global-local attention and represents structures by combining relative position representations and flexible masking. It also employs CPC pre-training for hierarchical global tokens (structures). ==== Key Points ==== * Global-local attention: input tokens divided into two sets * Global: can attend to all input tokens * Long: can only locally attend to nearby tokens * Related to [[https://arxiv.org/abs/2004.05150v1|Longformer]]. * [[https://arxiv.org/abs/1803.02155v2|Relative position representations]]: allow encoding arbitrary structure relations between input tokens. * [[https://arxiv.org/abs/1807.03748v2|Contrastive predictive coding (CPC)]]: pre-training objective that helps the model learn how to use global summary tokens. * The experiments use benchmarks with long/structured data * Question answering: Natural Questions (NQ), HotpotQA, WikiHop * Keyphrase extraction: OpenKP * It was pre-trained using BERT original datasets but filtering out documents with fewer than 7 sentences. * MLM * CPC * Lifting weights directly from RoBERTa (with RoBERTa’s vocabulary)