ETC: Encoding Long and Structured Inputs in Transformers

TLDR; ETC encodes long inputs using global-local attention and represents structures by combining relative position representations and flexible masking. It also employs CPC pre-training for hierarchical global tokens (structures).

Key Points

Global-local attention: input tokens divided into two sets
- Global: can attend to all input tokens
- Long: can only locally attend to nearby tokens
- Related to Longformer.
Relative position representations: allow encoding arbitrary structure relations between input tokens.
Contrastive predictive coding (CPC): pre-training objective that helps the model learn how to use global summary tokens.
The experiments use benchmarks with long/structured data
- Question answering: Natural Questions (NQ), HotpotQA, WikiHop
- Keyphrase extraction: OpenKP
It was pre-trained using BERT original datasets but filtering out documents with fewer than 7 sentences.
- MLM
- CPC
- Lifting weights directly from RoBERTa (with RoBERTa’s vocabulary)