This is an old revision of the document!

Dataset Creation

Annotation

For annotation tools, see Software - Annotation Tools.
To annotate data manually (without using crowdsourcing), practitioners generally follow these steps:

Gather data: Decide the data source and gather some data to annotate. Be sure to consider any ethical issues. If you want to be able to release the data publicly, check for potential copyright or privacy violations.
Decide what to annotate: Look at a portion of the data and decide a rough idea of what you want to annotate – that is the phenomena you want to capture and at what granularity. Write a document describing the preliminary annotation scheme.
Pilot annotation: Try annotating some data by yourself or with some collegues using the annotation scheme. Compare annotations and decide on edge cases (decide what to do on the difficult boardline cases). Decide if you want to simplify or extend the annotation scheme.
Refine annotation scheme (iterative): Refine your annotation scheme until you're happy with it and it is easy to annotate. This may take several rounds of pilot annotation.
Compute inter-annotator agreement: Make sure to doubly annotate a subset of the data you annotate so you can compute inter-annotator agreement
Full-scale annotation: Annotate a bunch of data yourself or using annotators you've trained (usually it helps to have these annotators involved in developing the annotation scheme during the pilot annotation). Depending on the complexity of the annotation task, you may need to have regular meetings during this time to decide on edge cases as they come up.
Release: Release it with a document or published paper describing the annotation scheme and a datasheet. Make sure there are no copyright or privacy violations when releasing the data. You can create a project page such as this one or this one.
Updates: You can release another version of the data to annotate more data or fix errors. You can also have a bug report form (like this one) for the dataset to allow others to fix errors.

Annotation Agreement

Measures of inter-annotator agreement
- Cohen's kappa Better than percent agreement, see slides here
- Weighted Cohen's kappa
- Fleiss' kappa More general than Cohen's kappa
Software
- R's Inter-Annotator Reliability Package (IRR) is great. docs example

Dataset and Data Selection Issues

Data Validation

See Data Cleaning and Validation.

Crowdsourcing

See Crowdsourcing.

Alternative Methods

Methods of Faster or Cheaper Annotation

Garrette & Baldridge 2013 - Learning a Part-of-Speech Tagger from Two Hours of Annotation Talks efficiency of annotating types vs tokens
Schneider et al 2013 - A Framework for (Under)specifying Dependency Syntax without Overloading Annotators
Ratner et al 2016 - Data Programming: Creating Large Training Sets, Quickly
See also Prompting and Scao & Rush 2021 - How Many Data Points is a Prompt Worth? Prompts are very helpful in small data regimes, and are worth 100's of datapoints.

Methods of Avoiding Dataset Bias or Improving Robustness

Adversarial Filtering
- Zellers et al 2018 - SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference Introduced adversarial filtering
- Zellers et al 2019 - HellaSwag: Can a Machine Really Finish Your Sentence?

Reducing Bias

See Reducing Annotation Artifacts During Dataset Creation

Documentation

Good to document the data with data cards, such as this or this (or the old version)

NLP Wiki

Table of Contents

Dataset Creation

Annotation

Annotation Agreement

Dataset and Data Selection Issues

Data Validation

Crowdsourcing

Alternative Methods

Methods of Faster or Cheaper Annotation

Methods of Avoiding Dataset Bias or Improving Robustness

Reducing Bias

Documentation

Related Pages