User Tools

Site Tools


nlp:dataset_creation

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
nlp:dataset_creation [2022/07/21 17:44] – [Annotation Agreement] jmflanignlp:dataset_creation [2023/12/10 06:18] (current) – [Building Your own Annotation Tool] jmflanig
Line 2: Line 2:
  
 ===== Annotation ===== ===== Annotation =====
-For annotation tools, see [[nlp:software#annotation_tools|Software - Annotation Tools]].\\+For annotation tools, see [[nlp:software#annotation_tools|Software - Annotation Tools]]. Annotation can often be greatly sped up by building your own annotation tool with exactly the features you want for your application. This is can be a worthwhile time investment, since a well-designed tool can speed up annotation.\\
 To annotate data manually (without using [[crowdsourcing]]), practitioners generally follow these steps: To annotate data manually (without using [[crowdsourcing]]), practitioners generally follow these steps:
   * **Gather data**: Decide the data source and gather some data to annotate.  Be sure to consider any ethical issues.  If you want to be able to release the data publicly, check for potential copyright or privacy violations.   * **Gather data**: Decide the data source and gather some data to annotate.  Be sure to consider any ethical issues.  If you want to be able to release the data publicly, check for potential copyright or privacy violations.
   * **Decide what to annotate**: Look at a portion of the data and decide a rough idea of what you want to annotate -- that is the phenomena you want to capture and at what granularity.  Write a document describing the preliminary annotation scheme.   * **Decide what to annotate**: Look at a portion of the data and decide a rough idea of what you want to annotate -- that is the phenomena you want to capture and at what granularity.  Write a document describing the preliminary annotation scheme.
-  * **Pilot annotation**: Try annotating some data by yourself or with some collegues using the annotation scheme.  Compare annotations and decide on edge cases (decide what to do on the difficult boardline cases).  Decide if you want to simplify or extend the annotation scheme.+  * **Pilot annotation**: Try annotating some data by yourself or with some collegues using the annotation scheme.  (Annotate the same data). Compare annotations and decide on edge cases (decide what to do on the difficult boardline cases).  Decide if you want to simplify or extend the annotation scheme.
   * **Refine annotation scheme (iterative)**: Refine your annotation scheme until you're happy with it and it is easy to annotate.  This may take several rounds of pilot annotation.    * **Refine annotation scheme (iterative)**: Refine your annotation scheme until you're happy with it and it is easy to annotate.  This may take several rounds of pilot annotation. 
   * **Compute inter-annotator agreement**: Make sure to doubly annotate a subset of the data you annotate so you can compute inter-annotator agreement   * **Compute inter-annotator agreement**: Make sure to doubly annotate a subset of the data you annotate so you can compute inter-annotator agreement
Line 20: Line 20:
   * Software   * Software
     * R's [[https://cran.r-project.org/web/packages/irr/index.html|Inter-Annotator Reliability Package]] (IRR) is great. [[https://cran.r-project.org/web/packages/irr/irr.pdf|docs]] [[https://www.andywills.info/rminr/irr.html|example]]     * R's [[https://cran.r-project.org/web/packages/irr/index.html|Inter-Annotator Reliability Package]] (IRR) is great. [[https://cran.r-project.org/web/packages/irr/irr.pdf|docs]] [[https://www.andywills.info/rminr/irr.html|example]]
 +
 +
 +==== Building Your own Annotation Tool ====
 +  * For simple projects, annotation can be done in a spreadsheet
 +  * When building your own annotation tool, here are some things to consider
 +    * The purpose of the tool is to make the annotation faster.  Think carefully about what interface will be fastest for trained annotators.
 +    * To speed up development, use whatever language and API you are familiar with or find easiest.
 +    * Think very carefully about ways to reduce unnecessary mouse clicks, typing, reading text, etc.  Every mouse click counts.  Aggressively remove anything that is unnecessary, like typing escape or enter to save.  Instead, automatically save when you go to the next example, etc.
 +    * Plan on doing some iterations on the tool.  You will need to try it, and change it based on your experience.
 +    * It doesn't need to be perfect, it just needs to be fast to use.  It's ok to have bugs in the annotation tool if it's not widely used, and they don't slow down annotation.
 +    * Don't make it full-featured.  You just need the features that make annotation fast.
  
 ===== Dataset and Data Selection Issues ===== ===== Dataset and Data Selection Issues =====
Line 37: Line 48:
   * [[https://arxiv.org/pdf/1306.2091.pdf|Schneider et al 2013 - A Framework for (Under)specifying Dependency Syntax without Overloading Annotators]]   * [[https://arxiv.org/pdf/1306.2091.pdf|Schneider et al 2013 - A Framework for (Under)specifying Dependency Syntax without Overloading Annotators]]
   * [[https://arxiv.org/pdf/1605.07723.pdf|Ratner et al 2016 - Data Programming: Creating Large Training Sets, Quickly]]   * [[https://arxiv.org/pdf/1605.07723.pdf|Ratner et al 2016 - Data Programming: Creating Large Training Sets, Quickly]]
-  * See also [[Prompting and Task Descriptions|Prompting]] and [[https://arxiv.org/pdf/2103.08493.pdf|Scao & Rush 2021 - How Many Data Points is a Prompt Worth?]] Prompts are very helpful in small data regimes, and are worth 100's of datapoints.+  * See also [[Prompting]] and [[https://arxiv.org/pdf/2103.08493.pdf|Scao & Rush 2021 - How Many Data Points is a Prompt Worth?]] Prompts are very helpful in small data regimes, and are worth 100's of datapoints.
  
 === Methods of Avoiding Dataset Bias or Improving Robustness === === Methods of Avoiding Dataset Bias or Improving Robustness ===
Line 51: Line 62:
  
 ===== Related Pages ===== ===== Related Pages =====
 +  * [[nlp:software#Annotation Tools]]
   * [[Bias]]   * [[Bias]]
   * [[Crowdsourcing]]   * [[Crowdsourcing]]
   * [[ml:Data Cleaning and Validation]]   * [[ml:Data Cleaning and Validation]]
 +  * [[Data Augmentation]]
   * [[Data Preparation]]   * [[Data Preparation]]
   * [[Ethics]]   * [[Ethics]]
   * [[Robustness in NLP]]   * [[Robustness in NLP]]
  
nlp/dataset_creation.1658425458.txt.gz · Last modified: 2023/06/15 07:36 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki