nlp:dataset_creation
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| nlp:dataset_creation [2022/07/21 17:44] – [Annotation Agreement] jmflanig | nlp:dataset_creation [2023/12/10 06:18] (current) – [Building Your own Annotation Tool] jmflanig | ||
|---|---|---|---|
| Line 2: | Line 2: | ||
| ===== Annotation ===== | ===== Annotation ===== | ||
| - | For annotation tools, see [[nlp: | + | For annotation tools, see [[nlp: |
| To annotate data manually (without using [[crowdsourcing]]), | To annotate data manually (without using [[crowdsourcing]]), | ||
| * **Gather data**: Decide the data source and gather some data to annotate. | * **Gather data**: Decide the data source and gather some data to annotate. | ||
| * **Decide what to annotate**: Look at a portion of the data and decide a rough idea of what you want to annotate -- that is the phenomena you want to capture and at what granularity. | * **Decide what to annotate**: Look at a portion of the data and decide a rough idea of what you want to annotate -- that is the phenomena you want to capture and at what granularity. | ||
| - | * **Pilot annotation**: | + | * **Pilot annotation**: |
| * **Refine annotation scheme (iterative)**: | * **Refine annotation scheme (iterative)**: | ||
| * **Compute inter-annotator agreement**: | * **Compute inter-annotator agreement**: | ||
| Line 20: | Line 20: | ||
| * Software | * Software | ||
| * R's [[https:// | * R's [[https:// | ||
| + | |||
| + | |||
| + | ==== Building Your own Annotation Tool ==== | ||
| + | * For simple projects, annotation can be done in a spreadsheet | ||
| + | * When building your own annotation tool, here are some things to consider | ||
| + | * The purpose of the tool is to make the annotation faster. | ||
| + | * To speed up development, | ||
| + | * Think very carefully about ways to reduce unnecessary mouse clicks, typing, reading text, etc. Every mouse click counts. | ||
| + | * Plan on doing some iterations on the tool. You will need to try it, and change it based on your experience. | ||
| + | * It doesn' | ||
| + | * Don't make it full-featured. | ||
| ===== Dataset and Data Selection Issues ===== | ===== Dataset and Data Selection Issues ===== | ||
| Line 37: | Line 48: | ||
| * [[https:// | * [[https:// | ||
| * [[https:// | * [[https:// | ||
| - | * See also [[Prompting and Task Descriptions|Prompting]] and [[https:// | + | * See also [[Prompting]] and [[https:// |
| === Methods of Avoiding Dataset Bias or Improving Robustness === | === Methods of Avoiding Dataset Bias or Improving Robustness === | ||
| Line 51: | Line 62: | ||
| ===== Related Pages ===== | ===== Related Pages ===== | ||
| + | * [[nlp: | ||
| * [[Bias]] | * [[Bias]] | ||
| * [[Crowdsourcing]] | * [[Crowdsourcing]] | ||
| * [[ml:Data Cleaning and Validation]] | * [[ml:Data Cleaning and Validation]] | ||
| + | * [[Data Augmentation]] | ||
| * [[Data Preparation]] | * [[Data Preparation]] | ||
| * [[Ethics]] | * [[Ethics]] | ||
| * [[Robustness in NLP]] | * [[Robustness in NLP]] | ||
nlp/dataset_creation.1658425458.txt.gz · Last modified: 2023/06/15 07:36 (external edit)