nlp:corpus_analysis
Table of Contents
Corpus Analysis
Often considered a linguistics topic, corpus analysis is the study of language in a corpus, often analyzing the distribution of various phenomena (phonological, lexical, syntactic, etc). Sometimes the analysis is performed comparing across time, languages, or different genres.
Frequency Distribution and Zipf's Law
Zipf's law describes the frequency distribution of words in language.
Historical Papers
- Bull 1952 - Problems of Vocabulary Frequency and Distribution An interesting read, from here.
- Miller & Newman 1958 - Tests of a Statistical Explanation of the Rank-Frequency Relation for Words in Written English A study on the UNIVAC computer
- Miller et al 1959 - Length-frequency statistics for written English, available here. A study of frequency statistics of words using the UNIVAC. Talks about types and tokens. Introduces the terms “function words” and “content words” on p. 377 (p. 8 in the pdf).
Books
- Word Frequency Distributions, Harald (2002)
People
nlp/corpus_analysis.txt · Last modified: 2023/06/15 07:36 by 127.0.0.1