====== Corpus Analysis ====== Often considered a linguistics topic, //**corpus analysis**// is the study of language in a corpus, often analyzing the distribution of various phenomena (phonological, lexical, syntactic, etc). Sometimes the analysis is performed comparing across time, languages, or different genres. ===== Frequency Distribution and Zipf's Law ===== [[https://en.wikipedia.org/wiki/Zipf%27s_law|Zipf's law]] describes the frequency distribution of words in language. === Historical Papers === * [[https://psycnet.apa.org/record/1935-04756-000|Zipf 1935 - The Psycho-Biology of Language (book)]] * [[https://aclanthology.org/1952.earlymt-1.17.pdf|Bull 1952 - Problems of Vocabulary Frequency and Distribution]] An interesting read, from [[https://aclanthology.org/events/earlymt-1952/|here]]. * [[https://www.jstor.org/stable/1419208#metadata_info_tab_contents|Miller & Newman 1958 - Tests of a Statistical Explanation of the Rank-Frequency Relation for Words in Written English]] A study on the UNIVAC computer * [[https://www.sciencedirect.com/science/article/pii/S0019995858902298|Miller et al 1959 - Length-frequency statistics for written English]], available [[https://www.sciencedirect.com/journal/information-and-control/vol/1/issue/4|here]]. A study of frequency statistics of words using the UNIVAC. Talks about types and tokens. Introduces the terms "function words" and "content words" on p. 377 (p. 8 in the pdf). ===== Books ===== * [[https://books.google.com/books?id=fzkQPKoFEb0C&pg=PA1|Word Frequency Distributions]], Harald (2002) ===== People ===== * [[https://en.wikipedia.org/wiki/George_Armitage_Miller|George Miller]] * [[https://en.wikipedia.org/wiki/George_Kingsley_Zipf|George Zipf]]