User Tools

Site Tools


nlp:corpus_analysis

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
nlp:corpus_analysis [2022/08/28 00:10] – [Corpus Analysis] jmflanignlp:corpus_analysis [2023/06/15 07:36] (current) – external edit 127.0.0.1
Line 1: Line 1:
 ====== Corpus Analysis ====== ====== Corpus Analysis ======
-Often considered a linguistics topic, //**corpus analysis**// is the study of language use in a corpus, often analysing the distribution of various phenomena.+Often considered a linguistics topic, //**corpus analysis**// is the study of language in a corpus, often analyzing the distribution of various phenomena (phonological, lexical, syntactic, etc). Sometimes the analysis is performed comparing across time, languages, or different genres.
  
-==== Frequency and Zipf's Law ====+===== Frequency Distribution and Zipf's Law =====
 [[https://en.wikipedia.org/wiki/Zipf%27s_law|Zipf's law]] describes the frequency distribution of words in language. [[https://en.wikipedia.org/wiki/Zipf%27s_law|Zipf's law]] describes the frequency distribution of words in language.
  
 +=== Historical Papers ===
   * [[https://psycnet.apa.org/record/1935-04756-000|Zipf 1935 - The Psycho-Biology of Language (book)]]   * [[https://psycnet.apa.org/record/1935-04756-000|Zipf 1935 - The Psycho-Biology of Language (book)]]
-  * [[https://www.jstor.org/stable/1419208#metadata_info_tab_contents|Miller & Newman 1958 - Tests of a Statistical Explanation of the Rank-Frequency Relation for Words in Written English]] A study on the UNIVAC computer 
-  * Miller et al 1959 - Length-frequency statistics for written English, available [[https://www.sciencedirect.com/journal/information-and-control/vol/1/issue/4|here]]. A study of frequency statistics of words using the UNIVAC 
   * [[https://aclanthology.org/1952.earlymt-1.17.pdf|Bull 1952 - Problems of Vocabulary Frequency and Distribution]] An interesting read, from [[https://aclanthology.org/events/earlymt-1952/|here]].   * [[https://aclanthology.org/1952.earlymt-1.17.pdf|Bull 1952 - Problems of Vocabulary Frequency and Distribution]] An interesting read, from [[https://aclanthology.org/events/earlymt-1952/|here]].
 +  * [[https://www.jstor.org/stable/1419208#metadata_info_tab_contents|Miller & Newman 1958 - Tests of a Statistical Explanation of the Rank-Frequency Relation for Words in Written English]] A study on the UNIVAC computer
 +  * [[https://www.sciencedirect.com/science/article/pii/S0019995858902298|Miller et al 1959 - Length-frequency statistics for written English]], available [[https://www.sciencedirect.com/journal/information-and-control/vol/1/issue/4|here]]. A study of frequency statistics of words using the UNIVAC. Talks about types and tokens.  Introduces the terms "function words" and "content words" on p. 377 (p. 8 in the pdf).
 +
 +===== Books =====
 +  * [[https://books.google.com/books?id=fzkQPKoFEb0C&pg=PA1|Word Frequency Distributions]], Harald (2002)
 +
 +===== People =====
 +  * [[https://en.wikipedia.org/wiki/George_Armitage_Miller|George Miller]]
 +  * [[https://en.wikipedia.org/wiki/George_Kingsley_Zipf|George Zipf]]
  
nlp/corpus_analysis.1661645452.txt.gz · Last modified: 2023/06/15 07:36 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki