====== Language Identification ====== ===== Overviews ===== * [[https://arxiv.org/pdf/1804.08186.pdf|Jauhiainen et al 2018 - Automatic Language Identification in Texts: A Survey]] ===== Methods and Papers ===== * [[https://www.aclweb.org/anthology/P12-3005.pdf|Lui & Baldwin 2012 - langid.py: An Off-the-shelf Language Identification Tool]] * [[https://arxiv.org/pdf/1909.12940.pdf|Palakodety et al 2020- Hope Speech Detection: A Computational Analysis of the Voice of Peace]] Clustering based on polyglot word embeddings is an easy method for unsupervised language detection (see section 5.1). * [[https://www.aclweb.org/anthology/2020.wnut-1.24.pdf|Palakodety & KhudaBukhsh 2020 - Annotation Efficient Language Identification from Weak Labels]] ===== Software ===== Comparison [[https://towardsdatascience.com/benchmarking-language-detection-for-nlp-8250ea8b67c|here]]. * [[https://fasttext.cc/blog/2017/10/02/blog-post.html|FastText Language ID]] * [[https://cloud.google.com/translate/docs/basic/detecting-language|GoogleLangID]] * [[https://github.com/saffsd/langid.py|langid.py]] [[https://www.aclweb.org/anthology/P12-3005.pdf|paper]] * [[https://pypi.org/project/langdetect/|langdetect]] * [[https://spacy.io/universe/project/spacy-langdetect|spaCy langdetect]] ===== Related Pages ===== * [[Code Switching]] * [[Data Preparation]]