Work in progress towards a page listing (web) corpus linguistics references and resources.
Summary
Corpus Linguistics and Corpus Building
- The Routledge Handbook of Corpus Linguistics, 1 ed., O’Keeffe, A. and McCarthy, M., Eds., London, New York: Routledge, 2010.
- N. Bubenhofer, Einführung in die Korpuslinguistik: Praktische Grundlagen und Werkzeuge, Zürich:2009.
- S. Loiseau, “Corpus, quantification et typologie textuelle”, Syntaxe et sémantique, vol. 9, pp. 73-85, 2008.
- C. Draxler, Korpusbasierte Sprachverarbeitung, Günter Narr, 2008. M. Cori, “Des méthodes de traitement automatique aux linguistiques fondées sur les corpus”, Langages, vol. 171, iss. 3, pp. 95-110, 2008.
- S. Loiseau, “CorpusReader : un dispositif de codage pour articuler une pluralité d’interprétations”, Corpus, vol. 6, pp. 153-186, 2007.
- M. Hundt, N. Nesselhauf, and C. Biewer, Corpus linguistics and the web, Rodopi, 2007.
- Sprachkorpora – Datenmengen und Erkenntnisfortschritt, Berlin: Walter de Gruyter, 2007.
- G. Rehm, A. Witt, H. Zinsmeister, and J. Dellert, “Corpus masking: Legally bypassing licensing restrictions for the free distribution of text collections”, Digital Humanities, pp. 166-170, 2007.
- T. McEnery, R. Xiao, and Y. Tono, Corpus-Based Language Studies: An advanced resource book, London and New York: Routledge, 2006.
- R. Duffner and A. Näf, “Digitale Textdatenbanken im Vergleich”, Linguistik Online, pp. 7-23, 2006.
- D. Biber, S. Conrad, and R. Reppen, Corpus linguistics – Investigating language structure and use, 5 ed., Cambridge: Cambridge University Press, 2006.
- B. Pincemin, “Introduction”, Corpus, vol. 6, pp. 5-15, 2006.
- C. Weiß, “Die thematische Erschließung von Sprachkorpora”, OPAL — Online publizierte Arbeiten zur Linguistik, iss. 1, 2005.
- L. Bowker and J. Pearson, Working with Specialized Language : A Practical Guide to Using Corpora, London and New York: Routledge, 2002.
- T. McEnery and A. Wilson, Corpus linguistics : an introduction, Edinburgh University Press, 2001.
- S. Wallis and G. Nelson, “Knowledge discovery in grammatically analysed corpora”, Data Mining and Knowledge Discovery, vol. 5, iss. 4, pp. 305-335, 2001.
- B. Habert, “Des corpus représentatifs: de quoi, pour quoi, comment”, Cahiers de l’Université de Perpignan, vol. 31, pp. 11-58, 2000.
- I. Jüttner, “Mannheimer Korpus und Urheberrecht. Die Einbeziehung zeitgenössischer digitalisierter Texte in die computergespeicherten Korpora des IDS und ihre juristischen Grundlagen”, Sprachreport, iss. 3, 2000.
- B. Habert, A. Nazarenko, and A. Salem, Les linguistiques de corpus, Armand Colin, 1997.
- J. Sinclair, “Preliminary recommendations on Corpus Typology”, EAGLES 1996.
- D. Biber, “Representativeness in corpus design”, Literary and linguistic computing, vol. 8, iss. 4, p. 243, 1993.
- Directions in Corpus Linguistics, Berlin, New York: Mouton de Gruyter, 1992.
- D. Biber, Variation across speech and writing, Cambridge University Press, 1988.
Web Corpora
Projects
- ACL SIGWAC
- COW – Corpora from the web (FU Berlin)
- Sketch Engine (proprietary software)
- WaCKy Initiative
Webcrawling
- Niocchi (distributed crawling engine)
- Nutch (Apache project)
- How to crawl a quarter billion webpages in 40 hours
- INFOMINE at UC Riverside (focused crawling, metadata extraction)
- Combine (another focused crawler with metadata extraction)
Document classification
Spam filtering
- Wikipedia regular expressions list
- shallalist.de (free list updated daily)
- Black list at the University of Toulouse (updated regularly)
Misc
For other (possibly dated) resources lists see: