All Categories for Bits of Language: corpus linguistics, NLP and text analytics

Code 8

Wed 01 December 2021 How to make language detection with langid.py faster

Wed 10 February 2021 A simple multilingual lemmatizer for Python

Thu 26 March 2020 Evaluation of date extraction tools for Python

Wed 29 January 2020 Evaluating scraping and text extraction tools for Python

Wed 27 July 2016 Indexing text with ElasticSearch

Tue 22 December 2015 Parsing and converting HTML documents to XML format using Python’s lxml

Wed 27 November 2013 Guessing if a URL points to a WordPress blog

Thu 13 October 2011 Find and delete LaTeX temporary files

Code, Publications 1

Tue 17 December 2013 A one-pass valency-oriented chunker for German

Code, Software 1

Thu 05 July 2012 Two open-source corpus-builders for German and French

Code, Tutorial 2

Mon 14 December 2020 Filtering links to gather texts on the web

Mon 29 July 2013 Batch file conversion to the same encoding on Linux

Complexity & Readability 21

Fri 19 October 2012 A note on Computational Models of Psycholinguistics

Wed 25 July 2012 Review of the readability checker DeLite

Mon 02 July 2012 On global vs. local visualization of readability

Thu 24 May 2012 “Gerolinguistics” and text comprehension

Thu 26 April 2012 Microsoft to analyze social networks to determine comprehension level

Fri 13 April 2012 Amazon’s readability statistics by example

Mon 23 January 2012 Canadian research on readability in the ‘90s

Wed 18 January 2012 Word lists, word frequency and contextual diversity

Mon 09 January 2012 Interview with children’s books author Sabine Ludwig

Wed 28 December 2011 Tendencies in research on readability

Tue 13 December 2011 A note on Amazon’s text readability stats

Wed 13 July 2011 About Google Reading Level

Thu 26 May 2011 Workshop on Complexity in Language – Day 2 (report)

Tue 24 May 2011 Workshop on Complexity in Language - Day 1 (report)

Mon 16 May 2011 Halliday on complexity (1992)

Thu 17 March 2011 Simon, Gell-Mann and Lloyd on complex systems

Tue 08 March 2011 Melanie Mitchell: defining and measuring complexity

Thu 03 March 2011 Renate Bartsch on linguistic complexity

Mon 06 December 2010 E. Castello, Text Complexity and Reading Comprehension Tests - Reading Notes

Mon 22 November 2010 Commented bibliography on readability assessment

Mon 15 November 2010 Comparison of Features for Automatic Readability Assessment: review

Corpora 8

Wed 02 November 2016 Ad hoc and general-purpose corpus construction from web sources

Mon 20 June 2016 Bibliography

Fri 10 June 2016 Collection and indexing of tweets with a geographical focus

Fri 27 November 2015 Analysis of the German Reddit corpus

Fri 25 October 2013 Review of the Czech internet corpus

Sun 11 March 2012 2nd release of the German Political Speeches Corpus

Tue 26 July 2011 Introducing the German Political Speeches Corpus and Visualization Tool

Wed 15 December 2010 On Text Linguistics

Corpora, Other 1

Mon 15 October 2012 Feeding the COW at the FU Berlin

Corpora, Publications 3

Thu 31 August 2017 On the interest of social media corpora

Mon 06 January 2014 Challenges in web corpus construction for low-resource languages

Fri 28 June 2013 What is good enough to become part of a web corpus?

Corpus Linguistics 1

Wed 26 January 2022 “Googleology is bad science”: Anatomy of a web corpus infrastructure

Digital Humanities 3

Wed 02 October 2019 Two studies on toponyms in literary texts

Thu 12 May 2016 Distant reading and text visualization

Fri 08 April 2016 Foucault and the spatial turn

Ideas 3

Mon 17 October 2011 Parallel work with two taggers

Mon 01 November 2010 Building a topic-specific corpus out of two different corpora

Fri 22 October 2010 Collecting academic papers

Links 5

Sat 24 December 2011 Bibliography and links updates

Mon 10 October 2011 Selected recent discoveries

Mon 04 July 2011 A few links on producing posters using LaTeX

Mon 17 January 2011 Resource links update

Tue 21 December 2010 Three series of recorded lectures

Misc 3

Wed 22 November 2017 On the creation and use of social media resources

Fri 22 October 2010 Bibliography

Fri 15 October 2010 Resources and links of interest

Miscellaneous 4

Sat 31 December 2011 My contribution to the Anglicism of the Year award

Fri 31 December 2010 Having fun and making money doing research

Mon 29 November 2010 Using and parsing the hCard microformat, an introduction

Thu 21 October 2010 Why I don’t blog on hypotheses.org and why I might do so (someday…)

Miscellaneous, Digital Humanities 2

Fri 23 March 2018 Franco-German workshop series on the historical illustrated press

Mon 08 November 2010 A short bibliography on Latent Semantic Analysis and Indexing

News 1

Wed 03 November 2021 Web scraping with Trafilatura just got faster

Other 4

Thu 22 August 2013 Overview of URL analysis and classification methods

Wed 05 September 2012 Ludovic Tanguy on Visual Analysis of Linguistic Data

Sat 03 March 2012 XML standards for language corpora (review)

Wed 12 January 2011 Quick review of the Falko Project

Philosophy of Technology 4

Tue 21 June 2011 Lord Kelvin, Bachelard and Dilbert on Measurement

Thu 28 April 2011 Approaches to philosophy of technology

Tue 22 February 2011 Philosophy of technology, how things started: a typology

Sat 05 February 2011 Philosophy of technology: a few resources

Philosophy of Technology, Digital Humanities 1

Tue 13 November 2012 Blind reason, Leibniz and the age of cybernetics

Publications 1

Wed 21 May 2014 Finding viable seed URLs for web corpora

Software 1

Tue 09 July 2013 Introducing the Microblog Explorer

Tutorial 19

Mon 13 December 2021 Replicating the BootCat method to build web corpora from search engines

Fri 05 November 2021 How to download web pages in parallel and follow politeness rules in Python

Thu 21 October 2021 An easy way to save time and resources: content-aware URL filtering

Tue 11 May 2021 Web scraping with R: Text and metadata extraction

Tue 23 February 2021 Using a rule-based tokenizer for German

Tue 16 February 2021 Using RSS and Atom feeds to collect web pages with Python

Mon 04 January 2021 Using sitemaps to crawl websites on the command-line

Wed 04 December 2019 Validating TEI-XML documents with Python

Fri 13 September 2019 Extracting the main text content from web pages using Python

Fri 15 September 2017 A module to extract date information from web pages

Sat 04 July 2015 Rule-based URL cleaning for text collections

Thu 07 February 2013 Recipes for several model fitting techniques in R

Fri 21 December 2012 Data analysis and modeling in R: a crash course

Sun 26 February 2012 Completing web pages on the fly with JavaScript

Mon 05 September 2011 Display long texts with CSS, tutorial and example

Mon 06 June 2011 Crawling a newspaper website to build a corpus

Sat 04 June 2011 Building a basic specialized crawler

Tue 25 January 2011 Binary search to find words in a list: Perl tutorial

Mon 18 October 2010 A fast bash pipe for TreeTagger