Tags for Bits of Language: corpus linguistics, NLP and text analytics

bash 4

Mon 29 July 2013 Batch file conversion to the same encoding on Linux

Tue 09 July 2013 Introducing the Microblog Explorer

Thu 13 October 2011 Find and delete LaTeX temporary files

Mon 18 October 2010 A fast bash pipe for TreeTagger

bibliography 10

Tue 13 November 2012 Blind reason, Leibniz and the age of cybernetics

Sat 24 December 2011 Bibliography and links updates

Tue 22 February 2011 Philosophy of technology, how things started: a typology

Sat 05 February 2011 Philosophy of technology: a few resources

Mon 17 January 2011 Resource links update

Tue 21 December 2010 Three series of recorded lectures

Wed 15 December 2010 On Text Linguistics

Mon 22 November 2010 Commented bibliography on readability assessment

Mon 15 November 2010 Comparison of Features for Automatic Readability Assessment: review

Mon 08 November 2010 A short bibliography on Latent Semantic Analysis and Indexing

canonicalization 1

Wed 10 February 2021 A simple multilingual lemmatizer for Python

chunk parsing 2

Tue 17 December 2013 A one-pass valency-oriented chunker for German

Mon 01 November 2010 Building a topic-specific corpus out of two different corpora

code snippet 17

Fri 05 November 2021 How to download web pages in parallel and follow politeness rules in Python

Thu 21 October 2021 An easy way to save time and resources: content-aware URL filtering

Tue 11 May 2021 Web scraping with R: Text and metadata extraction

Tue 16 February 2021 Using RSS and Atom feeds to collect web pages with Python

Wed 04 December 2019 Validating TEI-XML documents with Python

Fri 13 September 2019 Extracting the main text content from web pages using Python

Fri 15 September 2017 A module to extract date information from web pages

Wed 27 July 2016 Indexing text with ElasticSearch

Tue 22 December 2015 Parsing and converting HTML documents to XML format using Python’s lxml

Sat 04 July 2015 Rule-based URL cleaning for text collections

Wed 27 November 2013 Guessing if a URL points to a WordPress blog

Mon 29 July 2013 Batch file conversion to the same encoding on Linux

Thu 07 February 2013 Recipes for several model fitting techniques in R

Fri 21 December 2012 Data analysis and modeling in R: a crash course

Thu 13 October 2011 Find and delete LaTeX temporary files

Tue 25 January 2011 Binary search to find words in a list: Perl tutorial

Mon 18 October 2010 A fast bash pipe for TreeTagger

complex systems 4

Thu 26 May 2011 Workshop on Complexity in Language – Day 2 (report)

Tue 24 May 2011 Workshop on Complexity in Language - Day 1 (report)

Thu 17 March 2011 Simon, Gell-Mann and Lloyd on complex systems

Tue 08 March 2011 Melanie Mitchell: defining and measuring complexity

computational linguistics 1

Fri 15 October 2010 Resources and links of interest

computer-mediated communication 1

Thu 31 August 2017 On the interest of social media corpora

conference 8

Fri 23 March 2018 Franco-German workshop series on the historical illustrated press

Wed 22 November 2017 On the creation and use of social media resources

Thu 31 August 2017 On the interest of social media corpora

Wed 21 May 2014 Finding viable seed URLs for web corpora

Mon 04 July 2011 A few links on producing posters using LaTeX

Thu 26 May 2011 Workshop on Complexity in Language – Day 2 (report)

Tue 24 May 2011 Workshop on Complexity in Language - Day 1 (report)

Wed 15 December 2010 On Text Linguistics

corpus construction 2

Tue 16 February 2021 Using RSS and Atom feeds to collect web pages with Python

Wed 02 November 2016 Ad hoc and general-purpose corpus construction from web sources

corpus linguistics 21

Mon 13 December 2021 Replicating the BootCat method to build web corpora from search engines

Wed 02 October 2019 Two studies on toponyms in literary texts

Fri 10 June 2016 Collection and indexing of tweets with a geographical focus

Fri 27 November 2015 Analysis of the German Reddit corpus

Mon 06 January 2014 Challenges in web corpus construction for low-resource languages

Fri 25 October 2013 Review of the Czech internet corpus

Mon 29 July 2013 Batch file conversion to the same encoding on Linux

Fri 28 June 2013 What is good enough to become part of a web corpus?

Mon 15 October 2012 Feeding the COW at the FU Berlin

Thu 05 July 2012 Two open-source corpus-builders for German and French

Sun 11 March 2012 2nd release of the German Political Speeches Corpus

Sat 03 March 2012 XML standards for language corpora (review)

Mon 23 January 2012 Canadian research on readability in the ‘90s

Wed 18 January 2012 Word lists, word frequency and contextual diversity

Mon 17 October 2011 Parallel work with two taggers

Tue 26 July 2011 Introducing the German Political Speeches Corpus and Visualization Tool

Mon 16 May 2011 Halliday on complexity (1992)

Wed 12 January 2011 Quick review of the Falko Project

Mon 01 November 2010 Building a topic-specific corpus out of two different corpora

Fri 22 October 2010 Collecting academic papers

Fri 15 October 2010 Resources and links of interest

COW 2

Fri 28 June 2013 What is good enough to become part of a web corpus?

Mon 15 October 2012 Feeding the COW at the FU Berlin

css 2

Sun 26 February 2012 Completing web pages on the fly with JavaScript

Mon 05 September 2011 Display long texts with CSS, tutorial and example

cybernetics 1

Tue 13 November 2012 Blind reason, Leibniz and the age of cybernetics

data mining 11

Thu 21 October 2021 An easy way to save time and resources: content-aware URL filtering

Fri 13 September 2019 Extracting the main text content from web pages using Python

Fri 15 September 2017 A module to extract date information from web pages

Wed 27 November 2013 Guessing if a URL points to a WordPress blog

Tue 09 July 2013 Introducing the Microblog Explorer

Fri 21 December 2012 Data analysis and modeling in R: a crash course

Wed 05 September 2012 Ludovic Tanguy on Visual Analysis of Linguistic Data

Mon 02 July 2012 On global vs. local visualization of readability

Thu 26 April 2012 Microsoft to analyze social networks to determine comprehension level

Mon 17 January 2011 Resource links update

Tue 21 December 2010 Three series of recorded lectures

databases 1

Wed 27 July 2016 Indexing text with ElasticSearch

digital humanities 4

Wed 02 October 2019 Two studies on toponyms in literary texts

Fri 23 March 2018 Franco-German workshop series on the historical illustrated press

Thu 12 May 2016 Distant reading and text visualization

Fri 08 April 2016 Foucault and the spatial turn

evaluation 2

Thu 26 March 2020 Evaluation of date extraction tools for Python

Wed 29 January 2020 Evaluating scraping and text extraction tools for Python

French 1

Thu 05 July 2012 Two open-source corpus-builders for German and French

geography 1

Fri 08 April 2016 Foucault and the spatial turn

German 7

Tue 23 February 2021 Using a rule-based tokenizer for German

Tue 17 December 2013 A one-pass valency-oriented chunker for German

Wed 25 July 2012 Review of the readability checker DeLite

Thu 05 July 2012 Two open-source corpus-builders for German and French

Sun 11 March 2012 2nd release of the German Political Speeches Corpus

Sat 24 December 2011 Bibliography and links updates

Wed 12 January 2011 Quick review of the Falko Project

google trends 1

Fri 31 December 2010 Having fun and making money doing research

history 1

Fri 23 March 2018 Franco-German workshop series on the historical illustrated press

htmldate 2

Thu 26 March 2020 Evaluation of date extraction tools for Python

Fri 15 September 2017 A module to extract date information from web pages

javascript 1

Sun 26 February 2012 Completing web pages on the fly with JavaScript

latent semantic indexing 2

Mon 08 November 2010 A short bibliography on Latent Semantic Analysis and Indexing

Mon 01 November 2010 Building a topic-specific corpus out of two different corpora

LaTeX 3

Sat 24 December 2011 Bibliography and links updates

Thu 13 October 2011 Find and delete LaTeX temporary files

Mon 04 July 2011 A few links on producing posters using LaTeX

lemmatization 1

Wed 10 February 2021 A simple multilingual lemmatizer for Python

linguistic complexity 1

Fri 22 October 2010 Bibliography

measurement 5

Fri 13 April 2012 Amazon’s readability statistics by example

Mon 23 January 2012 Canadian research on readability in the ‘90s

Tue 13 December 2011 A note on Amazon’s text readability stats

Wed 13 July 2011 About Google Reading Level

Tue 21 June 2011 Lord Kelvin, Bachelard and Dilbert on Measurement

Microsoft 1

Thu 26 April 2012 Microsoft to analyze social networks to determine comprehension level

newspapers 3

Fri 25 October 2013 Review of the Czech internet corpus

Thu 05 July 2012 Two open-source corpus-builders for German and French

Mon 06 June 2011 Crawling a newspaper website to build a corpus

nlp 3

Wed 03 November 2021 Web scraping with Trafilatura just got faster

Tue 23 February 2021 Using a rule-based tokenizer for German

Wed 10 February 2021 A simple multilingual lemmatizer for Python

open source 2

Mon 13 December 2021 Replicating the BootCat method to build web corpora from search engines

Wed 01 December 2021 How to make language detection with langid.py faster

perl 7

Tue 09 July 2013 Introducing the Microblog Explorer

Fri 21 December 2012 Data analysis and modeling in R: a crash course

Thu 05 July 2012 Two open-source corpus-builders for German and French

Sat 24 December 2011 Bibliography and links updates

Tue 26 July 2011 Introducing the German Political Speeches Corpus and Visualization Tool

Mon 06 June 2011 Crawling a newspaper website to build a corpus

Tue 25 January 2011 Binary search to find words in a list: Perl tutorial

PhD 1

Fri 15 October 2010 Resources and links of interest

philosophy 1

Fri 08 April 2016 Foucault and the spatial turn

placenames 1

Wed 02 October 2019 Two studies on toponyms in literary texts

programming tips 1

Wed 01 December 2021 How to make language detection with langid.py faster

psycholinguistics 2

Fri 19 October 2012 A note on Computational Models of Psycholinguistics

Thu 24 May 2012 “Gerolinguistics” and text comprehension

python 12

Mon 13 December 2021 Replicating the BootCat method to build web corpora from search engines

Wed 01 December 2021 How to make language detection with langid.py faster

Fri 05 November 2021 How to download web pages in parallel and follow politeness rules in Python

Wed 03 November 2021 Web scraping with Trafilatura just got faster

Tue 23 February 2021 Using a rule-based tokenizer for German

Wed 10 February 2021 A simple multilingual lemmatizer for Python

Wed 04 December 2019 Validating TEI-XML documents with Python

Fri 13 September 2019 Extracting the main text content from web pages using Python

Fri 15 September 2017 A module to extract date information from web pages

Tue 22 December 2015 Parsing and converting HTML documents to XML format using Python’s lxml

Sat 04 July 2015 Rule-based URL cleaning for text collections

Tue 09 July 2013 Introducing the Microblog Explorer

R 2

Thu 07 February 2013 Recipes for several model fitting techniques in R

Fri 21 December 2012 Data analysis and modeling in R: a crash course

readability assessment 15

Wed 25 July 2012 Review of the readability checker DeLite

Mon 02 July 2012 On global vs. local visualization of readability

Thu 24 May 2012 “Gerolinguistics” and text comprehension

Thu 26 April 2012 Microsoft to analyze social networks to determine comprehension level

Fri 13 April 2012 Amazon’s readability statistics by example

Mon 09 January 2012 Interview with children’s books author Sabine Ludwig

Wed 28 December 2011 Tendencies in research on readability

Tue 13 December 2011 A note on Amazon’s text readability stats

Wed 13 July 2011 About Google Reading Level

Tue 21 June 2011 Lord Kelvin, Bachelard and Dilbert on Measurement

Thu 03 March 2011 Renate Bartsch on linguistic complexity

Mon 06 December 2010 E. Castello, Text Complexity and Reading Comprehension Tests - Reading Notes

Mon 22 November 2010 Commented bibliography on readability assessment

Mon 15 November 2010 Comparison of Features for Automatic Readability Assessment: review

Fri 22 October 2010 Bibliography

references 1

Fri 08 April 2016 Foucault and the spatial turn

research 16

Fri 25 October 2013 Review of the Czech internet corpus

Thu 22 August 2013 Overview of URL analysis and classification methods

Fri 19 October 2012 A note on Computational Models of Psycholinguistics

Mon 15 October 2012 Feeding the COW at the FU Berlin

Wed 05 September 2012 Ludovic Tanguy on Visual Analysis of Linguistic Data

Wed 25 July 2012 Review of the readability checker DeLite

Mon 02 July 2012 On global vs. local visualization of readability

Thu 24 May 2012 “Gerolinguistics” and text comprehension

Thu 26 April 2012 Microsoft to analyze social networks to determine comprehension level

Sat 03 March 2012 XML standards for language corpora (review)

Mon 23 January 2012 Canadian research on readability in the ‘90s

Wed 18 January 2012 Word lists, word frequency and contextual diversity

Wed 28 December 2011 Tendencies in research on readability

Tue 26 July 2011 Introducing the German Political Speeches Corpus and Visualization Tool

Tue 21 June 2011 Lord Kelvin, Bachelard and Dilbert on Measurement

Fri 31 December 2010 Having fun and making money doing research

resources 2

Fri 22 October 2010 Bibliography

Fri 15 October 2010 Resources and links of interest

semantic markup 2

Wed 13 July 2011 About Google Reading Level

Mon 29 November 2010 Using and parsing the hCard microformat, an introduction

simplemma 1

Tue 23 February 2021 Using a rule-based tokenizer for German

social media 1

social networks 4

spidering 2

Tue 16 February 2021 Using RSS and Atom feeds to collect web pages with Python

Mon 04 January 2021 Using sitemaps to crawl websites on the command-line

statistics 5

Thu 07 February 2013 Recipes for several model fitting techniques in R

Fri 13 April 2012 Amazon’s readability statistics by example

Wed 18 January 2012 Word lists, word frequency and contextual diversity

Tue 13 December 2011 A note on Amazon’s text readability stats

Fri 31 December 2010 Having fun and making money doing research

stemming 1

Wed 10 February 2021 A simple multilingual lemmatizer for Python

summary 1

Wed 02 November 2016 Ad hoc and general-purpose corpus construction from web sources

tei 1

Wed 04 December 2019 Validating TEI-XML documents with Python

text classification 5

Wed 21 May 2014 Finding viable seed URLs for web corpora

Mon 06 January 2014 Challenges in web corpus construction for low-resource languages

Fri 25 October 2013 Review of the Czech internet corpus

Thu 22 August 2013 Overview of URL analysis and classification methods

Fri 28 June 2013 What is good enough to become part of a web corpus?

text cleaning 12

Wed 26 January 2022 “Googleology is bad science”: Anatomy of a web corpus infrastructure

Wed 03 November 2021 Web scraping with Trafilatura just got faster

Tue 23 February 2021 Using a rule-based tokenizer for German

Thu 26 March 2020 Evaluation of date extraction tools for Python

Wed 29 January 2020 Evaluating scraping and text extraction tools for Python

Wed 04 December 2019 Validating TEI-XML documents with Python

Mon 29 July 2013 Batch file conversion to the same encoding on Linux

Thu 05 July 2012 Two open-source corpus-builders for German and French

Mon 17 October 2011 Parallel work with two taggers

Mon 06 June 2011 Crawling a newspaper website to build a corpus

Sat 04 June 2011 Building a basic specialized crawler

Mon 18 October 2010 A fast bash pipe for TreeTagger

text linguistics 2

Mon 16 May 2011 Halliday on complexity (1992)

Wed 15 December 2010 On Text Linguistics

tokenization 1

Wed 27 July 2016 Indexing text with ElasticSearch

trafilatura 14

Wed 26 January 2022 “Googleology is bad science”: Anatomy of a web corpus infrastructure

Mon 13 December 2021 Replicating the BootCat method to build web corpora from search engines

Fri 05 November 2021 How to download web pages in parallel and follow politeness rules in Python

Wed 03 November 2021 Web scraping with Trafilatura just got faster

Tue 11 May 2021 Web scraping with R: Text and metadata extraction

Tue 16 February 2021 Using RSS and Atom feeds to collect web pages with Python

Mon 04 January 2021 Using sitemaps to crawl websites on the command-line

Mon 14 December 2020 Filtering links to gather texts on the web

Thu 26 March 2020 Evaluation of date extraction tools for Python

Wed 29 January 2020 Evaluating scraping and text extraction tools for Python

Wed 04 December 2019 Validating TEI-XML documents with Python

Fri 13 September 2019 Extracting the main text content from web pages using Python

Fri 15 September 2017 A module to extract date information from web pages

Wed 02 November 2016 Ad hoc and general-purpose corpus construction from web sources

TreeTagger 3

Mon 17 October 2011 Parallel work with two taggers

Wed 12 January 2011 Quick review of the Falko Project

Mon 18 October 2010 A fast bash pipe for TreeTagger

tweets 1

Wed 27 July 2016 Indexing text with ElasticSearch

URLs 5

Thu 21 October 2021 An easy way to save time and resources: content-aware URL filtering

Wed 21 May 2014 Finding viable seed URLs for web corpora

Mon 06 January 2014 Challenges in web corpus construction for low-resource languages

Wed 27 November 2013 Guessing if a URL points to a WordPress blog

Thu 22 August 2013 Overview of URL analysis and classification methods

visualization 9

Wed 02 October 2019 Two studies on toponyms in literary texts

Thu 12 May 2016 Distant reading and text visualization

Fri 27 November 2015 Analysis of the German Reddit corpus

Fri 21 December 2012 Data analysis and modeling in R: a crash course

Wed 05 September 2012 Ludovic Tanguy on Visual Analysis of Linguistic Data

Mon 02 July 2012 On global vs. local visualization of readability

Fri 13 April 2012 Amazon’s readability statistics by example

Sun 11 March 2012 2nd release of the German Political Speeches Corpus

Mon 05 September 2011 Display long texts with CSS, tutorial and example

web corpora 3

Wed 03 November 2021 Web scraping with Trafilatura just got faster

Thu 21 October 2021 An easy way to save time and resources: content-aware URL filtering

Tue 16 February 2021 Using RSS and Atom feeds to collect web pages with Python

web corpus construction 17

Wed 26 January 2022 “Googleology is bad science”: Anatomy of a web corpus infrastructure

Mon 13 December 2021 Replicating the BootCat method to build web corpora from search engines

Mon 04 January 2021 Using sitemaps to crawl websites on the command-line

Mon 14 December 2020 Filtering links to gather texts on the web

Thu 26 March 2020 Evaluation of date extraction tools for Python

Wed 29 January 2020 Evaluating scraping and text extraction tools for Python

Fri 13 September 2019 Extracting the main text content from web pages using Python

Fri 15 September 2017 A module to extract date information from web pages

Thu 31 August 2017 On the interest of social media corpora

Mon 20 June 2016 Bibliography

Fri 10 June 2016 Collection and indexing of tweets with a geographical focus

Fri 27 November 2015 Analysis of the German Reddit corpus

Wed 21 May 2014 Finding viable seed URLs for web corpora

Mon 06 January 2014 Challenges in web corpus construction for low-resource languages

Fri 25 October 2013 Review of the Czech internet corpus

Fri 28 June 2013 What is good enough to become part of a web corpus?

Sat 04 June 2011 Building a basic specialized crawler

web crawling 19

Wed 26 January 2022 “Googleology is bad science”: Anatomy of a web corpus infrastructure

Fri 05 November 2021 How to download web pages in parallel and follow politeness rules in Python

Thu 21 October 2021 An easy way to save time and resources: content-aware URL filtering

Tue 11 May 2021 Web scraping with R: Text and metadata extraction

Mon 04 January 2021 Using sitemaps to crawl websites on the command-line

Wed 02 November 2016 Ad hoc and general-purpose corpus construction from web sources

Mon 20 June 2016 Bibliography

Wed 21 May 2014 Finding viable seed URLs for web corpora

Mon 06 January 2014 Challenges in web corpus construction for low-resource languages

Wed 27 November 2013 Guessing if a URL points to a WordPress blog

Thu 22 August 2013 Overview of URL analysis and classification methods

Tue 09 July 2013 Introducing the Microblog Explorer

Fri 28 June 2013 What is good enough to become part of a web corpus?

Mon 15 October 2012 Feeding the COW at the FU Berlin

Thu 05 July 2012 Two open-source corpus-builders for German and French

Mon 06 June 2011 Crawling a newspaper website to build a corpus

Sat 04 June 2011 Building a basic specialized crawler

Mon 29 November 2010 Using and parsing the hCard microformat, an introduction

Fri 22 October 2010 Collecting academic papers

web scraping 4

Tue 11 May 2021 Web scraping with R: Text and metadata extraction

Mon 14 December 2020 Filtering links to gather texts on the web

Thu 26 March 2020 Evaluation of date extraction tools for Python

Wed 29 January 2020 Evaluating scraping and text extraction tools for Python

wordlist 2

Wed 18 January 2012 Word lists, word frequency and contextual diversity

Sat 24 December 2011 Bibliography and links updates

xml 4

Wed 04 December 2019 Validating TEI-XML documents with Python

Thu 05 July 2012 Two open-source corpus-builders for German and French

Sun 11 March 2012 2nd release of the German Political Speeches Corpus

Sat 03 March 2012 XML standards for language corpora (review)

ZDL 10

Wed 01 December 2021 How to make language detection with langid.py faster

Fri 05 November 2021 How to download web pages in parallel and follow politeness rules in Python

Wed 03 November 2021 Web scraping with Trafilatura just got faster

Tue 16 February 2021 Using RSS and Atom feeds to collect web pages with Python

Wed 10 February 2021 A simple multilingual lemmatizer for Python

Mon 04 January 2021 Using sitemaps to crawl websites on the command-line

Mon 14 December 2020 Filtering links to gather texts on the web

Wed 29 January 2020 Evaluating scraping and text extraction tools for Python

Wed 04 December 2019 Validating TEI-XML documents with Python

Fri 13 September 2019 Extracting the main text content from web pages using Python