Bits of Language: corpus linguistics, NLP and text analytics
  • Corpus Linguistics
  • Tutorials
  • Text Complexity

Replicating the BootCat method to build web corpora from search engines

This post describes an easy and modern way to gather web sources using search engines by adapting the BootCat method, whose positive and negative aspects are discussed.

more ...

How to make language detection with langid.py faster

The language detector langid.py has become quite popular. Using the modernized fork py3langid as an example I show how to maintain and optimize a Python package.

more ...

About Adrien Barbaresi
I'm a research scientist at the
Berlin-Brandenburg Academy of Sciences

Welcome to my academic blog about web corpora, text mining, computational linguistics and digital humanities.

  • Social

    • Twitter
    • LinkedIn
    • GitHub
  • Tags

    • code snippet
    • corpus linguistics
    • data mining
    • python
    • readability assessment
    • research
    • text cleaning
    • trafilatura
    • web corpus construction
    • web crawling
  • Links

    • Homepage
    • Scientific Publications
    • Web text collections (DWDS)
    • Center for Digital Lexicography of German (ZDL)

© 2021 Adrien Barbaresi · Powered by pelican-bootstrap3, Pelican, Bootstrap

Creative Commons License Content licensed under a Creative Commons Attribution-ShareAlike 4.0 International License, except where indicated otherwise.

Back to top