Replicating the BootCat method to build web corpora from search engines
This post describes an easy and modern way to gather web sources using search engines by adapting the BootCat method, whose positive and negative aspects are discussed.
more ...This post describes an easy and modern way to gather web sources using search engines by adapting the BootCat method, whose positive and negative aspects are discussed.
more ...Optimizing downloads is crucial to gather data from a series of websites. However, one should respect “politeness” rules. Here is a simple way keep an eye on all these constraints as once.
more ...Avoid wasting bandwidth capacity and processing time for webpages which are probably not worth the effort. Stay away from pages with little text in the target language or focus on other pages to gather links.
more ...Why choose between R and Python when you can have both? This tutorial shows how to install a Python scraper and use it for content discovery and text extraction, all straight from R.
more ...Tokenization is a text segmentation process whose objective resides in dividing written text into meaningful units. This post introduces two rule-based methods to perform tokenization on German, English and beyond.
more ...This post describes practical ways to find recent URLs within a website and to extract text, metadata, and comments. It contains all necessary code snippets to optimize link discovery and document filtering.
more ...Sitemaps are particularly useful for web crawling, so that machines can more intelligently crawl the site. The post entails all necessary code snippets to optimize link discovery and filtering and to work with sitemaps on the command-line.
more ...Here are two ways to validate XML documents in Python according the guidelines of the Text Encoding Initiative. The tutorial shows how to parse and to validate XML documents, using a shortcut or detailing each step.
more ...Trafilatura is a Python library designed to download, parse, and scrape web page data. It also offers tools that can easily help with website navigation and extraction of links from sitemaps and feeds.
more ...Date is a critical component for web archives since it is one of the few metadata that are relevant for philologists and information scientists alike. The Python library htmldate provides a way to extract the creation or modification date of web pages.
more ...