Extracting the main text content from web pages using Python
Web corpus construction involves a significant number of design decisions and turning points in data processing. Depending of the purpose of data collection, it may also require a substantial filtering and quality assessment. While some large-scale algorithms can be expected to smooth out irregularities, uses requiring a low margin of error and close reading approaches (such as the search for examples in lexicographic research) imply constant refinements and improvements with respect to the building and processing of the dataset.
Interest
Because of the vastly increasing variety of corpora, text types and use cases, it becomes more and more difficult to assess the adequacy and quality of certain web data for given research objectives. A central operation in corpus construction consists in retaining the desired content while discarding the rest, a task which has many names referring to peculiar subtasks or to the whole: web scraping, boilerplate removal or boilerplate detection, web page template detection, web page cleaning, or web content extraction – for a recent overview see Lejeune & Zhu (2018).
Recently, approaches using the CommonCrawl have flourished, as they allow for faster download and processing by skipping (or more precisely outsourcing) the crawling phase. While I think that finding one’s …
more ...