Using RSS and Atom feeds to collect web pages with Python

Feeds are a convenient way to get hold of a website’s most recent publications. Used alone or together with text extraction, they can allow for regular updates to databases of web documents. That way, recent documents can be collected and stored for further use. Furthermore, it can be useful to download the portions of a website programmatically to save time and resources, so that a feed-based approach is light-weight and easier to maintain.

This post describes practical ways to find recent URLs within a website and to extract text, metadata, and comments. It contains all necessary code snippets to …

more ...

Validating TEI-XML documents with Python

This post introduces two ways to validate XML documents in Python according the guidelines of the Text Encoding Initiative, using a format commonly known as TEI-XML. The first one takes a shortcut using a library I am working on, while the second one shows an exhaustive way to perform the operation.

Both ground on LXML, an efficient library for processing XML and HTML. The following lines of code will try to parse and validate a document in the same directory as the terminal window or Python console.

Shortcut with the trafilatura library

I am currently using this web scraping …

more ...


A module to extract date information from web pages

Description

Metadata extraction

Diverse content extraction and scraping techniques are routinely used on web document collections by companies and research institutions alike. Being able to better qualify the contents allows for insights based on metadata (e.g. content type, authors or categories), better bandwidth control (e.g. by knowing when webpages have been updated), or optimization of indexing (e.g. language-based heuristics, LRU cache, etc.).

In short, metadata extraction is useful for different kinds of purposes ranging from knowledge extraction and business intelligence to classification and refined visualizations. It is often necessary to fully parse the document or apply robust …

more ...

Indexing text with ElasticSearch

The Lucene-based search engine Elasticsearch is fast and adaptable, so that it suits most demanding configurations, including large text corpora. I use it daily with tweets and began to release the scripts I use to do so. In this post, I give concrete tips for indexation of text and linguistic analysis.

Mapping

You do not need to define a type for the indexed fields, the database can guess it for you, however it speeds up the process and gives more control to use a mapping. The official documentation is extensive and it is sometimes difficult to get a general idea …

more ...

Parsing and converting HTML documents to XML format using Python’s lxml

The Internet is vast and full of different things. There are even tutorials explaining how to convert to or from XML formats using regular expressions. While this may work for very simple steps, as soon as exhaustive conversions and/or quality control is needed, working on a parsed document is the way to go.

In this post, I describe how I work using Python’s lxml module. I take the example of HTML to XML conversion, more specifically XML complying with the guidelines of the Text Encoding Initiative, also known as XML TEI.

Update: I released a Python module that …

more ...

Rule-based URL cleaning for text collections

I would like to introduce the way I clean lists of unknown URLs before going further (e.g. by retrieving the documents). I often use a Python script named clean_urls.py which I made available under a open-source license as a part of the FLUX-toolchain.

The following Python-based regular expressions show how malformed URLs, URLs leading to irrelevant content as well as URLs which obviously lead to adult content and spam can be filtered using a rule-based approach.

Avoid recurrent sites and patterns to save bandwidth

First, it can be useful to make sure that the URL was properly …

more ...

Guessing if a URL points to a WordPress blog

I am currently working on a project for which I need to identify WordPress blogs as fast as possible, given a list of URLs. I decided to write a review on this topic since I found relevant but sparse hints on how to do it.

First of all, let’s say that guessing if a website uses WordPress by analysing HTML code is straightforward if nothing was been done to hide it, which is almost always the case. As WordPress is one of the most popular content management systems, downloading every page and performing a check afterward is an option …

more ...

Batch file conversion to the same encoding on Linux

I recently had to deal with a series of files with different encodings in the same corpus, and I would like to share the solution I found in order to try to convert automatically all the files in a directory to the same encoding (here UTF-8).

file -i

I first tried to write a script in order to detect and correct the encoding, but it was everything but easy, so I decided to use UNIX software instead, assuming these tools would be adequate and robust enough.

I was not disappointed, as file for example gives relevant information when used …

more ...

Recipes for several model fitting techniques in R

As I recently tried several modeling techniques in R, I would like to share some of these, with a focus on linear regression.

Disclaimer: the code lines below work, but I would not suggest that they are the most efficient way to deal with this kind of data (as a matter of fact, all of them score slightly below 80% accuracy on the Kaggle datasets). Moreover, there are not always the most efficient way to implement a given model.

I see it as a way to quickly test several frameworks without going into details.

The column names used in the …

more ...