Feeds are a convenient way to get hold of a website’s most recent publications. Used alone or together with text extraction, they can allow for regular updates to databases of web documents. That way, recent documents can be collected and stored for further use. Furthermore, it can be useful to download the portions of a website programmatically to save time and resources, so that a feed-based approach is light-weight and easier to maintain.

This post describes practical ways to find recent URLs within a website and to extract text, metadata, and comments. It contains all necessary code snippets to optimize link discovery and document filtering.

Getting started

Interest of feeds

A feed is a file that generally lists the new documents published on a website or a particular website section, the main goal being to inform readers that news is available and to tell machines where to look for content. This process is also called web syndication, meaning a form of syndication in which content is made available from one website to other sites.

Most commonly, feeds are made available to provide either summaries or full renditions of a website’s recently added content. The term may also describe other kinds of content licensing for reuse. The kinds of content delivered by a web feed are typically HTML (webpage content) or links to webpages and other kinds of digital media. Many news websites, weblogs, schools, and podcasters operate web feeds. The feed icon is commonly used to indicate that a web feed is available.

The machine-based retrieval and download of documents within a website is called web crawling or web spidering. Web crawlers usually discover pages from links within the site and from other sites, following a series of rules and protocols. Feeds supplement this data to allow crawlers that support feeds to pick up all URLs in the feeds and learn about those URLs using the associated metadata.

Web (or news) feeds use standardized data formats to provide users with frequently updated content. The two main formats are Atom and RSS.

All in all, feeds are way to discover content more intelligently. This is particularly true if there is a chance to overlook some of the new or recently updated content, for example because some areas of the websites are not available through the browsable interface, or when websites have a huge number of pages that are isolated or not well linked together (typical example: a news outlet).

The extraction tool Trafilatura

The tool for web text extraction I am working on can process a list of URLs and find the main text along with useful metadata. Trafilatura is a Python package and command-line tool which seamlessly downloads, parses, and scrapes web page data: it can extract metadata, main body text and comments while preserving parts of the text formatting and page structure.

In addition, trafilatura includes support for multilingual and multinational sitemaps. For example, a site can target English language users through links likehttp://www.example.com/en/… and German language users through http://www.example.com/de/….

With its focus on straightforward, easy extraction, this tool can be used straight from the command-line. In all, trafilatura should make it much easier to list links from sitemaps and also download and process them if required. It is the recommended method to perform the task described here.

If this command sounds familiar you can install it directly with it: pip3 install trafilatura. Otherwise please refer to the tutorial on installing the trafilatura tool.

Trafilatura supports XML-based feeds with the two common formats Atom and RSS, as well as feed discovery. The tool can start from a homepage or a specific feed URL. Here are the steps needed:

  1. Grab a website’s homepage or feed
    • The software package will try to discovery potential feeds automatically
    • Establishing a list of targets allows for mass parallel processing
  2. Retrieve a list
    • Let the software list the URLs found in the feeds
    • Download and process the discovered web pages

In the following, I show how to gather links and process them, both with Python and on the command-line.

Straightforward link discovery, listing and processing

With Python

Python can be easy to pick up whether you’re a first time programmer or you’re experienced with other languages. See this list of tutorials.

Gathering links

The function find_feed_urls is a all-in-one utility that attemps to discover the feeds from the homepage if required and/or downloads and parses feeds. It returns the extracted links as list, more precisely as a sorted list of unique links.

>>> from trafilatura import feeds
>>> mylist = feeds.find_feed_urls('https://www.theguardian.com/')
# https://www.theguardian.com/international/rss has been found
>>> len(mylist)
74 # can change over time
>>> mylist
['https://www.theguardian.com/...', '...' ...]
# use a feed URL directly
>>> mylist = feeds.find_feed_urls('https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml')
>>> mylist is not []
True # it's not empty, great!

An optional argument target_lang makes it possible to filter links according to their expected target language. A series of heuristics are applied on the link path and parameters to try to discard unwanted URLs, thus saving processing time and download bandwidth.

>>> from trafilatura import feeds
>>> mylist = feeds.find_feed_urls('https://www.un.org/en/rss.xml', target_lang='en')
>>> mylist is not []
True # links found as expected
>>> mylist = feeds.find_feed_urls('https://www.un.org/en/rss.xml', target_lang='ja')
>>> mylist
[] # target_lang set to Japanese, the English links were discarded this time

Processing extracted URLs

The links can then get processed as usual with trafilatura:

  • The fetch_url function fetches a weg page and decodes the response. It returns a response object if the download succeeded.
  • The extract function acts as a wrapper for text extraction and conversion to chosen output format (defaults to txt).

The following snippet demonstrates how to go download a process a web page:

>>> from trafilatura import fetch_url, extract
>>> downloaded = fetch_url('https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/')
>>> extract(downloaded)
# outputs main content and comments as plain text ...
>>> extract(downloaded, xml_output=True, include_comments=False)
# outputs main content without comments as XML ...

For more information see core functions documentation.

On the command-line

The following instructions use the command-line interface (CLI). This mode notably allows for mass parallel processing where trafilatura takes care of all necessary settings. For more information on the command-line see these links:

Main features

  • Links can be gathered straight from the homepage (using heuristics) or using a particular URL if it is already known
  • The --list option is useful to list URLs prior to processing
  • Links discovery can start from an input file (-i) containing a list of sources which will then be processed in parallel

Examples

The following examples return lists of links. If --list is absent the pages that have been found are directly retrieved, processed, and returned in the chosen output format (per default: TXT and standard output).

# looking for feeds
$ trafilatura --feed "https://www.dwds.de/" --list
# already known feed
$ trafilatura --feed "https://www.dwds.de/api/feed/themenglossar/Corona" --list

Websites can also be examined in parallel by providing a list as input, named here mylist.txt. The list has to entail a series of URLs, one per line.

# parallel retrieval of sitemap links from list of websites
$ trafilatura -i mylist.txt --feed --list
# same operation, targeting webpages in German
$ trafilatura -i mylist.txt --feed --list --target-language "de"

More on content discovery

See more about content discovery in this tutorial: Gathering a custom web corpus, which shows how a comprehensive overview of the available documents can be obtained faster and more efficiently.