HTML to text extraction just got faster with the dedicated Trafilatura software. That means faster execution times as measured on the benchmark available on the repository, not only for HTML2TXT operations but also for all possible output formats: CSV, JSON, Markdown and XML. The change follows from from two major changes introduced recently in the package dependencies, where I was involved.
charset_normalizer
First of all, the character encoding recognizer was replaced has been replaced. The chardet
library is pretty standard in the python universe, but there is a new kid on the block named charset_normalizer, which adopts a different approach based on the following characteristics:
- Character unigrams to detect which language or which language family the input is to be attributed to
- Character sets capturing various characteristics the target language has to have, for example accentuated characters
- Heuristics focusing on the amount of noise contained in the input
The latter is a novelty introduced by charset_normalizer
which makes it more robust, especially if the input text differs from expectations. That means it’s more robust. I know it for a fact since I actually contributed to the code and filed a few issues. So I know the internals of this new package and it’s already (I guess) quite a success for its author, Ahmed R. Tahri.
In our use case, the document whose encoding is to be identified is a web page in HTML format. Measuring the amount of noise allows for more flexibility instead of correcting everything in croaking on the first error or returning an encoding guess which causes errors down the line. While the changes in the charset_normalizer
library led to speed-ups of a few percentage points, since its use in trafilatura
everything got faster.
Detection of text encoding is a critical component since encodings declared on top of HTML contents cannot systematically be trusted and since these documents indeed generate noise by embedding various elements in a single web page, following the Web 2.0 paradigm where content is dynamically injected from multiple sources. Overall, the web scraping and text extraction parts now run approximately 25 to 30 percent faster. For further information, click here to follow charset_normalizer releases and here to see the series of changes mentioned here.
jusText
Secondly, changes have been made to one of the packages running in the background as a fallback for text extraction. JusText uses a generic algorithm developed by Jan Pomikálek during his PhD thesis in Brno (Czech Republic). After a series of reworks targeting precision and execution speed, the package has recently been updated to version 3.0. A few bugs have also been fixed.
I started collaborating on this package as well since I have been using it for some time. I wanted to see if it was possible to improve its performance, and indeed the changes were beneficial. Click here for further information on the latest release and here to see a series of changes I have been involved in.
As a matter of fact, code parts were run repeatedly without it being necessary. By focusing on logical optimizations of a few parts in the code, it was possible to greatly improve it. There was two main ideas behind this:
- Moving code out of loops, for instance by using package-level variables
- Benefiting from LRU function caching which is now standard in all common versions of Python.
I think we were able to reach like a much faster and more stable code base. On my benchmark it’s now about 2.5 times faster than before! See here for a list of changes.
Since further processing is triggered in case the default extraction did not work properly, the changes are less visible than in the first case. Indeed, justext
is currently used as a second fallback after readability-lxml
, mostly if the output appears to be too short or to noisy, for example if too many embedded elements remain. In the end, it is not that much of a change, but it is still quite noticeable. I would say that the changes and the subsequent update (version 0.9.3 of trafilatura) result in 10 to 20 percent faster code.
Further work
There is still ongoing work on the libraries used by Trafilatura. So it depends on whether readability-lxml can be further updated or improved, which at best involves a community effort through the Python packages ecosystem, in particular my libraries of interest, see below for a synoptic view of Trafilatura’s ecosystem:
There are a few things to to be changed, but it also depends on the underlying libraries and on the cooperation with the package maintainers, which I’d like to thank here for their effort and which do not always have the time to fully dedicate themselves to the packages. In certain cases, they wrote them some time ago before moving to something else. Maintenance and stability through time remains a challenge in the open-source community.
We will see what the future brings. In any case, concordant signs indicate that Python keep getting more and more popular. It first topped the TIOBE index last month, so there is a fair chance these software packages or comparable alternatives will evolve positively. This is yet another reason to invest some time in making code bases easier to adapt and to maintain.