Scientific criteria for Web corpora

While the pervasiveness of digital communication is undeniable, the numerous traces left by users-customers are collected and used for commercial purposes. The creation of digital research objects should provide the scientific community with ways to access and analyze them. Particularly in linguistics, the diversity and quantity of texts present on the Internet have to be better assessed in order to make current text corpora available, allowing for the description of the variety of languages uses and ongoing changes. In addition, transferring the field of analysis from traditional written text corpora to texts taken from the web results in the creation of new tools and new observables. We must therefore provide the necessary theoretical and practical background to establish scientific criteria for research on these texts.

This is the subject of my PhD work which has been performed under the supervision of Benoît Habert and which led to a thesis entitled Ad hoc and general-purpose corpus construction from web sources, defended on June 19th 2015 at the École Normale Supérieure de Lyon to obtain the degree of doctor of philosophy in linguistics.

Methodological considerations

At the beginning of the first chapter the interdisciplinary setting between linguistics, corpus linguistics, and computational linguistics is introduced. Then, the notion of corpus is put into focus, existing definitions of corpus and text are discussed in an interdisciplinary setting. Several milestones of corpus design are presented, from pre-digital corpora at the end of the 1950s to web corpora in the 2000s and 2010s. The continuities and changes between the linguistic tradition and web native corpora are exposed, as well as the invisible layer of corpora and the linguistic tools, a series of often implicit decisions and instruments that ought to be better documented, verified and reproduced.

In this regard, the second chapter contains methodological insights on automated text scrutiny in computer science, computational linguistics and natural language processing. The state of the art on text quality assessment and web text filtering exemplifies current interdisciplinary research trends on web texts. Readability studies and automated text classification are used as a paragon of methods to find salient features in order to grasp text characteristics. Text visualization exemplifies corpus processing in the digital humanities framework. As a conclusion, guiding principles for research practice are listed, and reasons are given to find a balance between quantitative analysis and corpus linguistics, in an environment which is spanned by technological innovation and artificial intelligence techniques.


Third, current research on web corpora is summarized. Concepts of web science are presented from the perspective of search and download of data which, once transformed, will lead to actual corpora. I distinguish two main approaches to web document retrieval: restricted retrieval and web crawling. The notion of web corpus preprocessing is introduced and salient steps are discussed. This step may lead to a better understanding of data as well as to a greater efficiency. Therefore, its impact on research results is assessed. I explain why the importance of preprocessing should not be underestimated and why it is an important task for linguists to learn new skills in order to confront the whole data gathering and preprocessing phase. These elements justify the contributions of the thesis.

I present my work on web corpus construction in the fourth chapter. Two types of corpora are distinguished; on one hand restricted (ad hoc) corpora which focus on a given genre or website, and which transpose “traditional” issues such as the choice of sources or the quality of texts into the web era; and on the other hand general-purpose corpora, heirs of the reference corpora, which are meant to contain the largest possible variety of linguistic phenomena and thereby imply other choices and adjustments. My analyses concern two main aspects, first the question of corpus sources (prequalification), and secondly the problem of including valid, desirable documents in a corpus (document qualification). My experiments focus on scrutiny of sources and detection of invalid or unwanted documents. A multilingual approach using a lightweight scout to prepare web crawling is presented. Then the work on documents selection before the inclusion in a corpus is summarized: it is possible to use the contributions of readability studies as well as machine learning techniques during the construction of the corpus. The efficiency of the process is evaluated on a set of annotated web page samples in English. The corpora built during the thesis are introduced, they target the study of German. Last, I present work on corpus visualization consisting of extracting certain corpus characteristics in order to give indications on corpus contents and quality.

Focusing on the actual construction processes leads to a better corpus design, beyond simple collections or heterogeneous resources. In the digital age, invisible and unthought features of technology ought to be better known by the research community in order to meet scientific requirements and allow for an informed return to philology and text studies.