I recently presented a paper at the third LRL Workshop (a joint LTC-ELRA-FLaReNet-META_NET workshop on “Less Resourced Languages, new technologies, new challenges and opportunities”).
Motivation
The state of the art tools of the “web as corpus” framework rely heavily
on URLs obtained from search engines. Recently, this querying process
became very slow or impossible to perform on a low budget.
Moreover, there are diverse and partly unknown search biases related to
search engine optimization tricks and undocumented PageRank adjustments,
so that diverse sources of URL seeds could at least ensure that there is
not a single bias, but …
more ...