I recently presented a paper at the third LRL
Workshop (a joint
LTC-ELRA-FLaReNet-META_NET workshop on “Less Resourced Languages, new
technologies, new challenges and opportunities”).
The state of the art tools of the “web as corpus” framework rely heavily
on URLs obtained from search engines. Recently, this querying process
became very slow or impossible to perform on a low budget.
Moreover, there are diverse and partly unknown search biases related to
search engine optimization tricks and undocumented PageRank adjustments,
so that diverse sources of URL seeds could at least ensure that there is
not a single bias, but …