Building a basic specialized crawler

As I went on crawling again in the last few days I thought it could be helpful to describe the way I do.

Note that it is for educational purpose only (I am not assuming that I built the fastest and most reliable crawling engine ever) and that the aim is to crawl specific pages of interest. That implies I know which links I want to follow just by regular expressions, because I observe how a given website is organized.

I see two (or eventually three) steps in the process, which I will go through giving a few hints in pseudocode.

A shell script

You might want to write a shell script to fire the two main phases automatically and/or to save your results on a regular basis (if something goes wrong after a reasonable amount of explored pages you don’t want to lose all the work, even if it’s mainly CPU time and electricity).

A list of links

If the website has an archive, a sitemap or a general list of its contents you can spare time by picking the interesting links once and for all.

going through a shortlist of archives DO {      fetch page      find …

more ...

Workshop on Complexity in Language – Day 2 (report)

I could not follow the whole second day of the Workshop on Complexity in Language (see previous post), but here is what I heard in the morning.

Salikoko Mufwene talked about the emergence of complexity, which he sees as a self-organization process : we don’t plan the way we are going to speak.

He adopts a relativistic perspective speaking of a multi-agent system and asking if the agents are really agentive or if there are triggers of particular behaviors. He likes to consider language as a technology that evolved. At the end of the talk he also tackled the notion of communal complexity and communal patterns used by speakers (also known as norms).

Luc Steels explained his understanding of language complexity and how he simulates communication with robots. He thinks there is an alternative to the evolutionary framework: according to him grammar is functional and not superficial and complexity has grown step by step in a cultural evolution rather than a biological.

His perception of self-organization bases most notably on alignment, structural coupling and linguistic selection. That’s what he builds models for by letting robots find common words to describe a situation (for example the fact that a given …

more ...

Workshop on Complexity in Language - Day 1 (report)

I attended yesterday the first day of a workshop organized by Salikoko Mufwene and held at the ENS Lyon. This “Workshop on Complexity in Language: Developmental and Evolutionary Perspectives” lasts two days: HTML version of the program.

Here is my personal report on what I heard during the first day and on what I found interesting.

Complexity and complexity science

First of all, William S.-Y. Wang referred to Herbert Simon and Melanie Mitchell in particular to define complexity, two approaches that I described on this blog.

Tom Schoenemann talked about the increasing richness, subtlety and complexity of hominin conceptual understanding which created a need for syntax and grammar as characteristics resulting from it. In the course of history brain areas appear less directly connected, they process information more independently. What he calls “conceptual complexity” bases on the idea of “grounded cognition” developed by Lawrence W. Barsalou.

Barbara L. Davis said of the complexity science that it was another paradigm. Indeed, most of the debate took place on an abstract level, with many different (and not really compatible) notions of language and complexity. William Croft for instance said the whole context of language needed to be taken into account, and …

more ...

Halliday on complexity (1992)

Sometimes you just feel lucky : I was reading the famous article by Charles J. Fillmore, “Corpus linguistics” or “Computer-aided armchair linguistics”, in the proceedings of a Nobel symposium which took place in 1991 (it is known for the introducing descriptions of the armchair and of the corpus linguist who don’t have anything to say to each other) as I decided to read the following article. The title did not seem promising to me, but still, it was written by Halliday :

M.A.K. Halliday, Language as system and language as instance: The corpus as a theoretical construct, pp. 61-77.

The author gives a few insights on the questions which one could ask to a given text to find a language model. One of the points has to do with “text dynamics”. Here is how Halliday defines it :

« It is a form of dynamic in which there is (or seems to be) an increase in complexity over time: namely, the tendency for complexity to increase in the course of the text. » (p. 69)

In fact, Halliday develops a very interesting idea from the textual dimension of complexity, also named the “unfolding of the text” (p. 69), its “individuation” or the …

more ...

Approaches to philosophy of technology

I held a presentation last week at the Easterhegg conference in Hamburg, which aim was to give a few insights into this topic and a few notions that could explain aspects of the hacker culture.

My talk was entitled Denkansätze zur Philosophie der Technik, as it dealt with approaches to philosophy of technology.

I started with a historical description of technology as a given fact that no one puts into question, then I spoke from the contempt regarding technicians and the difficulty to consider philosophy of technology as a subfield of philosophy.

The main part of my presentation consisted of a few main themes like the critical perspective on technology and the political dimension of technology assessment. I also suggested a typology of tools and instruments/devices grounding on the work of Gilbert Simondon. Then I briefly described the notion of technoscience.

At last, I presented a broader idea of technology, including for instance government technologies through apparatuses as described by Michel Foucault and more recently Giorgio Agamben, taking the position paper of the German CSU-party as an example.

There is a paper in German regarding this talk that may be found online. Here are the references I used …

more ...

Simon, Gell-Mann and Lloyd on complex systems


Herbert A. Simon is one of the first who tried to formalize the notion of a complex system: * H. A. Simon, “The Architecture of Complexity”, Proceedings of the American Philosophical Society, vol. 106, iss. 6, pp. 467-482, 1962.

First of all, here is how he defines it:

« Roughly, by a complex system I mean one made up of a large number of parts that interact in a nonsimple way. In such systems, the whole is more than the sum of the parts, not in an ultimate, metaphysical sense, but in the important pragmatic sense that, given the properties of the parts and the laws of their interaction, it is not a trivial matter to infer the properties of the whole. » p. 467-468

According to Simon the idea of hierarchy (and therefore of architecture) is preponderant.

« By a hierarchic system, or hierarchy, I mean a system that is composed of interrelated subsystems, each of the latter being, in turn, hierarchic in structure until we reach some lowest level of elementary subsystem. » p.468

Nowadays this definition can be considered as a keystone of complex systems theory. To find the architecture, the dependencies between the subsystems, how they interact and interface …

more ...

Melanie Mitchell: defining and measuring complexity

I just read with peculiar attention the seventh chapter of Complexity: A Guided Tour, by Melanie Mitchell (Defining and measuring complexity, pages 94 to 111). She works with the Santa Fe Institute which is a major institution regarding research on complex systems. She gives a convincing outlook of this field. Still, I did not read anything on the question of language as a complex adaptive system, although there are researchers who focus on this topic (e.g. in Santa Fe).

According to her, there are different sciences of complexity with different notions of what complexity means. The notion of complexity is itself complex. She chooses to refer to three questions coined by Seth Lloyd in 2001 to approach the complexity of a system:

  1. How hard is it to describe ?
  2. How hard is it to create ?
  3. What is its degree of organization ?

Then she details a few definitions which can be seen as sides of the problem. Beginning with a selection from a larger list by Seth Lloyd, she tries to explain where or if these approaches are used. Thus, according to her, these are possible definitions of complexity:

  • Size
  • Entropy
  • Algorithmic information content – Murray Gell-Mann speaks of « effective complexity »
  • Logical …
more ...

Renate Bartsch on linguistic complexity

I just found a seminal article on complexity written by Renate Bartsch in 1973 (in German). It is a very good summary of the perspective on this topic at the beginning of the ‘70s. The generative grammar background research on language starts to be criticized, but it is still a landmark and a framework (most notably the reflexion on surface and deep structure).

R. Bartsch, “Gibt es einen sinnvollen Begriff von linguistischer Komplexität ?” Zeitschrift für Germanistische Linguistik, vol. 1, iss. 1, pp. 6-31, 1973.

Bartsch focuses on three main aspects of the problem to answer this question: does the idea of linguistic complexity make sense ?


The framework of the transformational grammar alone cannot be trusted when it comes to measuring complexity, because the surface complexity does not account for a potential underlying complexity.
Bartsch quotes the interviews made by Labov and his conclusions stating that the dialect difference is to be found on the surface without having anything to do with the logic of a sentence.


This is by far the most interesting part of the article, lots of criteria for linguistic complexity are analyzed with examples (some in German).
Bartsch also writes about complexity metrics and claims …

more ...

Philosophy of technology, how things started: a typology

In my previous post, I presented a few references. I went on reading books and articles on this topic, and I am now able to sort them in several kinds of approaches.

This is mostly thanks to these books in French on philosophy of technology:

  • G. Simondon, L’invention dans les techniques : cours et conférences, Paris: Seuil, 2005.
  • G. Hottois, Philosophies des sciences, philosophies des techniques, Paris: Odile Jacob, 2004.
  • J. Goffi, La philosophie de la technique, Presses Universitaires de France, 1988.
  • G. Hottois, Le signe et la technique : la philosophie à l’épreuve de la technique, Paris: Aubier, 1984.

In his second lesson at the Collège de France (Philosophies des sciences, philosophies des techniques, p. 94-118), Gilbert Hottois tries to provide a state-of-the-art in philosophy of technology: he describes several traditions and backgrounds. Here is how things started:

  1. A German origin of the reflexion on technology (Ernst Kapp, Friedrich Dessauer) which is mostly analyzed by engineers who shed a new light on this topic and try to think it as a system. The VDI (Verein Deutscher Ingenieure) continues this tradition. From 1956 onwards, this association organizes a series of meetings entitled Man and Technology which notably sees the question …
more ...

Philosophy of technology: a few resources

As I once studied philosophy (back in the classes préparatoires), I like to keep in touch with this kind of reflexion. Moreover, in this research field where everything is moving very fast, it is a way to find a few continuities and to ground the peculiar questions regarding the analysis of language in a more conceptual framework.

Here is a list of texts available on the Internet (some of them partly) that seem important to me. Some are written in English, some in French or in German, as I chose the original ones.

It does not have the pretension to be complete ! Other references may follow.

  • Denis Diderot wrote the article Art in the Encyclopédie. It is a state of the art introducing the word and its different meanings (which by that time included arts, techniques and technology). Diderot is speaking in favor of the techniques developed by the craftsmen and give an account of the ideas of the time about liberal arts, theory and usage.
    The whole text was made available by the ARTFL Encyclopédie Project.

    Les Artisans se sont crus méprisables, parce qu’on les a méprisés; apprenons - leur à mieux penser d’eux - mêmes: c’est le …

more ...