A module to extract date information from web pages

Description

Metadata extraction

Diverse content extraction and scraping techniques are routinely used on web document collections by companies and research institutions alike. Being able to better qualify the contents allows for insights based on metadata (e.g. content type, authors or categories), better bandwidth control (e.g. by knowing when webpages have been updated), or optimization of indexing (e.g. language-based heuristics, LRU cache, etc.).

In short, metadata extraction is useful for different kinds of purposes ranging from knowledge extraction and business intelligence to classification and refined visualizations. It is often necessary to fully parse the document or apply robust …

more ...

Guessing if a URL points to a WordPress blog

I am currently working on a project for which I need to identify WordPress blogs as fast as possible, given a list of URLs. I decided to write a review on this topic since I found relevant but sparse hints on how to do it.

First of all, let’s say that guessing if a website uses WordPress by analysing HTML code is straightforward if nothing was been done to hide it, which is almost always the case. As WordPress is one of the most popular content management systems, downloading every page and performing a check afterward is an option …

more ...

Introducing the Microblog Explorer

The Microblog Explorer project is about gathering URLs from social networks (FriendFeed, identi.ca, and Reddit) to use them as web crawling seeds. At least by the last two of them a crawl appears to be manageable in terms of both API accessibility and corpus size, which is not the case concerning Twitter for example.

Hypotheses:

  1. These platforms account for a relative diversity of user profiles.
  2. Documents that are most likely to be important are being shared.
  3. It becomes possible to cover languages which are more rarely seen on the Internet, below the English-speaking spammer’s radar.
  4. Microblogging services are …
more ...

Data analysis and modeling in R: a crash course

Let’s pretend you recently installed R (a software to do statistical computing), you have a text collection you would like to analyze or classify and some time to lose. Here are a few quick commands that could get you a little further. I also write this kind of cheat sheet in order to remember a set of useful tricks and packages I recently gathered and from which I thought they could help others too.

Letter frequencies

In this example I will use a series of characteristics (or features) extracted from a text collection, more precisely the frequency of each …

more ...

Ludovic Tanguy on Visual Analysis of Linguistic Data

In his professorial thesis (or habilitation thesis), which is about to be made public (the defence takes place next week), Ludovic Tanguy explains why and on what conditions data visualization could help linguists. In a previous post, I showed a few examples of visualization applied to the field of readability assessment. Tanguy’s questioning is more general, it has to do with what is to include in the disciplinary field of linguistics.

He gives a few reasons to use the methods from the emerging field of visual analytics and mentions some of its upholders (like Daniel Keim or Jean-Daniel Fekete …

more ...

On global vs. local visualization of readability

It is not only a matter of scale : the perspective one chooses is crucial when it comes to visualize how difficult a text is. Two main options can be taken into consideration:

  • An overview in form of a summary which enables to compare a series of phenomena for the whole text.
  • A visualization which takes the course of the text into account, as well as the possible evolution of parameters.

I already dealt with the first type of visualization on this blog when I evoked Amazon’s text stats. To sum up, their simplicity is also their main problem, they …

more ...

Microsoft to analyze social networks to determine comprehension level

I recently read that Microsoft was planning to analyze several social networks in order to know more about users, so that the search engine could deliver more appropriate results. See this article on geekwire.com : Microsoft idea: Analyze social networks posts to deduce mood, interests, education.

Among the variables that are considered, the ‘sophistication and education level’ of the posts is mentionned. This is highly interesting, because it assumes a double readability assessment, on the reader’s side and on the side of the search engine. More precisely, this could refer to a classification task.

Here is an extract of …

more ...

Resource links update

I recently updated the blogroll section and I also would like to share a few links:

As I will be teaching LaTeX soon the LaTeX links section of the blog has expanded.

Last but not least, here is an E-Book, Mining of Massive Datasets, by A. Rajaraman and J. D. Ullmann. It was made of classes taught at Stanford and is now free to use (available chapter …

more ...

Three series of recorded lectures

Here is my selection of introductory courses given by well-known specialists in Computer Science or Natural Language Processing and recorded so that they can be followed at home.

1. Artificial Intelligence | Natural Language Processing, Christopher D. Manning, Stanford University.
More than 20 hours, 18 lectures.
Introduction to the key topics of NLP, summary of existing models.
Lecture 12 : Dan Jurafsky as a guest lecturer.
Requires the Silverlight plugin (no comment). Transcripts available.

2. Bits, Harry R. Lewis, Harvard University.
A general overview of information as quantity and quantitative methods.
Very comprehensive lecture (data theories, internet protocols, encryption, copyright issues, laws …

more ...