Building a topic-specific corpus out of two different corpora

I have (say, I crawled two websites and got hold of) two corpora which sometimes focus on the same topics. I would like to try and melt them together in order to build a balanced and coherent corpus. As this is a highly discussed research topic there are plenty of subtle ways to do it.

Still, as I am only at the beginning of my research and as I don’t know how far I am going to go with both corpora I want to keep it simple.

TRICK

One of the appropriate techniques (if not the best)

I could do it using LSA (in this particular case Latent semantic analysis, and not Lysergic acid amide !) or to be more precise Latent semantic indexing.

As this technical report shows, it can perform well in that kind of case
Comparison of Human and Latent Semantic Analysis (LSA) Judgements of Pairwise Document Similarities for a News Corpus,  B. Pincombe, Australian Department of Defence, 2004. (full text available here or through any good search engine, see previous post)

This could be an issue for later research.

TRICK

The approach that I am working on (not quick and dirty but simpler and hopefully robust ...

more ...

Collecting academic papers

I would like to build a corpus from a variety of scientific papers of a given field in a given language (german).

The problems of crawling put aside, I wonder if there is a way to do this automatically. All the papers I read deal with hand-collected corpora.

The Open Archive format might be a good workaround (see The Open Archives Initiative Protocol for Metadata Harvesting). As far as I know it is well-spread. And there are search engines that look for academic papers and use these metadata.

Among the most popular ones (Google Scholar, Scirus, OAIster), a few seem to deal with a lot of german texts : Scientific Commons (St. Gallen, CH) and Base (Bielefeld).

I read an interesting article today about the search engines regarding this particular field : by Dirk Pieper and Sebastian Wolf from the University Library of Bielefeld, “Wissenschaftliche Dokumente in Suchmaschinen”, in Handbuch Internet-Suchmaschinen, D. Lewandowski (ed.), Heidelberg, 2009. PDF version here.

I could crawl the result pages of a given website and see what I get. I’ll see what I can do.

more ...

Bibliography

Here is the beginning of a bibliography generated from my Master’s thesis, converted between different formats, and parked here for further reference.

Complexity and Readability Assessment

Background

Complexity and Linguistic Complexity Theory

  • S. T. Piantadosi, H. Tily, and E. Gibson, “Word lengths are optimized for efficient communication”, Proceedings of the National Academy of Sciences, vol. 108, iss. 9, pp. 3526-3529, 2011.
  • L. Maurits, A. Perfors, and D. Navarro, “Why are some word orders more common than others? A uniform information density account”, in Proceedings of NIPS, 2010.
  • P. Blache, “Un modèle de caractérisation de la complexité syntaxique”, in TALN 2010, Montréal, 2010.
  • T. Givon, The Genesis of Syntactic Complexity : diachrony, ontogeny, neuro-cognition, evolution, Amsterdam, New York: John Benjamins Publishing Co., 2009.
  • M. Mitchell, Complexity: A Guided Tour, Oxford, New York: Oxford University Press, 2009.
  • C. Beckner, N. C. Ellis, R. Blythe, J. Holland, J. Bybee, J. Ke, M. H. Christiansen, D. Larsen-Freeman, W. Croft, and T. Schoenemann, “Language Is a Complex Adaptive System ...
more ...

Why I don’t blog on hypotheses.org and why I might do so (someday…)

People around me at the lab keep talking about a French institutional blog platform named hypotheses.org. In fact it is well-known but no one is using it. The website is still a bit new, according to them they currently host a hundred blogs.

The main benefits are visibility and durability as it is institutional, well-referenced and competently maintained.

It is what it claims to be, which is also why I hesitated and finally chose to set up a basic personal website.

  • First you need to fill out a form to get a registration, which is good in terms of label, but I don’t know how long or how often I am going to blog. I don’t want to request a service I might finally not use.
  • The second reason is that it is very useful for people who do not want to deal with layout issues, all the pages look quite the same apart from backgrounds colors and a few images. I think it may be to maintain a global coherence on the website.
  • It’s not that international, it’s not what it’s meant to be. Most of the articles are in French, and I ...
more ...

A fast bash pipe for TreeTagger

I have been working with the part-of-speech tagger developed at the IMS Stuttgart TreeTagger since my master thesis. It performs well on german texts as one could easily suppose, since it was one of its primary purposes. One major problem is that it’s poorly documented, so I would like to share the way that I found to pass things to TreeTagger through a pipe.

The first thing is that TreeTagger doesn’t take Unicode strings, as it dates back to the nineties. So you have to convert whatever you pass to ISO-8859-1, which the iconv software with the translit option set does very well. It means here “find an equivalent if the character cannot be exactly translated”.

Then you have to define the options that you want to use. I put the most frequent ones in the example.

Benefits

The advantage of a pipe is that you can clean the text while passing it to the tagger. Here is one way of doing it, by using the text editor sed to : 1. remove the trailing white lines 2. replace everything that’s more than one space by one space and 3. replacing spaces by new lines.

This way ...

more ...

Resources and links of interest

  1. Linguistics and NLP
  2. Corpus Linguistics
  3. Perl
  4. LaTeX
  5. R
  6. PhD related
  7. Misc.

Archive of links gathered during my PhD thesis.

1 – Linguistics and NLP

General Linguistics

Computational Linguistics

Online Articles and Conferences

Lists of CL Blogs

Resources for German

Computer Science

more ...