Building a topic-specific corpus out of two different corpora

I have (say, I crawled two websites and got hold of) two corpora which sometimes focus on the same topics. I would like to try and melt them together in order to build a balanced and coherent corpus. As this is a highly discussed research topic there are plenty of subtle ways to do it.

Still, as I am only at the beginning of my research and as I don’t know how far I am going to go with both corpora I want to keep it simple.

One of the appropriate techniques (if not the best)

I could do it using LSA (in this particular case Latent semantic analysis, and not Lysergic acid amide!) or to be more precise Latent semantic indexing.

As this technical report shows, it can perform well in that kind of case: Comparison of Human and Latent Semantic Analysis (LSA) Judgements of Pairwise Document Similarities for a News Corpus, B. Pincombe, Australian Department of Defence, 2004. (full text available here or through any good search engine, see previous post)

This could be an issue for later research.

The approach that I am working on (not quick and dirty but simpler and hopefully robust)

For exercise purposes (and who knows? maybe it will prove efficient), I am using another approach. Instead of applying corpus-wide statistical methods, I plan to take advantage of a local grammatical handling of the sentences, i.e. a partial / surface / chunk / shallow parsing.

The two main steps are:

a term weighting phase (to find important words in the documents)
a vector space search through the corpus (to find documents that have something in common)

Where I am now: the first phase

So far, I wrote a finite-state automaton which identifies the heads of the phrases, as these words often carry more significant “weight”.

Taking as input the information given by a part-of-speech tagger (the TreeTagger), it outputs its different states and if it found a possible head and/or a possible extension, taking into account that German is rather a head-final language. It works for noun, verb and a few adpositional phrases.

I think that may be enough to trigger a second automaton which would take advantage of this information and try to give the structure of the sentences, both automata building a kind of finite-state cascade (see Steven Abney. Partial parsing via ﬁnite-state cascades. In John Carroll (ed.), Workshop on Robust Parsing (ESSLLI ’96), pages 8–15, 1996). But for now it doesn’t seem to be necessary.

The heads that are to be found several times or at relevant places, such as a title, are stored as tags for a given document with an indication of their strength/relevance. I am refining the way to get to the tags and I am thinking about moving on to phase 2.

One of the appropriate techniques (if not the best)

The approach that I am working on (not quick and dirty but simpler and hopefully robust)

Where I am now: the first phase

Related Posts: