I have (say, I crawled two websites and got hold of) two corpora which sometimes focus on the same topics. I would like to try and melt them together in order to build a balanced and coherent corpus. As this is a highly discussed research topic there are plenty of subtle ways to do it.
Still, as I am only at the beginning of my research and as I don’t know how far I am going to go with both corpora I want to keep it simple.
One of the appropriate techniques (if not the best)
As this technical report shows, it can perform well in that kind of case: Comparison of Human and Latent Semantic Analysis (LSA) Judgements of Pairwise Document Similarities for a News Corpus, B. Pincombe, Australian Department of Defence, 2004. (full text available here or through any good search engine, see previous post)
This could be an issue for later research.
The approach that I am working on (not quick and dirty but simpler and hopefully robust)
For exercise …more ...