I have (say, I crawled two websites and got hold of) two corpora which sometimes focus on the same topics. I would like to try and melt them together in order to build a balanced and coherent corpus. As this is a highly discussed research topic there are plenty of subtle ways to do it.
Still, as I am only at the beginning of my research and as I don’t know how far I am going to go with both corpora I want to keep it simple.
One of the appropriate techniques (if not the best)
- As this technical report shows, it can perform well in that kind of case
- Comparison of Human and Latent Semantic Analysis (LSA) Judgements of Pairwise Document Similarities for a News Corpus, B. Pincombe, Australian Department of Defence, 2004. (full text available here or through any good search engine, see previous post)
This could be an issue for later research.