Contents - Previous - Next

Documents are Chunked for Text Analaysis

In information retrieval projects, which try to return the best answers for user queries, each individual document in a corpus is usually treated as a unit.

In text analysis in the digital humanities, however, it is more useful to parse documents into much smaller units or chunks for processing, which allows us to work at the level of passages.

In the Chymistry LSA project, we have parsed the documents, approximately, into 250-word chunks and 1000-word chunks.