Contents - Next

About Latent Semantic Analysis

Latent semantic analysis (LSA) is a computational method that allows researchers to analyze the relationships between all terms and documents in a corpus.

You can find a brief introduction to LSA in the Wikipedia entry for Latent semantic analysis..

LSA calculates correlations between documents, correlations between terms, and correlations between terms and documents. The method begins with a term-document matrix whose rows list all the terms in the corpus and whose columns tally the frequencies of the terms in each document.

LSA uses singular value decomposition to produce vectors for each of the terms and each of the documents. Each vector is an n-dimensional description of the terms and the documents. LSA then applies rank reduction to simplify the calculations and expose the most important dimensions that capture the latent structure of the corpus in terms of its semantics.

We use the term vectors and document vectors to calculate the correlations as cosine similarites: the result gives the cosine of the angle between the two term vectors or document vectors, or between a term vector and a document vector. The closer the cosine is to 1.0, the greater the similarity between two documents or the greater the semantic relationship between two terms. As the cosine approaches zero there is no real relationship at all.

The Newton LSA component stores all significant term-term, document-document, and term-document correlations in a database and provides a suite of built-in query tools that a user can use to investigate those relationships. There are several million correlations and the tools allow us to make both broad and focused queries

The LSA also supports user queries based on choice of terms or on choice of documents.