A common message found on the inside tab of many popular cereal boxes might be appended to many of the search engines available on the World-Wide Web. As with your favorite cereal, you assume that the contents at the bottom are never as good as those at the the top. Getting the right content or context of information is certainly an important goal for any information retrieval (IR) system.
Several approaches to retrieving textual information depend on a lexical match between words in users' requests and keywords or indices assigned to documents. However, due to the tremendous diversity in the words used by authors and readers, such lexical-matching methods are necessarily incomplete and imprecise. Recently, vector space IR models based on matrix decompositions from linear algebra have been used to estimate the implicit higher-order semantic structure or association of terms (or keywords) with documents. Using a low-rank approximations to large sparse "term-by-document" matrices, both terms and documents can be encoded for concept-matching with users' queries in high-dimensional vector spaces. When based upon the singular value decomposition (SVD), this approach is commonly referred to as Latent Semantic Indexing or LSI since the subspaces spanned by the approximate singular vectors represent important associative relationships between terms and documents that are not evident in individual documents.
This talk will focus on the motivation and design of LSI-based IR models.