What is LSA?

Note: If you linked here from the WIRED News article on the Intelligent Essay Assessor, you can click here to open the full "Latent Semantic Analysis @ CU Boulder" website.

The information on this page is based on:

Landauer, T. K., Foltz, P. W., & Laham, D. (1998). 
     Introduction to Latent Semantic Analysis. Discourse 
     Processes, 25, 259-284.

which is available for downloading on the Group Papers page.

Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text. The underlying idea is that the totality of information about all the word contexts in which a given word does and does not appear provides a set of mutual constraints that largely determines the similarity of meaning of words and set of words to each other. The adequacy of LSA's reflection of human knowledge has been established in a variety of ways. For example, its scores overlap those of humans on standard vocabulary and subject matter tests, it mimics human word sorting and category judgments, simulates word-word and passage-word lexical priming data and, as reported in Group Papers, accurately estimates passage coherence, learnability of passages by individual students and the quality and quantity of knowledge contained in an essay.

Research reported in, and applications available from, this website exploit a new method for determining and representing the similarity of meaning of words and passages by statistical analysis of large text corpora. After processing a large sample of machine-readable language, Latent Semantic Analysis (LSA) represents the words used in it, and any set of these words-such as those contained in a sentence, paragraph, or essay, either taken from the original corpus or new-as points in a very high (e.g. 50-1,000) dimensional semantic space. LSA is based on singular value decomposition, a mathematical matrix decomposition technique closely akin to factor analysis that has recently become applicable to databases approaching the volume of relevant language experienced by people. Word and discourse meaning representations derived by LSA have been found capable of simulating a variety of human cognitive phenomena, ranging from acquisition of recognition vocabulary to sentence-word semantic priming and judgments of essay quality.

LSA can be construed in two ways: (1) simply as a practical expedient for obtaining approximate estimates of the contextual usage substitutability of words in larger text segments, and of the kinds of-as yet incompletely specified- meaning similarities among words and text segments that such relations may reflect, or (2) as a model of the computational processes and representations underlying substantial portions of the acquisition and utilization of knowledge. We next sketch both views.

As a practical method for the statistical characterization of word usage, we know that LSA produces measures of word-word, word-passage and passage-passage relations that are reasonably well correlated with several human cognitive phenomena involving association or semantic similarity. Empirical evidence of this will be reviewed shortly. The correlation must be the result of the way peoples' representation of meaning is reflected in the word choice of writers, and/or vice-versa, that peoples' representations of meaning reflect the statistics of what they have read and heard. LSA allows us to approximate human judgments of overall meaning similarity, estimates of which often figure prominently in research on discourse processing. It is important to note from the start, however, that the similarity estimates derived by LSA are not simple contiguity frequencies or co-occurrence contingencies, but depend on a deeper statistical analysis (thus the term "Latent Semantic"), that is capable of correctly inferring relations beyond first order co-occurrence and, as a consequence, is often a very much better predictor of human meaning-based judgments and performance.

Of course, LSA, as currently practiced, induces its representations of the meaning of words and passages from analysis of text alone. None of its knowledge comes directly from perceptual information about the physical world, from instinct, or from experiential intercourse with bodily functions and feelings. Thus its representation of reality is bound to be somewhat sterile and bloodless. However, it does take in descriptions and verbal outcomes of all these juicy processes, and so far as people have put such things into words, or that their words have reflected such matters unintentionally, LSA has at least potential access to knowledge about them. The representations of passages that LSA forms can be interpreted as abstractions of "episodes", sometimes of episodes of purely verbal content such as logical arguments, and sometimes episodes from real or imagined life coded into verbal descriptions. Its representation of words is, in turn, intertwined with and mutually interdependent with its knowledge of episodes. Thus while LSA's potential knowledge is surely imperfect, we believe it can often offer a close enough approximation to people's knowledge to underwrite theories and tests of theories of cognition. (One might consider its maximal knowledge of the world to be analogous to a well-read Nun's knowledge of sex, a level of knowledge often deemed a sufficient basis for advising the young.)

However, LSA as currently practiced has some additional limitations. It makes no use of word order, thus of syntactic relations or logic, or of morphology. Remarkably, it manages to extract correct reflections of passage and word meanings quite well without these aids, but it must still be suspected of incompleteness or likely error on some occasions.

LSA differs from other statistical approaches in two significant respects. First, the LSA analysis (at least as currently practiced) uses as its initial data not just the summed contiguous pairwise (or tuple-wise) co-occurrences of words, but the detailed patterns of occurrences of words over very large numbers of local meaning-bearing contexts, such as sentences or paragraphs, treated as unitary wholes. Second, the LSA method assumes that the choice of dimensionality in which all of the local word-context relations are jointly represented is of great importance, that reducing the dimensionality (the number parameters by which a word or passage is described) of the observed data from the number of initial contexts to a much smaller-but still large-number will often produce much better approximations to human cognitive relations. Thus, an important component of applying the technique is finding the optimal dimensionality for the final representation. A possible interpretation of this step, in terms familiar to researchers in psycholinguistics, is that the resulting dimensions of the description are analogous to the semantic features often postulated as the basis of word meaning, although establishing concrete relations to mentalisticly interpretable features poses daunting technical and conceptual problems and has not yet been seriously attempted. Finally, LSA, unlike many other methods, employs a preprocessing step in which the overall distribution of words over usage contexts, independent of their correlations, is taken into account; pragmatically, this step improves LSA's results considerably.

However, as stated above, there is another, quite different way to think about LSA. Landauer and Dumais (1996; 1997) have proposed that LSA constitutes a fundamental computational theory of the acquisition and representation of knowledge. They maintain that its underlying mechanism can account for a long-standing and important mystery, the inductive property of learning by which people acquire much more knowledge than appears to be available in experience, the famous problem of the poverty of the stimulus. The LSA mechanism that solves the problem consists simply of accommodating a very large number of local co-occurrence relations simultaneously in a space of the right dimensionality, hypothetically one in which there is a match of dimensionality between the semantic space of the source that generates discourse and that of the representation in which it is reconstructed, thereby extracting much indirect information from the myriad local constraints and entailments latently contained in the data of experience. The principal support for this claim has come from using LSA to derive measures of the similarity of meaning of words from text. The results have shown that: 1) the meaning similarities so derived closely match those of humans, 2) LSA's rate of acquisition of such knowledge from text approximates that of humans, and 3) these accomplishments depend strongly on the dimensionality of the representation. In this and other ways, LSA performs a powerful and, by the human-comparison standard, correct induction of knowledge. Using representations so derived, it simulates a variety of other cognitive phenomena that depend on word and passage meaning. Thus, we propose to researchers in discourse processing not only that they use LSA to expedite their investigations, but that they join in the project of testing, developing and exploring its fundamental theoretical implications and limits.

Preliminary Details about Creating LSA Semantic Spaces

Latent Semantic Analysis is a fully automatic mathematical/statistical technique for extracting and inferring relations of expected contextual usage of words in passages of discourse. It is not a traditional natural language processing or artificial intelligence program; it uses no humanly constructed dictionaries, knowledge bases, semantic networks, grammars, syntactic parsers, or morphologies, etc., and takes as its input only raw text parsed into words defined as unique character strings and separated into meaningful passages or samples such as sentences or paragraphs.

The first step is to represent the text as a matrix in which each row stands for a unique word and each column stands for a text passage or other context. Each cell contains the frequency with which the word of its row appears in the passage denoted by its column. Next, the cell entries are subjected to a preliminary transformation in which each cell frequency is weighted by a function that expresses both the word's importance in the particular passage and the degree to which the word type carries information in the domain of discourse in general.

Next, LSA applies singular value decomposition (SVD) to the matrix. This is a form of factor analysis, or more properly the mathematical generalization of which factor analysis is a special case. In SVD a rectangular matrix is decomposed into the product of three other matrices. One component matrix describes the original row entities as vectors of derived orthogonal factor values, another describes the original column entities in the same way, and the third is a diagonal matrix containing scaling values such that when the three components are matrix-multiplied, the original matrix is reconstructed. There is a mathematical proof that any matrix can be so decomposed perfectly, using no more factors than the smallest dimension of the original matrix. When fewer than the necessary number of factors are used, the reconstructed matrix is a least-squares best fit. One can reduce the dimensionality of the solution simply by deleting coefficients in the diagonal matrix, ordinarily starting with the smallest. (In practice, for computational reasons, for very large corpora only a limited number of dimensions can be constructed.)

Landauer, T. K., & Dumais, S. T. (1996). How come you know so much? From practical problem to theory. In D. Hermann, C. McEvoy, M. Johnson, & P. Hertel (Eds.), Basic and applied memory: Memory in context. Mahwah, NJ: Erlbaum, 105-126.

Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211-240.