Word Embedding Analysis

Welcome lsa.colorado.edu users! This is an updated website that encompasses all of the old lsa.colorado.edu functionality and more.

Overview

Semantic analysis of language is commonly performed using high-dimensional vector space word embeddings of text. These embeddings are generated under the premise of distributional semantics, whereby "a word is characterized by the company it keeps" (John R. Firth). Thus, words that appear in similar contexts are semantically related to one another and consequently will be close in distance to one another in a derived embedding space. This approach has served as the basis for a number of widely used word embedding methods.

Approaches to the generation of word embeddings have evolved over the years: an early technique is Latent Semantic Analysis (Deerwester et al., 1990, Landauer, Foltz & Laham, 1998) and more recently word2vec (Mikolov et al., 2013). LSA performs a singular value decomposition on a sparse word type to document matrix to obtain lower dimensional vectors of each of the types. Word2vec uses a neural network-based word embedding model trained on a large corpus of text to predict either a word given its context (continuous bag of words; CBOW) or the context surrounding a given word (skip-gram). Contemporary examples of word embedding techniques include ELMo, BERT, GPT-3, XLNet.

The analysis tools available on this website harness LSA, word2vec, and BERT word embeddings. Others may be provided later.

Quick Links

Information

First time user? See the informational page on word embedding analysis for an overview of word embeddings. For information on how to perform word embedding analyses using this website, see the how to page.

See the papers page for references to recommended reading, for technical information on the embedding techniques harnessed in this website as well as for examples of the application of semantic comparisons in various domains.

See the FAQ page for answers to frequently asked questions.