Semantic Spaces

A Semantic Space, as described in the What is LSA? section, is a mathematical representation of a large body of text. Every term, every text, and every novel combination of terms has a high dimensional vector representation. When you compare two terms you compare the cosine of the angle between the vectors representing the terms. This occurs within a semantic space. You cannot compare the same word directly between semantic spaces.

At this time, these are the semantic spaces available and those coming soon:

literature

The literature space is composed of English and American Literature from the 18th and 19th century (English = ~294 works; American = ~444 works). This space is a collection of literary text taken from the project Gutenberg page. The space is composed of 104852 terms and 942425 documents, with 338 dimensions. The total number of words is 57,092,140.

literature with idioms

Literature with idioms is the same space, but with a different parsing: all the idioms have been considered a single token. The corpus has been created with 500 factors.

encyclopedia

This space contains the text from 30,473 encyclopedia articles. There are 60,768 unique terms. The actual text from these articles is not available in the text selection program- only the titles of the articles will be returned as the documents. There are 371 saved dimensions. Studies show that the optimum dimensionality for this collection is usually 275-350.

psychology

This space contains the text from three college level psychology textbooks with each paragraph used as a document. There are 13,902 documents and 30,119 unique terms. There are 398 saved dimensions. Optimum dimensionality appears to be approximately 300-400.

In addition, there are various spaces created from subsets of these documents. If you wish to use these smaller spaces, email Darrell Laham for details.

smallheart

This small space contains the text from a number of articles about the heart. Each document is a sentence of an article.

French Spaces

There are 8 french semantic spaces ("psychology" is not included in these 8 semantic spaces) that would be organized in 4 categories.

1- Newspapers "Le Monde"

Francais-Monde (300) contains six months (January to June of "Le Monde" newspapers (1993). It contains 20208 documents, 150756 different unique words, and 8675391 total words

Francais-Monde-Extended (300) contains six other months (July to December of "Le Monde" newspapers (1993)).

Francais-Total (300) is the concatenation of Francais-Monde (300) + Francais-Livres (300).

2-Litterature

Francais-Livres (300) contains books published before 1920: 14622 documents, 111094 different unique words, and 5748581 total words.

Francais-Livres1and2 (300) contains books published before 1920 + recent books. Livres1and2 is <>119000 docs

Francais-Livres3 (100) is smaller and contains only recent litterature with idioms. Livres3 is <>26000 docs.

3- Tales (stories)

Francais-Contes-Total (300) contains all traditional tales I have found in electronic format + recent tales also found on web sites. This semanctic space is used to study recall or summary of stories by children and adolescents.

4- Children production

Francais-Production-Total (300) contains texts written by children from 7 to 12 years in primary school in Belgium and France.

There are 830 docs and 3034 unique terms. This space was created using a stop list of 439 common words. There are 94 saved dimensions.

heart

This is the smallheart space with additional documents folded in for use in some of the text selection demonstrations. For general purpose comparisons of heart texts, use smallheart.

tasaXX

These spaces use a variety of texts, novels, newspaper articles, and other information, from the TASA (Touchstone Applied Science Associates, Inc.) corpus used to develop The Educator's Word Frequency Guide. We are EXTREMELY THANKFUL to the kind folks at TASA for providing us with these samples.

This first incarnation of TASA-based spaces breaks out by grade level -- there are spaces for 3rd, 6th, 9th and 12th grades plus one for 'college' level. These are cumulative spaces, i.e. the 6th grade space includes all the 3rd grade docs, the 9th grade space includes all the 6th and 3rd, etc.

The judgment for inclusion in a grade level space comes from a readability score (DRP-Degrees of Reading Power Scale) assigned by TASA to each sample. DRP scores in the TASA corpus range from about 30 to about 73. TASA studies determined what ranges of difficulty are being used in different grade levels, e.g. the texts used in 3rd grade classes range from 45-51 DRP units. For the LSA spaces, all documents less than or equal to the maximum DRP score for a grade level are included, e.g. the 3rd grade corpus includes all text samples that score <= 51 DRP units.

Following are the specifics for each space:

 name

 grade

 maxDRP

 #docs

 #terms

 #dims

 tasa03

 3

 51

 6,974

 29,315

 432

 tasa06

 6

 59

 17,949

 55,105

 412

 tasa09

 9

 62

 22,211

 63,582

 407

 tasa12

 12

 67

 28,882

 76,132

 412

 tasaALL

 college

 73

 37,651

 92,409

 419

The documents are formatted like this:

[Aaron01.01.01] [P#=1] [DRP=49.889142] [SocialStudies=Yes] [S] who were the first americans? [S] many, many years ago, perhaps 35,000 years ago, life was very different than it is today. [S] at that time, the earth was in the grip of the last ice age. ...

The first tag is the ID for the sample, P# is the number of paragraphs in the sample, DRP score, and then any 'academic area' tags.

The breakdown for samples by academic area (in tasaALL):

 

 samples

 paragraphs

 LanguageArts

 16,044

 57,106

 Health

  1,359

  3,396

 HomeEconomics

  283

 403

 IndustrialArts

 142

  462

 Science

  5,356

 15,569

 SocialStudies

  10,501

  29,280

 Business

 1079

 4834

 Miscellaneous

  675

  2272

 Unmarked

 2,212

  6,305

 

-------

--------

 

 37,651

 119,627

 

COMING SOON

canparall

This space contains transcripts from the Canadian parliament proceedings. Each document contains both the English and French language equivalents. There are 2482 documents and 19,568 unique terms.

astronomy - text from a number of on-line sources about astronomy

biology - text from a college level biology textbook

disney - closed-captioning data from the Disney Channel television programming