Semantic Spaces |
A Semantic Space, as described in the What is LSA? section, is a mathematical representation of a large body of text. Every term, every text, and every novel combination of terms has a high dimensional vector representation. When you compare two terms you compare the cosine of the angle between the vectors representing the terms. This occurs within a semantic space. You cannot compare the same word directly between semantic spaces.
At this time, these are the semantic spaces available and those coming soon:
literature
The literature space is composed of English and American Literature from the 18th and 19th century (English = ~294 works; American = ~444 works). This space is a collection of literary text taken from the project Gutenberg page. The space is composed of 104852 terms and 942425 documents, with 338 dimensions. The total number of words is 57,092,140.literature with idioms
Literature with idioms is the same space, but with a different parsing: all the idioms have been considered a single token. The corpus has been created with 500 factors.encyclopedia
This space contains the text from 30,473 encyclopedia articles. There are 60,768 unique terms. The actual text from these articles is not available in the text selection program- only the titles of the articles will be returned as the documents. There are 371 saved dimensions. Studies show that the optimum dimensionality for this collection is usually 275-350.
psychology
This space contains the text from three college level psychology textbooks with each paragraph used as a document. There are 13,902 documents and 30,119 unique terms. There are 398 saved dimensions. Optimum dimensionality appears to be approximately 300-400.
In addition, there are various spaces created from subsets of these documents. If you wish to use these smaller spaces, email Darrell Laham for details.
smallheart
This small space contains the text from a number of articles about the heart. Each document is a sentence of an article.
French Spaces
There are 8 french semantic spaces ("psychology" is not included in these 8 semantic spaces) that would be organized in 4 categories.
1- Newspapers "Le Monde"
Francais-Monde (300) contains six months (January to June of "Le Monde" newspapers (1993). It contains 20208 documents, 150756 different unique words, and 8675391 total words
Francais-Monde-Extended (300) contains six other months (July to December of "Le Monde" newspapers (1993)).
Francais-Total (300) is the concatenation of Francais-Monde (300) + Francais-Livres (300).
2-Litterature
Francais-Livres (300) contains books published before 1920: 14622 documents, 111094 different unique words, and 5748581 total words.
Francais-Livres1and2 (300) contains books published before 1920 + recent books. Livres1and2 is <>119000 docs
Francais-Livres3 (100) is smaller and contains only recent litterature with idioms. Livres3 is <>26000 docs.
3- Tales (stories)
Francais-Contes-Total (300) contains all traditional tales I have found in electronic format + recent tales also found on web sites. This semanctic space is used to study recall or summary of stories by children and adolescents.4- Children production
Francais-Production-Total (300) contains texts written by children from 7 to 12 years in primary school in Belgium and France.
There are 830 docs and 3034 unique terms. This space was created using a stop list of 439 common words. There are 94 saved dimensions.heart
This is the smallheart space with additional documents folded in for use in some of the text selection demonstrations. For general purpose comparisons of heart texts, use smallheart.
tasaXX
These spaces use a variety of texts, novels, newspaper articles, and other information, from the TASA (Touchstone Applied Science Associates, Inc.) corpus used to develop The Educator's Word Frequency Guide. We are EXTREMELY THANKFUL to the kind folks at TASA for providing us with these samples.
This first incarnation of TASA-based spaces breaks out by grade level -- there are spaces for 3rd, 6th, 9th and 12th grades plus one for 'college' level. These are cumulative spaces, i.e. the 6th grade space includes all the 3rd grade docs, the 9th grade space includes all the 6th and 3rd, etc.
The judgment for inclusion in a grade level space comes from a readability score (DRP-Degrees of Reading Power Scale) assigned by TASA to each sample. DRP scores in the TASA corpus range from about 30 to about 73. TASA studies determined what ranges of difficulty are being used in different grade levels, e.g. the texts used in 3rd grade classes range from 45-51 DRP units. For the LSA spaces, all documents less than or equal to the maximum DRP score for a grade level are included, e.g. the 3rd grade corpus includes all text samples that score <= 51 DRP units.
Following are the specifics for each space:
name |
grade |
maxDRP |
#docs |
#terms |
#dims |
tasa03 |
3 |
51 |
6,974 |
29,315 |
432 |
tasa06 |
6 |
59 |
17,949 |
55,105 |
412 |
tasa09 |
9 |
62 |
22,211 |
63,582 |
407 |
tasa12 |
12 |
67 |
28,882 |
76,132 |
412 |
tasaALL |
college |
73 |
37,651 |
92,409 |
419 |
The documents are formatted like this:
[Aaron01.01.01] [P#=1] [DRP=49.889142] [SocialStudies=Yes] [S] who were the first americans? [S] many, many years ago, perhaps 35,000 years ago, life was very different than it is today. [S] at that time, the earth was in the grip of the last ice age. ...
The first tag is the ID for the sample, P# is the number of paragraphs in the sample, DRP score, and then any 'academic area' tags.
The breakdown for samples by academic area (in tasaALL):
|
samples |
paragraphs |
LanguageArts |
16,044 |
57,106 |
Health |
1,359 |
3,396 |
HomeEconomics |
283 |
403 |
IndustrialArts |
142 |
462 |
Science |
5,356 |
15,569 |
SocialStudies |
10,501 |
29,280 |
Business |
1079 |
4834 |
Miscellaneous |
675 |
2272 |
Unmarked |
2,212 |
6,305 |
------- |
-------- | |
37,651 |
119,627 |
COMING SOON
canparall
This space contains transcripts from the Canadian parliament proceedings. Each document contains both the English and French language equivalents. There are 2482 documents and 19,568 unique terms.
astronomy - text from a number of on-line sources about astronomy
biology - text from a college level biology textbook
disney - closed-captioning data from the Disney Channel television programming