Word Frequency List 60000 Englis
Counting words and lemmas: The following frequency lists count distinct orthographic words, including inflected and some capitalised forms. For example, the verb "to be" is represented by "is", "are", "were", and so on.
Word Frequency List 60000 Englis
Take a look at the Information Content section of the Wordnet Similarity project at -similarity.sourceforge.net/. There you will find databases of word frequencies (or, rather, information content, which is derived from word frequency) of Wordnet lemmas, calculated from several different corpora. The source codes are in Perl, but the databases are provided independently and can be easily used with NLTK.
Because some high-frequency words (e.g., the, and, is, was, for, are) are essential to learning how to read, teachers of kindergarten and grade 1 typically provide explicit instruction to help students automatically read some of these words. Students are taught to read them as whole words at the same time that they are being taught how to decode most other words. However, once students are able to orthographically map, they will start to store high-frequency words as sight words on their own.
THIS list was compiled by merging different word-lists. The British spelling was preferred and American versions deleted. We have used it in crossword compiling (together with a programme) with much success. A few word groups (e.g. RUN_OF_THE_MILL, written RUNOFTHEMILL) are therefore also included. In all hyphenated words the hyphen was deleted to form one word.
There are two versions of the list -- the one all CAPS and the other one all lower case (even for normally capitalised words) -- both in txt format. Both will open in new windows if these links are clicked:
You can see a break-down of the word frequencies in the infographic below. It uses information from the SubtlexUS American Word Frequency list at This is data taken from subtitles, so it matches spoken English patterns. You'll see how few words you really need to understand the majority of English, and how many you need to understand the rest of it.
The vocabulary of 10,000 words that this website offers was taken from the vocabulary list compiled in 2012 by Paul Nation and Mark Davies, using the "British National Corpus" (BNC) and "The Corpus of Contemporary American English" (COCA).
Some lists contain annotations, which are special charactersappended to certain words. For instance, the ":" character is used insome lists to identify abbreviations which are ordinarily used withouta terminating period. This annotation allows these abbreviations to bedistinguished from possibly similar regular words. Another annotation,used in the 3of6game and 3of6all lists, is the "$" character,indicating a word that was placed in the list even though fewer thanthree of the sources mention it. The "+" and "!'" annotations are usedto identify signature words and neologisms, as described below. Notethat is it possible for a word to have more than one annotation, thoughthis is uncommon. For instance, in the 6of12 list, the word boldfaced= has botha "" and a "=" annotation, signifying that the word was an arbitrarychoice between two equally attested forms (boldfacedand bold-faced),and that it was not given a separate definition in a majority of thesources listing it.A number of the lists contain signature words. These are words (orphrases) which do not meet the formal criteria for inclusion in alist, but which I have chosen to add anyway, as words which "ought tobe" present. Whether a list contains signature words depends on thespecific list. Usually, but not always, a signature word is present insome ofthe sources used for a list, but not enough of them to qualify forinclusion on that basis. Some lists may "inherit" signature words fromother lists from which they were assembled. For instance, the 6phraselist includes the signature words from the 3of6all list. In mostcases, signature words are marked with the "+" annotation.The neol2016 list containsneologisms, words which are not listed insome or all of the source dictionaries for 12dicts, generally for oneof two reasons. First, many of the words are recent coinages which werenot yet fully recognized by mainstream lexicographers when the 12dictssources were published. Examples of such words are selfie, Obamacare, emojiand snarky.Other so-called neologisms are well-established, often well-known,words which areconsidered scandalous, such as sexual slang and ethnic slurs, and which areoften deliberately omitted from dictionaries. (I will not give anyexamples of this sortof word here, but you will find some in the neol2016 list.) Note thatthe neologism list has been accumulating for about fifteen years now,andsome of its words have become almost old-fashioned, such as spam and dotcom. Theneologism list is provided so that some or all of its words can beadded to the other lists where the intended usage makes thatappropriate. However, I have added the single-word neologisms to the2of12inf and 3of6game, as these lists are the most likely to be used incoding word games, where it is desirable to recognize the verylatest hot vocabulary. In these lists, neologisms areannotated with the "!" character.One other observation worth making is about diacritics. Somedictionaries will tell you that there are English words correctlyspelled café, naïve, façade and piñata,and I do not wish to disagree with these authorities. But as apractical matter, Americans do not like to use diacritics. Furthermorethey use keyboards which do not contain accented letters, and are oftenunfamiliar with the often clumsy techniques that their softwareprovides to use such characters. For this reason, 12dicts drops all theaccents from its English vocabulary. This is particularly valuable forcoding word games, where expecting players to accent the e in cafe is not going tomake them happy. (I cannot help pointing out that Scrabble containsno É tiles.) I apologize to those who consider it a matter of someemotional importance that resumeand résuméshould be differently spelled.The organization of 12dictsThe 12dicts lists are organized into four directories,groupinglists with similar characteristics together. The remainder of thisdocument follows this organization as well. For each directory, asection of the documentation describes in detail the lists it contains.Most users of 12dicts end up using only a single list. If it is clearwhich directory will contain the list you need, you can go directly tothe appropriate documentation.The four directories are: American.The lists in this directory contain primarily American Englishwords.
International.The lists in this directory contain words from both AmericanEnglish and British English.
Lemmatized.The lists in this directory combine other lists, and are formatted in a way that clarifies wordrelationships.
Special.The lists in this directory are special-purpose lists that do not fitinto the other directories.
Picking a list to useIf you are not certain which directory might contain thekind oflist you are looking for, here is a breakdown of the 12dicts lists bysize and purpose which may be helpful. If it does not help you find what you are lookingfor, you might want to check out this table,which summarizes the characteristics of all the 12dicts files, puttogether by Kevin Atkinson. Also, I suggest reading the introduction toeach directory presented in the previous paragraph, eachof which contains a table summarizing exactly what you can expect fromeach list in that directory. Lists for use in word games: 2of12inf (American), 3of6game (International).
A list ordered by word frequency: 2+2+3frq (Lemmatized).
Small lists of common words: 2of5core (Special, very small), 3esl (American), 2+2+3cmn(Lemmatized).
Medium-sized lists: 6of12(American, smaller, includes phrases), 2of12(American, larger, no phrases).
Large lists: 3of6all(International, includes phrases), 5d+2a(International, no phrases, many obscure words), 2+2+3lem(Lemmatized, very large).
A list of phrases: 6phrase(Special).
The classic (American) 12dictslistsThe 12dicts project began as the n-dicts projects, n being a variablewhosevalue finally stabilized as 12. The purpose of the project was tocreate alist of words approximating the common core of the vocabulary ofAmericanEnglish.
The methodology of the project was to record andcorrelate the wordslisted in a number of small dictionaries. The number of dictionariesso recorded ended up as 12, comprising 8 ESL (English as a SecondLanguage)dictionaries and 4 "desk dictionaries". The dictionaries chosenvaried widely by publisher, by style, by completeness and by depth. Allof them were dictionaries of AmericanEnglish (three from British publishers). The smallest of them containedabout 20,000 entries, and the largest 46,000. (All totaled, there areabout 75,000 entries, many of which appeared in only a singledictionary.)All but two of the sources were published between 1992 and 1999, when12dictswas first released.
I initially tried two different ways of winnowing the 12dicts data toproduce lists of common words. Both produced interesting results.One list, the 6of12 list, contained all words and phraseslisted in 6 of the 12 dictionaries. One way of describing this listis that it contains those words and phrases which a (seeming) majorityof lexicographers believe are relevant to people learning English,and/or to everyday usage. This list contained about 32,000 words andphrases. The other list, the 2of12 list, was more inclusive in that itincluded words listed in as few as two of the source dictionaries, butless inclusive in that it excluded items of various sorts, includingmulti-word phrases, proper names and abbreviations. This list containedabout 41,000 words. It was likely more suitable for use in areaslike spell checking or word games than the 6of12 list. (Honestycompels me to admit that neither of these lists is, by itself, a goodchoice for spell checking, due to the absence of inflections, propernames, Roman numerals, etc.)