Poets that lasting marble seek,
Must carve in Latin or in Greek.
We write in sand, our language grows,
And like the tide, our work o'erflows.

-- Edmund Waller



Northwestern
MorphAdorner
    INFORMATION TECHNOLOGY  
    MorphAdorner Site Map  
MorphAdorner > Spelling Standardizer
 
Home
 
Announcements and News
 
Download MorphAdorner
 
Documentation
 
Licenses
 
Glossary
 
Helpful References
 
Tech Talk
 

Language Recognizer
 
Lemmatizer
 
Lexicon Lookup
 
Name Recognizer
 
Parser
 
Part of Speech Tagger
 
Pluralizer
 
Sentence Splitter
 
Spelling Standardizer
 
Text Segmenter
 
Verb Conjugator
 
Word Tokenizer
 
  Spelling Standardizer
 
 

English texts of the past exhibit far greater spelling variance than contemporary texts. Texts from the seventeenth century and earlier times use conventions that differ from contemporary standards in the use of "u" and "v" and "y" and capitalization, among others. Often the same words is spelled differently even within the same work. By the eighteenth-century texts employ much more modern orthographic standards, except for capitalization.

MorphAdorner uses rules, word lists, and extended search techniques such as spelling correction methods and other heuristics to map variant spellings to their standard (usually modern) form. For obsolete words no longer in use, a representative standard form is chosen which is usually the Oxford English Dictionary headword form. Presently MorphAdorner knows a couple of hundred thousand variant spellings. Using this list, MorphAdorner can automatically determine the correct standard form for previously unseen spellings in many cases.

Sometimes a new spelling is just too different from any of the ones MorphAdorner already knows. Using the extended search facilities on such a spelling may result in a "standard spelling" which veers far from the correct form. As time goes one we hope to reduce the occurrence of such errors.

Orthographic standardization improves the quality of part-of-speech tagging, name recognition, and text searching. However, standardization by itself isn't sufficient to fix some other problems. These include the lack of the apostrophe to mark the possessive case and the inconsistent practices of capitalization as markers of proper nouns.

In English before 1700 the apostrophe never indicates the genitive, and "her mother's daughter" is written "her mothers daughter". An even more problematic example is "her majesty's daughter" which appears in early texts as "her majesties daughter." The use of the apostrophe as a genetive marker gained ground during the eighteenth century, and has been used as it is today since the early nineteenth century.

In the eighteenth century, the apostrophe is sometimes used as a plural marker in certain character combinations. Thus "canoe's" is much more likely to be a plural than a possessive form.

The modern practice of restricting capitalization to names, namelike entities, and certain emphatic uses is about two centuries old. In earlier English nouns are freely capitalized, and capitalization is not a reliable way of picking out proper nouns. However, proper nouns have usually been capitalized in all forms of written English since about 1550. Before that names can appear in lower case.

In poetry the first word of each line is often capitalized even when that word does not start a sentence. For purposes of part-of-speech tagging, a simple workaround is to use the lower case form of a word that does not start a sentence, except if the word appears in a list of known proper names.

You can read a more detailed description of the spelling standardization process.

You can try MorphAdorner's spelling standardizer online.

 

Information Technology | Academic Technologies | Scholarly Technologies 2East Resource Center |
Northwestern Home | Calendar: Plan-It Purple | Sites A-Z | Search
Academic Technologies  NU Library 2East  1970 Campus Drive  Evanston, IL 60208
E-mail: pib@northwestern.edu
Last updated Sun Mar 15 05:52:56 2009   World Wide Web Disclaimer and University Policy Statements   © 2007, 2008 Northwestern University