|
Lemmatization is the process of reducing
an inflected spelling to its lexical root or lemma
form. The lemma form is the base form or head word
form you would find in a dictionary. The combination of
the lemma form with its word class (noun, verb. etc.) is
called the lexeme.
In English, the base form for a verb is the simple
infinitive. For example, the gerund "striking" and the past
form "struck" are both forms of the lemma "(to) strike".
The base form for a noun is the singular form. For
example, the plural "mice" is a form of the lemma "mouse."
Most English spellings can be lemmatized using regular
rules of English grammar, as long as the word class is
known. MorphAdorner uses a list of about 200 such rules.
Some spellings require special handling because they don't
follow the rules. These irregular forms include "strong"
verbs like "to catch" and nouns like "mouse." MorphAdorner
recognizes over 3,000 irregular forms.
The lemma form of a spelling depends upon its word class.
Thus the noun "bee" has "bee" as a lemma form, while "bee"
as a verb has "(to) be" as a lemma form. This turns out to
be a bigger problem in Early Modern English than in
contemporary English because spelling was not reasonably
standardized until the late eighteenth century. Using a standard
spelling helps in finding the lemma form. For example,
the gerund "strykynge" is an old spelling for "striking."
By transforming the old spelling to a standardized (usually
modern) spelling, we can apply the standard lemmatization
rules and obtain "(to) strike" as the lemma. MorphAdorner's
English lemmatizer works best with standardized spellings.
Another problem area is the use of the "'s" as a
possessive. Sixteenth and seventeenth century English
texts generally did not use the "'s" for the possessive
form. Thus a phrase like "his majesty's horses" might
appear as "his majesties horses." Handling this problem
requires part of speech tagging in tandem with spelling
standardization.
Not so trivial is the disambiguation of homonyms like 'lie'
or 'bark'. There are a few hundred (at most) such pairs in
English. In the future we may be able to distinguish which
homonym is meant in some situations using methods
collectively called word sense disambiguation.
That would allow more accurate lemmatization for homonyms.
You can read a more detailed description of the
English lemmatization process.
Stemming offers a simpler alternative to
lemmatization. Stemming also attempts to reduce a word to a
base form by removing affixes, but the resulting stem is
not necessarily a proper lemma. Such stems can be useful in
information retrieval applications.
Two widely used stemmers are included in MorphAdorner.
- The Porter stemmer, created by Martin Porter.
- The Lancaster stemmer, created by Chris Paice and Gareth Husk.
You can try MorphAdorner's
English lemmatizer online.
|