A program like MorphAdorner assigns a part of speech tag to each token
in an input text, e.g., this word token is a noun or this token is a period.
This task is difficult since many words can take on more than one
part of speech. Determining which part of speech applies to a particular
word occurrence depends upon the context in which the word appears.
A set of training data specifies a large number of words along with their
potential parts of speech in actual reading contexts. This combination of
known words and parts of speech, along with statistical methods
and/or context rules, allows a program like MorphAdorner to assign
correct parts of speech to words in new texts about 97% of the time,
as long as all the words in the new texts are known. That is, the words
have been encountered in the training data with all their possible parts of
speech, or the words appear in supplemental dictionaries along with
their parts of speech.
Unfortunately many words in new texts will not have been seen
in the training data and will not occur in a supplemental dictionary.
This means a program like MorphAdorner must "guess" the relevant
possible parts of speech for an unknown word to assign a proper part of
speech tag in context.
MorphAdorner uses a variety of techniques to guess the possible parts
of speech for an unknown word. The default MorphAdorner guesser
applies the following methods, in order, until at least one
potential part of speech is identified. A programmer can
modify or replace this default guesser, and several MorphAdorner
configuration settings allow you to modify the guessing process as well.
-
Is the word punctuation?
Examples: period, quote mark, question mark,
sequence of periods
Assign the punctuation or punctuation class as the part of speech.
-
Is the word a symbol?
Examples: A paragraph mark.
Assign the symbol class as the part of speech.
-
Is the word a cardinal number?
Examples: 12, 12.5
Assign the cardinal number class as the part of speech.
-
Is the word an ordinal number?
Examples: 1st, 12th
Assign the ordinal number class as the part of speech.
-
Is the word a currency amount?
Examples: $12.50, 1L, 1£, £10
Assign the cardinal number class as the part of speech.
-
Is the word a Roman numeral?
Examples: I, V, IX, .IX., .IX, MMM, IIIJ
Assign the cardinal number class as the part of speech.
For Roman numerals that can also be initials (I, V) or English
pronouns (I), add the proper noun and appropriate
pronoun classes as well.
Note that the definition of a Roman numeral is much looser in
older texts than is defined in contemporary usage.
-
Is the word an ordinal Roman numeral?
Examples: xviith
Assign the ordinal number part of speech class.
-
Is the word hyphenated?
Examples: head-master, sea-serpent
MorphAdorner extracts the part of the word after the last hyphen.
If that is a known word, assign its part of speech classes.
The following cases are treated specially.
- a letter followed by ---'s is considered a
possessive noun.
- ---'s or ---'S is considered a
possessive noun.
- a letter followed by --- is considered a
proper or common possessive noun, or an exclamation.
-
Is a spelling standardizer defined?
If so, get the parts of speech for the standardized spelling.
Example: "vniversitie" regularizes to "university"
Assign the part of speech classes for "university" if known.
-
Is the word a proper name?
MorphAdorner defines some auxiliary word lists
containing lists of proper names for people
and places. If the word appears on one of these "name"
lists, assign the proper noun class.
-
Is the word defined by an auxiliary word list?
MorphAdorner defines some auxiliary word lists
which define words and possible part of speech classes
for those words. If the word appears on one of these
lists, assign the associated part of speech classes
defined in the lists.
-
Is the word an abbreviation?
Examples: U.S., p.m.
If the word appears to be an abbreviation, assign a
proper noun class if it begins with a capital letter,
or a common noun class if it does not begin with a
capital letter.
-
Is a suffix lexicon defined?
If so, perform the following suffix analysis.
For each successively shorter ending substring of the
word, look up that substring in the suffix lexicon.
If the substring exists in the suffix lexicon, assign its
part of speech classes as those of the unknown word.
Example: reputedly
Look up the successively shorter terminal strings:
reputedly
eputedly
putedly
utedly
tedly
edly
dly
ly
y
and stop at the first of those suffix strings which appears
in the suffix lexicon, and use the associated part of
speech classes.
-
Is the word entirely in upper case?
Example: MCDOODLE
Assign the singular proper noun part of speech class.
-
If all else fails, assume the word is a noun.
If the word begins with a capital letter and ends with "s",
assume it is a plural proper noun.
If the word begins with a capital letter and does not
end with "s", assume it is a singular proper noun.
If the word does not begin with a capital letter and ends with "s",
assume it is a plural common noun.
If the word does not begin with a capital letter and does not
end with "s", assume it is a singular common noun.