|
|
This section describes the process by which MorphAdorner maps a variant
spelling to a standard (usually modern) form.
Spelling Map File Formats
Spelling maps are the key to MorphAdorner's methodology for standardizing
or modernizing spelling. A spelling map is a utf-8 text file contain two
fields separated by a tab character. The first field is a variant spelling.
The second field is the standardized spelling for the variant.
Currently MorphAdorner uses two maps. The first is culled primarily
from nineteenth century fiction texts and currently contains about
5,000 entries. The second is culled from Early Modern English texts
and contains over 350,000 thousand known variants. There is also a
short list of about 400 variants which are known to vary by word class.
Here are some entries from the Early Modern English spelling map
showing standard spellings for forms of "advance."
The first column is the variant, the second column is the standard
spelling.
| aduauce | advance |
| aduauced | advanced |
| aduauceing | advancing |
| aduaucement | advancement |
| aduauceth | advanceth |
| aduaucing | advancing |
| aduaucyng | advancing |
| aduaucynge | advancing |
| aduaunc'd | advanced |
The file of spellings by word class is similar except that
it contains multiple sections. Each is headed by a word class name
by a colon. This is followed by the list of variant to standard
spellings for that word class. For example, the adjectives section
starts:
| adjective: |
| agean | again |
| bad | bad |
| blew | blue |
| browne | brown |
| chaste | chaste |
| christen | christian |
| clere | clear |
| cliver | clever |
| cold | cold |
| cross | cross |
| cumfbler | cumfortabler |
while the verb section starts:
| verb: |
| d' | do |
| 'm | am |
| 'old | hold |
| 's | is |
| aint | aren't |
| ain't | aren't |
| allays | allays |
| an't | aren't |
| ar | are |
| ar' | are |
| arena | aren't |
| bad | bade |
Some spellings map to themselves when they have different
standard spellings for different word classes. The spelling
"bad" is an example.
Standardization Steps
MorphAdorner attempts to standardize a spelling as follows.
Load the list of known standard spellings. This is a combination
of entries from the 1911 Webster's Dictionary and entries verified
against the Oxford English Dictionary from ongoing work with the
Monk project texts.
Load maps of known variant spellings to modern spellings as
described above.
Create a ternary trie of all the standard and
variant spellings. A ternary trie allows very efficient extraction of
strings within a specified edit distance of a given string.
In other words, it allows efficient extraction of list of words whose
spellings are near to any given word's spelling.
Load a list of modernization rules. Currently MorphAdorner defines
about 70 such rules which can transform many variant spellings to their
modern spellings, or come very close. The rules also provide for
correcting defective spellings that contain "gap" markers reflecting
illegible letters in the original text. Some sample rules include:
- Transform the ending "me~" to "men"
- Transform the ending "ynge" to "ing"
- Transform "uu" to "w"
- Transform "v" followed by a non-vowel to "u"
Now for each old spelling, perform the following steps.
Apply all the applicable transformation rules which results in an
improved spelling. If this spelling appears in the standard spellings
list, we're done. For example, applying the rules to strykynge
directly produces the modern standard spelling striking.
See if the transformed spelling appears in the variant spellings map.
If so, assign the mapped spelling value as the standard spelling.
We're done. For example, applying the rules
to vniuersitie produces universitie .
This is not the modern spelling, but it is close. The mapped spelling
list for Early Modern English provides an entry for universitie,
giving the modern spelling as university.
Compile a list of words whose spellings are "close to" the
transformed spelling by using the ternary trie to search quickly for all
words within a specified edit distance of the transformed word.
Compute a measure of string similarity between each found
spelling and the transformed spelling. String similarity measures
how similar two strings of characters are. A similarity of 0.0 indicates
two strings are completely different, while a similarity of 1.0 indicates
two strings are identical. MorphAdorner uses a weighted similarity
score based upon letter pair similarity, phonetic distance, and edit
distance.
Choose the found spelling with the highest similarity as the most
probable correct/standard spelling. If this spelling appears in the
standard spellings list, we're done. If not, see if it appears in the
mapped spellings list. if so, take the mapped spelling value as the
standard spelling, and we're done. Otherwise, accept the transformed
spelling as the standard spelling, with the proviso that it may not be
a proper standard spelling, and requires further review.
Interactions with Part Of Speech
The standard spelling for some words cannot be determined until the
part of speech for the word is known. Examples of such words include
doe, bee, poor, marie, and wast. Thus "doe" is most likely "doe" a
female deer when it appears as a noun, while "doe" is most likely "do"
when it appears as a verb. When "marie" appears as an adjective it is
probably "merry", but most likely "marry" when used as a verb.
MorphAdorner keeps a short list of variant spellings by general word
class. The final standardized spelling is not assigned until a part
of speech has been assigned, so these special cases can usually be
disambiguated properly.
Standardizing Proper Names
Proper names can appear with a bewildering variety of spellings even
within a single work. Some variants can be transformed to their modern
standard forms by using the general standardization rules presented above.
For example, the spellings Syracvse and Vlysses, which are the commonest
variants of those proper name spellings in the TCP/EEBO version
of Plutarch's Lives, both transform by rule to their modern spellings
Syracuse and Ulysses.
Other variants are not so easily rectified. The place name Cappadocia
appears in Plutarch's Lives as
| CPADOCIA | 1 |
| Cappadocia | 21 |
| OHPPADOCIA | 1 |
| Coppadocia | 1 |
| CAPRADOCIA | 1 |
where the frequency of occurrence follows each variant.
MorphAdorner currently uses the following algorithm to look for standard
spelling candidates for proper names. This is a variant of the extended
search algorithm for standard spellings described above. Because we
know we are looking for proper names, we can do a better job by limiting
the search space to known proper names.
Proper name search algorithm
Collect the list of known spellings of proper names (tagged with NUPOS
parts of speech np1 and np2) in the early modern English lexicon.
Currently there are around 66,000 such spellings.
Construct a "name" ternary trie of the lowercase versions of
all these names. A ternary trie allows very efficient extraction of
strings within a specified edit distance of a given string.
Construct a "consonant" ternary trie of the lowercase
versions of the names with all vowels removed. For each unique
combination of consonants (in order), store the list of spellings
which reduce to that consonant string.
For each unknown name, perform the following steps.
Find all strings in the "name" trie within a specified edit
distance of the unknown name. An edit distance of 2 seems to be a
good choice.
If any names were found in step 1, compute a measure of string
similarity between each found name and the unknown name. Choose the
found name with the highest similarity as the most probable
correct/standard spelling. Letter-pair similarity seems
to work well as a measure of string similarity, but there are many
other possible choices.
If no names were found in step 1, find all strings in the
"consonant" trie within a specified edit distance of the unknown name
with vowels removed. An edit distance of 3seems to be a good choice.
If any consonant strings were found in step 3, perform the
following steps for each consonant string.
Pick up all the names which reduce to this consonant string.
For each of those names, compute a measure of string
similarity between the name and the unknown name (that is, between
the full spellings).
Keep a list of those found names with a similarity score above a
reasonable threshhold. 0.75 seems to be a good choice.
Choose the found name with the highest similarity as the most probable
correct/standard spelling.
If no names were found by either lookup procedure, leave the unknown
name alone.
Here is an example of the algorithm applied to the list of names
above. In each case, only one candidate spelling (the correct one,
it turns out) was found.
Names near CPADOCIA
cappadocia (0.75)
Names near Cappadocia
cappadocia (1.0)
Names near OHPPADOCIA
cappadocia (0.7777777777777778)
Names near Coppadocia
cappadocia (0.7777777777777778)
Names near CAPRADOCIA
cappadocia (0.7777777777777778)
|
|