|
MergeEnhancedBrillLexicon
merges the contents of an enhanced Brill format lexicon with a MorphAdorner
format lexicon into a combined MorphAdorner lexicon.
Usage:
mergeenhancedbrilllexicon lexicon.lex enhancedbrilllexicon.txt mergedlexicon.lex
where
- lexicon.lex -- input MorphAdorner format word lexicon.
- enhancedbrilllexicon.txt -- input enhanced Brill format word
lexicon to be merged with MorphAdorner word lexicon.
- mergedlexicon.lex -- output merged lexicon in MorphAdorner format.
An enhanced Brill lexicon is a simple utf-8 formatted text file containing
words and their possible part of speech tags along with the lemma for
each part of speech. Each word appears on a separate line.
The first token on each line is the word. The remaining tokens are a
a set of pairs of potential parts of speech for the word, followed by
a blank, followed by the lemma for that word and part of speech.
The most commonly occurring part of speech should be the first one listed.
word pos1 lemma1 pos2 lemma2 pos3 lemma3 ...
This type of lexicon is an enhancement over the simple lexicon format
popularized by Eric Brill's part of speech tagger in the early 1990s.
The original Brill lexicon did not provide for specifying the lemmata.
The enhanced Brill entries are merged with the input MorphAdorner lexicon
to produce an updated output MorphAdorner format lexicon. The first
part of speech for each word is added with a could of two, while the
remaining words are added with a count of one. When a word to be added
already exists in the MorphAdorner lexicon, only the new parts of speech
are added to the existing lexicon entry.
Enhanced Brill lexicons are convenient for adding large lists of words
such as proper and place names, foreign language words, and
so on. Here is a small section of a sample enhanced Brill lexicon.
Chippewas np2 Chippewa
mor'n d|cs more|than
quicker'n jc|cs quick|than
y'r po22 you
you'se pn22|vbb you|be
youv'e pn22|vhb you|have
MorphAdorner also allows you to
merge a simple Brill lexicon
into a MorphAdorner lexicon. A simple Brill lexicon only provides
the list of parts of speech for each word, not the lemmata.
|