|
The article
Finding text boundaries in Java
by Rich Gillam describes describes the Java BreakIterator which underlies
the ICU4JBreakIterator class used by MorphAdorner to obtain an initial
deconstruction of text into sentences.
MorphAdorner only uses ICU4JBreakIterator to provide initial sentence
boundaries.
MorphAdorner's word tokenizer uses its own methods for determining token
boundaries within a sentence.
Abbreviations
The period ending an abbreviation may act as both a part of the abbreviation
and the end of a sentence. MorphAdorner maintains a list of common
abbreviations along with a flag indicating if the abbreviation usually
can end a sentence. MorphAdorner will not split a sentence after an
abbreviation which is not designated as a potential sentence ender.
For example, the abbreviation Mrs. rarely ends a sentence, so
MorphAdorner does not issue sentence splits following Mrs.
Thus
Mrs. Smith was here earlier.
is correctly considered a single sentence, while
I will leave it up to the Mrs. She will know what to do.
which should be two sentences (with a split after Mrs.)
is also treated as a single sentence by MorphAdorner.
This could be handled by recognizing that Mrs. can end a
sentence when followed by something other than a proper name.
When an abbreviation can end a sentence, MorphAdorner tries to determine
if a particular use ends a sentence or not by looking for possible verbs
before and after the abbreviation. MorphAdorner does not split the
sentence after the abbreviation unless it has found a possible verb in the
sentence preceding the abbreviation. MorphAdorner does not use detailed
part of speech information during sentence splitting. However, the parts
of speech for any word can be looked up in the word lexicon or determined
using a part of speech guesser. That is sufficient to guide the sentence
splitting algorithm in many but not all cases.
MorphAdorner splits the text
I mailed the letter early in the a.m. The next step is to wait for a reply.
correctly into two sentences following a.m., while
I mailed the letter early in the a.m. the next day too.
is left unsplit.
MorphAdorner correctly leaves unsplit the following sentences.
She needs her car by 5 p.m. Saturday evening.
At 5 p.m. I had to go to the bank.
She has an appointment at 5 p.m. Saturday afternoon.
By 5 p.m. Sunday I have to be at home.
MorphAdorner correctly splits the following text into two sentences
following p.m.:
It was due Friday at 5 p.m. Saturday afternoon would be too late.
The text
She has an appointment at 5 p.m. Saturday afternoon to get her car fixed.
should be left as a single sentence, but MorphAdorner splits it into
two sentences with the split occurring after p.m. While both get and
fixed can be verbs, neither appears in context as the the right kind of
verb form to allow the text following p.m. to be considered a sentence.
MorphAdorner does not recognize abbreviations containing blanks, such as
"U. S." for United States. However, "U.S." without the blank is recognized.
Characters not allowed to start a sentence
MorphAdorner does not allow a sentence to start with a comma, a period,
or a percent sign. These characters will be attached to the previous token
and/or sentence, if any. Dashes and hyphens are joined preferentially
to the end of a sentence rather than the start of a sentence.
Interjections
MorphAdorner maintains a list of common interjections, These are words
typically used for emphasis, and generally followed by an exclamation
mark or question mark. MorphAdorner does not split the sentence following
the interjection, and it leaves the question mark or exclamation point
attached to the interjection word. The situation can become ambiguous when
quote marks are involved.
MorphAdorner treats the following lines as single sentences.
What! That's bad!
"What! That's bad!"
On the other hand, the following line is treated as two sentences.
"What!" is the first sentence and "That's bad!" is the second
sentence.
Numbers
A period following a number may act as both a decimal point and the end
of a sentence (in English). In general, MorphAdorner ends a sentence
following a number ending in a period when the next word begins with a
capital letter. The following text is considered one sentence by
MorphAdorner.
MorphAdorner splits each of the following two lines into two sentences
following 12.
There are 12. More would be unnecessary.
There are 12. "More would be unnecessary."
|