Package org.apache.lucene.analysis.util
Class SegmentingTokenizerBase
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.util.SegmentingTokenizerBase
- All Implemented Interfaces:
- Closeable,- AutoCloseable
- Direct Known Subclasses:
- ThaiTokenizer
Breaks text into sentences with a 
BreakIterator and allows subclasses to decompose these
 sentences into words.
 This can be used by subclasses that need sentence context for tokenization purposes, such as CJK segmenters.
Additionally it can be used by subclasses that want to mark sentence boundaries (with a custom attribute, extra token, position increment, etc) for downstream processing.
- WARNING: This API is experimental and might change in incompatible ways in the next release.
- 
Nested Class SummaryNested classes/interfaces inherited from class org.apache.lucene.util.AttributeSourceAttributeSource.State
- 
Field SummaryFieldsModifier and TypeFieldDescriptionprotected final char[]protected static final intprotected intaccumulated offset of previous buffers for this reader, for offsetAttFields inherited from class org.apache.lucene.analysis.TokenStreamDEFAULT_TOKEN_ATTRIBUTE_FACTORY
- 
Constructor SummaryConstructorsConstructorDescriptionSegmentingTokenizerBase(BreakIterator iterator) Construct a new SegmenterBase, using the provided BreakIterator for sentence segmentation.SegmentingTokenizerBase(AttributeFactory factory, BreakIterator iterator) Construct a new SegmenterBase, also supplying the AttributeFactory
- 
Method SummaryModifier and TypeMethodDescriptionfinal voidend()final booleanprotected abstract booleanReturns true if another word is availableprotected booleanisSafeEnd(char ch) For sentence tokenization, these are the unambiguous break positions.voidreset()protected abstract voidsetNextSentence(int sentenceStart, int sentenceEnd) Provides the next input sentence for analysisMethods inherited from class org.apache.lucene.analysis.Tokenizerclose, correctOffset, setReader, setReaderTestPointMethods inherited from class org.apache.lucene.util.AttributeSourceaddAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
- 
Field Details- 
BUFFERMAXprotected static final int BUFFERMAX- See Also:
 
- 
bufferprotected final char[] buffer
- 
offsetprotected int offsetaccumulated offset of previous buffers for this reader, for offsetAtt
 
- 
- 
Constructor Details- 
SegmentingTokenizerBaseConstruct a new SegmenterBase, using the provided BreakIterator for sentence segmentation.Note that you should never share BreakIterators across different TokenStreams, instead a newly created or cloned one should always be provided to this constructor. 
- 
SegmentingTokenizerBaseConstruct a new SegmenterBase, also supplying the AttributeFactory
 
- 
- 
Method Details- 
incrementToken- Specified by:
- incrementTokenin class- TokenStream
- Throws:
- IOException
 
- 
reset- Overrides:
- resetin class- Tokenizer
- Throws:
- IOException
 
- 
end- Overrides:
- endin class- TokenStream
- Throws:
- IOException
 
- 
isSafeEndprotected boolean isSafeEnd(char ch) For sentence tokenization, these are the unambiguous break positions.
- 
setNextSentenceprotected abstract void setNextSentence(int sentenceStart, int sentenceEnd) Provides the next input sentence for analysis
- 
incrementWordprotected abstract boolean incrementWord()Returns true if another word is available
 
-