Class ICUTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.icu.segmentation.ICUTokenizer
- All Implemented Interfaces:
- Closeable,- AutoCloseable
Breaks text into words according to UAX #29: Unicode Text Segmentation
 (http://www.unicode.org/reports/tr29/)
 
Words are broken across script boundaries, then segmented according to the BreakIterator and
 typing provided by the ICUTokenizerConfig
- See Also:
- WARNING: This API is experimental and might change in incompatible ways in the next release.
- 
Nested Class SummaryNested classes/interfaces inherited from class org.apache.lucene.util.AttributeSourceAttributeSource.State
- 
Field SummaryFields inherited from class org.apache.lucene.analysis.TokenStreamDEFAULT_TOKEN_ATTRIBUTE_FACTORY
- 
Constructor SummaryConstructorsConstructorDescriptionConstruct a new ICUTokenizer that breaks text into words from the given Reader.ICUTokenizer(ICUTokenizerConfig config) Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.ICUTokenizer(AttributeFactory factory, ICUTokenizerConfig config) Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.
- 
Method SummaryMethods inherited from class org.apache.lucene.analysis.Tokenizerclose, correctOffset, setReader, setReaderTestPointMethods inherited from class org.apache.lucene.util.AttributeSourceaddAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
- 
Constructor Details- 
ICUTokenizerpublic ICUTokenizer()Construct a new ICUTokenizer that breaks text into words from the given Reader.The default script-specific handling is used. The default attribute factory is used. - See Also:
 
- 
ICUTokenizerConstruct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.The default attribute factory is used. - Parameters:
- config- Tailored BreakIterator configuration
 
- 
ICUTokenizerConstruct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.- Parameters:
- factory- AttributeFactory to use
- config- Tailored BreakIterator configuration
 
 
- 
- 
Method Details- 
incrementToken- Specified by:
- incrementTokenin class- TokenStream
- Throws:
- IOException
 
- 
reset- Overrides:
- resetin class- Tokenizer
- Throws:
- IOException
 
- 
end- Overrides:
- endin class- TokenStream
- Throws:
- IOException
 
 
-