Class ICUTokenizerConfig
- java.lang.Object
-
- org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
-
- Direct Known Subclasses:
DefaultICUTokenizerConfig
public abstract class ICUTokenizerConfig extends Object
Class that allows for tailored Unicode Text Segmentation on a per-writing system basis.- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
-
Field Summary
Fields Modifier and Type Field Description static intEMOJI_SEQUENCE_STATUSRule status for emoji sequences
-
Constructor Summary
Constructors Constructor Description ICUTokenizerConfig()Sole constructor.
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description abstract booleancombineCJ()true if Han, Hiragana, and Katakana scripts should all be returned as Japaneseabstract com.ibm.icu.text.RuleBasedBreakIteratorgetBreakIterator(int script)Return a breakiterator capable of processing a given script.abstract StringgetType(int script, int ruleStatus)Return a token type value for a given script and BreakIterator rule status.
-
-
-
Field Detail
-
EMOJI_SEQUENCE_STATUS
public static final int EMOJI_SEQUENCE_STATUS
Rule status for emoji sequences- See Also:
- Constant Field Values
-
-
Method Detail
-
getBreakIterator
public abstract com.ibm.icu.text.RuleBasedBreakIterator getBreakIterator(int script)
Return a breakiterator capable of processing a given script.
-
getType
public abstract String getType(int script, int ruleStatus)
Return a token type value for a given script and BreakIterator rule status.
-
combineCJ
public abstract boolean combineCJ()
true if Han, Hiragana, and Katakana scripts should all be returned as Japanese
-
-