Class DefaultICUTokenizerConfig
- java.lang.Object
-
- org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
-
- org.apache.lucene.analysis.icu.segmentation.DefaultICUTokenizerConfig
-
public class DefaultICUTokenizerConfig extends ICUTokenizerConfig
DefaultICUTokenizerConfigthat is generally applicable to many languages.Generally tokenizes Unicode text according to UAX#29 (
BreakIterator.getWordInstance(ULocale.ROOT)), but with the following tailorings:- Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
-
Field Summary
Fields Modifier and Type Field Description static StringWORD_EMOJIToken type for words that appear to be emoji sequencesstatic StringWORD_HANGULToken type for words containing Korean hangulstatic StringWORD_HIRAGANAToken type for words containing Japanese hiraganastatic StringWORD_IDEOToken type for words containing ideographic charactersstatic StringWORD_KATAKANAToken type for words containing Japanese katakanastatic StringWORD_LETTERToken type for words that contain lettersstatic StringWORD_NUMBERToken type for words that appear to be numbers-
Fields inherited from class org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
EMOJI_SEQUENCE_STATUS
-
-
Constructor Summary
Constructors Constructor Description DefaultICUTokenizerConfig(boolean cjkAsWords, boolean myanmarAsWords)Creates a new config.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description booleancombineCJ()true if Han, Hiragana, and Katakana scripts should all be returned as Japanesecom.ibm.icu.text.RuleBasedBreakIteratorgetBreakIterator(int script)Return a breakiterator capable of processing a given script.StringgetType(int script, int ruleStatus)Return a token type value for a given script and BreakIterator rule status.
-
-
-
Field Detail
-
WORD_IDEO
public static final String WORD_IDEO
Token type for words containing ideographic characters
-
WORD_HIRAGANA
public static final String WORD_HIRAGANA
Token type for words containing Japanese hiragana
-
WORD_KATAKANA
public static final String WORD_KATAKANA
Token type for words containing Japanese katakana
-
WORD_HANGUL
public static final String WORD_HANGUL
Token type for words containing Korean hangul
-
WORD_LETTER
public static final String WORD_LETTER
Token type for words that contain letters
-
WORD_NUMBER
public static final String WORD_NUMBER
Token type for words that appear to be numbers
-
WORD_EMOJI
public static final String WORD_EMOJI
Token type for words that appear to be emoji sequences
-
-
Constructor Detail
-
DefaultICUTokenizerConfig
public DefaultICUTokenizerConfig(boolean cjkAsWords, boolean myanmarAsWords)Creates a new config. This object is lightweight, but the first time the class is referenced, breakiterators will be initialized.- Parameters:
cjkAsWords- true if cjk text should undergo dictionary-based segmentation, otherwise text will be segmented according to UAX#29 defaults. If this is true, all Han+Hiragana+Katakana words will be tagged as IDEOGRAPHIC.myanmarAsWords- true if Myanmar text should undergo dictionary-based segmentation, otherwise it will be tokenized as syllables.
-
-
Method Detail
-
combineCJ
public boolean combineCJ()
Description copied from class:ICUTokenizerConfigtrue if Han, Hiragana, and Katakana scripts should all be returned as Japanese- Specified by:
combineCJin classICUTokenizerConfig
-
getBreakIterator
public com.ibm.icu.text.RuleBasedBreakIterator getBreakIterator(int script)
Description copied from class:ICUTokenizerConfigReturn a breakiterator capable of processing a given script.- Specified by:
getBreakIteratorin classICUTokenizerConfig
-
getType
public String getType(int script, int ruleStatus)
Description copied from class:ICUTokenizerConfigReturn a token type value for a given script and BreakIterator rule status.- Specified by:
getTypein classICUTokenizerConfig
-
-