Package org.apache.lucene.analysis.ko
Class KoreanTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.ko.KoreanTokenizer
- All Implemented Interfaces:
- Closeable,- AutoCloseable
Tokenizer for Korean that uses morphological analysis.
 
This tokenizer sets a number of additional attributes:
- PartOfSpeechAttributecontaining part-of-speech.
- ReadingAttributecontaining reading.
This tokenizer uses a rolling Viterbi search to find the least cost segmentation (path) of the incoming characters.
- WARNING: This API is experimental and might change in incompatible ways in the next release.
- 
Nested Class SummaryNested ClassesModifier and TypeClassDescriptionstatic enumDecompound mode: this determines how the tokenizer handlesPOS.Type.COMPOUND,POS.Type.INFLECTandPOS.Type.PREANALYSIStokens.Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSourceAttributeSource.State
- 
Field SummaryFieldsModifier and TypeFieldDescriptionstatic final KoreanTokenizer.DecompoundModeDefault mode for the decompound of tokens (KoreanTokenizer.DecompoundMode.DISCARD.Fields inherited from class org.apache.lucene.analysis.TokenStreamDEFAULT_TOKEN_ATTRIBUTE_FACTORY
- 
Constructor SummaryConstructorsConstructorDescriptionCreates a new KoreanTokenizer with default parameters.KoreanTokenizer(AttributeFactory factory, TokenInfoDictionary systemDictionary, UnknownDictionary unkDictionary, ConnectionCosts connectionCosts, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams, boolean discardPunctuation) Create a new KoreanTokenizer supplying a custom system dictionary and unknown dictionary.KoreanTokenizer(AttributeFactory factory, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams) Create a new KoreanTokenizer using the system and unknown dictionaries shipped with Lucene.KoreanTokenizer(AttributeFactory factory, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams, boolean discardPunctuation) Create a new KoreanTokenizer using the system and unknown dictionaries shipped with Lucene.
- 
Method SummaryMethods inherited from class org.apache.lucene.analysis.TokenizercorrectOffset, setReader, setReaderTestPointMethods inherited from class org.apache.lucene.util.AttributeSourceaddAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
- 
Field Details- 
DEFAULT_DECOMPOUNDDefault mode for the decompound of tokens (KoreanTokenizer.DecompoundMode.DISCARD.
 
- 
- 
Constructor Details- 
KoreanTokenizerpublic KoreanTokenizer()Creates a new KoreanTokenizer with default parameters.Uses the default AttributeFactory. 
- 
KoreanTokenizerpublic KoreanTokenizer(AttributeFactory factory, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams) Create a new KoreanTokenizer using the system and unknown dictionaries shipped with Lucene.- Parameters:
- factory- the AttributeFactory to use
- userDictionary- Optional: if non-null, user dictionary.
- mode- Decompound mode.
- outputUnknownUnigrams- if true outputs unigrams for unknown words.
 
- 
KoreanTokenizerpublic KoreanTokenizer(AttributeFactory factory, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams, boolean discardPunctuation) Create a new KoreanTokenizer using the system and unknown dictionaries shipped with Lucene.- Parameters:
- factory- the AttributeFactory to use
- userDictionary- Optional: if non-null, user dictionary.
- mode- Decompound mode.
- outputUnknownUnigrams- if true outputs unigrams for unknown words.
- discardPunctuation- true if punctuation tokens should be dropped from the output.
 
- 
KoreanTokenizerpublic KoreanTokenizer(AttributeFactory factory, TokenInfoDictionary systemDictionary, UnknownDictionary unkDictionary, ConnectionCosts connectionCosts, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams, boolean discardPunctuation) Create a new KoreanTokenizer supplying a custom system dictionary and unknown dictionary. This constructor provides an entry point for users that want to construct custom language models that can be used as input toDictionaryBuilder.- Parameters:
- factory- the AttributeFactory to use
- systemDictionary- a custom known token dictionary
- unkDictionary- a custom unknown token dictionary
- connectionCosts- custom token transition costs
- userDictionary- Optional: if non-null, user dictionary.
- mode- Decompound mode.
- outputUnknownUnigrams- if true outputs unigrams for unknown words.
- discardPunctuation- true if punctuation tokens should be dropped from the output.
- WARNING: This API is experimental and might change in incompatible ways in the next release.
 
 
- 
- 
Method Details- 
close- Specified by:
- closein interface- AutoCloseable
- Specified by:
- closein interface- Closeable
- Overrides:
- closein class- Tokenizer
- Throws:
- IOException
 
- 
reset- Overrides:
- resetin class- Tokenizer
- Throws:
- IOException
 
- 
end- Overrides:
- endin class- TokenStream
- Throws:
- IOException
 
- 
incrementToken- Specified by:
- incrementTokenin class- TokenStream
- Throws:
- IOException
 
- 
setGraphvizFormatterExpert: set this to produce graphviz (dot) output of the Viterbi lattice
 
-