Package org.apache.lucene.analysis.ja
Class JapaneseTokenizer
- java.lang.Object
-
- org.apache.lucene.util.AttributeSource
-
- org.apache.lucene.analysis.TokenStream
-
- org.apache.lucene.analysis.Tokenizer
-
- org.apache.lucene.analysis.ja.JapaneseTokenizer
-
- All Implemented Interfaces:
Closeable,AutoCloseable
public final class JapaneseTokenizer extends Tokenizer
Tokenizer for Japanese that uses morphological analysis.This tokenizer sets a number of additional attributes:
BaseFormAttributecontaining base form for inflected adjectives and verbs.PartOfSpeechAttributecontaining part-of-speech.ReadingAttributecontaining reading and pronunciation.InflectionAttributecontaining additional part-of-speech information for inflected forms.
This tokenizer uses a rolling Viterbi search to find the least cost segmentation (path) of the incoming characters. For tokens that appear to be compound (> length 2 for all Kanji, or > length 7 for non-Kanji), we see if there is a 2nd best segmentation of that token after applying penalties to the long tokens. If so, and the Mode is
JapaneseTokenizer.Mode.SEARCH, we output the alternate segmentation as well.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classJapaneseTokenizer.ModeTokenization mode: this determines how the tokenizer handles compound and unknown words.static classJapaneseTokenizer.TypeToken type reflecting the original source of this token-
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
-
Field Summary
Fields Modifier and Type Field Description static JapaneseTokenizer.ModeDEFAULT_MODEDefault tokenization mode.-
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
-
Constructor Summary
Constructors Constructor Description JapaneseTokenizer(UserDictionary userDictionary, boolean discardPunctuation, boolean discardCompoundToken, JapaneseTokenizer.Mode mode)Create a new JapaneseTokenizer.JapaneseTokenizer(UserDictionary userDictionary, boolean discardPunctuation, JapaneseTokenizer.Mode mode)Create a new JapaneseTokenizer.JapaneseTokenizer(AttributeFactory factory, TokenInfoDictionary systemDictionary, UnknownDictionary unkDictionary, ConnectionCosts connectionCosts, UserDictionary userDictionary, boolean discardPunctuation, boolean discardCompoundToken, JapaneseTokenizer.Mode mode)Create a new JapaneseTokenizer, supplying a custom system dictionary and unknown dictionary.JapaneseTokenizer(AttributeFactory factory, UserDictionary userDictionary, boolean discardPunctuation, boolean discardCompoundToken, JapaneseTokenizer.Mode mode)Create a new JapaneseTokenizer using the system and unknown dictionaries shipped with Lucene.JapaneseTokenizer(AttributeFactory factory, UserDictionary userDictionary, boolean discardPunctuation, JapaneseTokenizer.Mode mode)Create a new JapaneseTokenizer using the system and unknown dictionaries shipped with Lucene.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description intcalcNBestCost(String examples)voidclose()voidend()booleanincrementToken()voidreset()voidsetGraphvizFormatter(GraphvizFormatter dotOut)Expert: set this to produce graphviz (dot) output of the Viterbi latticevoidsetNBestCost(int value)-
Methods inherited from class org.apache.lucene.analysis.Tokenizer
correctOffset, setReader, setReaderTestPoint
-
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
-
-
-
Field Detail
-
DEFAULT_MODE
public static final JapaneseTokenizer.Mode DEFAULT_MODE
Default tokenization mode. Currently this isJapaneseTokenizer.Mode.SEARCH.
-
-
Constructor Detail
-
JapaneseTokenizer
public JapaneseTokenizer(UserDictionary userDictionary, boolean discardPunctuation, JapaneseTokenizer.Mode mode)
Create a new JapaneseTokenizer.Uses the default AttributeFactory.
- Parameters:
userDictionary- Optional: if non-null, user dictionary.discardPunctuation- true if punctuation tokens should be dropped from the output.mode- tokenization mode.
-
JapaneseTokenizer
public JapaneseTokenizer(UserDictionary userDictionary, boolean discardPunctuation, boolean discardCompoundToken, JapaneseTokenizer.Mode mode)
Create a new JapaneseTokenizer.Uses the default AttributeFactory.
- Parameters:
userDictionary- Optional: if non-null, user dictionary.discardPunctuation- true if punctuation tokens should be dropped from the output.discardCompoundToken- true if compound tokens should be dropped from the output when tokenization mode is not NORMAL.mode- tokenization mode.
-
JapaneseTokenizer
public JapaneseTokenizer(AttributeFactory factory, UserDictionary userDictionary, boolean discardPunctuation, JapaneseTokenizer.Mode mode)
Create a new JapaneseTokenizer using the system and unknown dictionaries shipped with Lucene.- Parameters:
factory- the AttributeFactory to useuserDictionary- Optional: if non-null, user dictionary.discardPunctuation- true if punctuation tokens should be dropped from the output.mode- tokenization mode.
-
JapaneseTokenizer
public JapaneseTokenizer(AttributeFactory factory, UserDictionary userDictionary, boolean discardPunctuation, boolean discardCompoundToken, JapaneseTokenizer.Mode mode)
Create a new JapaneseTokenizer using the system and unknown dictionaries shipped with Lucene.- Parameters:
factory- the AttributeFactory to useuserDictionary- Optional: if non-null, user dictionary.discardPunctuation- true if punctuation tokens should be dropped from the output.discardCompoundToken- true if compound tokens should be dropped from the output when tokenization mode is not NORMAL.mode- tokenization mode.
-
JapaneseTokenizer
public JapaneseTokenizer(AttributeFactory factory, TokenInfoDictionary systemDictionary, UnknownDictionary unkDictionary, ConnectionCosts connectionCosts, UserDictionary userDictionary, boolean discardPunctuation, boolean discardCompoundToken, JapaneseTokenizer.Mode mode)
Create a new JapaneseTokenizer, supplying a custom system dictionary and unknown dictionary. This constructor provides an entry point for users that want to construct custom language models that can be used as input toDictionaryBuilder.- Parameters:
factory- the AttributeFactory to usesystemDictionary- a custom known token dictionaryunkDictionary- a custom unknown token dictionaryconnectionCosts- custom token transition costsuserDictionary- Optional: if non-null, user dictionary.discardPunctuation- true if punctuation tokens should be dropped from the output.discardCompoundToken- true if compound tokens should be dropped from the output when tokenization mode is not NORMAL.mode- tokenization mode.- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
-
Method Detail
-
setGraphvizFormatter
public void setGraphvizFormatter(GraphvizFormatter dotOut)
Expert: set this to produce graphviz (dot) output of the Viterbi lattice
-
close
public void close() throws IOException- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCloseable- Overrides:
closein classTokenizer- Throws:
IOException
-
reset
public void reset() throws IOException- Overrides:
resetin classTokenizer- Throws:
IOException
-
end
public void end() throws IOException- Overrides:
endin classTokenStream- Throws:
IOException
-
incrementToken
public boolean incrementToken() throws IOException- Specified by:
incrementTokenin classTokenStream- Throws:
IOException
-
calcNBestCost
public int calcNBestCost(String examples)
-
setNBestCost
public void setNBestCost(int value)
-
-