Package org.apache.lucene.tests.analysis
Class MockTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.tests.analysis.MockTokenizer
- All Implemented Interfaces:
Closeable,AutoCloseable
Tokenizer for testing.
This tokenizer is a replacement for WHITESPACE, SIMPLE, and KEYWORD
tokenizers. If you are writing a component such as a TokenFilter, it's a great idea to test it
wrapping this tokenizer instead for extra checks. This tokenizer has the following behavior:
- An internal state-machine is used for checking consumer consistency. These checks can be
disabled with
setEnableChecks(boolean). - For convenience, optionally lowercases terms that it outputs.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intLimit the default token length to a size that doesn't cause random analyzer failures on unpredictable data like the enwiki data set.static final CharacterRunAutomatonActs Similar to KeywordTokenizer.static final CharacterRunAutomatonActs like LetterTokenizer.static final CharacterRunAutomatonActs Similar to WhitespaceTokenizerFields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY -
Constructor Summary
ConstructorsConstructorDescriptionMockTokenizer(AttributeFactory factory) MockTokenizer(AttributeFactory factory, CharacterRunAutomaton runAutomaton, boolean lowerCase) MockTokenizer(AttributeFactory factory, CharacterRunAutomaton runAutomaton, boolean lowerCase, int maxTokenLength) MockTokenizer(CharacterRunAutomaton runAutomaton, boolean lowerCase) MockTokenizer(CharacterRunAutomaton runAutomaton, boolean lowerCase, int maxTokenLength) -
Method Summary
Modifier and TypeMethodDescriptionvoidclose()voidend()final booleanprotected booleanisTokenChar(int c) protected intnormalize(int c) protected intreadChar()protected intvoidreset()voidsetEnableChecks(boolean enableChecks) Toggle consumer workflow checking: if your test consumes tokenstreams normally you should leave this enabled.protected voidMethods inherited from class org.apache.lucene.analysis.Tokenizer
correctOffset, setReaderMethods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Field Details
-
WHITESPACE
Acts Similar to WhitespaceTokenizer -
KEYWORD
Acts Similar to KeywordTokenizer. TODO: Keyword returns an "empty" token for an empty reader... -
SIMPLE
Acts like LetterTokenizer. -
DEFAULT_MAX_TOKEN_LENGTH
public static final int DEFAULT_MAX_TOKEN_LENGTHLimit the default token length to a size that doesn't cause random analyzer failures on unpredictable data like the enwiki data set.This value defaults to
CharTokenizer.DEFAULT_MAX_WORD_LEN(255).- See Also:
-
-
Constructor Details
-
MockTokenizer
public MockTokenizer(AttributeFactory factory, CharacterRunAutomaton runAutomaton, boolean lowerCase, int maxTokenLength) -
MockTokenizer
-
MockTokenizer
-
MockTokenizer
public MockTokenizer() -
MockTokenizer
public MockTokenizer(AttributeFactory factory, CharacterRunAutomaton runAutomaton, boolean lowerCase) -
MockTokenizer
-
-
Method Details
-
incrementToken
- Specified by:
incrementTokenin classTokenStream- Throws:
IOException
-
readCodePoint
- Throws:
IOException
-
readChar
- Throws:
IOException
-
isTokenChar
protected boolean isTokenChar(int c) -
normalize
protected int normalize(int c) -
reset
- Overrides:
resetin classTokenizer- Throws:
IOException
-
close
- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCloseable- Overrides:
closein classTokenizer- Throws:
IOException
-
setReaderTestPoint
protected void setReaderTestPoint()- Overrides:
setReaderTestPointin classTokenizer
-
end
- Overrides:
endin classTokenStream- Throws:
IOException
-
setEnableChecks
public void setEnableChecks(boolean enableChecks) Toggle consumer workflow checking: if your test consumes tokenstreams normally you should leave this enabled.
-