Package org.apache.lucene.analysis.email
Class UAX29URLEmailTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.email.UAX29URLEmailTokenizer
- All Implemented Interfaces:
Closeable,AutoCloseable
This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified
in Unicode Standard Annex #29 URLs and email
addresses are also tokenized according to the relevant RFCs.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intAlpha/numeric token typestatic final intEmail token typestatic final intEmoji token type.static final intHangul token typestatic final intHiragana token typestatic final intIdeographic token typestatic final intKatakana token typestatic final intAbsolute maximum sized tokenstatic final intNumeric token typestatic final intSoutheast Asian token typestatic final String[]String token types that correspond to token type int constantsstatic final intURL token typeFields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY -
Constructor Summary
ConstructorsConstructorDescriptionCreates a new instance of the UAX29URLEmailTokenizer.UAX29URLEmailTokenizer(AttributeFactory factory) Creates a new UAX29URLEmailTokenizer with a givenAttributeFactory -
Method Summary
Modifier and TypeMethodDescriptionvoidclose()final voidend()intfinal booleanvoidreset()voidsetMaxTokenLength(int length) Set the max allowed token length.Methods inherited from class org.apache.lucene.analysis.Tokenizer
correctOffset, setReader, setReaderTestPointMethods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Field Details
-
ALPHANUM
public static final int ALPHANUMAlpha/numeric token type- See Also:
-
NUM
public static final int NUMNumeric token type- See Also:
-
SOUTHEAST_ASIAN
public static final int SOUTHEAST_ASIANSoutheast Asian token type- See Also:
-
IDEOGRAPHIC
public static final int IDEOGRAPHICIdeographic token type- See Also:
-
HIRAGANA
public static final int HIRAGANAHiragana token type- See Also:
-
KATAKANA
public static final int KATAKANAKatakana token type- See Also:
-
HANGUL
public static final int HANGULHangul token type- See Also:
-
URL
public static final int URLURL token type- See Also:
-
EMAIL
public static final int EMAILEmail token type- See Also:
-
EMOJI
public static final int EMOJIEmoji token type.- See Also:
-
TOKEN_TYPES
String token types that correspond to token type int constants -
MAX_TOKEN_LENGTH_LIMIT
public static final int MAX_TOKEN_LENGTH_LIMITAbsolute maximum sized token- See Also:
-
-
Constructor Details
-
UAX29URLEmailTokenizer
public UAX29URLEmailTokenizer()Creates a new instance of the UAX29URLEmailTokenizer. Attaches theinputto the newly created JFlex scanner. -
UAX29URLEmailTokenizer
Creates a new UAX29URLEmailTokenizer with a givenAttributeFactory
-
-
Method Details
-
setMaxTokenLength
public void setMaxTokenLength(int length) Set the max allowed token length. Tokens larger than this will be chopped up at this token length and emitted as multiple tokens. If you need to skip such large tokens, you could increase this max length, and then useLengthFilterto remove long tokens. The default isUAX29URLEmailAnalyzer.DEFAULT_MAX_TOKEN_LENGTH.- Throws:
IllegalArgumentException- if the given length is outside of the range [1, 1048576].
-
getMaxTokenLength
public int getMaxTokenLength()- See Also:
-
incrementToken
- Specified by:
incrementTokenin classTokenStream- Throws:
IOException
-
end
- Overrides:
endin classTokenStream- Throws:
IOException
-
close
- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCloseable- Overrides:
closein classTokenizer- Throws:
IOException
-
reset
- Overrides:
resetin classTokenizer- Throws:
IOException
-