Package org.apache.lucene.analysis.email
Class UAX29URLEmailTokenizerImpl
java.lang.Object
org.apache.lucene.analysis.email.UAX29URLEmailTokenizerImpl
This class implements Word Break rules from the Unicode Text Segmentation 
 algorithm, as specified in 
 Unicode Standard Annex #29 
 URLs and email addresses are also tokenized according to the relevant RFCs.
 
Tokens produced are of the following types:
- <ALPHANUM>: A sequence of alphabetic and numeric characters
- <NUM>: A number
- <URL>: A URL
- <EMAIL>: An email address
- <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
- <IDEOGRAPHIC>: A single CJKV ideographic character
- <HIRAGANA>: A single hiragana character
- <KATAKANA>: A sequence of katakana characters
- <HANGUL>: A sequence of Hangul characters
- <EMOJI>: A sequence of Emoji characters
- 
Field SummaryFieldsModifier and TypeFieldDescriptionstatic final intstatic final intEmail token typestatic final intEmoji token typestatic final intHangul token typestatic final intHiragana token typestatic final intIdeographic token typestatic final intKatakana token typestatic final intNumbersstatic final intChars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.).static final intURL token typestatic final intAlphanumeric sequencesstatic final intThis character denotes the end of file.static final intLexical States.
- 
Constructor SummaryConstructors
- 
Method SummaryModifier and TypeMethodDescriptionintResumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.final voidFills CharTermAttribute with the current token text.final voidsetBufferSize(int numChars) Sets the scanner buffer size in charsfinal booleanyyatEOF()Returns whether the scanner has reached the end of the reader it reads from.final voidyybegin(int newState) Enters a new lexical state.final intyychar()Character count processed so farfinal charyycharat(int position) Returns the character at the given position from the matched text.final voidyyclose()Closes the input reader.final intyylength()How many characters were matched.voidyypushback(int number) Pushes the specified amount of characters back into the input stream.final voidResets the scanner to read from a new input stream.final intyystate()Returns the current lexical state.final Stringyytext()Returns the text matched by the current regular expression.
- 
Field Details- 
YYEOFpublic static final int YYEOFThis character denotes the end of file.- See Also:
 
- 
YYINITIALpublic static final int YYINITIALLexical States.- See Also:
 
- 
AVOID_BAD_URLpublic static final int AVOID_BAD_URL- See Also:
 
- 
WORD_TYPEpublic static final int WORD_TYPEAlphanumeric sequences- See Also:
 
- 
NUMERIC_TYPEpublic static final int NUMERIC_TYPENumbers- See Also:
 
- 
SOUTH_EAST_ASIAN_TYPEpublic static final int SOUTH_EAST_ASIAN_TYPEChars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29.See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA - See Also:
 
- 
IDEOGRAPHIC_TYPEpublic static final int IDEOGRAPHIC_TYPEIdeographic token type- See Also:
 
- 
HIRAGANA_TYPEpublic static final int HIRAGANA_TYPEHiragana token type- See Also:
 
- 
KATAKANA_TYPEpublic static final int KATAKANA_TYPEKatakana token type- See Also:
 
- 
HANGUL_TYPEpublic static final int HANGUL_TYPEHangul token type- See Also:
 
- 
EMAIL_TYPEpublic static final int EMAIL_TYPEEmail token type- See Also:
 
- 
URL_TYPEpublic static final int URL_TYPEURL token type- See Also:
 
- 
EMOJI_TYPEpublic static final int EMOJI_TYPEEmoji token type- See Also:
 
 
- 
- 
Constructor Details- 
UAX29URLEmailTokenizerImplCreates a new scanner- Parameters:
- in- the java.io.Reader to read input from.
 
 
- 
- 
Method Details- 
yycharpublic final int yychar()Character count processed so far
- 
getTextFills CharTermAttribute with the current token text.
- 
setBufferSizepublic final void setBufferSize(int numChars) Sets the scanner buffer size in chars
- 
yycloseCloses the input reader.- Throws:
- IOException- if the reader could not be closed.
 
- 
yyresetResets the scanner to read from a new input stream.Does not close the old reader. All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to ZZ_INITIAL.Internal scan buffer is resized down to its initial length, if it has grown. - Parameters:
- reader- The new input stream.
 
- 
yyatEOFpublic final boolean yyatEOF()Returns whether the scanner has reached the end of the reader it reads from.- Returns:
- whether the scanner has reached EOF.
 
- 
yystatepublic final int yystate()Returns the current lexical state.- Returns:
- the current lexical state.
 
- 
yybeginpublic final void yybegin(int newState) Enters a new lexical state.- Parameters:
- newState- the new lexical state
 
- 
yytextReturns the text matched by the current regular expression.- Returns:
- the matched text.
 
- 
yycharatpublic final char yycharat(int position) Returns the character at the given position from the matched text.It is equivalent to yytext().charAt(pos), but faster.- Parameters:
- position- the position of the character to fetch. A value from 0 to- yylength()-1.
- Returns:
- the character at position.
 
- 
yylengthpublic final int yylength()How many characters were matched.- Returns:
- the length of the matched text region.
 
- 
yypushbackpublic void yypushback(int number) Pushes the specified amount of characters back into the input stream.They will be read again by then next call of the scanning method. - Parameters:
- number- the number of characters to be read again. This number must not be greater than- yylength().
 
- 
getNextTokenResumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.- Returns:
- the next token.
- Throws:
- IOException- if any I/O-Error occurs.
 
 
-