Class UnifiedHighlighter
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS), term vectors (FieldType.setStoreTermVectorOffsets(boolean)), or via re-analyzing text.
This highlighter treats the single original document as the whole corpus, and then scores
individual passages as if they were documents in this corpus. It uses a BreakIterator to
find passages in the text; by default it breaks using getSentenceInstance(Locale.ROOT). It then iterates in
parallel (merge sorting by offset) through the positions of all terms from the query, coalescing
those hits that occur in a single passage into a Passage, and then scores each Passage
using a separate PassageScorer. Passages are finally formatted into highlighted snippets
with a PassageFormatter.
You can customize the behavior by calling some of the setters, or by subclassing and overriding some methods. Some important hooks:
getBreakIterator(String): Customize how the text is divided into passages.getScorer(String): Customize how passages are ranked.getFormatter(String): Customize how snippets are formatted.getPassageSortComparator(String): Customize how snippets are formatted.
This is thread-safe, notwithstanding the setters.
- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic classBuilder for UnifiedHighlighter.static enumFlags for controlling highlighting behavior.protected static classFetches stored fields for highlighting.static enumSource of term offsets; essential for highlighting. -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intstatic final intprotected FieldInfosprotected final Analyzerprotected static final charprotected final IndexSearcherprotected static final LabelledCharArrayMatcher[] -
Constructor Summary
ConstructorsConstructorDescriptionUnifiedHighlighter(IndexSearcher indexSearcher, Analyzer indexAnalyzer) Deprecated.Constructs the highlighter with the givenUnifiedHighlighter.Builder. -
Method Summary
Modifier and TypeMethodDescriptionstatic UnifiedHighlighter.Builderbuilder(IndexSearcher searcher, Analyzer indexAnalyzer) static UnifiedHighlighter.BuilderbuilderWithoutSearcher(Analyzer indexAnalyzer) Creates aUnifiedHighlighter.Builderobject in which you can only usehighlightWithoutSearcher(String, Query, String, int)for highlighting.protected Set<UnifiedHighlighter.HighlightFlag> evaluateFlags(boolean shouldHandleMultiTermQuery, boolean shouldHighlightPhrasesStrictly, boolean shouldPassageRelevancyOverSpeed, boolean shouldEnableWeightMatches) This method returns the set of ofUnifiedHighlighter.HighlightFlags, which will be applied to the UH object.protected Set<UnifiedHighlighter.HighlightFlag> Deprecated.protected Set<UnifiedHighlighter.HighlightFlag> evaluateFlags(UnifiedHighlighter.Builder uhBuilder) Evaluate the highlight flags and set theflagsvariable.extractTerms(Query query) Extracts matching termsprotected static BytesRef[]filterExtractedTerms(Predicate<String> fieldMatcher, Set<Term> queryTerms) protected LabelledCharArrayMatcher[]getAutomata(String field, Query query, Set<UnifiedHighlighter.HighlightFlag> highlightFlags) protected BreakIteratorgetBreakIterator(String field) Returns theBreakIteratorto use for dividing text into passages.intLimits the amount of field value pre-fetching until this threshold is passed.protected FieldHighlightergetFieldHighlighter(String field, Query query, Set<Term> allTerms, int maxPassages) protected FieldInfogetFieldInfo(String field) Called by the default implementation ofgetOffsetSource(String).getFieldMatcher(String field) Returns the predicate to use for extracting the query part that must be highlighted.protected Set<UnifiedHighlighter.HighlightFlag> Returns theUnifiedHighlighter.HighlightFlags applicable for the current UH instance.protected PassageFormattergetFormatter(String field) Returns thePassageFormatterto use for formatting passages into highlighted snippets.protected UHComponentsgetHighlightComponents(String field, Query query, Set<Term> allTerms) ......getMaskedFields(String field) intThe maximum content size to process.protected intgetMaxNoHighlightPassages(String field) Returns the number of leading passages (as delineated by theBreakIterator) when no highlights could be found.protected UnifiedHighlighter.OffsetSourcegetOffsetSource(String field) Determine the offset source for the specified field.protected FieldOffsetStrategygetOffsetStrategy(UnifiedHighlighter.OffsetSource offsetSource, UHComponents components) protected UnifiedHighlighter.OffsetSourcegetOptimizedOffsetSource(UHComponents components) protected Comparator<Passage> getPassageSortComparator(String field) Returns theComparatorto use for finally sorting passages.protected PhraseHelpergetPhraseHelper(String field, Query query, Set<UnifiedHighlighter.HighlightFlag> highlightFlags) protected PassageScorerReturns thePassageScorerto use for ranking passages.protected booleanhasUnrecognizedQuery(Predicate<String> fieldMatcher, Query query) String[]Highlights the top passages from a single field.String[]Highlights the top-N passages from a single field.highlightFields(String[] fieldsIn, Query query, int[] docidsIn, int[] maxPassagesIn) Highlights the top-N passages from multiple fields, for the provided int[] docids.highlightFields(String[] fields, Query query, TopDocs topDocs) Highlights the top passages from multiple fields.highlightFields(String[] fields, Query query, TopDocs topDocs, int[] maxPassages) Highlights the top-N passages from multiple fields.highlightFieldsAsObjects(String[] fieldsIn, Query query, int[] docIdsIn, int[] maxPassagesIn) Expert: highlights the top-N passages from multiple fields, for the provided int[] docids, to custom Object as returned by thePassageFormatter.highlightWithoutSearcher(String field, Query query, String content, int maxPassages) Highlights text passed as a parameter.protected List<CharSequence[]> loadFieldValues(String[] fields, DocIdSetIterator docIter, int cacheCharsThreshold) Loads the String values for each docId by field to be highlighted.protected FieldHighlighternewFieldHighlighter(String field, FieldOffsetStrategy fieldOffsetStrategy, BreakIterator breakIterator, PassageScorer passageScorer, int maxPassages, int maxNoHighlightPassages, PassageFormatter passageFormatter, Comparator<Passage> passageSortComparator) newLimitedStoredFieldsVisitor(String[] fields) protected Collection<Query> preSpanQueryRewrite(Query query) When highlighting phrases accurately, we may need to handle custom queries that aren't supported in theWeightedSpanTermExtractoras called by thePhraseHelper.protected BooleanrequiresRewrite(SpanQuery spanQuery) When highlighting phrases accurately, we need to know whichSpanQuery's need to haveQuery.rewrite(IndexSearcher)called on them.voidsetBreakIterator(Supplier<BreakIterator> breakIterator) Deprecated.voidsetCacheFieldValCharsThreshold(int cacheFieldValCharsThreshold) Deprecated.voidsetFieldMatcher(Predicate<String> predicate) Deprecated.voidsetFormatter(PassageFormatter formatter) Deprecated.voidsetHandleMultiTermQuery(boolean handleMtq) Deprecated.voidsetHighlightPhrasesStrictly(boolean highlightPhrasesStrictly) Deprecated.voidsetMaxLength(int maxLength) Deprecated.voidsetMaxNoHighlightPassages(int defaultMaxNoHighlightPassages) Deprecated.voidsetPassageRelevancyOverSpeed(boolean passageRelevancyOverSpeed) Deprecated.voidsetScorer(PassageScorer scorer) Deprecated.voidsetWeightMatches(boolean weightMatches) Deprecated.protected booleanshouldHandleMultiTermQuery(String field) Deprecated.protected booleanDeprecated.protected booleanDeprecated.
-
Field Details
-
MULTIVAL_SEP_CHAR
protected static final char MULTIVAL_SEP_CHAR- See Also:
-
DEFAULT_MAX_LENGTH
public static final int DEFAULT_MAX_LENGTH- See Also:
-
DEFAULT_CACHE_CHARS_THRESHOLD
public static final int DEFAULT_CACHE_CHARS_THRESHOLD- See Also:
-
ZERO_LEN_AUTOMATA_ARRAY
-
searcher
-
indexAnalyzer
-
fieldInfos
-
-
Constructor Details
-
UnifiedHighlighter
Deprecated.Constructs the highlighter with the given index searcher and analyzer.- Parameters:
indexSearcher- Usually required, unlesshighlightWithoutSearcher(String, Query, String, int)is used, in which case this needs to be null.indexAnalyzer- Required, even if in some circumstances it isn't used.
-
UnifiedHighlighter
Constructs the highlighter with the givenUnifiedHighlighter.Builder.- Parameters:
builder- - aUnifiedHighlighter.Builderobject.
-
-
Method Details
-
setHandleMultiTermQuery
Deprecated. -
setHighlightPhrasesStrictly
Deprecated. -
setPassageRelevancyOverSpeed
Deprecated. -
setMaxLength
Deprecated. -
setBreakIterator
Deprecated. -
setScorer
Deprecated. -
setFormatter
Deprecated. -
setMaxNoHighlightPassages
Deprecated. -
setCacheFieldValCharsThreshold
Deprecated. -
setFieldMatcher
Deprecated. -
setWeightMatches
Deprecated. -
shouldHandleMultiTermQuery
Deprecated.Returns whetherMultiTermQueryderivatives will be highlighted. By default it's enabled. MTQ highlighting can be expensive, particularly when using offsets in postings. -
shouldHighlightPhrasesStrictly
Deprecated.Returns whether position sensitive queries (e.g. phrases andSpanQueryies) should be highlighted strictly based on query matches (slower) versus any/all occurrences of the underlying terms. By default it's enabled, but there's no overhead if such queries aren't used. -
shouldPreferPassageRelevancyOverSpeed
Deprecated. -
builder
- Parameters:
searcher- - aIndexSearcherobject.indexAnalyzer- - aAnalyzerobject.- Returns:
- a
UnifiedHighlighter.Builderobject
-
builderWithoutSearcher
Creates aUnifiedHighlighter.Builderobject in which you can only usehighlightWithoutSearcher(String, Query, String, int)for highlighting.- Parameters:
indexAnalyzer- - aAnalyzerobject.- Returns:
- a
UnifiedHighlighter.Builderobject
-
extractTerms
Extracts matching terms -
evaluateFlags
protected Set<UnifiedHighlighter.HighlightFlag> evaluateFlags(boolean shouldHandleMultiTermQuery, boolean shouldHighlightPhrasesStrictly, boolean shouldPassageRelevancyOverSpeed, boolean shouldEnableWeightMatches) This method returns the set of ofUnifiedHighlighter.HighlightFlags, which will be applied to the UH object. The output depends on the values provided toUnifiedHighlighter.Builder.withHandleMultiTermQuery(boolean),UnifiedHighlighter.Builder.withHighlightPhrasesStrictly(boolean),UnifiedHighlighter.Builder.withPassageRelevancyOverSpeed(boolean)andUnifiedHighlighter.Builder.withWeightMatches(boolean)ORsetHandleMultiTermQuery(boolean),setHighlightPhrasesStrictly(boolean),setPassageRelevancyOverSpeed(boolean)andsetWeightMatches(boolean)- Parameters:
shouldHandleMultiTermQuery- - flag for adding Multi-term queryshouldHighlightPhrasesStrictly- - flag for adding phrase highlightingshouldPassageRelevancyOverSpeed- - flag for adding passage relevancyshouldEnableWeightMatches- - flag for enabling weight matches- Returns:
- a set of
UnifiedHighlighter.HighlightFlags.
-
evaluateFlags
Evaluate the highlight flags and set theflagsvariable. This is called only once when the Builder object is used to create a UH object.- Parameters:
uhBuilder- -UnifiedHighlighter.Builderobject.- Returns:
UnifiedHighlighter.HighlightFlags.
-
evaluateFlags
Deprecated.Evaluate the highlight flags and set theflagsvariable. This is called every timegetFlags(String)method is called. This is used in the builder and has been marked deprecated since it is used only for the mutable initialization of a UH object.- Parameters:
uh- -UnifiedHighlighterobject.- Returns:
UnifiedHighlighter.HighlightFlags.
-
getFieldMatcher
Returns the predicate to use for extracting the query part that must be highlighted. By default only queries that target the current field are kept. (AKA requireFieldMatch) -
getMaskedFields
-
getFlags
Returns theUnifiedHighlighter.HighlightFlags applicable for the current UH instance. -
getMaxLength
public int getMaxLength()The maximum content size to process. Content will be truncated to this size before highlighting. Typically snippets closer to the beginning of the document better summarize its content. -
getBreakIterator
Returns theBreakIteratorto use for dividing text into passages. This returnsBreakIterator.getSentenceInstance(Locale)by default; subclasses can override to customize.Note: this highlighter will call
BreakIterator.preceding(int)andBreakIterator.next()many times on it. The default generic JDK implementation ofprecedingperforms poorly. -
getScorer
Returns thePassageScorerto use for ranking passages. -
getFormatter
Returns thePassageFormatterto use for formatting passages into highlighted snippets. -
getPassageSortComparator
Returns theComparatorto use for finally sorting passages. -
getMaxNoHighlightPassages
Returns the number of leading passages (as delineated by theBreakIterator) when no highlights could be found. If it's less than 0 (the default) then this defaults to themaxPassagesparameter given for each request. If this is 0 then the resulting highlight is null (not formatted). -
getCacheFieldValCharsThreshold
public int getCacheFieldValCharsThreshold()Limits the amount of field value pre-fetching until this threshold is passed. The highlighter internally highlights in batches of documents sized on the sum field value length (in chars) of the fields to be highlighted (bounded bygetMaxLength()for each field). By setting this to 0, you can force documents to be fetched and highlighted one at a time, which you usually shouldn't do. The default is 524288 chars which translates to about a megabyte. However, note that the highlighter sometimes ignores this and highlights one document at a time (without caching a bunch of documents in advance) when it can detect there's no point in it -- such as when all fields will be highlighted via re-analysis as one example. -
getIndexSearcher
... as passed in from constructor. -
getIndexAnalyzer
... as passed in from constructor. -
getOffsetSource
Determine the offset source for the specified field. The default algorithm is as follows:- This calls
getFieldInfo(String). Note this returns null if there is no searcher or if the field isn't found there. - If there's a field info it has
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETSthenUnifiedHighlighter.OffsetSource.POSTINGSis returned. - If there's a field info and
FieldInfo.hasTermVectors()thenUnifiedHighlighter.OffsetSource.TERM_VECTORSis returned (note we can't check here if the TV has offsets; if there isn't then an exception will get thrown down the line). - Fall-back:
UnifiedHighlighter.OffsetSource.ANALYSISis returned.
Note that the highlighter sometimes switches to something else based on the query, such as if you have
UnifiedHighlighter.OffsetSource.POSTINGS_WITH_TERM_VECTORSbut in fact don't need term vectors. - This calls
-
getFieldInfo
Called by the default implementation ofgetOffsetSource(String). If there is no searcher then we simply always return null. -
highlight
Highlights the top passages from a single field.- Parameters:
field- field name to highlight. Must have a stored string value and also be indexed with offsets.query- query to highlight.topDocs- TopDocs containing the summary result documents to highlight.- Returns:
- Array of formatted snippets corresponding to the documents in
topDocs. If no highlights were found for a document, the first sentence for the field will be returned. - Throws:
IOException- if an I/O error occurred during processingIllegalArgumentException- iffieldwas indexed withoutIndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
-
highlight
public String[] highlight(String field, Query query, TopDocs topDocs, int maxPassages) throws IOException Highlights the top-N passages from a single field.- Parameters:
field- field name to highlight. Must have a stored string value.query- query to highlight.topDocs- TopDocs containing the summary result documents to highlight.maxPassages- The maximum number of top-N ranked passages used to form the highlighted snippets.- Returns:
- Array of formatted snippets corresponding to the documents in
topDocs. If no highlights were found for a document, the firstmaxPassagessentences from the field will be returned. - Throws:
IOException- if an I/O error occurred during processingIllegalArgumentException- iffieldwas indexed withoutIndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
-
highlightFields
public Map<String,String[]> highlightFields(String[] fields, Query query, TopDocs topDocs) throws IOException Highlights the top passages from multiple fields.Conceptually, this behaves as a more efficient form of:
Map m = new HashMap(); for (String field : fields) { m.put(field, highlight(field, query, topDocs)); } return m;- Parameters:
fields- field names to highlight. Must have a stored string value.query- query to highlight.topDocs- TopDocs containing the summary result documents to highlight.- Returns:
- Map keyed on field name, containing the array of formatted snippets corresponding to
the documents in
topDocs. If no highlights were found for a document, the first sentence from the field will be returned. - Throws:
IOException- if an I/O error occurred during processingIllegalArgumentException- iffieldwas indexed withoutIndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
-
highlightFields
public Map<String,String[]> highlightFields(String[] fields, Query query, TopDocs topDocs, int[] maxPassages) throws IOException Highlights the top-N passages from multiple fields.Conceptually, this behaves as a more efficient form of:
Map m = new HashMap(); for (String field : fields) { m.put(field, highlight(field, query, topDocs, maxPassages)); } return m;- Parameters:
fields- field names to highlight. Must have a stored string value.query- query to highlight.topDocs- TopDocs containing the summary result documents to highlight.maxPassages- The maximum number of top-N ranked passages per-field used to form the highlighted snippets.- Returns:
- Map keyed on field name, containing the array of formatted snippets corresponding to
the documents in
topDocs. If no highlights were found for a document, the firstmaxPassagessentences from the field will be returned. - Throws:
IOException- if an I/O error occurred during processingIllegalArgumentException- iffieldwas indexed withoutIndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
-
highlightFields
public Map<String,String[]> highlightFields(String[] fieldsIn, Query query, int[] docidsIn, int[] maxPassagesIn) throws IOException Highlights the top-N passages from multiple fields, for the provided int[] docids.- Parameters:
fieldsIn- field names to highlight. Must have a stored string value.query- query to highlight.docidsIn- containing the document IDs to highlight.maxPassagesIn- The maximum number of top-N ranked passages per-field used to form the highlighted snippets.- Returns:
- Map keyed on field name, containing the array of formatted snippets corresponding to
the documents in
docidsIn. If no highlights were found for a document, the firstmaxPassagesfrom the field will be returned. - Throws:
IOException- if an I/O error occurred during processingIllegalArgumentException- iffieldwas indexed withoutIndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
-
highlightFieldsAsObjects
protected Map<String,Object[]> highlightFieldsAsObjects(String[] fieldsIn, Query query, int[] docIdsIn, int[] maxPassagesIn) throws IOException Expert: highlights the top-N passages from multiple fields, for the provided int[] docids, to custom Object as returned by thePassageFormatter. Use this API to render to something other than String.- Parameters:
fieldsIn- field names to highlight. Must have a stored string value.query- query to highlight.docIdsIn- containing the document IDs to highlight.maxPassagesIn- The maximum number of top-N ranked passages per-field used to form the highlighted snippets.- Returns:
- Map keyed on field name, containing the array of formatted snippets corresponding to
the documents in
docIdsIn. If no highlights were found for a document, the firstmaxPassagesfrom the field will be returned. - Throws:
IOException- if an I/O error occurred during processingIllegalArgumentException- iffieldwas indexed withoutIndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
-
highlightWithoutSearcher
public Object highlightWithoutSearcher(String field, Query query, String content, int maxPassages) throws IOException Highlights text passed as a parameter. This requires theIndexSearcherprovided to this highlighter is null. This use-case is more rare. Naturally, the mode of operation will beUnifiedHighlighter.OffsetSource.ANALYSIS. The result of this method is whatever thePassageFormatterreturns. For theDefaultPassageFormatterand assumingcontenthas non-zero length, the result will be a non-null string -- so it's safe to callObject.toString()on it in that case.- Parameters:
field- field name to highlight (as found in the query).query- query to highlight.content- text to highlight.maxPassages- The maximum number of top-N ranked passages used to form the highlighted snippets.- Returns:
- result of the
PassageFormatter-- probably a String. Might be null. - Throws:
IOException- if an I/O error occurred during processing
-
getFieldHighlighter
protected FieldHighlighter getFieldHighlighter(String field, Query query, Set<Term> allTerms, int maxPassages) -
newFieldHighlighter
protected FieldHighlighter newFieldHighlighter(String field, FieldOffsetStrategy fieldOffsetStrategy, BreakIterator breakIterator, PassageScorer passageScorer, int maxPassages, int maxNoHighlightPassages, PassageFormatter passageFormatter, Comparator<Passage> passageSortComparator) -
getHighlightComponents
-
hasUnrecognizedQuery
-
filterExtractedTerms
-
getPhraseHelper
protected PhraseHelper getPhraseHelper(String field, Query query, Set<UnifiedHighlighter.HighlightFlag> highlightFlags) -
getAutomata
protected LabelledCharArrayMatcher[] getAutomata(String field, Query query, Set<UnifiedHighlighter.HighlightFlag> highlightFlags) -
getOptimizedOffsetSource
-
getOffsetStrategy
protected FieldOffsetStrategy getOffsetStrategy(UnifiedHighlighter.OffsetSource offsetSource, UHComponents components) -
requiresRewrite
When highlighting phrases accurately, we need to know whichSpanQuery's need to haveQuery.rewrite(IndexSearcher)called on them. It helps performance to avoid it if it's not needed. This method will be invoked on all SpanQuery instances recursively. If you have custom SpanQuery queries then override this to check instanceof and provide a definitive answer. If the query isn't your custom one, simply return null to have the default rules apply, which govern the ones included in Lucene. -
preSpanQueryRewrite
When highlighting phrases accurately, we may need to handle custom queries that aren't supported in theWeightedSpanTermExtractoras called by thePhraseHelper. Should custom query types be needed, this method should be overriden to return a collection of queries if appropriate, or null if nothing to do. If the query is not custom, simply returning null will allow the default rules to apply.- Parameters:
query- Query to be highlighted- Returns:
- A Collection of Query object(s) if needs to be rewritten, otherwise null.
-
loadFieldValues
protected List<CharSequence[]> loadFieldValues(String[] fields, DocIdSetIterator docIter, int cacheCharsThreshold) throws IOException Loads the String values for each docId by field to be highlighted. By default this loads from stored fields by the same name as given, but a subclass can change the source. The returned Strings must be identical to what was indexed (at least for postings or term-vectors offset sources). This method must load fields for at least one document from the givenDocIdSetIteratorbut need not return all of them; by default the character lengths are summed and this method will return early whencacheCharsThresholdis exceeded. Specifically if that number is 0, then only one document is fetched no matter what. Values in the array ofCharSequencewill be null if no value was found.- Throws:
IOException
-
newLimitedStoredFieldsVisitor
protected UnifiedHighlighter.LimitedStoredFieldVisitor newLimitedStoredFieldsVisitor(String[] fields) - NOTE: This API is for internal purposes only and might change in incompatible ways in the next release.
-