Class DFISimilarity
java.lang.Object
org.apache.lucene.search.similarities.Similarity
org.apache.lucene.search.similarities.SimilarityBase
org.apache.lucene.search.similarities.DFISimilarity
Implements the Divergence from Independence (DFI) model based on Chi-square statistics
 (i.e., standardized Chi-squared distance from independence in term frequency tf).
 
DFI is both parameter-free and non-parametric:
- parameter-free: it does not require any parameter tuning or training.
- non-parametric: it does not make any assumptions about word frequency distributions on document collections.
It is highly recommended not to remove stopwords (very common terms: the, of, and, to, a, in, for, is, on, that, etc) with this similarity.
For more information see: A nonparametric term weighting method for information retrieval based on measuring the divergence from independence
- See Also:
- WARNING: This API is experimental and might change in incompatible ways in the next release.
- 
Nested Class SummaryNested classes/interfaces inherited from class org.apache.lucene.search.similarities.SimilaritySimilarity.SimScorer
- 
Constructor SummaryConstructorsConstructorDescriptionDFISimilarity(Independence independenceMeasure) Create DFI with the specified divergence from independence measure and using default discountOverlaps valueDFISimilarity(Independence independenceMeasure, boolean discountOverlaps) Create DFI with the specified parameters
- 
Method SummaryModifier and TypeMethodDescriptionprotected Explanationexplain(BasicStats stats, Explanation freq, double docLen) Explains the score.Returns the measure of independenceprotected doublescore(BasicStats stats, double freq, double docLen) Scores the documentdoc.toString()Subclasses must override this method to return the name of the Similarity and preferably the values of parameters (if any) as well.Methods inherited from class org.apache.lucene.search.similarities.SimilarityBaseexplain, fillBasicStats, log2, newStats, scorerMethods inherited from class org.apache.lucene.search.similarities.SimilaritycomputeNorm, getDiscountOverlaps
- 
Constructor Details- 
DFISimilarityCreate DFI with the specified divergence from independence measure and using default discountOverlaps value- Parameters:
- independenceMeasure- measure of divergence from independence
 
- 
DFISimilarityCreate DFI with the specified parameters- Parameters:
- independenceMeasure- measure of divergence from independence
- discountOverlaps- true if overlap tokens should not impact document length for scoring.
 
 
- 
- 
Method Details- 
scoreDescription copied from class:SimilarityBaseScores the documentdoc.Subclasses must apply their scoring formula in this class. - Specified by:
- scorein class- SimilarityBase
- Parameters:
- stats- the corpus level statistics.
- freq- the term frequency.
- docLen- the document length.
- Returns:
- the score.
 
- 
getIndependenceReturns the measure of independence
- 
explainDescription copied from class:SimilarityBaseExplains the score. The implementation here provides a basic explanation in the format score(name-of-similarity, doc=doc-id, freq=term-frequency), computed from:, and attaches the score (computed via theSimilarityBase.score(BasicStats, double, double)method) and the explanation for the term frequency. Subclasses content with this format may add additional details inSimilarityBase.explain(List, BasicStats, double, double).- Overrides:
- explainin class- SimilarityBase
- Parameters:
- stats- the corpus level statistics.
- freq- the term frequency and its explanation.
- docLen- the document length.
- Returns:
- the explanation.
 
- 
toStringDescription copied from class:SimilarityBaseSubclasses must override this method to return the name of the Similarity and preferably the values of parameters (if any) as well.- Specified by:
- toStringin class- SimilarityBase
 
 
-