TextProfileSignature calculates a fuzzy hash of textual fields for Deduplication, and may be incorporated using a SignatureUpdateProcessorFactory definition including the following parameters:




Default value



The minimum token length to consider




When multiplied by the maximum token frequency, this determines count quantization


The signature calculation proceeds as follows:

Tokenization and normalization

Tokens are then counted, tracking the frequency maxFreq of the most frequent token.

Count quantization

A value quant is calculated as follows:


if maxFreq <= 1

quant :=


if round(maxFreq * quantRate) < 2

round(maxFreq * quantRate)


Token frequencies are then rounded down to the nearest multiple of quant, and any token occurring less than quant times is discarded.


The set of frequencies is transformed to a string as a space-delimited sequence of tokens and their frequencies, in descending frequency order. This is then MD5-hashed.

See also TextProfileSignature's javadoc

Implications and limitations

Though this matches two texts approximately, it is still based on exactly matching a single hash. It may fail to match documents that differ by exactly one word, if that word's frequency changes from k * quant - 1 to k * quant.

Words appearing once are ignored unless the text consists only of words appearing once. Hence, "the cat sat on a mat" will hash distinctly to "the cat sat on the mat".

For the default quantRate (0.01), quant will exceed 2 only if the most frequent word occurs maxFreq >= 251 times.

These properties all suggest that TextProfileSignature is brittle for short texts.

TextProfileSignature operates on raw text, without the filtering provided by Analyzers, and hence will fail to ignore HTML, normalize for diacritics, word stem/semantics, or incorporate the relative importance of different tokens, etc. It also considers only the bag of words, ignoring any word order.



Example settings:

  <!-- An example dedup update processor that creates the "id" field on the fly
       based on the hash code of some other fields.  This example has overwriteDupes
       set to false since we are using the id field as the signatureField and Solr
       will maintain uniqueness based on that anyway. -->
  <updateRequestProcessorChain name="dedupe">
    <processor class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
      <bool name="enabled">true</bool>
      <bool name="overwriteDupes">false</bool>
      <str name="signatureField">id</str>
      <str name="fields">name,features,cat</str>
      <str name="signatureClass">org.apache.solr.update.processor.TextProfileSignature</str>
      <str name="quantRate">.2</str>
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />

TextProfileSignature (last edited 2013-02-20 22:00:39 by EustacheFelenc)