Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: formatting fixes

...

  1. The advanced query parser always splits on whitespace, so the whitespace tokenizer is used at index time to ensure corresponding tokens. 
  2. PatternTypingFilterFactory matches incoming tokens to the 2nd (white space delimited) column of patterns.txt in sequence so the token C++ would match the last line and 401(k) would match the first line. 
  3. Upon matching 401(k), PatternTyping filter adds the type attribute _TASlegal2_TAS__legal2_401_k and sets the second bit of the flags attribute (determined by the first column of patterns.txt). The purpose of the __TAS__ prefix is to avoid any cases in which a token from the text might coincide with the tokens from the synonyms when this type is converted into a token later on. the ::: in patterns.txt is just a separator to make it easier to see where patterns end and replacements begin.
  4. TokenAnalyzerFilterFactory conducts the text_general analysis on the tokens provided and is instructed to add the existing token type to any tokens produced, at this point 401(k) is broken into 401 and k, each with type __TAS__legal2_401_k and flag = 2
  5. TypeAsSynonymFilterFactory converts the type into a flag, but a new ignore attribute allows it to not convert the standard "word" type that every token gets by default. Note that this new token will NOT bear the flag set by PatternTypingFilterFactory.
  6. DropIfFlaggedFilterFactory drops all tokens that have all flags specified set. So if a token arrives with a flags value of 5, it will not be dropped, but 2,6,10 etc would be dropped. If dropFlags were set to 3, then any flags attribute with a value of 1, 2, 3, 5,6,7,9,10,11 etc would be dropped. 
  7. In addition to text_general there is a text_general_lit type that can be used for a text_aqp_literal type which would be identical except for the configured field type in TokenAnalyzerFilterFactory. (omitted for brevity)

Thus you would configure fields like this:

Code Block
<field name="bill_text" type="text_aqp" indexed="true" stored="false" multiValued="true"/

...

>
<field name="bill_text_lit" type="text_aqp_literal" indexed="true" stored="false" multiValued="true"/>

The net result is that both 401k and 401(k) produce produce __TAS__legal2_401_k and match the same documents but the analysis does not produce tokens for '401' or 'k' so Rhode Island phone numbers and Documents pertaining to Vitamin K do not match. 

...