Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: formatting fixes

...

It has the following constructs:

Name

Symbol

Example

Explanation

SHOULD

~

~foo

explicitly override the default operator to enforce should logic

MUST_NOT

!

!foo

Chosen over '-' to reduce conflicts with hyphenated words, not an operator at the end of a token so no conflict with exclamations (Spanish uses an upside down ! in front)

MUST

+

+foo

Similar to standard query parser

ANALYZED_PHRASE 

"" 

"foo" 

phrase search including synonyms/and full analysis

LITERAL_PHRASE

''

'foo'

phrase search with reduced analysis (see below for details)

GROUP

()

(foo bar)

applies the default operator (or other specified operator to the terms within the parenthesis, and causes them to be considered as a unit.

DISTANCE

n/#()

n/3(foo bar)

Specifies a span query where foo and bar occur (in either order) within 3 tokens of each other

ORDERED_DISTANCE

w/#()

w/4(foo bar)

Specifies a span query where foo and bar occur within 4 tokens of each other with foo occurring before bar.

PREFIX

*

foo*

Specifies a prefix search matching any tokens starting with 'foo' default settings require at least 3 prefix characters.

FIELD

:

title:foo

searches the title field for foo

RANGE

:[ TO ]
:{ TO }

votes:[0 TO 10}

Typical lucene range searches on text, date or numeric data, inclusive and exclusive bounds supported as in standard parser





Several Elements of other syntaxes are intentionally omitted:

...

"Literal" searches are performed by appending appending _lit to the field for the literal search. This is treated as a fielded phrase search on an alternate field (i.e. _text__lit or title_lit) so the following two searches are equivalent:

Code Block
title:'foo and bar'

title

...

_lit:"foo and bar"

This does impose requirements on the indexing strategy, but this is an "Advanced" feature (hence then name!) so that's ok. The result is that "literal" search can be as literal or analyzed as desired depending on the configuration of the corresponding _lit field.

...

One of the major goals of this parser is to enable a configuration that can apply synonyms to punctuated constructs that have significance to the user but are typically destroyed by the existing parsers. An example configuration of a field type to achieve this (anticipating the use of this parser) looks like this:

Code Block
languagetext
<fieldType name="text_aqp" class="solr.TextField"

...

>
  <analyzer type="index"

...

>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/

...

>
    <filter class="solr.PatternTypingFilterFactory" patternFile="patterns.txt"/

...

>
    <filter class="solr.TokenAnalyzerFilterFactory" asType="text_general" preserveType="true"/

...

>
    <filter class="solr.TypeAsSynonymFilterFactory" prefix="__TAS__" synFlagsMask="0" ignore="word"/

...

>
    <filter class="solr.DropIfFlaggedFilterFactory" dropFlags="2"/

...

>
  </

...

analyzer>
  <analyzer type="query"

...

>
    <tokenizer class="solr.KeywordTokenizerFactory"/> <!-- query parser already handles splitting --

...

>
    <filter class="solr.PatternTypingFilterFactory" patternFile="patterns.txt"/

...

>

    <filter class="solr.TokenAnalyzerFilterFactory" asType="text_en_aqp" preserveType="true" /

...

>
    <filter class="solr.TypeAsSynonymFilterFactory" prefix="__TAS__" synFlagsMask="0"ignore="word"/

...

>
    <filter class="solr.DropIfFlaggedFilterFactory" dropFlags="2"/

...

>
  </

...

analyzer>
</fieldType>
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true"

...

>
  <analyzer type="index"

...

>
    <tokenizer class="solr.StandardTokenizerFactory"/

...

>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/

...

>
    <filter class="solr.LowerCaseFilterFactory"/

...

>
    <filter class="solr.EnglishPossessiveFilterFactory"/

...

>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/

...

>
    <filter class="solr.PorterStemFilterFactory"/

...

>
  </

...

analyzer>
  <analyzer type="query"

...

>
    <tokenizer class="solr.StandardTokenizerFactory"/

...

>
    <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/

...

>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/

...

>
    <filter class="solr.LowerCaseFilterFactory"/

...

>
    <filter class="solr.EnglishPossessiveFilterFactory"/

...

>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/

...

>
    <filter class="solr.PorterStemFilterFactory"/

...

>
  </

...

analyzer>
</fieldType> 
<fieldType name="text_general_lit" class="solr.TextField" positionIncrementGap="100" multiValued="true"

...

>
  <analyzer type="index">    <tokenizer class="solr.WhitespaceTokenizerFactory"/

...

>
    <filter class="solr.LowerCaseFilterFactory"/

...

>
  </

...

analyzer>
  <analyzer type="query"

...

>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/

...

>
    <filter class="solr.LowerCaseFilterFactory"/

...

>
  </

...

analyzer>
</fieldType> 

---- patterns.txt ----

...

 

2 (\d+)\(?([a-z])\)? ::: legal2_$1_

...

$2
2 (\d+)\(?([a-z])\)?\(?(\d+)\)? ::: legal3_$1_$2_

...

$3
2 C\+\+ ::: c_plus_plus


There's a lot to unpack there, so starting from the top:

  1. The advanced query parser always splits on whitespace, so the whitespace tokenizer is used at index time to ensure corresponding tokens. 
  2. PatternTypingFilterFactory matches incoming tokens to the 2nd (white space delimited) column of patterns.txt in sequence so the token C++ would match the last line and 401(k) would match the first line. 
  3. Upon matching 401(k), PatternTyping filter adds the type attribute _TASlegal2_TAS__legal2_401_k and sets the second bit of the flags attribute (determined by the first column of patterns.txt). The purpose of the __TAS__ prefix is to avoid any cases in which a token from the text might coincide with the tokens from the synonyms when this type is converted into a token later on. the ::: in patterns.txt is just a separator to make it easier to see where patterns end and replacements begin.
  4. TokenAnalyzerFilterFactory conducts the text_general analysis on the tokens provided and is instructed to add the existing token type to any tokens produced, at this point 401(k) is broken into 401 and k, each with type __TAS__legal2_401_k and flag = 2
  5. TypeAsSynonymFilterFactory converts the type into a flag, but a new ignore attribute allows it to not convert the standard "word" type that every token gets by default. Note that this new token will NOT bear the flag set by PatternTypingFilterFactory.
  6. DropIfFlaggedFilterFactory drops all tokens that have all flags specified set. So if a token arrives with a flags value of 5, it will not be dropped, but 2,6,10 etc would be dropped. If dropFlags were set to 3, then any flags attribute with a value of 1, 2, 3, 5,6,7,9,10,11 etc would be dropped. 
  7. In addition to text_general there is a text_general_lit type that can be used for a text_aqp_literal type which would be identical except for the configured field type in TokenAnalyzerFilterFactory. (omitted for brevity)

Thus you would configure fields like this:

Code Block
<field name="bill_text" type="text_aqp" indexed="true" stored="false" multiValued="true"/

...

>
<field name="bill_text_lit" type="text_aqp_literal" indexed="true" stored="false" multiValued="true"/>

The net result is that both 401k and 401(k) produce produce __TAS__legal2_401_k and match the same documents but the analysis does not produce tokens for '401' or 'k' so Rhode Island phone numbers and Documents pertaining to Vitamin K do not match. 

...