RE: Re: Replacing FAST functionality at sesam.no - ShingleFilter+exact matching

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

RE: Re: Replacing FAST functionality at sesam.no - ShingleFilter+exact matching

steve_rowe
Hi mck,

On 09/09/2008 at 12:58 PM, Mck wrote:

> > *ShortVersion*
> >  is there a way to make the ShingleFilter perform exact matching via
> > inserting ^ $ begin/end markers?
>
> Reading through the mailing list i see how exact matching can
> be done, a la STFW to myself...
>
> So the ShortVersion now stands:
>
> For my query "abcd efgh ijkl"
> Why does a (perfect looking) MultiPhraseQuery with
> termArrays[0] = { list_entry_shingles:abcd
>  list_entry_shingles:abcd efgh
>  list_entry_shingles:abcd efgh ijkl
> }
> termArrays[1] = { list_entry_shingles:efgh
>  list_entry_shingles:efgh ijkl
> }
> termArrays[2] = { list_entry_shingles:ijkl }
>
> return only "abcd efgh ijkl" !?
>
> (when the field is indexed as TextField this is the only hit i get)
> (when the field is indexed as StrField i get zero hits!)
>
> When the index contains 9 entries:
>  "abcd efgh ijkl", "abcd efgh", "efgh ijkl", "abcd", "efgh",
> "ijkl", "ijkl efgh", "efgh abcd", and "ijkl efgh abcd".
>
> Does this MultiPhraseQuery actually require a match against
> *every* item in each termArray on any document?

I've never used MultiPhraseQuery, but I *think* (based on the Javadocs) that it requires one  match from each termArrays[] entry, contiguously, in the same sequence as the termArrays[] entries (unless you add slop, which I don't think you're doing).

A TextField index would have ("abcd", "efgh", "ijkl") for the "abcd efgh ijkl" document (assuming you used WhitespaceAnalyzer, which I believe you showed in one of your emails); unlike all of the other documents, one member from each of your query's termArrays[] entries is sequentially present, so I think that the behavior you're seeing is expected.  If you add "abcd efgh ijkl mnop" as a document, it should also be matched.

Looks to me like MultiPhraseQuery is getting in the way.  Shingles that begin at the same word are given the same position by ShingleFilter, and Solr's FieldQParserPlugin creates a MultiPhraseQuery when it encounters tokens in a query with the same position.  I think what you want is to convert queries into shingle disjunctions (*any* matching shingle results in a hit),  right?

Any Solr cognoscenti know how to arrange for Solr's query parser to avoid invoking MultiPhraseQuery?

Steve
Reply | Threaded
Open this post in threaded view
|

Re: Replacing FAST functionality at sesam.no - ShingleFilter+exact matching

michaelsembwever

> Looks to me like MultiPhraseQuery is getting in the way.  Shingles
> that begin at the same word are given the same position by
> ShingleFilter, and Solr's FieldQParserPlugin creates a
> MultiPhraseQuery when it encounters tokens in a query with the same
> position.  I think what you want is to convert queries into shingle
> disjunctions (*any* matching shingle results in a hit),  right?

Yes you're right Steve. thank you.

One way, i see now, to get the behaviour i want is to set the unigrams'
positionIncrement to zero instead of one.

For example in ShingleFilter.fillOutputBuffer(..) replacing the two
ocurrances of
> .setPositionIncrement(1);
with
> .setPositionIncrement(0);

Then i end up with a MultiPhraseQuery with
        termArrays[0] = { list_entry_shingles:abcd
                          list_entry_shingles:abcd efgh
                          list_entry_shingles:abcd efgh ijkl
                          list_entry_shingles:efgh
                          list_entry_shingles:efgh ijkl
                          list_entry_shingles:ijkl }

and it works perfectly :-)

I see no way of configuring this behaviour though.
 If it is possible and someone can say how this would be a real godsend.

Otherwise would a patch to ShingleFilter that offers an option
"unigramPositionIncrement" (that defaults to 1) likely be accepted into
trunk?

~mck

--
"Between two evils, I always pick the one I never tried before." Mae
West
| semb.wever.org | sesat.no | sesam.no |

signature.asc (204 bytes) Download Attachment