Avoiding false-positives in multivalued field search with intervals?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Avoiding false-positives in multivalued field search with intervals?

Dawid Weiss-2
Hi Alan,

You're the expert here so I thought I'd ask before I jump in deep. Do
you think it's feasible to solve the following multivalued-field
problem:

doc: field=["foo", "bar"]
query: field:(foo AND bar)

I'd like the above to return zero hits (no single value contains both
foo and bar), but since multi-valued fields are logically indexed as a
single field, it returns doc. I recognize this as a well known problem
but subdocuments are not fun to deal with so I'd like to avoid them at
all costs.

Would it be possible to solve the above with intervals? Say, something
like this:

Intervals.containing(valuePositionRanges(), query).

I assume the containment relationship would get rid of false-positives
crossing value boundary here. The problem is in how to construct those
value position ranges... Store them at index-construction time
somehow? Compute them on the fly for anything that has a chance to
match query? Your thoughts would be very appreciated.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Avoiding false-positives in multivalued field search with intervals?

Alan Woodward-3
I’ve solved this sort of thing in the past by indexing boundary tokens, and wrapping the queries with the equivalent of Intervals.notContaining(query, boundary-query); you could also put a very large position increment gap and use a width filter, but that’s a bit more error prone if you could conceivably have lots of text in the individual field entries.

> On 10 Sep 2020, at 10:38, Dawid Weiss <[hidden email]> wrote:
>
> Hi Alan,
>
> You're the expert here so I thought I'd ask before I jump in deep. Do
> you think it's feasible to solve the following multivalued-field
> problem:
>
> doc: field=["foo", "bar"]
> query: field:(foo AND bar)
>
> I'd like the above to return zero hits (no single value contains both
> foo and bar), but since multi-valued fields are logically indexed as a
> single field, it returns doc. I recognize this as a well known problem
> but subdocuments are not fun to deal with so I'd like to avoid them at
> all costs.
>
> Would it be possible to solve the above with intervals? Say, something
> like this:
>
> Intervals.containing(valuePositionRanges(), query).
>
> I assume the containment relationship would get rid of false-positives
> crossing value boundary here. The problem is in how to construct those
> value position ranges... Store them at index-construction time
> somehow? Compute them on the fly for anything that has a chance to
> match query? Your thoughts would be very appreciated.
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Avoiding false-positives in multivalued field search with intervals?

Dawid Weiss-2
Yeah... I was thinking about adding synthetic boundaries but this
seems... impure. :) Another quick reflection is that I'd have to
somehow translate the original query (which can be arbitrarily
complex) into an interval query. Tough.

D.

On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward <[hidden email]> wrote:

>
> I’ve solved this sort of thing in the past by indexing boundary tokens, and wrapping the queries with the equivalent of Intervals.notContaining(query, boundary-query); you could also put a very large position increment gap and use a width filter, but that’s a bit more error prone if you could conceivably have lots of text in the individual field entries.
>
> > On 10 Sep 2020, at 10:38, Dawid Weiss <[hidden email]> wrote:
> >
> > Hi Alan,
> >
> > You're the expert here so I thought I'd ask before I jump in deep. Do
> > you think it's feasible to solve the following multivalued-field
> > problem:
> >
> > doc: field=["foo", "bar"]
> > query: field:(foo AND bar)
> >
> > I'd like the above to return zero hits (no single value contains both
> > foo and bar), but since multi-valued fields are logically indexed as a
> > single field, it returns doc. I recognize this as a well known problem
> > but subdocuments are not fun to deal with so I'd like to avoid them at
> > all costs.
> >
> > Would it be possible to solve the above with intervals? Say, something
> > like this:
> >
> > Intervals.containing(valuePositionRanges(), query).
> >
> > I assume the containment relationship would get rid of false-positives
> > crossing value boundary here. The problem is in how to construct those
> > value position ranges... Store them at index-construction time
> > somehow? Compute them on the fly for anything that has a chance to
> > match query? Your thoughts would be very appreciated.
> >
> > Dawid
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Avoiding false-positives in multivalued field search with intervals?

jim ferenczi
You could set a very high position increment gap for multi-valued fields (Analyzer#getPositionIncrementGap) and perform something
like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ? 


Le jeu. 10 sept. 2020 à 12:32, Dawid Weiss <[hidden email]> a écrit :
Yeah... I was thinking about adding synthetic boundaries but this
seems... impure. :) Another quick reflection is that I'd have to
somehow translate the original query (which can be arbitrarily
complex) into an interval query. Tough.

D.

On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward <[hidden email]> wrote:
>
> I’ve solved this sort of thing in the past by indexing boundary tokens, and wrapping the queries with the equivalent of Intervals.notContaining(query, boundary-query); you could also put a very large position increment gap and use a width filter, but that’s a bit more error prone if you could conceivably have lots of text in the individual field entries.
>
> > On 10 Sep 2020, at 10:38, Dawid Weiss <[hidden email]> wrote:
> >
> > Hi Alan,
> >
> > You're the expert here so I thought I'd ask before I jump in deep. Do
> > you think it's feasible to solve the following multivalued-field
> > problem:
> >
> > doc: field=["foo", "bar"]
> > query: field:(foo AND bar)
> >
> > I'd like the above to return zero hits (no single value contains both
> > foo and bar), but since multi-valued fields are logically indexed as a
> > single field, it returns doc. I recognize this as a well known problem
> > but subdocuments are not fun to deal with so I'd like to avoid them at
> > all costs.
> >
> > Would it be possible to solve the above with intervals? Say, something
> > like this:
> >
> > Intervals.containing(valuePositionRanges(), query).
> >
> > I assume the containment relationship would get rid of false-positives
> > crossing value boundary here. The problem is in how to construct those
> > value position ranges... Store them at index-construction time
> > somehow? Compute them on the fly for anything that has a chance to
> > match query? Your thoughts would be very appreciated.
> >
> > Dawid
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Avoiding false-positives in multivalued field search with intervals?

Dawid Weiss-2
Yup - similar to what Alan suggested. I'd have to rewrite the (general
text-to-query) query parser to only use intervals though. Still
thinking about possible approaches to this.

D.

On Thu, Sep 10, 2020 at 3:58 PM jim ferenczi <[hidden email]> wrote:

>
> You could set a very high position increment gap for multi-valued fields (Analyzer#getPositionIncrementGap) and perform something
> like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ?
>
>
> Le jeu. 10 sept. 2020 à 12:32, Dawid Weiss <[hidden email]> a écrit :
>>
>> Yeah... I was thinking about adding synthetic boundaries but this
>> seems... impure. :) Another quick reflection is that I'd have to
>> somehow translate the original query (which can be arbitrarily
>> complex) into an interval query. Tough.
>>
>> D.
>>
>> On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward <[hidden email]> wrote:
>> >
>> > I’ve solved this sort of thing in the past by indexing boundary tokens, and wrapping the queries with the equivalent of Intervals.notContaining(query, boundary-query); you could also put a very large position increment gap and use a width filter, but that’s a bit more error prone if you could conceivably have lots of text in the individual field entries.
>> >
>> > > On 10 Sep 2020, at 10:38, Dawid Weiss <[hidden email]> wrote:
>> > >
>> > > Hi Alan,
>> > >
>> > > You're the expert here so I thought I'd ask before I jump in deep. Do
>> > > you think it's feasible to solve the following multivalued-field
>> > > problem:
>> > >
>> > > doc: field=["foo", "bar"]
>> > > query: field:(foo AND bar)
>> > >
>> > > I'd like the above to return zero hits (no single value contains both
>> > > foo and bar), but since multi-valued fields are logically indexed as a
>> > > single field, it returns doc. I recognize this as a well known problem
>> > > but subdocuments are not fun to deal with so I'd like to avoid them at
>> > > all costs.
>> > >
>> > > Would it be possible to solve the above with intervals? Say, something
>> > > like this:
>> > >
>> > > Intervals.containing(valuePositionRanges(), query).
>> > >
>> > > I assume the containment relationship would get rid of false-positives
>> > > crossing value boundary here. The problem is in how to construct those
>> > > value position ranges... Store them at index-construction time
>> > > somehow? Compute them on the fly for anything that has a chance to
>> > > match query? Your thoughts would be very appreciated.
>> > >
>> > > Dawid
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: [hidden email]
>> > > For additional commands, e-mail: [hidden email]
>> > >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [hidden email]
>> > For additional commands, e-mail: [hidden email]
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Avoiding false-positives in multivalued field search with intervals?

jim ferenczi
Right, I misunderstood Alan's answer. The boundary option is not "impure" in my opinion. It solves this issue nicely but maybe it needs something more packaged to add the boundaries and build queries easily.

Le jeu. 10 sept. 2020 à 16:16, Dawid Weiss <[hidden email]> a écrit :
Yup - similar to what Alan suggested. I'd have to rewrite the (general
text-to-query) query parser to only use intervals though. Still
thinking about possible approaches to this.

D.

On Thu, Sep 10, 2020 at 3:58 PM jim ferenczi <[hidden email]> wrote:
>
> You could set a very high position increment gap for multi-valued fields (Analyzer#getPositionIncrementGap) and perform something
> like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ?
>
>
> Le jeu. 10 sept. 2020 à 12:32, Dawid Weiss <[hidden email]> a écrit :
>>
>> Yeah... I was thinking about adding synthetic boundaries but this
>> seems... impure. :) Another quick reflection is that I'd have to
>> somehow translate the original query (which can be arbitrarily
>> complex) into an interval query. Tough.
>>
>> D.
>>
>> On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward <[hidden email]> wrote:
>> >
>> > I’ve solved this sort of thing in the past by indexing boundary tokens, and wrapping the queries with the equivalent of Intervals.notContaining(query, boundary-query); you could also put a very large position increment gap and use a width filter, but that’s a bit more error prone if you could conceivably have lots of text in the individual field entries.
>> >
>> > > On 10 Sep 2020, at 10:38, Dawid Weiss <[hidden email]> wrote:
>> > >
>> > > Hi Alan,
>> > >
>> > > You're the expert here so I thought I'd ask before I jump in deep. Do
>> > > you think it's feasible to solve the following multivalued-field
>> > > problem:
>> > >
>> > > doc: field=["foo", "bar"]
>> > > query: field:(foo AND bar)
>> > >
>> > > I'd like the above to return zero hits (no single value contains both
>> > > foo and bar), but since multi-valued fields are logically indexed as a
>> > > single field, it returns doc. I recognize this as a well known problem
>> > > but subdocuments are not fun to deal with so I'd like to avoid them at
>> > > all costs.
>> > >
>> > > Would it be possible to solve the above with intervals? Say, something
>> > > like this:
>> > >
>> > > Intervals.containing(valuePositionRanges(), query).
>> > >
>> > > I assume the containment relationship would get rid of false-positives
>> > > crossing value boundary here. The problem is in how to construct those
>> > > value position ranges... Store them at index-construction time
>> > > somehow? Compute them on the fly for anything that has a chance to
>> > > match query? Your thoughts would be very appreciated.
>> > >
>> > > Dawid
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: [hidden email]
>> > > For additional commands, e-mail: [hidden email]
>> > >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [hidden email]
>> > For additional commands, e-mail: [hidden email]
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Avoiding false-positives in multivalued field search with intervals?

Dawid Weiss-2
I am fine with the boundary token suggestion, actually. What I don't
see at the moment is how I can marry it with an output of a general
query parser (which returns any Query). I could give an attempt to
process the query node tree from standard query parser (which we're
using at the moment anyway) but if the tree becomes complex there is
no guarantee I can extract subtrees that can be parsed into
IntervalSources (and then in turn into IntervalQuery).

Dawid

On Thu, Sep 10, 2020 at 4:28 PM jim ferenczi <[hidden email]> wrote:

>
> Right, I misunderstood Alan's answer. The boundary option is not "impure" in my opinion. It solves this issue nicely but maybe it needs something more packaged to add the boundaries and build queries easily.
>
> Le jeu. 10 sept. 2020 à 16:16, Dawid Weiss <[hidden email]> a écrit :
>>
>> Yup - similar to what Alan suggested. I'd have to rewrite the (general
>> text-to-query) query parser to only use intervals though. Still
>> thinking about possible approaches to this.
>>
>> D.
>>
>> On Thu, Sep 10, 2020 at 3:58 PM jim ferenczi <[hidden email]> wrote:
>> >
>> > You could set a very high position increment gap for multi-valued fields (Analyzer#getPositionIncrementGap) and perform something
>> > like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ?
>> >
>> >
>> > Le jeu. 10 sept. 2020 à 12:32, Dawid Weiss <[hidden email]> a écrit :
>> >>
>> >> Yeah... I was thinking about adding synthetic boundaries but this
>> >> seems... impure. :) Another quick reflection is that I'd have to
>> >> somehow translate the original query (which can be arbitrarily
>> >> complex) into an interval query. Tough.
>> >>
>> >> D.
>> >>
>> >> On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward <[hidden email]> wrote:
>> >> >
>> >> > I’ve solved this sort of thing in the past by indexing boundary tokens, and wrapping the queries with the equivalent of Intervals.notContaining(query, boundary-query); you could also put a very large position increment gap and use a width filter, but that’s a bit more error prone if you could conceivably have lots of text in the individual field entries.
>> >> >
>> >> > > On 10 Sep 2020, at 10:38, Dawid Weiss <[hidden email]> wrote:
>> >> > >
>> >> > > Hi Alan,
>> >> > >
>> >> > > You're the expert here so I thought I'd ask before I jump in deep. Do
>> >> > > you think it's feasible to solve the following multivalued-field
>> >> > > problem:
>> >> > >
>> >> > > doc: field=["foo", "bar"]
>> >> > > query: field:(foo AND bar)
>> >> > >
>> >> > > I'd like the above to return zero hits (no single value contains both
>> >> > > foo and bar), but since multi-valued fields are logically indexed as a
>> >> > > single field, it returns doc. I recognize this as a well known problem
>> >> > > but subdocuments are not fun to deal with so I'd like to avoid them at
>> >> > > all costs.
>> >> > >
>> >> > > Would it be possible to solve the above with intervals? Say, something
>> >> > > like this:
>> >> > >
>> >> > > Intervals.containing(valuePositionRanges(), query).
>> >> > >
>> >> > > I assume the containment relationship would get rid of false-positives
>> >> > > crossing value boundary here. The problem is in how to construct those
>> >> > > value position ranges... Store them at index-construction time
>> >> > > somehow? Compute them on the fly for anything that has a chance to
>> >> > > match query? Your thoughts would be very appreciated.
>> >> > >
>> >> > > Dawid
>> >> > >
>> >> > > ---------------------------------------------------------------------
>> >> > > To unsubscribe, e-mail: [hidden email]
>> >> > > For additional commands, e-mail: [hidden email]
>> >> > >
>> >> >
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: [hidden email]
>> >> > For additional commands, e-mail: [hidden email]
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [hidden email]
>> >> For additional commands, e-mail: [hidden email]
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Avoiding false-positives in multivalued field search with intervals?

jim ferenczi
Ok so the more general question is whether we need an interval query parser

Le jeu. 10 sept. 2020 à 17:28, Dawid Weiss <[hidden email]> a écrit :
I am fine with the boundary token suggestion, actually. What I don't
see at the moment is how I can marry it with an output of a general
query parser (which returns any Query). I could give an attempt to
process the query node tree from standard query parser (which we're
using at the moment anyway) but if the tree becomes complex there is
no guarantee I can extract subtrees that can be parsed into
IntervalSources (and then in turn into IntervalQuery).

Dawid

On Thu, Sep 10, 2020 at 4:28 PM jim ferenczi <[hidden email]> wrote:
>
> Right, I misunderstood Alan's answer. The boundary option is not "impure" in my opinion. It solves this issue nicely but maybe it needs something more packaged to add the boundaries and build queries easily.
>
> Le jeu. 10 sept. 2020 à 16:16, Dawid Weiss <[hidden email]> a écrit :
>>
>> Yup - similar to what Alan suggested. I'd have to rewrite the (general
>> text-to-query) query parser to only use intervals though. Still
>> thinking about possible approaches to this.
>>
>> D.
>>
>> On Thu, Sep 10, 2020 at 3:58 PM jim ferenczi <[hidden email]> wrote:
>> >
>> > You could set a very high position increment gap for multi-valued fields (Analyzer#getPositionIncrementGap) and perform something
>> > like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ?
>> >
>> >
>> > Le jeu. 10 sept. 2020 à 12:32, Dawid Weiss <[hidden email]> a écrit :
>> >>
>> >> Yeah... I was thinking about adding synthetic boundaries but this
>> >> seems... impure. :) Another quick reflection is that I'd have to
>> >> somehow translate the original query (which can be arbitrarily
>> >> complex) into an interval query. Tough.
>> >>
>> >> D.
>> >>
>> >> On Thu, Sep 10, 2020 at 12:22 PM Alan Woodward <[hidden email]> wrote:
>> >> >
>> >> > I’ve solved this sort of thing in the past by indexing boundary tokens, and wrapping the queries with the equivalent of Intervals.notContaining(query, boundary-query); you could also put a very large position increment gap and use a width filter, but that’s a bit more error prone if you could conceivably have lots of text in the individual field entries.
>> >> >
>> >> > > On 10 Sep 2020, at 10:38, Dawid Weiss <[hidden email]> wrote:
>> >> > >
>> >> > > Hi Alan,
>> >> > >
>> >> > > You're the expert here so I thought I'd ask before I jump in deep. Do
>> >> > > you think it's feasible to solve the following multivalued-field
>> >> > > problem:
>> >> > >
>> >> > > doc: field=["foo", "bar"]
>> >> > > query: field:(foo AND bar)
>> >> > >
>> >> > > I'd like the above to return zero hits (no single value contains both
>> >> > > foo and bar), but since multi-valued fields are logically indexed as a
>> >> > > single field, it returns doc. I recognize this as a well known problem
>> >> > > but subdocuments are not fun to deal with so I'd like to avoid them at
>> >> > > all costs.
>> >> > >
>> >> > > Would it be possible to solve the above with intervals? Say, something
>> >> > > like this:
>> >> > >
>> >> > > Intervals.containing(valuePositionRanges(), query).
>> >> > >
>> >> > > I assume the containment relationship would get rid of false-positives
>> >> > > crossing value boundary here. The problem is in how to construct those
>> >> > > value position ranges... Store them at index-construction time
>> >> > > somehow? Compute them on the fly for anything that has a chance to
>> >> > > match query? Your thoughts would be very appreciated.
>> >> > >
>> >> > > Dawid
>> >> > >
>> >> > > ---------------------------------------------------------------------
>> >> > > To unsubscribe, e-mail: [hidden email]
>> >> > > For additional commands, e-mail: [hidden email]
>> >> > >
>> >> >
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: [hidden email]
>> >> > For additional commands, e-mail: [hidden email]
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [hidden email]
>> >> For additional commands, e-mail: [hidden email]
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Avoiding false-positives in multivalued field search with intervals?

Dawid Weiss-2
> Ok so the more general question is whether we need an interval query parser

Oh, to this I'd say: yes, yes, yes.

I didn't have much prior experience writing frontend apps on top of
Solr/Lucene but once I did have
to go that route it quickly turns out that several things that are
readily available from code-level
are so darn difficult to achieve and integrate from the outside. Specifically:

- Field expansion in query parsers is a must (so that unqualified
terms are expanded over multiple fields).
Any query parser that doesn't support this is in my opinion of zero
use. The "default" copy-to sink field known
from Solr brings more problems than it solves.

- Exact match-region hit highlighting is a strong expectation. I
solved this with matches API (see LUCENE-9461)
and flexible query parser's multifield expansion. Works like a charm.

- Multivalued fields are common and sub-document handling is a pain.
The problem I raised here is a result of
direct user feedback. In real life multivalued fields are omnipresent
and searches over those fields can be complex.
Users see hits that just should not be there and are confused.

- People do use complex queries. Maybe not all people but there are
people out there who do... Just recently I extended
flexible query parser with a handcrafted min-should-match operator
because it is otherwise not accessible in any Lucene
query parser (!). I can make this code available (it's not terribly
complex), although, since you asked, I think a query parser that
exposes all sorts of "higher level" functionality of intervals would
be very, very useful.

It may end up that I'll have to write something for intervals anyway
so we can work on this together if you like.
Especially the syntax is an open question - should it be
operator-based (like the current boost of fuzzy operators) or
meta-function-based (so that pseudo-functions would be available). Or
maybe a mix of both? I don't know, really. :)

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Avoiding false-positives in multivalued field search with intervals?

Michael Sokolov-4
A slightly different but related topic is how to manage lots of fields

I agree that sub-fields are a pain and that mashing everything
together in an all-field is a mess, but for best performance with a
large number of fields/sub-fields, it is the only workable option I
can see? Expanding a query over numerous fields grows combinatorically
in the number of fields (if I want my query to match when all terms
match in *some* field), doesn't it?

I would like to see a mechanism for defining sub-fields using
positions. Together with an absolute positional query this would
enable both match-any-field as well as field-specific matching with
each token indexed only once (multi-values are possible within this
with boundary tokens or big enough position ranges, as Alan
suggested). It does mean that the sub-field boundaries have to be
managed somehow. Without index support, you can set an arbitrary large
size for your sub-field and insert position gaps at the boundaries,
but maybe we could detect the largest sub-field at flush time and
write that metadata somewhere in the index to enable smaller gaps?
Another issue is differing analysis for the sub-fields, and properly
updating the positions during analysis: at the boundaries(you don't
want to insert a gap, rather advance to a fixed position, and you have
to index sub-fields in order. Maybe we could make it less horrible by
adding better support for it.

Re: query parsing; wasn't there at one time an interval query parser?
It had operators like w() and n() IIRC

On Thu, Sep 10, 2020 at 4:20 PM Dawid Weiss <[hidden email]> wrote:

>
> > Ok so the more general question is whether we need an interval query parser
>
> Oh, to this I'd say: yes, yes, yes.
>
> I didn't have much prior experience writing frontend apps on top of
> Solr/Lucene but once I did have
> to go that route it quickly turns out that several things that are
> readily available from code-level
> are so darn difficult to achieve and integrate from the outside. Specifically:
>
> - Field expansion in query parsers is a must (so that unqualified
> terms are expanded over multiple fields).
> Any query parser that doesn't support this is in my opinion of zero
> use. The "default" copy-to sink field known
> from Solr brings more problems than it solves.
>
> - Exact match-region hit highlighting is a strong expectation. I
> solved this with matches API (see LUCENE-9461)
> and flexible query parser's multifield expansion. Works like a charm.
>
> - Multivalued fields are common and sub-document handling is a pain.
> The problem I raised here is a result of
> direct user feedback. In real life multivalued fields are omnipresent
> and searches over those fields can be complex.
> Users see hits that just should not be there and are confused.
>
> - People do use complex queries. Maybe not all people but there are
> people out there who do... Just recently I extended
> flexible query parser with a handcrafted min-should-match operator
> because it is otherwise not accessible in any Lucene
> query parser (!). I can make this code available (it's not terribly
> complex), although, since you asked, I think a query parser that
> exposes all sorts of "higher level" functionality of intervals would
> be very, very useful.
>
> It may end up that I'll have to write something for intervals anyway
> so we can work on this together if you like.
> Especially the syntax is an open question - should it be
> operator-based (like the current boost of fuzzy operators) or
> meta-function-based (so that pseudo-functions would be available). Or
> maybe a mix of both? I don't know, really. :)
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Avoiding false-positives in multivalued field search with intervals?

Gus Heck
You're thinking of SurroundQuery parser for span queries I think...
and the Advanced Query Parser will have a similar syntax

On Thu, Sep 10, 2020 at 4:40 PM Michael Sokolov <[hidden email]> wrote:
A slightly different but related topic is how to manage lots of fields

I agree that sub-fields are a pain and that mashing everything
together in an all-field is a mess, but for best performance with a
large number of fields/sub-fields, it is the only workable option I
can see? Expanding a query over numerous fields grows combinatorically
in the number of fields (if I want my query to match when all terms
match in *some* field), doesn't it?

I would like to see a mechanism for defining sub-fields using
positions. Together with an absolute positional query this would
enable both match-any-field as well as field-specific matching with
each token indexed only once (multi-values are possible within this
with boundary tokens or big enough position ranges, as Alan
suggested). It does mean that the sub-field boundaries have to be
managed somehow. Without index support, you can set an arbitrary large
size for your sub-field and insert position gaps at the boundaries,
but maybe we could detect the largest sub-field at flush time and
write that metadata somewhere in the index to enable smaller gaps?
Another issue is differing analysis for the sub-fields, and properly
updating the positions during analysis: at the boundaries(you don't
want to insert a gap, rather advance to a fixed position, and you have
to index sub-fields in order. Maybe we could make it less horrible by
adding better support for it.

Re: query parsing; wasn't there at one time an interval query parser?
It had operators like w() and n() IIRC

On Thu, Sep 10, 2020 at 4:20 PM Dawid Weiss <[hidden email]> wrote:
>
> > Ok so the more general question is whether we need an interval query parser
>
> Oh, to this I'd say: yes, yes, yes.
>
> I didn't have much prior experience writing frontend apps on top of
> Solr/Lucene but once I did have
> to go that route it quickly turns out that several things that are
> readily available from code-level
> are so darn difficult to achieve and integrate from the outside. Specifically:
>
> - Field expansion in query parsers is a must (so that unqualified
> terms are expanded over multiple fields).
> Any query parser that doesn't support this is in my opinion of zero
> use. The "default" copy-to sink field known
> from Solr brings more problems than it solves.
>
> - Exact match-region hit highlighting is a strong expectation. I
> solved this with matches API (see LUCENE-9461)
> and flexible query parser's multifield expansion. Works like a charm.
>
> - Multivalued fields are common and sub-document handling is a pain.
> The problem I raised here is a result of
> direct user feedback. In real life multivalued fields are omnipresent
> and searches over those fields can be complex.
> Users see hits that just should not be there and are confused.
>
> - People do use complex queries. Maybe not all people but there are
> people out there who do... Just recently I extended
> flexible query parser with a handcrafted min-should-match operator
> because it is otherwise not accessible in any Lucene
> query parser (!). I can make this code available (it's not terribly
> complex), although, since you asked, I think a query parser that
> exposes all sorts of "higher level" functionality of intervals would
> be very, very useful.
>
> It may end up that I'll have to write something for intervals anyway
> so we can work on this together if you like.
> Especially the syntax is an open question - should it be
> operator-based (like the current boost of fuzzy operators) or
> meta-function-based (so that pseudo-functions would be available). Or
> maybe a mix of both? I don't know, really. :)
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



--
Reply | Threaded
Open this post in threaded view
|

Re: Avoiding false-positives in multivalued field search with intervals?

Dawid Weiss-2
In reply to this post by Michael Sokolov-4
bq. Expanding a query over numerous fields grows combinatorically
in the number of fields (if I want my query to match when all terms
match in *some* field), doesn't it?

I don't think it does? It grows linearly with the number of fields? In
my experience the number of fields
searchable "by default" is typically limited - it's not *all* fields -
it's just a subset that constitutes the "text body"
of a document. Of course everyone's experience will vary depending on
the application.

> Re: query parsing; wasn't there at one time an interval query parser? It had operators like w() and n() IIRC

I've tried that but it's really unusable unless the queries are
automated - the syntax is difficult to use; mistakes cause cryptic
parse errors and are hard to recover from.

Dawid

On Thu, Sep 10, 2020 at 10:40 PM Michael Sokolov <[hidden email]> wrote:

>
> A slightly different but related topic is how to manage lots of fields
>
> I agree that sub-fields are a pain and that mashing everything
> together in an all-field is a mess, but for best performance with a
> large number of fields/sub-fields, it is the only workable option I
> can see? Expanding a query over numerous fields grows combinatorically
> in the number of fields (if I want my query to match when all terms
> match in *some* field), doesn't it?
>
> I would like to see a mechanism for defining sub-fields using
> positions. Together with an absolute positional query this would
> enable both match-any-field as well as field-specific matching with
> each token indexed only once (multi-values are possible within this
> with boundary tokens or big enough position ranges, as Alan
> suggested). It does mean that the sub-field boundaries have to be
> managed somehow. Without index support, you can set an arbitrary large
> size for your sub-field and insert position gaps at the boundaries,
> but maybe we could detect the largest sub-field at flush time and
> write that metadata somewhere in the index to enable smaller gaps?
> Another issue is differing analysis for the sub-fields, and properly
> updating the positions during analysis: at the boundaries(you don't
> want to insert a gap, rather advance to a fixed position, and you have
> to index sub-fields in order. Maybe we could make it less horrible by
> adding better support for it.
>
> Re: query parsing; wasn't there at one time an interval query parser?
> It had operators like w() and n() IIRC
>
> On Thu, Sep 10, 2020 at 4:20 PM Dawid Weiss <[hidden email]> wrote:
> >
> > > Ok so the more general question is whether we need an interval query parser
> >
> > Oh, to this I'd say: yes, yes, yes.
> >
> > I didn't have much prior experience writing frontend apps on top of
> > Solr/Lucene but once I did have
> > to go that route it quickly turns out that several things that are
> > readily available from code-level
> > are so darn difficult to achieve and integrate from the outside. Specifically:
> >
> > - Field expansion in query parsers is a must (so that unqualified
> > terms are expanded over multiple fields).
> > Any query parser that doesn't support this is in my opinion of zero
> > use. The "default" copy-to sink field known
> > from Solr brings more problems than it solves.
> >
> > - Exact match-region hit highlighting is a strong expectation. I
> > solved this with matches API (see LUCENE-9461)
> > and flexible query parser's multifield expansion. Works like a charm.
> >
> > - Multivalued fields are common and sub-document handling is a pain.
> > The problem I raised here is a result of
> > direct user feedback. In real life multivalued fields are omnipresent
> > and searches over those fields can be complex.
> > Users see hits that just should not be there and are confused.
> >
> > - People do use complex queries. Maybe not all people but there are
> > people out there who do... Just recently I extended
> > flexible query parser with a handcrafted min-should-match operator
> > because it is otherwise not accessible in any Lucene
> > query parser (!). I can make this code available (it's not terribly
> > complex), although, since you asked, I think a query parser that
> > exposes all sorts of "higher level" functionality of intervals would
> > be very, very useful.
> >
> > It may end up that I'll have to write something for intervals anyway
> > so we can work on this together if you like.
> > Especially the syntax is an open question - should it be
> > operator-based (like the current boost of fuzzy operators) or
> > meta-function-based (so that pseudo-functions would be available). Or
> > maybe a mix of both? I don't know, really. :)
> >
> > Dawid
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Avoiding false-positives in multivalued field search with intervals?

Michael Gibney
This might be a little outside the spirit of this discussion (in that
it's not really "off-the-shelf") -- but I implemented a
proof-of-concept for a different use case that I think could be
adapted here:

For a given doc, for each term in your multivalued field, you could
record a bitset representation of the indexes of the individual fields
in which that term appears; then in conjunction DISI for different
terms, intersect the bitset values for different terms to speed the
determination of whether the terms appear in the same field. You could
put the bitset representation, e.g., in the Payload for the first
position of each term, or for more general-purpose use, in
polyField/subfield DocValues, or whatever.

It seems like everyone's on the same page more-or-less, but I'll
explicitly note: this feels superficially a little like a "special
case", as it addresses only the "conjunction" case ... but for
avoiding false-positives in the multivalued-field case, arguably the
conjunction case *is* the general case.

Michael

On Mon, Sep 14, 2020 at 3:17 AM Dawid Weiss <[hidden email]> wrote:

>
> bq. Expanding a query over numerous fields grows combinatorically
> in the number of fields (if I want my query to match when all terms
> match in *some* field), doesn't it?
>
> I don't think it does? It grows linearly with the number of fields? In
> my experience the number of fields
> searchable "by default" is typically limited - it's not *all* fields -
> it's just a subset that constitutes the "text body"
> of a document. Of course everyone's experience will vary depending on
> the application.
>
> > Re: query parsing; wasn't there at one time an interval query parser? It had operators like w() and n() IIRC
>
> I've tried that but it's really unusable unless the queries are
> automated - the syntax is difficult to use; mistakes cause cryptic
> parse errors and are hard to recover from.
>
> Dawid
>
> On Thu, Sep 10, 2020 at 10:40 PM Michael Sokolov <[hidden email]> wrote:
> >
> > A slightly different but related topic is how to manage lots of fields
> >
> > I agree that sub-fields are a pain and that mashing everything
> > together in an all-field is a mess, but for best performance with a
> > large number of fields/sub-fields, it is the only workable option I
> > can see? Expanding a query over numerous fields grows combinatorically
> > in the number of fields (if I want my query to match when all terms
> > match in *some* field), doesn't it?
> >
> > I would like to see a mechanism for defining sub-fields using
> > positions. Together with an absolute positional query this would
> > enable both match-any-field as well as field-specific matching with
> > each token indexed only once (multi-values are possible within this
> > with boundary tokens or big enough position ranges, as Alan
> > suggested). It does mean that the sub-field boundaries have to be
> > managed somehow. Without index support, you can set an arbitrary large
> > size for your sub-field and insert position gaps at the boundaries,
> > but maybe we could detect the largest sub-field at flush time and
> > write that metadata somewhere in the index to enable smaller gaps?
> > Another issue is differing analysis for the sub-fields, and properly
> > updating the positions during analysis: at the boundaries(you don't
> > want to insert a gap, rather advance to a fixed position, and you have
> > to index sub-fields in order. Maybe we could make it less horrible by
> > adding better support for it.
> >
> > Re: query parsing; wasn't there at one time an interval query parser?
> > It had operators like w() and n() IIRC
> >
> > On Thu, Sep 10, 2020 at 4:20 PM Dawid Weiss <[hidden email]> wrote:
> > >
> > > > Ok so the more general question is whether we need an interval query parser
> > >
> > > Oh, to this I'd say: yes, yes, yes.
> > >
> > > I didn't have much prior experience writing frontend apps on top of
> > > Solr/Lucene but once I did have
> > > to go that route it quickly turns out that several things that are
> > > readily available from code-level
> > > are so darn difficult to achieve and integrate from the outside. Specifically:
> > >
> > > - Field expansion in query parsers is a must (so that unqualified
> > > terms are expanded over multiple fields).
> > > Any query parser that doesn't support this is in my opinion of zero
> > > use. The "default" copy-to sink field known
> > > from Solr brings more problems than it solves.
> > >
> > > - Exact match-region hit highlighting is a strong expectation. I
> > > solved this with matches API (see LUCENE-9461)
> > > and flexible query parser's multifield expansion. Works like a charm.
> > >
> > > - Multivalued fields are common and sub-document handling is a pain.
> > > The problem I raised here is a result of
> > > direct user feedback. In real life multivalued fields are omnipresent
> > > and searches over those fields can be complex.
> > > Users see hits that just should not be there and are confused.
> > >
> > > - People do use complex queries. Maybe not all people but there are
> > > people out there who do... Just recently I extended
> > > flexible query parser with a handcrafted min-should-match operator
> > > because it is otherwise not accessible in any Lucene
> > > query parser (!). I can make this code available (it's not terribly
> > > complex), although, since you asked, I think a query parser that
> > > exposes all sorts of "higher level" functionality of intervals would
> > > be very, very useful.
> > >
> > > It may end up that I'll have to write something for intervals anyway
> > > so we can work on this together if you like.
> > > Especially the syntax is an open question - should it be
> > > operator-based (like the current boost of fuzzy operators) or
> > > meta-function-based (so that pseudo-functions would be available). Or
> > > maybe a mix of both? I don't know, really. :)
> > >
> > > Dawid
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Avoiding false-positives in multivalued field search with intervals?

Dawid Weiss-2
Thanks Michael. The outcome of this discussion seems to be clear that
everyone is trying to reinvent the wheel somehow. ;) I think it really
should become part of core Lucene functionality. Seems like a corner
case people are not aware of until they hit it (and then it's not
clear what to do about it).

Dawid

On Mon, Sep 14, 2020 at 4:57 PM Michael Gibney
<[hidden email]> wrote:

>
> This might be a little outside the spirit of this discussion (in that
> it's not really "off-the-shelf") -- but I implemented a
> proof-of-concept for a different use case that I think could be
> adapted here:
>
> For a given doc, for each term in your multivalued field, you could
> record a bitset representation of the indexes of the individual fields
> in which that term appears; then in conjunction DISI for different
> terms, intersect the bitset values for different terms to speed the
> determination of whether the terms appear in the same field. You could
> put the bitset representation, e.g., in the Payload for the first
> position of each term, or for more general-purpose use, in
> polyField/subfield DocValues, or whatever.
>
> It seems like everyone's on the same page more-or-less, but I'll
> explicitly note: this feels superficially a little like a "special
> case", as it addresses only the "conjunction" case ... but for
> avoiding false-positives in the multivalued-field case, arguably the
> conjunction case *is* the general case.
>
> Michael
>
> On Mon, Sep 14, 2020 at 3:17 AM Dawid Weiss <[hidden email]> wrote:
> >
> > bq. Expanding a query over numerous fields grows combinatorically
> > in the number of fields (if I want my query to match when all terms
> > match in *some* field), doesn't it?
> >
> > I don't think it does? It grows linearly with the number of fields? In
> > my experience the number of fields
> > searchable "by default" is typically limited - it's not *all* fields -
> > it's just a subset that constitutes the "text body"
> > of a document. Of course everyone's experience will vary depending on
> > the application.
> >
> > > Re: query parsing; wasn't there at one time an interval query parser? It had operators like w() and n() IIRC
> >
> > I've tried that but it's really unusable unless the queries are
> > automated - the syntax is difficult to use; mistakes cause cryptic
> > parse errors and are hard to recover from.
> >
> > Dawid
> >
> > On Thu, Sep 10, 2020 at 10:40 PM Michael Sokolov <[hidden email]> wrote:
> > >
> > > A slightly different but related topic is how to manage lots of fields
> > >
> > > I agree that sub-fields are a pain and that mashing everything
> > > together in an all-field is a mess, but for best performance with a
> > > large number of fields/sub-fields, it is the only workable option I
> > > can see? Expanding a query over numerous fields grows combinatorically
> > > in the number of fields (if I want my query to match when all terms
> > > match in *some* field), doesn't it?
> > >
> > > I would like to see a mechanism for defining sub-fields using
> > > positions. Together with an absolute positional query this would
> > > enable both match-any-field as well as field-specific matching with
> > > each token indexed only once (multi-values are possible within this
> > > with boundary tokens or big enough position ranges, as Alan
> > > suggested). It does mean that the sub-field boundaries have to be
> > > managed somehow. Without index support, you can set an arbitrary large
> > > size for your sub-field and insert position gaps at the boundaries,
> > > but maybe we could detect the largest sub-field at flush time and
> > > write that metadata somewhere in the index to enable smaller gaps?
> > > Another issue is differing analysis for the sub-fields, and properly
> > > updating the positions during analysis: at the boundaries(you don't
> > > want to insert a gap, rather advance to a fixed position, and you have
> > > to index sub-fields in order. Maybe we could make it less horrible by
> > > adding better support for it.
> > >
> > > Re: query parsing; wasn't there at one time an interval query parser?
> > > It had operators like w() and n() IIRC
> > >
> > > On Thu, Sep 10, 2020 at 4:20 PM Dawid Weiss <[hidden email]> wrote:
> > > >
> > > > > Ok so the more general question is whether we need an interval query parser
> > > >
> > > > Oh, to this I'd say: yes, yes, yes.
> > > >
> > > > I didn't have much prior experience writing frontend apps on top of
> > > > Solr/Lucene but once I did have
> > > > to go that route it quickly turns out that several things that are
> > > > readily available from code-level
> > > > are so darn difficult to achieve and integrate from the outside. Specifically:
> > > >
> > > > - Field expansion in query parsers is a must (so that unqualified
> > > > terms are expanded over multiple fields).
> > > > Any query parser that doesn't support this is in my opinion of zero
> > > > use. The "default" copy-to sink field known
> > > > from Solr brings more problems than it solves.
> > > >
> > > > - Exact match-region hit highlighting is a strong expectation. I
> > > > solved this with matches API (see LUCENE-9461)
> > > > and flexible query parser's multifield expansion. Works like a charm.
> > > >
> > > > - Multivalued fields are common and sub-document handling is a pain.
> > > > The problem I raised here is a result of
> > > > direct user feedback. In real life multivalued fields are omnipresent
> > > > and searches over those fields can be complex.
> > > > Users see hits that just should not be there and are confused.
> > > >
> > > > - People do use complex queries. Maybe not all people but there are
> > > > people out there who do... Just recently I extended
> > > > flexible query parser with a handcrafted min-should-match operator
> > > > because it is otherwise not accessible in any Lucene
> > > > query parser (!). I can make this code available (it's not terribly
> > > > complex), although, since you asked, I think a query parser that
> > > > exposes all sorts of "higher level" functionality of intervals would
> > > > be very, very useful.
> > > >
> > > > It may end up that I'll have to write something for intervals anyway
> > > > so we can work on this together if you like.
> > > > Especially the syntax is an open question - should it be
> > > > operator-based (like the current boost of fuzzy operators) or
> > > > meta-function-based (so that pseudo-functions would be available). Or
> > > > maybe a mix of both? I don't know, really. :)
> > > >
> > > > Dawid
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [hidden email]
> > > > For additional commands, e-mail: [hidden email]
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Avoiding false-positives in multivalued field search with intervals?

Chris Hostetter-3
In reply to this post by Dawid Weiss-2

(caveat: i don't ever really understand what Intervals at hte lucene
feature set stage)

: Yup - similar to what Alan suggested. I'd have to rewrite the (general
: text-to-query) query parser to only use intervals though. Still
: thinking about possible approaches to this.
        ...
: > You could set a very high position increment gap for multi-valued
: > fields (Analyzer#getPositionIncrementGap) and perform something
: > like Intervals.maxWidth(Intervals.unordered(...), pos_gap-1) ?

I'm assuming form your response that the issue here is really that you
want to *directly* support the syntax you mentioned...

: >> > > doc: field=["foo", "bar"]
: >> > > query: field:(foo AND bar)

...and identify *when* the parser encouters a "boolean" expresion
preceeded by the "fieldName:" syntax, and *then* treat thta special.

ie: this seems 100% like a query parser question, and not at all as a
"what does the query structure look like ater parsing" question.

Because if you can adjust your parser syntax, this literallyly just
becomes:  ' field:"foo bar"~N '   ...  where N is the positionIncrementGap
on your analyzer ... OR ... ' field:"foo bar" ' ... if you call
setPhraseSlop on your QueryParser.

i *THINK* the crux of your question/problem is that -- from the point of
view of the QueryParserBase/BooleanQueryNodeBuilder, these 2 input strings
are treated identically by the time any "subclass" has a chance to do anything
interesting with them...

        field:(foo AND bar)
        field:foo AND field:bar

...so you can't for instance, build an Interval / sloppy Phrase query from
the first, while building a 2 clause boolean query from the second.

So maybe the "solution" (at least for the flexible parser ... IIUC, I
haven't used it much) would be for BooleanQueryNode to carry some metadata
indicating that there was a "fieldName:" prefix on it, so that the
BooleanQueryNodeBuilder can choose to use that information to do something
"special" if the "List<QueryNode> clauses" are all simple TermNodes (in
the same field)

        ?


-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Avoiding false-positives in multivalued field search with intervals?

Dawid Weiss-2
Hi Chris,

> Because if you can adjust your parser syntax, this literallyly just
> becomes:  ' field:"foo bar"~N '   ...  where N is the positionIncrementGap
> on your analyzer ... OR ... ' field:"foo bar" ' ... if you call
> setPhraseSlop on your QueryParser.

Yes - correct. This would be equivalent what others suggested with
intervals (search
for a fixed-length phrase and filter out false positives by leveraging
position increments
between values).

I think the second solution is somewhat more flexible - index a
sentinel token between values
and ensure it's not part of the hit range. This allows you to use any
type of interval query
underneath, which is nice.

> So maybe the "solution" (at least for the flexible parser ... IIUC, I
> haven't used it much) would be for BooleanQueryNode to carry some metadata [...]

I haven't reached the phase of modifying the flexible query parser for
my use case but it's definitely going to
work something like you suggest. I think I'm going to rewrite the node
tree from the syntax parser into
interval queries though (either in full or in part). I'll see what can
be done there.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]