multiple dateranges/timeslots per doc: modeling openinghours.

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

multiple dateranges/timeslots per doc: modeling openinghours.

britske
Sorry for the somewhat length post, I would like to make clear that I covered my basis here, and looking for an alternative solution, because the more trivial solutions don't seem to work for my use-case.

Consider Bars, musea, etc.

These places have multiple openinghours that can depend on:
REQ 1. day of week
REQ 2. special days on which they are closed, or have in another way different openinghours than there related 'day of week'

Now, I want to model these 'places' in a way so I'm able to do temporal queries like:
- which bars are open NOW (and stay open for at least another 3 hours)
- which musea are (already) open at 25-12-2011 - 10AM - and stay open until (at least) 3PM.

I believe having opening/closing hours available for each day at least gives me the data needed to query the above. (Note that having dayOfWeek*openinghours is not enough, bc. of the special cases in 2.)

Okay knowing I need openinghours*dates for each place, how would I format this in documents?

OPTION A)
-----------
Considering granularity: I want documents to represent Places and not Places*dates. Although the latter would trivially allow me to do the quering mentioned above, it has the disadvantages:

 - same place returned multiple times (each with a different date) when queries are not constrained to date.
- Lot's of data needs to be duplicated, all for the conceptually 'simple'  functionality of needing multiple date-ranges. It feels bad and a simpler solution should exist?
- Exploding the resultset (documents = say, 100 dates * 1.000.000 = 100.000.000. ) suddenly the size of the resultset goes from 'easily doable' to 'hmmm I have to think about this'. Given that places also have some other fields to sort on, Lucene fieldcache mem-usage would explode with a factor 100.

OPTION B)
----------
Another, faulty, option would be to model opening/closing hours in 2 multivalued date-fields, i.e: open, close. and insert open/close for each day, e.g:

open: 2011-11-08:1800 - close: 2011-11-09:0300
open: 2011-11-09:1700 - close: 2011-11-10:0500
open: 2011-11-10:1700 - close: 2011-11-11:0300

And queries would be of the form:

'open < now && close > now+3h'

But since there is no way to indicate that 'open' and 'close' are pairwise related I will get a lot of false positives, e.g the above document would be returned for:

open < 2011-11-09:0100 && close > 2011-11-09:0600
because SOME opendate is before 2011-11-09:0100 (i.e: 2011-11-08:1800) and SOME closedate is after 2011-11-09:0600 (for example: 2011-11-11:0300) but these open and close-dates are not pairwise related.

OPTION C) The best of what I have now:
---------------------------------------
I have been thinking about a totally different approach using Solr dynamic fields, in which each and every opening and closing-date gets it's own dynamic field, e.g:

_date_2011-11-09_open: 1800
_date_2011-11-09_close: 0300
_date_2011-11-09_open: 1700
_date_2011-11-10_close: 0500
_date_2011-11-10_open: 1700
_date_2011-11-11_close: 0300

Then, the client should know the date to query, and thus the correct fields to query. This would solve the problem, since startdate/ enddate are nor pairwise -related, but I fear this can be a big issue from a performance standpoint (especially memory consumption of the Lucene fieldcache)


IDEAL OPTION D)
----------------
I'm pretty sure this does not exist out-of-the-box, but might be extended.
Okay, Solr has a fieldtype: date, but what if it also had a fieldtype: Daterange? A Daterange would be modeled as <DateTimeA,DateTimeB> or <DateTimeA,Delta DateTimeA>

Then this problem would be really easily modelled as a multivalued field 'openinghours' of type 'Daterange'.
However, I have the feeling that the standard range-query implementation can't be used on this fieldtype, or perhaps should be run for each of the N datereange-values in 'openinghours'.

To make matters worse ( I didn't want to introduce this above)
REQ 3: It may be possible that certain places have multiple opening-hours / timeslots each day. Consider museum in Spain which get's closed around noon because of siesta-time.
OPTION D) would be able to handle this natively, all other options can't.

I would very much appreciate any pointers to:
 - how to start with option D. and if this approach is at all feasible.
 - if option C. would suffice. (excluding REQ 3. ), and if I'm likely to run into performance / memory troubles.
 - any other possible solutions I haven' thought of to tackle this.

Thanks a lot.

Cheers,
Geert-Jan




Reply | Threaded
Open this post in threaded view
|

Re: multiple dateranges/timeslots per doc: modeling openinghours.

David Smiley
In case anyone is curious, I responded to him with a solution using either SOLR-2155 (Geohash prefix query filter) or LSP: https://issues.apache.org/jira/browse/SOLR-2155?focusedCommentId=13115244#comment-13115244

~ David Smiley
Reply | Threaded
Open this post in threaded view
|

Re: multiple dateranges/timeslots per doc: modeling openinghours.

Chris Hostetter-3
In reply to this post by britske

: Another, faulty, option would be to model opening/closing hours in 2
: multivalued date-fields, i.e: open, close. and insert open/close for each
: day, e.g:
:
: open: 2011-11-08:1800 - close: 2011-11-09:0300
: open: 2011-11-09:1700 - close: 2011-11-10:0500
: open: 2011-11-10:1700 - close: 2011-11-11:0300
:
: And queries would be of the form:
:
: 'open < now && close > now+3h'
:
: But since there is no way to indicate that 'open' and 'close' are pairwise
: related I will get a lot of false positives, e.g the above document would be
: returned for:

This isn't possible out of the box, but the general idea of "position
linked" queries is possible using the same approach as the
FieldMaskingSpanQuery...

https://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/spans/FieldMaskingSpanQuery.html
https://issues.apache.org/jira/browse/LUCENE-1494

..implementing something like this that would work with
(Numeric)RangeQueries however would require some additional work, but it
should certianly be doable -- i've suggested this before but no one has
taken me up on it...
http://markmail.org/search/?q=hoss+FieldMaskingSpanQuery

If we take it as a given that you can do multiple ranges "at the same
position", then you can imagine supporting all of your "regular" hours
using just two fields ("open" and "close") by encoding the day+time of
each range of open hours into them -- even if a store is open for multiple
sets of ranges per day (ie: closed for siesta)...

  open: mon_12_30, tue_12_30, wed_07_30, wed_3_30, ...
  close: mon_20_00, tue_20_30, wed_12_30, wed_22_30, ...

then asking for "stores open now and for the next 3 hours" on "wed" at
"2:13PM" becomes a query for...

sameposition(open:[* TO wed_14_13], close:[wed_17_13 TO *])

For the special case part of your problem when there are certain dates
that a store will be open atypical hours, i *think* that could be solved
using some special docs and the new "join" QParser in a filter query...

        https://wiki.apache.org/solr/Join

imagine you have your "regular" docs with all the normal data about a
store, and the open/close fields i describe above.  but in addition to
those, for any store that you know is "closed on dec 25" or "only open
12:00-15:00 on Jan 01" you add an additional small doc encapsulating
the information about the stores closures on that special date - so that
each special case would be it's own doc, even if one store had 5 days
where there was a special case...

  specialdoc1:
    store_id: 42
    special_date: Dec-25
    status: closed
  specialdoc2:
    store_id: 42
    special_date: Jan-01
    status: irregular
    open: 09_30
    close: 13_00

then when you are executing your query, you use an "fq" to constrain to
stores that are (normally) open right now (like i mentioned above) and you
use another fq to find all docs *except* those resulting from a join
against these special case docs based on the current date.

so if you r query is "open now and for the next 3 hours" and "now" ==
"sunday, 2011-12-25 @ 10:17AM your query would be something like...

q=...user input...
time=sameposition(open:[* TO sun_10_17], close:[sun_13_17 TO *])
fq={!v=time}
fq={!join from=store_id to=unique_key v=$vv}
vv=-(+special_date:Dec-25 +(status:closed OR _query_:"{v=$time}"))

That join based approach for dealing with the special dates should work
regardless of wether someone implements a way to do pair wise
"sameposition()" rangequeries ... so if you can live w/o the multiple
open/close pairs per day, you can just use the "one field per day of hte
week" type approach you mentioned combined with the "join" for special
case days of hte year and everything you need should already work w/o any
code (on trunk).

(disclaimer: obviously i haven't tested that query, the exact syntax may
be off but the princible for modeling the "special docs" and using
them in a join should work)


-Hoss
Reply | Threaded
Open this post in threaded view
|

Re: multiple dateranges/timeslots per doc: modeling openinghours.

Mikhail Khludnev
I agree about SpanQueries. It's a viable measure against "false-positive
matches on multivalue fields".
 we've implemented this approach some time ago. Pls find details at
http://blog.griddynamics.com/2011/06/solr-experience-search-parent-child.html

and
http://blog.griddynamics.com/2011/07/solr-experience-search-parent-child.html
we are going to publish the third post about an implementation approaches.

--
Mikhail Khludnev


On Sat, Oct 1, 2011 at 6:25 AM, Chris Hostetter <[hidden email]>wrote:

>
> : Another, faulty, option would be to model opening/closing hours in 2
> : multivalued date-fields, i.e: open, close. and insert open/close for each
> : day, e.g:
> :
> : open: 2011-11-08:1800 - close: 2011-11-09:0300
> : open: 2011-11-09:1700 - close: 2011-11-10:0500
> : open: 2011-11-10:1700 - close: 2011-11-11:0300
> :
> : And queries would be of the form:
> :
> : 'open < now && close > now+3h'
> :
> : But since there is no way to indicate that 'open' and 'close' are
> pairwise
> : related I will get a lot of false positives, e.g the above document would
> be
> : returned for:
>
> This isn't possible out of the box, but the general idea of "position
> linked" queries is possible using the same approach as the
> FieldMaskingSpanQuery...
>
>
> https://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/spans/FieldMaskingSpanQuery.html
> https://issues.apache.org/jira/browse/LUCENE-1494
>
> ..implementing something like this that would work with
> (Numeric)RangeQueries however would require some additional work, but it
> should certianly be doable -- i've suggested this before but no one has
> taken me up on it...
> http://markmail.org/search/?q=hoss+FieldMaskingSpanQuery
>
> If we take it as a given that you can do multiple ranges "at the same
> position", then you can imagine supporting all of your "regular" hours
> using just two fields ("open" and "close") by encoding the day+time of
> each range of open hours into them -- even if a store is open for multiple
> sets of ranges per day (ie: closed for siesta)...
>
>  open: mon_12_30, tue_12_30, wed_07_30, wed_3_30, ...
>  close: mon_20_00, tue_20_30, wed_12_30, wed_22_30, ...
>
> then asking for "stores open now and for the next 3 hours" on "wed" at
> "2:13PM" becomes a query for...
>
> sameposition(open:[* TO wed_14_13], close:[wed_17_13 TO *])
>
> For the special case part of your problem when there are certain dates
> that a store will be open atypical hours, i *think* that could be solved
> using some special docs and the new "join" QParser in a filter query...
>
>        https://wiki.apache.org/solr/Join
>
> imagine you have your "regular" docs with all the normal data about a
> store, and the open/close fields i describe above.  but in addition to
> those, for any store that you know is "closed on dec 25" or "only open
> 12:00-15:00 on Jan 01" you add an additional small doc encapsulating
> the information about the stores closures on that special date - so that
> each special case would be it's own doc, even if one store had 5 days
> where there was a special case...
>
>  specialdoc1:
>    store_id: 42
>    special_date: Dec-25
>    status: closed
>  specialdoc2:
>    store_id: 42
>    special_date: Jan-01
>    status: irregular
>    open: 09_30
>    close: 13_00
>
> then when you are executing your query, you use an "fq" to constrain to
> stores that are (normally) open right now (like i mentioned above) and you
> use another fq to find all docs *except* those resulting from a join
> against these special case docs based on the current date.
>
> so if you r query is "open now and for the next 3 hours" and "now" ==
> "sunday, 2011-12-25 @ 10:17AM your query would be something like...
>
> q=...user input...
> time=sameposition(open:[* TO sun_10_17], close:[sun_13_17 TO *])
> fq={!v=time}
> fq={!join from=store_id to=unique_key v=$vv}
> vv=-(+special_date:Dec-25 +(status:closed OR _query_:"{v=$time}"))
>
> That join based approach for dealing with the special dates should work
> regardless of wether someone implements a way to do pair wise
> "sameposition()" rangequeries ... so if you can live w/o the multiple
> open/close pairs per day, you can just use the "one field per day of hte
> week" type approach you mentioned combined with the "join" for special
> case days of hte year and everything you need should already work w/o any
> code (on trunk).
>
> (disclaimer: obviously i haven't tested that query, the exact syntax may
> be off but the princible for modeling the "special docs" and using
> them in a join should work)
>
>
> -Hoss
>



--
Sincerely yours
Mikhail (Mike) Khludnev
Developer
Grid Dynamics
tel. 1-415-738-8644
Skype: mkhludnev
<http://www.griddynamics.com>
 <[hidden email]>
Reply | Threaded
Open this post in threaded view
|

Re: multiple dateranges/timeslots per doc: modeling openinghours.

britske
Interesting! Reading your previous blogposts, I gather that the to be posted
'implementation approaches' includes a way of making the SpanQueries
available within SOLR?
Also, would with your approach would (numeric) RangeQueries be possible as
Hoss suggests?

Looking forward to that 'implementation post'
Cheers,
Geert-Jan

Op 1 oktober 2011 19:57 schreef Mikhail Khludnev <[hidden email]
> het volgende:

> I agree about SpanQueries. It's a viable measure against "false-positive
> matches on multivalue fields".
>  we've implemented this approach some time ago. Pls find details at
>
> http://blog.griddynamics.com/2011/06/solr-experience-search-parent-child.html
>
> and
>
> http://blog.griddynamics.com/2011/07/solr-experience-search-parent-child.html
> we are going to publish the third post about an implementation approaches.
>
> --
> Mikhail Khludnev
>
>
> On Sat, Oct 1, 2011 at 6:25 AM, Chris Hostetter <[hidden email]
> >wrote:
>
> >
> > : Another, faulty, option would be to model opening/closing hours in 2
> > : multivalued date-fields, i.e: open, close. and insert open/close for
> each
> > : day, e.g:
> > :
> > : open: 2011-11-08:1800 - close: 2011-11-09:0300
> > : open: 2011-11-09:1700 - close: 2011-11-10:0500
> > : open: 2011-11-10:1700 - close: 2011-11-11:0300
> > :
> > : And queries would be of the form:
> > :
> > : 'open < now && close > now+3h'
> > :
> > : But since there is no way to indicate that 'open' and 'close' are
> > pairwise
> > : related I will get a lot of false positives, e.g the above document
> would
> > be
> > : returned for:
> >
> > This isn't possible out of the box, but the general idea of "position
> > linked" queries is possible using the same approach as the
> > FieldMaskingSpanQuery...
> >
> >
> >
> https://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/spans/FieldMaskingSpanQuery.html
> > https://issues.apache.org/jira/browse/LUCENE-1494
> >
> > ..implementing something like this that would work with
> > (Numeric)RangeQueries however would require some additional work, but it
> > should certianly be doable -- i've suggested this before but no one has
> > taken me up on it...
> > http://markmail.org/search/?q=hoss+FieldMaskingSpanQuery
> >
> > If we take it as a given that you can do multiple ranges "at the same
> > position", then you can imagine supporting all of your "regular" hours
> > using just two fields ("open" and "close") by encoding the day+time of
> > each range of open hours into them -- even if a store is open for
> multiple
> > sets of ranges per day (ie: closed for siesta)...
> >
> >  open: mon_12_30, tue_12_30, wed_07_30, wed_3_30, ...
> >  close: mon_20_00, tue_20_30, wed_12_30, wed_22_30, ...
> >
> > then asking for "stores open now and for the next 3 hours" on "wed" at
> > "2:13PM" becomes a query for...
> >
> > sameposition(open:[* TO wed_14_13], close:[wed_17_13 TO *])
> >
> > For the special case part of your problem when there are certain dates
> > that a store will be open atypical hours, i *think* that could be solved
> > using some special docs and the new "join" QParser in a filter query...
> >
> >        https://wiki.apache.org/solr/Join
> >
> > imagine you have your "regular" docs with all the normal data about a
> > store, and the open/close fields i describe above.  but in addition to
> > those, for any store that you know is "closed on dec 25" or "only open
> > 12:00-15:00 on Jan 01" you add an additional small doc encapsulating
> > the information about the stores closures on that special date - so that
> > each special case would be it's own doc, even if one store had 5 days
> > where there was a special case...
> >
> >  specialdoc1:
> >    store_id: 42
> >    special_date: Dec-25
> >    status: closed
> >  specialdoc2:
> >    store_id: 42
> >    special_date: Jan-01
> >    status: irregular
> >    open: 09_30
> >    close: 13_00
> >
> > then when you are executing your query, you use an "fq" to constrain to
> > stores that are (normally) open right now (like i mentioned above) and
> you
> > use another fq to find all docs *except* those resulting from a join
> > against these special case docs based on the current date.
> >
> > so if you r query is "open now and for the next 3 hours" and "now" ==
> > "sunday, 2011-12-25 @ 10:17AM your query would be something like...
> >
> > q=...user input...
> > time=sameposition(open:[* TO sun_10_17], close:[sun_13_17 TO *])
> > fq={!v=time}
> > fq={!join from=store_id to=unique_key v=$vv}
> > vv=-(+special_date:Dec-25 +(status:closed OR _query_:"{v=$time}"))
> >
> > That join based approach for dealing with the special dates should work
> > regardless of wether someone implements a way to do pair wise
> > "sameposition()" rangequeries ... so if you can live w/o the multiple
> > open/close pairs per day, you can just use the "one field per day of hte
> > week" type approach you mentioned combined with the "join" for special
> > case days of hte year and everything you need should already work w/o any
> > code (on trunk).
> >
> > (disclaimer: obviously i haven't tested that query, the exact syntax may
> > be off but the princible for modeling the "special docs" and using
> > them in a join should work)
> >
> >
> > -Hoss
> >
>
>
>
> --
> Sincerely yours
> Mikhail (Mike) Khludnev
> Developer
> Grid Dynamics
> tel. 1-415-738-8644
> Skype: mkhludnev
> <http://www.griddynamics.com>
>  <[hidden email]>
>
Reply | Threaded
Open this post in threaded view
|

Re: multiple dateranges/timeslots per doc: modeling openinghours.

britske
In reply to this post by Chris Hostetter-3
Thanks Hoss for that in-depth walkthrough.

I like your solution of using (something akin to)
FieldMaskingSpanQuery<https://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/spans/FieldMaskingSpanQuery.html>.
Conceptually
the Join-approach looks like it would work from paper, although I'm not a
big fan of introducing a lot of complexity to the frontend / querying part
of the solution.

As an alternative, what about using your fieldMaskingSpanQuery-approach
solely (without the JOIN-approach)  and encode open/close on a per day
basis?
I didn't mention it, but I 'only' need 100 days of data, which would lead to
100 open and 100 close values, not counting the pois with multiple
openinghours per day which are pretty rare.
The index is rebuild each night, refreshing the date-data.

I'm not sure what the performance implications would be like, but somehow
that feels doable. Perhaps it even offsets the extra time needed for doing
the Joins, only 1 way to find out I guess.
Disadvantage would be fewer cache-hits when using FQ.

Data then becomes:

open: 20111020_12_30, 20111021_12_30, 20111022_07_30, ...
close: 20111020_20_00, 20111021_26_30, 20111022_12_30, ...

Notice the: 20111021_26_30, which indicates close at 2AM the next day,
which would work (in contrast to encoding it like 20111022_02_30)

Alternatively, how would you compare your suggested approach with the
approach by David Smiley using either SOLR-2155 (Geohash prefix query
filter) or LSP:
https://issues.apache.org/jira/browse/SOLR-2155?focusedCommentId=13115244#comment-13115244.
That would work right now, and the LSP-approach seems pretty elegant to me.
FQ-style caching is probably not possible though.

Geert-Jan

Op 1 oktober 2011 04:25 schreef Chris Hostetter
<[hidden email]>het volgende:

>
> : Another, faulty, option would be to model opening/closing hours in 2
> : multivalued date-fields, i.e: open, close. and insert open/close for each
> : day, e.g:
> :
> : open: 2011-11-08:1800 - close: 2011-11-09:0300
> : open: 2011-11-09:1700 - close: 2011-11-10:0500
> : open: 2011-11-10:1700 - close: 2011-11-11:0300
> :
> : And queries would be of the form:
> :
> : 'open < now && close > now+3h'
> :
> : But since there is no way to indicate that 'open' and 'close' are
> pairwise
> : related I will get a lot of false positives, e.g the above document would
> be
> : returned for:
>
> This isn't possible out of the box, but the general idea of "position
> linked" queries is possible using the same approach as the
> FieldMaskingSpanQuery...
>
>
> https://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/spans/FieldMaskingSpanQuery.html
> https://issues.apache.org/jira/browse/LUCENE-1494
>
> ..implementing something like this that would work with
> (Numeric)RangeQueries however would require some additional work, but it
> should certianly be doable -- i've suggested this before but no one has
> taken me up on it...
> http://markmail.org/search/?q=hoss+FieldMaskingSpanQuery
>
> If we take it as a given that you can do multiple ranges "at the same
> position", then you can imagine supporting all of your "regular" hours
> using just two fields ("open" and "close") by encoding the day+time of
> each range of open hours into them -- even if a store is open for multiple
> sets of ranges per day (ie: closed for siesta)...
>
>  open: mon_12_30, tue_12_30, wed_07_30, wed_3_30, ...
>  close: mon_20_00, tue_20_30, wed_12_30, wed_22_30, ...
>
> then asking for "stores open now and for the next 3 hours" on "wed" at
> "2:13PM" becomes a query for...
>
> sameposition(open:[* TO wed_14_13], close:[wed_17_13 TO *])
>
> For the special case part of your problem when there are certain dates
> that a store will be open atypical hours, i *think* that could be solved
> using some special docs and the new "join" QParser in a filter query...
>
>        https://wiki.apache.org/solr/Join
>
> imagine you have your "regular" docs with all the normal data about a
> store, and the open/close fields i describe above.  but in addition to
> those, for any store that you know is "closed on dec 25" or "only open
> 12:00-15:00 on Jan 01" you add an additional small doc encapsulating
> the information about the stores closures on that special date - so that
> each special case would be it's own doc, even if one store had 5 days
> where there was a special case...
>
>  specialdoc1:
>    store_id: 42
>    special_date: Dec-25
>    status: closed
>  specialdoc2:
>    store_id: 42
>    special_date: Jan-01
>    status: irregular
>    open: 09_30
>    close: 13_00
>
> then when you are executing your query, you use an "fq" to constrain to
> stores that are (normally) open right now (like i mentioned above) and you
> use another fq to find all docs *except* those resulting from a join
> against these special case docs based on the current date.
>
> so if you r query is "open now and for the next 3 hours" and "now" ==
> "sunday, 2011-12-25 @ 10:17AM your query would be something like...
>
> q=...user input...
> time=sameposition(open:[* TO sun_10_17], close:[sun_13_17 TO *])
> fq={!v=time}
> fq={!join from=store_id to=unique_key v=$vv}
> vv=-(+special_date:Dec-25 +(status:closed OR _query_:"{v=$time}"))
>
> That join based approach for dealing with the special dates should work
> regardless of wether someone implements a way to do pair wise
> "sameposition()" rangequeries ... so if you can live w/o the multiple
> open/close pairs per day, you can just use the "one field per day of hte
> week" type approach you mentioned combined with the "join" for special
> case days of hte year and everything you need should already work w/o any
> code (on trunk).
>
> (disclaimer: obviously i haven't tested that query, the exact syntax may
> be off but the princible for modeling the "special docs" and using
> them in a join should work)
>
>
> -Hoss
>
Reply | Threaded
Open this post in threaded view
|

Re: multiple dateranges/timeslots per doc: modeling openinghours.

Mikhail Khludnev
In reply to this post by britske
On Mon, Oct 3, 2011 at 3:09 PM, Geert-Jan Brits <[hidden email]> wrote:

> Interesting! Reading your previous blogposts, I gather that the to be
> posted
> 'implementation approaches' includes a way of making the SpanQueries
> available within SOLR?
>

It's going to be posted in two days. But please don't expect much from them,
it's just a proof of concept. It's not a code for production nor for
contribution. e.g. we've chosen 'quick hack' way of boolean query converting
instead of XmlQuery, SurroundParser or contrib's query parser, etc. i.e. we
can share only core ideas, some of these are possibly wrong.


> Also, would with your approach would (numeric) RangeQueries be possible as
> Hoss suggests?
>

Basically range queries are just conjunctions (sometimes it's not great at
all) for numbers. If you encode your terms in sortable manner eg A0715 for
Monday 7-15 am, you'll be able to build the span merging 'conjunction' - new
SpanOrQuery(new SpanTermQuery(..),.... ).

Regards

Mikhail


> Looking forward to that 'implementation post'
> Cheers,
> Geert-Jan
>
> Op 1 oktober 2011 19:57 schreef Mikhail Khludnev <
> [hidden email]
> > het volgende:
>
> > I agree about SpanQueries. It's a viable measure against "false-positive
> > matches on multivalue fields".
> >  we've implemented this approach some time ago. Pls find details at
> >
> >
> http://blog.griddynamics.com/2011/06/solr-experience-search-parent-child.html
> >
> > and
> >
> >
> http://blog.griddynamics.com/2011/07/solr-experience-search-parent-child.html
> > we are going to publish the third post about an implementation
> approaches.
> >
> > --
> > Mikhail Khludnev
> >
> >
> > On Sat, Oct 1, 2011 at 6:25 AM, Chris Hostetter <
> [hidden email]
> > >wrote:
> >
> > >
> > > : Another, faulty, option would be to model opening/closing hours in 2
> > > : multivalued date-fields, i.e: open, close. and insert open/close for
> > each
> > > : day, e.g:
> > > :
> > > : open: 2011-11-08:1800 - close: 2011-11-09:0300
> > > : open: 2011-11-09:1700 - close: 2011-11-10:0500
> > > : open: 2011-11-10:1700 - close: 2011-11-11:0300
> > > :
> > > : And queries would be of the form:
> > > :
> > > : 'open < now && close > now+3h'
> > > :
> > > : But since there is no way to indicate that 'open' and 'close' are
> > > pairwise
> > > : related I will get a lot of false positives, e.g the above document
> > would
> > > be
> > > : returned for:
> > >
> > > This isn't possible out of the box, but the general idea of "position
> > > linked" queries is possible using the same approach as the
> > > FieldMaskingSpanQuery...
> > >
> > >
> > >
> >
> https://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/spans/FieldMaskingSpanQuery.html
> > > https://issues.apache.org/jira/browse/LUCENE-1494
> > >
> > > ..implementing something like this that would work with
> > > (Numeric)RangeQueries however would require some additional work, but
> it
> > > should certianly be doable -- i've suggested this before but no one has
> > > taken me up on it...
> > > http://markmail.org/search/?q=hoss+FieldMaskingSpanQuery
> > >
> > > If we take it as a given that you can do multiple ranges "at the same
> > > position", then you can imagine supporting all of your "regular" hours
> > > using just two fields ("open" and "close") by encoding the day+time of
> > > each range of open hours into them -- even if a store is open for
> > multiple
> > > sets of ranges per day (ie: closed for siesta)...
> > >
> > >  open: mon_12_30, tue_12_30, wed_07_30, wed_3_30, ...
> > >  close: mon_20_00, tue_20_30, wed_12_30, wed_22_30, ...
> > >
> > > then asking for "stores open now and for the next 3 hours" on "wed" at
> > > "2:13PM" becomes a query for...
> > >
> > > sameposition(open:[* TO wed_14_13], close:[wed_17_13 TO *])
> > >
> > > For the special case part of your problem when there are certain dates
> > > that a store will be open atypical hours, i *think* that could be
> solved
> > > using some special docs and the new "join" QParser in a filter query...
> > >
> > >        https://wiki.apache.org/solr/Join
> > >
> > > imagine you have your "regular" docs with all the normal data about a
> > > store, and the open/close fields i describe above.  but in addition to
> > > those, for any store that you know is "closed on dec 25" or "only open
> > > 12:00-15:00 on Jan 01" you add an additional small doc encapsulating
> > > the information about the stores closures on that special date - so
> that
> > > each special case would be it's own doc, even if one store had 5 days
> > > where there was a special case...
> > >
> > >  specialdoc1:
> > >    store_id: 42
> > >    special_date: Dec-25
> > >    status: closed
> > >  specialdoc2:
> > >    store_id: 42
> > >    special_date: Jan-01
> > >    status: irregular
> > >    open: 09_30
> > >    close: 13_00
> > >
> > > then when you are executing your query, you use an "fq" to constrain to
> > > stores that are (normally) open right now (like i mentioned above) and
> > you
> > > use another fq to find all docs *except* those resulting from a join
> > > against these special case docs based on the current date.
> > >
> > > so if you r query is "open now and for the next 3 hours" and "now" ==
> > > "sunday, 2011-12-25 @ 10:17AM your query would be something like...
> > >
> > > q=...user input...
> > > time=sameposition(open:[* TO sun_10_17], close:[sun_13_17 TO *])
> > > fq={!v=time}
> > > fq={!join from=store_id to=unique_key v=$vv}
> > > vv=-(+special_date:Dec-25 +(status:closed OR _query_:"{v=$time}"))
> > >
> > > That join based approach for dealing with the special dates should work
> > > regardless of wether someone implements a way to do pair wise
> > > "sameposition()" rangequeries ... so if you can live w/o the multiple
> > > open/close pairs per day, you can just use the "one field per day of
> hte
> > > week" type approach you mentioned combined with the "join" for special
> > > case days of hte year and everything you need should already work w/o
> any
> > > code (on trunk).
> > >
> > > (disclaimer: obviously i haven't tested that query, the exact syntax
> may
> > > be off but the princible for modeling the "special docs" and using
> > > them in a join should work)
> > >
> > >
> > > -Hoss
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail (Mike) Khludnev
> > Developer
> > Grid Dynamics
> > tel. 1-415-738-8644
> > Skype: mkhludnev
> > <http://www.griddynamics.com>
> >  <[hidden email]>
> >
>



--
Sincerely yours
Mikhail (Mike) Khludnev
Developer
Grid Dynamics
tel. 1-415-738-8644
Skype: mkhludnev
<http://www.griddynamics.com>
 <[hidden email]>
Reply | Threaded
Open this post in threaded view
|

Re: multiple dateranges/timeslots per doc: modeling openinghours.

Chris Hostetter-3
In reply to this post by britske

: Conceptually
: the Join-approach looks like it would work from paper, although I'm not a
: big fan of introducing a lot of complexity to the frontend / querying part
: of the solution.

you lost me there -- i don't see how using join would impact the front end
/ query side at all.  your query clients would never even know that a join
had happened (your indexing code would certianly have to know about
creating those special case docs to join against obviuosly)

: As an alternative, what about using your fieldMaskingSpanQuery-approach
: solely (without the JOIN-approach)  and encode open/close on a per day
: basis?
: I didn't mention it, but I 'only' need 100 days of data, which would lead to
: 100 open and 100 close values, not counting the pois with multiple
        ...
: Data then becomes:
:
: open: 20111020_12_30, 20111021_12_30, 20111022_07_30, ...
: close: 20111020_20_00, 20111021_26_30, 20111022_12_30, ...

aw hell ... i assumed you needed to suport an arbitrarily large number
of special case open+close pairs per doc.

if you only have to support a fix value (N=100) open+close values you
could just have N*2 date fields and a BooleanQuery containing N 2-clause
BooleanQueries contain ranging queries against each pair of your date
fields. ie...

  ((+open00:[* TO NOW] +close00:[NOW+3HOURS TO *])
   (+open01:[* TO NOW] +close01:[NOW+3HOURS TO *])
   (+open02:[* TO NOW] +close02:[NOW+3HOURS TO *])
   ...etc...
   (+open99:[* TO NOW] +close99:[NOW+3HOURS TO *]))

...for a lot of indexes, 100 clauses is small potatoes as far as number of
boolean clauses go, especially if many of them are going to short circut
out because there won't be any matches at all.

: Alternatively, how would you compare your suggested approach with the
: approach by David Smiley using either SOLR-2155 (Geohash prefix query
: filter) or LSP:
: https://issues.apache.org/jira/browse/SOLR-2155?focusedCommentId=13115244#comment-13115244.
: That would work right now, and the LSP-approach seems pretty elegant to me.

I'm afraid i'm totally ignorant of how the LSP stuff works so i can't
really comment there.

If i understand what you mean about mapping the open/close concepts to
lat/lon concepts, then i can see how it would be useful for multiple pair
wise (absolute) date ranges, but i'm not really sure how you would deal
with the diff open+close pairs per day (or on diff days of hte week, or
special days of the year) using the lat+lon conceptual model ... I guess
if the LSP stuff supports arbitrary N-dimensional spaces then you could
model day or week as a dimension .. but it still seems like you'd need
multiple fields for the special case days, right?

How it would compare performance wise: no idea.


-Hoss
Reply | Threaded
Open this post in threaded view
|

Re: multiple dateranges/timeslots per doc: modeling openinghours.

britske
Op 11 oktober 2011 03:21 schreef Chris Hostetter
<[hidden email]>het volgende:

>
> : Conceptually
> : the Join-approach looks like it would work from paper, although I'm not a
> : big fan of introducing a lot of complexity to the frontend / querying
> part
> : of the solution.
>
> you lost me there -- i don't see how using join would impact the front end
> / query side at all.  your query clients would never even know that a join
> had happened (your indexing code would certianly have to know about
> creating those special case docs to join against obviuosly)
>
> : As an alternative, what about using your fieldMaskingSpanQuery-approach
> : solely (without the JOIN-approach)  and encode open/close on a per day
> : basis?
> : I didn't mention it, but I 'only' need 100 days of data, which would lead
> to
> : 100 open and 100 close values, not counting the pois with multiple
>         ...
> : Data then becomes:
> :
> : open: 20111020_12_30, 20111021_12_30, 20111022_07_30, ...
> : close: 20111020_20_00, 20111021_26_30, 20111022_12_30, ...
>
> aw hell ... i assumed you needed to suport an arbitrarily large number
> of special case open+close pairs per doc.
>

I didn't express myself well. A POI can have multiple open+close pairs per
day, but each night I only index the coming 100 days. So MOST POIs will have
100 open+close pairs (1 openinghours per day) but some have more.


>
> if you only have to support a fix value (N=100) open+close values you
> could just have N*2 date fields and a BooleanQuery containing N 2-clause
> BooleanQueries contain ranging queries against each pair of your date
> fields. ie...
>
>  ((+open00:[* TO NOW] +close00:[NOW+3HOURS TO *])
>   (+open01:[* TO NOW] +close01:[NOW+3HOURS TO *])
>   (+open02:[* TO NOW] +close02:[NOW+3HOURS TO *])
>   ...etc...
>   (+open99:[* TO NOW] +close99:[NOW+3HOURS TO *]))
>
> ...for a lot of indexes, 100 clauses is small potatoes as far as number of
> boolean clauses go, especially if many of them are going to short circut
> out because there won't be any matches at all.
>

Given that I need multiple open+close pairs per day this can't be used
directly.

However when setting a logical upperbound on the maximum nr of openinghours
per day (say 3), which would be possible, this could be extended to:
open00 = day0 --> open00-0 = day0 timeslot 0, open00-1 = day0 timeslot 1,
etc.

So,

 ((+open00-0:[* TO NOW] +close00-0:[NOW+3HOURS TO *])
(+open00-1:[* TO NOW] +close00-1:[NOW+3HOURS TO *])
(+open00-2:[* TO NOW] +close00-2:[NOW+3HOURS TO *])
  (+open01-0:[* TO NOW] +close01-0:[NOW+3HOURS TO *])
 (+open01-1:[* TO NOW] +close01-1:[NOW+3HOURS TO *])
 (+open01-2:[* TO NOW] +close01-2:[NOW+3HOURS TO *])
  ...etc...
  (+open99:[* TO NOW] +close99:[NOW+3HOURS TO *]))

This would need 2*3*100 = 600 dynamicfields to cover the openinghours. You
mention this is peanuts for constructing a booleanquery, but how about
memory consumption?
I'm particularly concerned about the Lucene FieldCache getting populated for
each of the 600 fields. (Since I had some nasty OOM experiences with that in
the past. 2-3 years ago memory consumption of Lucene FieldCache couldn't be
controlled, I'm not sure how that is now to be honest)

I will not be sorting on any of the 600 dynamicfields btw. Instead I will
only use them as part of the above booleanquery, which I will likely define
as a Filter Query.
Just to be sure, in this situation, Lucene FieldCache won't be touched,
correct? If so, this will probably be a good workable solution!


> : Alternatively, how would you compare your suggested approach with the
> : approach by David Smiley using either SOLR-2155 (Geohash prefix query
> : filter) or LSP:
> :
> https://issues.apache.org/jira/browse/SOLR-2155?focusedCommentId=13115244#comment-13115244
> .
> : That would work right now, and the LSP-approach seems pretty elegant to
> me.
>
> I'm afraid i'm totally ignorant of how the LSP stuff works so i can't
> really comment there.
>
> If i understand what you mean about mapping the open/close concepts to
> lat/lon concepts, then i can see how it would be useful for multiple pair
> wise (absolute) date ranges, but i'm not really sure how you would deal
> with the diff open+close pairs per day (or on diff days of hte week, or
> special days of the year) using the lat+lon conceptual model ... I guess
> if the LSP stuff supports arbitrary N-dimensional spaces then you could
> model day or week as a dimension .. but it still seems like you'd need
> multiple fields for the special case days, right?
>

I planned to do the folllowing using LSP, (through help from David)

Each <open,close>-tuple would be modeled as a point(x,y) . (x = open, y =
close)
So a POI can have many (100 or more) points, each representing
a <open,close>-tuple.

Given: 100 days lookahead, granularity: 5 min, we can map dimensions x and y
to to [0,30000]

E.g:
- indexing starts at / baseline is at: 2011-11-01:0000
- poi open: 2011-11-08:1800 - poi close: 2011-11-09:0300
- (query): user visit: 2011-11-08:2300 - user depart: 2011-11-09:0200

Would map to:
- poi open: 2520 - poi close: 2628 =  point(x,y) = (2520,2628)
- (query):user visit: 2580 - user depart: 2616 = bbox filter with the
ranges x:[0 TO 2580], y:[2616 TO 30000]

All pois are returned which have one or more points within the bbox.

Both approaches seem pretty good to me. I'll be testing both soon.

Thanks!
Geert-Jan




> How it would compare performance wise: no idea.
>
>
> -Hoss
>
Reply | Threaded
Open this post in threaded view
|

Re: multiple dateranges/timeslots per doc: modeling openinghours.

Chris Hostetter-3

: This would need 2*3*100 = 600 dynamicfields to cover the openinghours. You
: mention this is peanuts for constructing a booleanquery, but how about
: memory consumption?
: I'm particularly concerned about the Lucene FieldCache getting populated for
: each of the 600 fields. (Since I had some nasty OOM experiences with that in
: the past. 2-3 years ago memory consumption of Lucene FieldCache couldn't be
: controlled, I'm not sure how that is now to be honest)
:
: I will not be sorting on any of the 600 dynamicfields btw. Instead I will
: only use them as part of the above booleanquery, which I will likely define
: as a Filter Query.
: Just to be sure, in this situation, Lucene FieldCache won't be touched,
: correct? If so, this will probably be a good workable solution!

correct.  searching on fields doesn't use the FieldCache (unless you are
doing a function query - you aren't in this case) so the memory usage of
FieldCache wouldn't be a factor here at all.


-Hoss