solr as nosql - pulling all docs vs deep paging limitations

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

solr as nosql - pulling all docs vs deep paging limitations

Petersen, Robert-2
Hi solr users,

We have a new use case where need to make a pile of data available as XML to a client and I was thinking we could easily put all this data into a solr collection and the client could just do a star search and page through all the results to obtain the data we need to give them.  Then I remembered we currently don't allow deep paging in our current search indexes as performance declines the deeper you go.  Is this still the case?

If so, is there another approach to make all the data in a collection easily available for retrieval?  The only thing I can think of is to query our DB for all the unique IDs of all the documents in the collection and then pull out the documents out in small groups with successive queries like 'UniqueIdField:(id1 OR id2 OR ... OR idn)' 'UniqueIdField:(idn+1 OR idn+2 OR ... etc)' which doesn't seem like a very good approach because the DB might have been updated with new data which hasn't been indexed yet and so all the ids might not be in there (which may or may not matter I suppose).

Then I was thinking we could have a field with an incrementing numeric value which could be used to perform range queries as a substitute for paging through everything.  Ie queries like 'IncrementalField:[1 TO 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to maintain as we update the index unless we reindex the entire collection every time we update any docs at all.

Is this perhaps not a good use case for solr?  Should I use something else or is there another approach that would work here to allow a client to pull groups of docs in a collection through the rest api until the client has gotten them all?

Thanks
Robi

Reply | Threaded
Open this post in threaded view
|

Re: solr as nosql - pulling all docs vs deep paging limitations

Chris Hostetter-3

: Then I remembered we currently don't allow deep paging in our current
: search indexes as performance declines the deeper you go.  Is this still
: the case?

Coincidently, i'm working on a new cursor based API to make this much more
feasible as we speak..

https://issues.apache.org/jira/browse/SOLR-5463

I did some simple perf testing of the strawman approach and posted the
results last week...

http://searchhub.org/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

...current iterations on the patch are to eliminate the
strawman code to improve performance even more and beef up the test
cases.

: If so, is there another approach to make all the data in a collection
: easily available for retrieval?  The only thing I can think of is to
        ...
: Then I was thinking we could have a field with an incrementing numeric
: value which could be used to perform range queries as a substitute for
: paging through everything.  Ie queries like 'IncrementalField:[1 TO
: 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to
: maintain as we update the index unless we reindex the entire collection
: every time we update any docs at all.

As i mentioned in the blog above, as long as you have a uniqueKey field
that supports range queries, bulk exporting of all documents is fairly
trivial by sorting on your uniqueKey field and using an fq that also
filters on your uniqueKey field modify the fq each time to change the
lower bound to match the highest ID you got on the previous "page".  

This approach works really well in simple cases where you wnat to "fetch
all" documents matching a query and then process/sort them by some other
criteria on the client -- but it's not viable if it's important to you
that the documents come back from solr in score order before your client
gets them because you want to "stop fetching" once some criteria is met in
your client.  Example: you have billions of documents matching a query,
you want to fetch all sorted by score desc and crunch them on your client
to compute some stats, and once your client side stat crunching tells you
you have enough results (which might be after the 1000th result, or might
be after the millionth result) then you want to stop.

SOLR-5463 will help even in that later case.  The bulk of the patch should
easy to use in the next day or so (having other people try out and
test in their applications would be *very* helpful) and hopefully show up
in Solr 4.7

-Hoss
http://www.lucidworks.com/
Reply | Threaded
Open this post in threaded view
|

Re: solr as nosql - pulling all docs vs deep paging limitations

Mikhail Khludnev
Hoss,

What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've been
asked many times for that.
What if client don't need to rank results somehow, but just requesting
unordered filtering result like they are used to in RDBMS?
Do you feel it will never considered as a resonable usecase for Solr? or
there is a well known approach for dealing with?


On Tue, Dec 17, 2013 at 10:16 PM, Chris Hostetter
<[hidden email]>wrote:

>
> : Then I remembered we currently don't allow deep paging in our current
> : search indexes as performance declines the deeper you go.  Is this still
> : the case?
>
> Coincidently, i'm working on a new cursor based API to make this much more
> feasible as we speak..
>
> https://issues.apache.org/jira/browse/SOLR-5463
>
> I did some simple perf testing of the strawman approach and posted the
> results last week...
>
>
> http://searchhub.org/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
>
> ...current iterations on the patch are to eliminate the
> strawman code to improve performance even more and beef up the test
> cases.
>
> : If so, is there another approach to make all the data in a collection
> : easily available for retrieval?  The only thing I can think of is to
>         ...
> : Then I was thinking we could have a field with an incrementing numeric
> : value which could be used to perform range queries as a substitute for
> : paging through everything.  Ie queries like 'IncrementalField:[1 TO
> : 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to
> : maintain as we update the index unless we reindex the entire collection
> : every time we update any docs at all.
>
> As i mentioned in the blog above, as long as you have a uniqueKey field
> that supports range queries, bulk exporting of all documents is fairly
> trivial by sorting on your uniqueKey field and using an fq that also
> filters on your uniqueKey field modify the fq each time to change the
> lower bound to match the highest ID you got on the previous "page".
>
> This approach works really well in simple cases where you wnat to "fetch
> all" documents matching a query and then process/sort them by some other
> criteria on the client -- but it's not viable if it's important to you
> that the documents come back from solr in score order before your client
> gets them because you want to "stop fetching" once some criteria is met in
> your client.  Example: you have billions of documents matching a query,
> you want to fetch all sorted by score desc and crunch them on your client
> to compute some stats, and once your client side stat crunching tells you
> you have enough results (which might be after the 1000th result, or might
> be after the millionth result) then you want to stop.
>
> SOLR-5463 will help even in that later case.  The bulk of the patch should
> easy to use in the next day or so (having other people try out and
> test in their applications would be *very* helpful) and hopefully show up
> in Solr 4.7
>
> -Hoss
> http://www.lucidworks.com/
>



--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <[hidden email]>
Reply | Threaded
Open this post in threaded view
|

RE: solr as nosql - pulling all docs vs deep paging limitations

Petersen, Robert-2
My use case is basically to do a dump of all contents of the index with no ordering needed.  It's actually to be a product data export for third parties.  Unique key is product sku.  I could take the min sku and range query up to the max sku but the skus are not contiguous because some get turned off and only some are valid for export so each range would return a different number of products (which may or may not be acceptable and I might be able to kind of hide that with some code).

-----Original Message-----
From: Mikhail Khludnev [mailto:[hidden email]]
Sent: Tuesday, December 17, 2013 10:41 AM
To: solr-user
Subject: Re: solr as nosql - pulling all docs vs deep paging limitations

Hoss,

What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've been asked many times for that.
What if client don't need to rank results somehow, but just requesting unordered filtering result like they are used to in RDBMS?
Do you feel it will never considered as a resonable usecase for Solr? or there is a well known approach for dealing with?


On Tue, Dec 17, 2013 at 10:16 PM, Chris Hostetter
<[hidden email]>wrote:

>
> : Then I remembered we currently don't allow deep paging in our
> current
> : search indexes as performance declines the deeper you go.  Is this
> still
> : the case?
>
> Coincidently, i'm working on a new cursor based API to make this much
> more feasible as we speak..
>
> https://issues.apache.org/jira/browse/SOLR-5463
>
> I did some simple perf testing of the strawman approach and posted the
> results last week...
>
>
> http://searchhub.org/coming-soon-to-solr-efficient-cursor-based-iterat
> ion-of-large-result-sets/
>
> ...current iterations on the patch are to eliminate the strawman code
> to improve performance even more and beef up the test cases.
>
> : If so, is there another approach to make all the data in a
> collection
> : easily available for retrieval?  The only thing I can think of is to
>         ...
> : Then I was thinking we could have a field with an incrementing
> numeric
> : value which could be used to perform range queries as a substitute
> for
> : paging through everything.  Ie queries like 'IncrementalField:[1 TO
> : 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to
> : maintain as we update the index unless we reindex the entire
> collection
> : every time we update any docs at all.
>
> As i mentioned in the blog above, as long as you have a uniqueKey
> field that supports range queries, bulk exporting of all documents is
> fairly trivial by sorting on your uniqueKey field and using an fq that
> also filters on your uniqueKey field modify the fq each time to change
> the lower bound to match the highest ID you got on the previous "page".
>
> This approach works really well in simple cases where you wnat to
> "fetch all" documents matching a query and then process/sort them by
> some other criteria on the client -- but it's not viable if it's
> important to you that the documents come back from solr in score order
> before your client gets them because you want to "stop fetching" once
> some criteria is met in your client.  Example: you have billions of
> documents matching a query, you want to fetch all sorted by score desc
> and crunch them on your client to compute some stats, and once your
> client side stat crunching tells you you have enough results (which
> might be after the 1000th result, or might be after the millionth result) then you want to stop.
>
> SOLR-5463 will help even in that later case.  The bulk of the patch
> should easy to use in the next day or so (having other people try out
> and test in their applications would be *very* helpful) and hopefully
> show up in Solr 4.7
>
> -Hoss
> http://www.lucidworks.com/
>



--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: solr as nosql - pulling all docs vs deep paging limitations

Otis Gospodnetić
In reply to this post by Petersen, Robert-2
Hoss is working on it. Search for deep paging or cursor in JIRA.

Otis
Solr & ElasticSearch Support
http://sematext.com/
On Dec 17, 2013 12:30 PM, "Petersen, Robert" <
[hidden email]> wrote:

> Hi solr users,
>
> We have a new use case where need to make a pile of data available as XML
> to a client and I was thinking we could easily put all this data into a
> solr collection and the client could just do a star search and page through
> all the results to obtain the data we need to give them.  Then I remembered
> we currently don't allow deep paging in our current search indexes as
> performance declines the deeper you go.  Is this still the case?
>
> If so, is there another approach to make all the data in a collection
> easily available for retrieval?  The only thing I can think of is to query
> our DB for all the unique IDs of all the documents in the collection and
> then pull out the documents out in small groups with successive queries
> like 'UniqueIdField:(id1 OR id2 OR ... OR idn)' 'UniqueIdField:(idn+1 OR
> idn+2 OR ... etc)' which doesn't seem like a very good approach because the
> DB might have been updated with new data which hasn't been indexed yet and
> so all the ids might not be in there (which may or may not matter I
> suppose).
>
> Then I was thinking we could have a field with an incrementing numeric
> value which could be used to perform range queries as a substitute for
> paging through everything.  Ie queries like 'IncrementalField:[1 TO 100]'
> 'IncrementalField:[101 TO 200]' but this would be difficult to maintain as
> we update the index unless we reindex the entire collection every time we
> update any docs at all.
>
> Is this perhaps not a good use case for solr?  Should I use something else
> or is there another approach that would work here to allow a client to pull
> groups of docs in a collection through the rest api until the client has
> gotten them all?
>
> Thanks
> Robi
>
>
Reply | Threaded
Open this post in threaded view
|

Re: solr as nosql - pulling all docs vs deep paging limitations

Joel Bernstein
SOLR-5244 is also working in this direction. This focuses on efficient
binary extract of entire search results.


On Tue, Dec 17, 2013 at 2:33 PM, Otis Gospodnetic <
[hidden email]> wrote:

> Hoss is working on it. Search for deep paging or cursor in JIRA.
>
> Otis
> Solr & ElasticSearch Support
> http://sematext.com/
> On Dec 17, 2013 12:30 PM, "Petersen, Robert" <
> [hidden email]> wrote:
>
> > Hi solr users,
> >
> > We have a new use case where need to make a pile of data available as XML
> > to a client and I was thinking we could easily put all this data into a
> > solr collection and the client could just do a star search and page
> through
> > all the results to obtain the data we need to give them.  Then I
> remembered
> > we currently don't allow deep paging in our current search indexes as
> > performance declines the deeper you go.  Is this still the case?
> >
> > If so, is there another approach to make all the data in a collection
> > easily available for retrieval?  The only thing I can think of is to
> query
> > our DB for all the unique IDs of all the documents in the collection and
> > then pull out the documents out in small groups with successive queries
> > like 'UniqueIdField:(id1 OR id2 OR ... OR idn)' 'UniqueIdField:(idn+1 OR
> > idn+2 OR ... etc)' which doesn't seem like a very good approach because
> the
> > DB might have been updated with new data which hasn't been indexed yet
> and
> > so all the ids might not be in there (which may or may not matter I
> > suppose).
> >
> > Then I was thinking we could have a field with an incrementing numeric
> > value which could be used to perform range queries as a substitute for
> > paging through everything.  Ie queries like 'IncrementalField:[1 TO 100]'
> > 'IncrementalField:[101 TO 200]' but this would be difficult to maintain
> as
> > we update the index unless we reindex the entire collection every time we
> > update any docs at all.
> >
> > Is this perhaps not a good use case for solr?  Should I use something
> else
> > or is there another approach that would work here to allow a client to
> pull
> > groups of docs in a collection through the rest api until the client has
> > gotten them all?
> >
> > Thanks
> > Robi
> >
> >
>



--
Joel Bernstein
Search Engineer at Heliosearch
Reply | Threaded
Open this post in threaded view
|

Re: solr as nosql - pulling all docs vs deep paging limitations

Otis Gospodnetić
Joel - can you please elaborate a bit on how this compares with Hoss'
approach?  Complementary?

Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Dec 17, 2013 at 6:45 PM, Joel Bernstein <[hidden email]> wrote:

> SOLR-5244 is also working in this direction. This focuses on efficient
> binary extract of entire search results.
>
>
> On Tue, Dec 17, 2013 at 2:33 PM, Otis Gospodnetic <
> [hidden email]> wrote:
>
> > Hoss is working on it. Search for deep paging or cursor in JIRA.
> >
> > Otis
> > Solr & ElasticSearch Support
> > http://sematext.com/
> > On Dec 17, 2013 12:30 PM, "Petersen, Robert" <
> > [hidden email]> wrote:
> >
> > > Hi solr users,
> > >
> > > We have a new use case where need to make a pile of data available as
> XML
> > > to a client and I was thinking we could easily put all this data into a
> > > solr collection and the client could just do a star search and page
> > through
> > > all the results to obtain the data we need to give them.  Then I
> > remembered
> > > we currently don't allow deep paging in our current search indexes as
> > > performance declines the deeper you go.  Is this still the case?
> > >
> > > If so, is there another approach to make all the data in a collection
> > > easily available for retrieval?  The only thing I can think of is to
> > query
> > > our DB for all the unique IDs of all the documents in the collection
> and
> > > then pull out the documents out in small groups with successive queries
> > > like 'UniqueIdField:(id1 OR id2 OR ... OR idn)' 'UniqueIdField:(idn+1
> OR
> > > idn+2 OR ... etc)' which doesn't seem like a very good approach because
> > the
> > > DB might have been updated with new data which hasn't been indexed yet
> > and
> > > so all the ids might not be in there (which may or may not matter I
> > > suppose).
> > >
> > > Then I was thinking we could have a field with an incrementing numeric
> > > value which could be used to perform range queries as a substitute for
> > > paging through everything.  Ie queries like 'IncrementalField:[1 TO
> 100]'
> > > 'IncrementalField:[101 TO 200]' but this would be difficult to maintain
> > as
> > > we update the index unless we reindex the entire collection every time
> we
> > > update any docs at all.
> > >
> > > Is this perhaps not a good use case for solr?  Should I use something
> > else
> > > or is there another approach that would work here to allow a client to
> > pull
> > > groups of docs in a collection through the rest api until the client
> has
> > > gotten them all?
> > >
> > > Thanks
> > > Robi
> > >
> > >
> >
>
>
>
> --
> Joel Bernstein
> Search Engineer at Heliosearch
>
Reply | Threaded
Open this post in threaded view
|

Re: solr as nosql - pulling all docs vs deep paging limitations

Joel Bernstein
They are for different use cases. Hoss's approach, I believe, focuses on
deep paging of ranked search results. SOLR-5244 focuses on the batch export
of an entire unranked search result in binary format. It's basically a very
efficient bulk extract for Solr.


On Tue, Dec 17, 2013 at 6:51 PM, Otis Gospodnetic <
[hidden email]> wrote:

> Joel - can you please elaborate a bit on how this compares with Hoss'
> approach?  Complementary?
>
> Thanks,
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Tue, Dec 17, 2013 at 6:45 PM, Joel Bernstein <[hidden email]>
> wrote:
>
> > SOLR-5244 is also working in this direction. This focuses on efficient
> > binary extract of entire search results.
> >
> >
> > On Tue, Dec 17, 2013 at 2:33 PM, Otis Gospodnetic <
> > [hidden email]> wrote:
> >
> > > Hoss is working on it. Search for deep paging or cursor in JIRA.
> > >
> > > Otis
> > > Solr & ElasticSearch Support
> > > http://sematext.com/
> > > On Dec 17, 2013 12:30 PM, "Petersen, Robert" <
> > > [hidden email]> wrote:
> > >
> > > > Hi solr users,
> > > >
> > > > We have a new use case where need to make a pile of data available as
> > XML
> > > > to a client and I was thinking we could easily put all this data
> into a
> > > > solr collection and the client could just do a star search and page
> > > through
> > > > all the results to obtain the data we need to give them.  Then I
> > > remembered
> > > > we currently don't allow deep paging in our current search indexes as
> > > > performance declines the deeper you go.  Is this still the case?
> > > >
> > > > If so, is there another approach to make all the data in a collection
> > > > easily available for retrieval?  The only thing I can think of is to
> > > query
> > > > our DB for all the unique IDs of all the documents in the collection
> > and
> > > > then pull out the documents out in small groups with successive
> queries
> > > > like 'UniqueIdField:(id1 OR id2 OR ... OR idn)' 'UniqueIdField:(idn+1
> > OR
> > > > idn+2 OR ... etc)' which doesn't seem like a very good approach
> because
> > > the
> > > > DB might have been updated with new data which hasn't been indexed
> yet
> > > and
> > > > so all the ids might not be in there (which may or may not matter I
> > > > suppose).
> > > >
> > > > Then I was thinking we could have a field with an incrementing
> numeric
> > > > value which could be used to perform range queries as a substitute
> for
> > > > paging through everything.  Ie queries like 'IncrementalField:[1 TO
> > 100]'
> > > > 'IncrementalField:[101 TO 200]' but this would be difficult to
> maintain
> > > as
> > > > we update the index unless we reindex the entire collection every
> time
> > we
> > > > update any docs at all.
> > > >
> > > > Is this perhaps not a good use case for solr?  Should I use something
> > > else
> > > > or is there another approach that would work here to allow a client
> to
> > > pull
> > > > groups of docs in a collection through the rest api until the client
> > has
> > > > gotten them all?
> > > >
> > > > Thanks
> > > > Robi
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Joel Bernstein
> > Search Engineer at Heliosearch
> >
>



--
Joel Bernstein
Search Engineer at Heliosearch
Reply | Threaded
Open this post in threaded view
|

Re: solr as nosql - pulling all docs vs deep paging limitations

Jens Grivolla
In reply to this post by Petersen, Robert-2
You can do range queries without an upper bound and just limit the
number of results. Then you look at the last result to obtain the new
lower bound.

-- Jens


On 17/12/13 20:23, Petersen, Robert wrote:

> My use case is basically to do a dump of all contents of the index with no ordering needed.  It's actually to be a product data export for third parties.  Unique key is product sku.  I could take the min sku and range query up to the max sku but the skus are not contiguous because some get turned off and only some are valid for export so each range would return a different number of products (which may or may not be acceptable and I might be able to kind of hide that with some code).
>
> -----Original Message-----
> From: Mikhail Khludnev [mailto:[hidden email]]
> Sent: Tuesday, December 17, 2013 10:41 AM
> To: solr-user
> Subject: Re: solr as nosql - pulling all docs vs deep paging limitations
>
> Hoss,
>
> What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've been asked many times for that.
> What if client don't need to rank results somehow, but just requesting unordered filtering result like they are used to in RDBMS?
> Do you feel it will never considered as a resonable usecase for Solr? or there is a well known approach for dealing with?
>
>
> On Tue, Dec 17, 2013 at 10:16 PM, Chris Hostetter
> <[hidden email]>wrote:
>
>>
>> : Then I remembered we currently don't allow deep paging in our
>> current
>> : search indexes as performance declines the deeper you go.  Is this
>> still
>> : the case?
>>
>> Coincidently, i'm working on a new cursor based API to make this much
>> more feasible as we speak..
>>
>> https://issues.apache.org/jira/browse/SOLR-5463
>>
>> I did some simple perf testing of the strawman approach and posted the
>> results last week...
>>
>>
>> http://searchhub.org/coming-soon-to-solr-efficient-cursor-based-iterat
>> ion-of-large-result-sets/
>>
>> ...current iterations on the patch are to eliminate the strawman code
>> to improve performance even more and beef up the test cases.
>>
>> : If so, is there another approach to make all the data in a
>> collection
>> : easily available for retrieval?  The only thing I can think of is to
>>          ...
>> : Then I was thinking we could have a field with an incrementing
>> numeric
>> : value which could be used to perform range queries as a substitute
>> for
>> : paging through everything.  Ie queries like 'IncrementalField:[1 TO
>> : 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to
>> : maintain as we update the index unless we reindex the entire
>> collection
>> : every time we update any docs at all.
>>
>> As i mentioned in the blog above, as long as you have a uniqueKey
>> field that supports range queries, bulk exporting of all documents is
>> fairly trivial by sorting on your uniqueKey field and using an fq that
>> also filters on your uniqueKey field modify the fq each time to change
>> the lower bound to match the highest ID you got on the previous "page".
>>
>> This approach works really well in simple cases where you wnat to
>> "fetch all" documents matching a query and then process/sort them by
>> some other criteria on the client -- but it's not viable if it's
>> important to you that the documents come back from solr in score order
>> before your client gets them because you want to "stop fetching" once
>> some criteria is met in your client.  Example: you have billions of
>> documents matching a query, you want to fetch all sorted by score desc
>> and crunch them on your client to compute some stats, and once your
>> client side stat crunching tells you you have enough results (which
>> might be after the 1000th result, or might be after the millionth result) then you want to stop.
>>
>> SOLR-5463 will help even in that later case.  The bulk of the patch
>> should easy to use in the next day or so (having other people try out
>> and test in their applications would be *very* helpful) and hopefully
>> show up in Solr 4.7
>>
>> -Hoss
>> http://www.lucidworks.com/
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>   <[hidden email]>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: solr as nosql - pulling all docs vs deep paging limitations

Mikhail Khludnev
In reply to this post by Joel Bernstein
Aha! SOLR-5244 is a particular case which I'm asking about. I wonder who
else consider it useful?
(I.m sorry if I hijacked the thread)
18.12.2013 5:41 пользователь "Joel Bernstein" <[hidden email]> написал:

> They are for different use cases. Hoss's approach, I believe, focuses on
> deep paging of ranked search results. SOLR-5244 focuses on the batch export
> of an entire unranked search result in binary format. It's basically a very
> efficient bulk extract for Solr.
>
>
> On Tue, Dec 17, 2013 at 6:51 PM, Otis Gospodnetic <
> [hidden email]> wrote:
>
> > Joel - can you please elaborate a bit on how this compares with Hoss'
> > approach?  Complementary?
> >
> > Thanks,
> > Otis
> > --
> > Performance Monitoring * Log Analytics * Search Analytics
> > Solr & Elasticsearch Support * http://sematext.com/
> >
> >
> > On Tue, Dec 17, 2013 at 6:45 PM, Joel Bernstein <[hidden email]>
> > wrote:
> >
> > > SOLR-5244 is also working in this direction. This focuses on efficient
> > > binary extract of entire search results.
> > >
> > >
> > > On Tue, Dec 17, 2013 at 2:33 PM, Otis Gospodnetic <
> > > [hidden email]> wrote:
> > >
> > > > Hoss is working on it. Search for deep paging or cursor in JIRA.
> > > >
> > > > Otis
> > > > Solr & ElasticSearch Support
> > > > http://sematext.com/
> > > > On Dec 17, 2013 12:30 PM, "Petersen, Robert" <
> > > > [hidden email]> wrote:
> > > >
> > > > > Hi solr users,
> > > > >
> > > > > We have a new use case where need to make a pile of data available
> as
> > > XML
> > > > > to a client and I was thinking we could easily put all this data
> > into a
> > > > > solr collection and the client could just do a star search and page
> > > > through
> > > > > all the results to obtain the data we need to give them.  Then I
> > > > remembered
> > > > > we currently don't allow deep paging in our current search indexes
> as
> > > > > performance declines the deeper you go.  Is this still the case?
> > > > >
> > > > > If so, is there another approach to make all the data in a
> collection
> > > > > easily available for retrieval?  The only thing I can think of is
> to
> > > > query
> > > > > our DB for all the unique IDs of all the documents in the
> collection
> > > and
> > > > > then pull out the documents out in small groups with successive
> > queries
> > > > > like 'UniqueIdField:(id1 OR id2 OR ... OR idn)'
> 'UniqueIdField:(idn+1
> > > OR
> > > > > idn+2 OR ... etc)' which doesn't seem like a very good approach
> > because
> > > > the
> > > > > DB might have been updated with new data which hasn't been indexed
> > yet
> > > > and
> > > > > so all the ids might not be in there (which may or may not matter I
> > > > > suppose).
> > > > >
> > > > > Then I was thinking we could have a field with an incrementing
> > numeric
> > > > > value which could be used to perform range queries as a substitute
> > for
> > > > > paging through everything.  Ie queries like 'IncrementalField:[1 TO
> > > 100]'
> > > > > 'IncrementalField:[101 TO 200]' but this would be difficult to
> > maintain
> > > > as
> > > > > we update the index unless we reindex the entire collection every
> > time
> > > we
> > > > > update any docs at all.
> > > > >
> > > > > Is this perhaps not a good use case for solr?  Should I use
> something
> > > > else
> > > > > or is there another approach that would work here to allow a client
> > to
> > > > pull
> > > > > groups of docs in a collection through the rest api until the
> client
> > > has
> > > > > gotten them all?
> > > > >
> > > > > Thanks
> > > > > Robi
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Joel Bernstein
> > > Search Engineer at Heliosearch
> > >
> >
>
>
>
> --
> Joel Bernstein
> Search Engineer at Heliosearch
>
Reply | Threaded
Open this post in threaded view
|

Re: solr as nosql - pulling all docs vs deep paging limitations

Chris Hostetter-3
In reply to this post by Mikhail Khludnev
:
: What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've been
: asked many times for that.
: What if client don't need to rank results somehow, but just requesting
: unordered filtering result like they are used to in RDBMS?
: Do you feel it will never considered as a resonable usecase for Solr? or
: there is a well known approach for dealing with?

If you don't care about ordering, then the approach i described (either
using SOLR-5463, or just using a sort by uniqueKey with increasing
range filters on the id) should work fine -- the fact that they come back
sorted by id is just an implementation detail that makes it possible to
batch the records (the same way most SQL databases will likely give you
back the docs based on whatever primary key index you have)

I think the key difference between approaches like SOLR-5244 vs the cursor
work in SOLR-5463 is that SOLR-5244 is really targeted at dumping all
data about all docs from a core (matching the query) in a single
request/response -- for something like SolrCloud, the client would
manually need to hit each shard (but as i understand it fro mthe
dscription, that's kind of the point, it's aiming to be a very low level
bulk export).  With the cursor approach in SOLR-5463, we do
agregation across all shards, and we support arbitrary sorts, and you can
control the batch size from the client and iterate over multiple
request/responses of that size.  if there is any network hucups, you can
re-do a request.  If you process half the docs that match (in a
particular order) and then decide "I've got all the docs i need for my
purposes", ou can stop requesting the continuation of that cursor.



-Hoss
http://www.lucidworks.com/
Reply | Threaded
Open this post in threaded view
|

Re: solr as nosql - pulling all docs vs deep paging limitations

Chris Hostetter-3
In reply to this post by Jens Grivolla

: You can do range queries without an upper bound and just limit the number of
: results. Then you look at the last result to obtain the new lower bound.

exactly.  instead of this:

   First: q=foo&start=0&rows=$ROWS
   After: q=foo&start=$X&rows=$ROWS

...where $ROWS is how big a batch of docsy you can handle at one time,
and you increase the value of $X by the value of $ROWS on each successive
request, you can just do this...

   First: q=foo&start=0&rows=$ROWS&sort=id+asc
   After: q=foo&start=0&rows=$ROWS&sort=id+asc&fq=id:{$X TO *]

...where $X is whatever the "last" id you got on the previous page.

Or: you try out the patch in SOLR-5463 and do something like this...

   First: q=foo&start=0&rows=$ROWS&sort=id+asc&cursorMark=*
   After: q=foo&start=0&rows=$ROWS&sort=id+asc&cursorMark=$X

...where $X is whatever "nextCursorMark" you got from the previous page.



-Hoss
http://www.lucidworks.com/
Reply | Threaded
Open this post in threaded view
|

Re: solr as nosql - pulling all docs vs deep paging limitations

Michael Della Bitta-2
In reply to this post by Mikhail Khludnev
Us too. That's going to be huge for us!

Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

"The Science of Influence Marketing"

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Wed, Dec 18, 2013 at 9:55 AM, Mikhail Khludnev <
[hidden email]> wrote:

> Aha! SOLR-5244 is a particular case which I'm asking about. I wonder who
> else consider it useful?
> (I.m sorry if I hijacked the thread)
> 18.12.2013 5:41 пользователь "Joel Bernstein" <[hidden email]>
> написал:
>
> > They are for different use cases. Hoss's approach, I believe, focuses on
> > deep paging of ranked search results. SOLR-5244 focuses on the batch
> export
> > of an entire unranked search result in binary format. It's basically a
> very
> > efficient bulk extract for Solr.
> >
> >
> > On Tue, Dec 17, 2013 at 6:51 PM, Otis Gospodnetic <
> > [hidden email]> wrote:
> >
> > > Joel - can you please elaborate a bit on how this compares with Hoss'
> > > approach?  Complementary?
> > >
> > > Thanks,
> > > Otis
> > > --
> > > Performance Monitoring * Log Analytics * Search Analytics
> > > Solr & Elasticsearch Support * http://sematext.com/
> > >
> > >
> > > On Tue, Dec 17, 2013 at 6:45 PM, Joel Bernstein <[hidden email]>
> > > wrote:
> > >
> > > > SOLR-5244 is also working in this direction. This focuses on
> efficient
> > > > binary extract of entire search results.
> > > >
> > > >
> > > > On Tue, Dec 17, 2013 at 2:33 PM, Otis Gospodnetic <
> > > > [hidden email]> wrote:
> > > >
> > > > > Hoss is working on it. Search for deep paging or cursor in JIRA.
> > > > >
> > > > > Otis
> > > > > Solr & ElasticSearch Support
> > > > > http://sematext.com/
> > > > > On Dec 17, 2013 12:30 PM, "Petersen, Robert" <
> > > > > [hidden email]> wrote:
> > > > >
> > > > > > Hi solr users,
> > > > > >
> > > > > > We have a new use case where need to make a pile of data
> available
> > as
> > > > XML
> > > > > > to a client and I was thinking we could easily put all this data
> > > into a
> > > > > > solr collection and the client could just do a star search and
> page
> > > > > through
> > > > > > all the results to obtain the data we need to give them.  Then I
> > > > > remembered
> > > > > > we currently don't allow deep paging in our current search
> indexes
> > as
> > > > > > performance declines the deeper you go.  Is this still the case?
> > > > > >
> > > > > > If so, is there another approach to make all the data in a
> > collection
> > > > > > easily available for retrieval?  The only thing I can think of is
> > to
> > > > > query
> > > > > > our DB for all the unique IDs of all the documents in the
> > collection
> > > > and
> > > > > > then pull out the documents out in small groups with successive
> > > queries
> > > > > > like 'UniqueIdField:(id1 OR id2 OR ... OR idn)'
> > 'UniqueIdField:(idn+1
> > > > OR
> > > > > > idn+2 OR ... etc)' which doesn't seem like a very good approach
> > > because
> > > > > the
> > > > > > DB might have been updated with new data which hasn't been
> indexed
> > > yet
> > > > > and
> > > > > > so all the ids might not be in there (which may or may not
> matter I
> > > > > > suppose).
> > > > > >
> > > > > > Then I was thinking we could have a field with an incrementing
> > > numeric
> > > > > > value which could be used to perform range queries as a
> substitute
> > > for
> > > > > > paging through everything.  Ie queries like 'IncrementalField:[1
> TO
> > > > 100]'
> > > > > > 'IncrementalField:[101 TO 200]' but this would be difficult to
> > > maintain
> > > > > as
> > > > > > we update the index unless we reindex the entire collection every
> > > time
> > > > we
> > > > > > update any docs at all.
> > > > > >
> > > > > > Is this perhaps not a good use case for solr?  Should I use
> > something
> > > > > else
> > > > > > or is there another approach that would work here to allow a
> client
> > > to
> > > > > pull
> > > > > > groups of docs in a collection through the rest api until the
> > client
> > > > has
> > > > > > gotten them all?
> > > > > >
> > > > > > Thanks
> > > > > > Robi
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Joel Bernstein
> > > > Search Engineer at Heliosearch
> > > >
> > >
> >
> >
> >
> > --
> > Joel Bernstein
> > Search Engineer at Heliosearch
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: solr as nosql - pulling all docs vs deep paging limitations

Jonathan Rochkind
In reply to this post by Chris Hostetter-3
On 12/17/13 1:16 PM, Chris Hostetter wrote:
> As i mentioned in the blog above, as long as you have a uniqueKey field
> that supports range queries, bulk exporting of all documents is fairly
> trivial by sorting on your uniqueKey field and using an fq that also
> filters on your uniqueKey field modify the fq each time to change the
> lower bound to match the highest ID you got on the previous "page".

Aha, very nice suggestion, I hadn't thought of this, when myself trying
to figure out decent ways to 'fetch all documents matching a query' for
some bulk offline processing.

One question that I was never sure about when trying to do things like
this -- is this going to end up blowing the query and/or document caches
if used on a live Solr?  By filling up those caches with the results of
the 'bulk' export?  If so, is there any way to avoid that? Or does it
probably not really matter?

Jonathan
Reply | Threaded
Open this post in threaded view
|

Re: solr as nosql - pulling all docs vs deep paging limitations

Chris Hostetter-3

: One question that I was never sure about when trying to do things like this --
: is this going to end up blowing the query and/or document caches if used on a
: live Solr?  By filling up those caches with the results of the 'bulk' export?
: If so, is there any way to avoid that? Or does it probably not really matter?

  q={!cache=false}...


-Hoss
http://www.lucidworks.com/
Reply | Threaded
Open this post in threaded view
|

Re: solr as nosql - pulling all docs vs deep paging limitations

Mikhail Khludnev
In reply to this post by Chris Hostetter-3
On Wed, Dec 18, 2013 at 8:03 PM, Chris Hostetter
<[hidden email]>wrote:

> :
> : What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've
> been
> : asked many times for that.
> : What if client don't need to rank results somehow, but just requesting
> : unordered filtering result like they are used to in RDBMS?
> : Do you feel it will never considered as a resonable usecase for Solr? or
> : there is a well known approach for dealing with?
>
> If you don't care about ordering, then the approach i described (either
> using SOLR-5463, or just using a sort by uniqueKey with increasing
> range filters on the id) should work fine -- the fact that they come back
> sorted by id is just an implementation detail that makes it possible to
> batch the records

From the functional standpoint it's true, but performance might matter, in
that side cases. eg. I wonder why the priority queue is needed even if we
request sort=_docid_.

 (the same way most SQL databases will likely give you

> back the docs based on whatever primary key index you have)
>
> I think the key difference between approaches like SOLR-5244 vs the cursor
> work in SOLR-5463 is that SOLR-5244 is really targeted at dumping all
> data about all docs from a core (matching the query) in a single
> request/response -- for something like SolrCloud, the client would
> manually need to hit each shard (but as i understand it fro mthe
> dscription, that's kind of the point, it's aiming to be a very low level
> bulk export).  With the cursor approach in SOLR-5463, we do
> agregation across all shards, and we support arbitrary sorts, and you can
> control the batch size from the client and iterate over multiple
> request/responses of that size.  if there is any network hucups, you can
> re-do a request.  If you process half the docs that match (in a
> particular order) and then decide "I've got all the docs i need for my
> purposes", ou can stop requesting the continuation of that cursor.
>
>
>
> -Hoss
> http://www.lucidworks.com/
>



--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <[hidden email]>