querying using filter query and lots of possible values

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

querying using filter query and lots of possible values

Daniel Brügge-2
Hi,

i am facing the following issue:

I have couple of million documents, which have a field called "source_id".
My problem is, that I want to retrieve all the documents which have a
source_id
in a specific range of values. This range can be pretty big, so for example
a
list of 200 to 2000 source ids.

I was thinking that a filter query can be used like fq=source_id:(1 2 3 4 5
6 .....)
but this reminds me of SQLs WHERE IN (...) which was always bit slow for a
huge
number of values.

Another solution that came into my mind was to assigned all the documents I
want to
retrieve a new kind of "filter id". So all the documents which i want to
analyse
get a new id. But i need to update all the millions of documents for this
and assign
them a new id. This could take some time.

Do you can think of a nicer way to solve this issue?

Regards & greetings

Daniel
Reply | Threaded
Open this post in threaded view
|

Re: querying using filter query and lots of possible values

Chantal Ackermann-2
Hi Daniel,

index the id into a field of type tint or tlong and use a range query (http://wiki.apache.org/solr/SolrQuerySyntax?highlight=%28rangequery%29):

fq=id:[200 TO 2000]

If you want to exclude certain ids it might be wiser to simply add an exclusion query in addition to the range query instead of listing all the single values. You will run into problems with too long request urls. If you cannot avoid long urls you might want to increase maxBooleanClauses (see http://wiki.apache.org/solr/SolrConfigXml/#The_Query_Section).

Cheers,
Chantal

Am 26.07.2012 um 18:01 schrieb Daniel Brügge:

> Hi,
>
> i am facing the following issue:
>
> I have couple of million documents, which have a field called "source_id".
> My problem is, that I want to retrieve all the documents which have a
> source_id
> in a specific range of values. This range can be pretty big, so for example
> a
> list of 200 to 2000 source ids.
>
> I was thinking that a filter query can be used like fq=source_id:(1 2 3 4 5
> 6 .....)
> but this reminds me of SQLs WHERE IN (...) which was always bit slow for a
> huge
> number of values.
>
> Another solution that came into my mind was to assigned all the documents I
> want to
> retrieve a new kind of "filter id". So all the documents which i want to
> analyse
> get a new id. But i need to update all the millions of documents for this
> and assign
> them a new id. This could take some time.
>
> Do you can think of a nicer way to solve this issue?
>
> Regards & greetings
>
> Daniel

Reply | Threaded
Open this post in threaded view
|

Re: querying using filter query and lots of possible values

Daniel Brügge-2
Hey Chantal,

thanks for your answer.

The range queries would not work, because they are not values in a row.
They can be randomly ordered with gaps. Above was just an example.

Excluding is also not a solution, because the list of excluded id would be
even longer.

To specify it even more. The IDs are not even integers, but UUIDs. And they
are tens of thousands. And the document pool contains hundreds of million
documents.

Thanks. Daniel



On Thu, Jul 26, 2012 at 6:22 PM, Chantal Ackermann <
[hidden email]> wrote:

> Hi Daniel,
>
> index the id into a field of type tint or tlong and use a range query (
> http://wiki.apache.org/solr/SolrQuerySyntax?highlight=%28rangequery%29):
>
> fq=id:[200 TO 2000]
>
> If you want to exclude certain ids it might be wiser to simply add an
> exclusion query in addition to the range query instead of listing all the
> single values. You will run into problems with too long request urls. If
> you cannot avoid long urls you might want to increase maxBooleanClauses
> (see http://wiki.apache.org/solr/SolrConfigXml/#The_Query_Section).
>
> Cheers,
> Chantal
>
> Am 26.07.2012 um 18:01 schrieb Daniel Brügge:
>
> > Hi,
> >
> > i am facing the following issue:
> >
> > I have couple of million documents, which have a field called
> "source_id".
> > My problem is, that I want to retrieve all the documents which have a
> > source_id
> > in a specific range of values. This range can be pretty big, so for
> example
> > a
> > list of 200 to 2000 source ids.
> >
> > I was thinking that a filter query can be used like fq=source_id:(1 2 3
> 4 5
> > 6 .....)
> > but this reminds me of SQLs WHERE IN (...) which was always bit slow for
> a
> > huge
> > number of values.
> >
> > Another solution that came into my mind was to assigned all the
> documents I
> > want to
> > retrieve a new kind of "filter id". So all the documents which i want to
> > analyse
> > get a new id. But i need to update all the millions of documents for this
> > and assign
> > them a new id. This could take some time.
> >
> > Do you can think of a nicer way to solve this issue?
> >
> > Regards & greetings
> >
> > Daniel
>
>
Reply | Threaded
Open this post in threaded view
|

Re: querying using filter query and lots of possible values

Alexandre Rafalovitch
In reply to this post by Daniel Brügge-2
You can't update the original documents except by reindexing them, so
no easy group assigment option.

If you create this 'collection' once but query it multiple times, you
may be able to use SOLR4 join with IDs being stored separately and
joined on. Still not great because the performance is an issue when
mapping on IDs:
http://www.lucidimagination.com/blog/2012/06/20/solr-and-joins/ .

If the list is some sort of combination of smaller lists - you could
probably precompute (at index time) those fragments and do compound
query over them.

But if you have to query every time and the list is different every
time, that could be complicated.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Thu, Jul 26, 2012 at 12:01 PM, Daniel Brügge
<[hidden email]> wrote:

> Hi,
>
> i am facing the following issue:
>
> I have couple of million documents, which have a field called "source_id".
> My problem is, that I want to retrieve all the documents which have a
> source_id
> in a specific range of values. This range can be pretty big, so for example
> a
> list of 200 to 2000 source ids.
>
> I was thinking that a filter query can be used like fq=source_id:(1 2 3 4 5
> 6 .....)
> but this reminds me of SQLs WHERE IN (...) which was always bit slow for a
> huge
> number of values.
>
> Another solution that came into my mind was to assigned all the documents I
> want to
> retrieve a new kind of "filter id". So all the documents which i want to
> analyse
> get a new id. But i need to update all the millions of documents for this
> and assign
> them a new id. This could take some time.
>
> Do you can think of a nicer way to solve this issue?
>
> Regards & greetings
>
> Daniel
Reply | Threaded
Open this post in threaded view
|

Re: querying using filter query and lots of possible values

Daniel Brügge-2
Thanks Alexandre,

the list of IDs is constant for a longer time. I will take a look at
these join thematic.
Maybe another solution would be to really create a whole new
collection or set of documents containing the aggregated documents (from the
ids) from scratch and to execute queries on this collection. Then this
would take
some time, but maybe it's worth it because the querying will thank you.

Daniel

On Thu, Jul 26, 2012 at 7:43 PM, Alexandre Rafalovitch
<[hidden email]>wrote:

> You can't update the original documents except by reindexing them, so
> no easy group assigment option.
>
> If you create this 'collection' once but query it multiple times, you
> may be able to use SOLR4 join with IDs being stored separately and
> joined on. Still not great because the performance is an issue when
> mapping on IDs:
> http://www.lucidimagination.com/blog/2012/06/20/solr-and-joins/ .
>
> If the list is some sort of combination of smaller lists - you could
> probably precompute (at index time) those fragments and do compound
> query over them.
>
> But if you have to query every time and the list is different every
> time, that could be complicated.
>
> Regards,
>    Alex.
>
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
>
>
> On Thu, Jul 26, 2012 at 12:01 PM, Daniel Brügge
> <[hidden email]> wrote:
> > Hi,
> >
> > i am facing the following issue:
> >
> > I have couple of million documents, which have a field called
> "source_id".
> > My problem is, that I want to retrieve all the documents which have a
> > source_id
> > in a specific range of values. This range can be pretty big, so for
> example
> > a
> > list of 200 to 2000 source ids.
> >
> > I was thinking that a filter query can be used like fq=source_id:(1 2 3
> 4 5
> > 6 .....)
> > but this reminds me of SQLs WHERE IN (...) which was always bit slow for
> a
> > huge
> > number of values.
> >
> > Another solution that came into my mind was to assigned all the
> documents I
> > want to
> > retrieve a new kind of "filter id". So all the documents which i want to
> > analyse
> > get a new id. But i need to update all the millions of documents for this
> > and assign
> > them a new id. This could take some time.
> >
> > Do you can think of a nicer way to solve this issue?
> >
> > Regards & greetings
> >
> > Daniel
>
Reply | Threaded
Open this post in threaded view
|

Re: querying using filter query and lots of possible values

Chantal Ackermann-2
In reply to this post by Daniel Brügge-2
Hi Daniel,

depending on how you decide on the list of ids, in the first place, you could also create a new index (core) and populate it with DIH which would select only documents from your main index (core) in this range of ids. When updating you could try a delta import.

Of course, this is only worth the effort if that core would exist for some time - but you've written that the subset of ids is constant for a longer time.

Just another idea on top ;-)
Chantal
Reply | Threaded
Open this post in threaded view
|

Re: querying using filter query and lots of possible values

Daniel Brügge-2
Exactly. Creating a new index from the aggregated documents is the plan
I described above. I don't really now, how long this will take for each
new index. Hopefully under 1 hour or so. That would be tolerable.

Thanks. Daniel

On Thu, Jul 26, 2012 at 8:47 PM, Chantal Ackermann <
[hidden email]> wrote:

> Hi Daniel,
>
> depending on how you decide on the list of ids, in the first place, you
> could also create a new index (core) and populate it with DIH which would
> select only documents from your main index (core) in this range of ids.
> When updating you could try a delta import.
>
> Of course, this is only worth the effort if that core would exist for some
> time - but you've written that the subset of ids is constant for a longer
> time.
>
> Just another idea on top ;-)
> Chantal
Reply | Threaded
Open this post in threaded view
|

Re: querying using filter query and lots of possible values

Chris Hostetter-3
In reply to this post by Daniel Brügge-2

: the list of IDs is constant for a longer time. I will take a look at
: these join thematic.
: Maybe another solution would be to really create a whole new
: collection or set of documents containing the aggregated documents (from the
: ids) from scratch and to execute queries on this collection. Then this
: would take
: some time, but maybe it's worth it because the querying will thank you.

Another avenue to consider...

http://lucene.apache.org/solr/api-4_0_0-ALPHA/org/apache/solr/schema/ExternalFileField.html

...would allow you to map values in your "source_id" to some numeric
values (many to many) and these numeric values would then be accessible in
functions -- so you could use something like fq={!frange ...} to select
all docs with value 67 where your extenral file field says that value 67
is mapped ot the following thousand source_id values.

the external field fields can then be modified at any time just by doing a
commit on your index.



-Hoss
Reply | Threaded
Open this post in threaded view
|

Re: querying using filter query and lots of possible values

Daniel Brügge-2
Hi,

thanks for this hint. Will check this out. Sounds promising.

Daniel

On Sat, Jul 28, 2012 at 3:18 AM, Chris Hostetter
<[hidden email]>wrote:

>
> : the list of IDs is constant for a longer time. I will take a look at
> : these join thematic.
> : Maybe another solution would be to really create a whole new
> : collection or set of documents containing the aggregated documents (from
> the
> : ids) from scratch and to execute queries on this collection. Then this
> : would take
> : some time, but maybe it's worth it because the querying will thank you.
>
> Another avenue to consider...
>
>
> http://lucene.apache.org/solr/api-4_0_0-ALPHA/org/apache/solr/schema/ExternalFileField.html
>
> ...would allow you to map values in your "source_id" to some numeric
> values (many to many) and these numeric values would then be accessible in
> functions -- so you could use something like fq={!frange ...} to select
> all docs with value 67 where your extenral file field says that value 67
> is mapped ot the following thousand source_id values.
>
> the external field fields can then be modified at any time just by doing a
> commit on your index.
>
>
>
> -Hoss
>