Cross index join query performance

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Cross index join query performance

Peter Keegan
I'm doing a cross-core join query and the join query is 30X slower than
each of the 2 individual queries. Here are the queries:

Main query: http://localhost:8983/solr/mainindex/select?q=title:java
QTime: 5 msec
hit count: 1000

Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO 0.3]
QTime: 4 msec
hit count: 25K

Join query:
http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindex
toIndex=subindex from=docid to=docid}fld1:[0.1 TO 0.3]
QTime: 160 msec
hit count: 205

Here are the index spec's:

mainindex size: 117K docs, 1 segment
mainindex schema:
   <field name="docid" type="int" indexed="true" stored="true"
required="true" multiValued="false" />
   <field name="title" type="text_en_splitting" indexed="true"
stored="true" multiValued="false" />
   <uniqueKey>docid</uniqueKey>

subindex size: 117K docs, 1 segment
subindex schema:
   <field name="docid" type="int" indexed="true" stored="true"
required="true" multiValued="false" />
   <field name="fld1" type="float" indexed="true" stored="true"
required="false" multiValued="false" />
   <uniqueKey>docid</uniqueKey>

With debugQuery=true I see:
  "debug":{
    "join":{
      "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO 0.3]":{
        "time":155,
        "fromSetSize":24742,
        "toSetSize":24742,
        "fromTermCount":117810,
        "fromTermTotalDf":117810,
        "fromTermDirectCount":117810,
        "fromTermHits":24742,
        "fromTermHitsTotalDf":24742,
        "toTermHits":24742,
        "toTermHitsTotalDf":24742,
        "toTermDirectCount":24627,
        "smallSetsDeferred":115,
        "toSetDocsAdded":24742}},

Via profiler and debugger, I see 150 msec spent in the outer
'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems like a
lot of time to join the bitsets. Does this seem right?

Peter
Reply | Threaded
Open this post in threaded view
|

Re: Cross index join query performance

Peter Keegan
I forgot to mention - this is Solr 4.3

Peter



On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan <[hidden email]>wrote:

> I'm doing a cross-core join query and the join query is 30X slower than
> each of the 2 individual queries. Here are the queries:
>
> Main query: http://localhost:8983/solr/mainindex/select?q=title:java
> QTime: 5 msec
> hit count: 1000
>
> Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO 0.3]
> QTime: 4 msec
> hit count: 25K
>
> Join query:
> http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindex toIndex=subindex from=docid to=docid}fld1:[0.1 TO 0.3]
> QTime: 160 msec
> hit count: 205
>
> Here are the index spec's:
>
> mainindex size: 117K docs, 1 segment
> mainindex schema:
>    <field name="docid" type="int" indexed="true" stored="true"
> required="true" multiValued="false" />
>    <field name="title" type="text_en_splitting" indexed="true"
> stored="true" multiValued="false" />
>    <uniqueKey>docid</uniqueKey>
>
> subindex size: 117K docs, 1 segment
> subindex schema:
>    <field name="docid" type="int" indexed="true" stored="true"
> required="true" multiValued="false" />
>    <field name="fld1" type="float" indexed="true" stored="true"
> required="false" multiValued="false" />
>    <uniqueKey>docid</uniqueKey>
>
> With debugQuery=true I see:
>   "debug":{
>     "join":{
>       "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO 0.3]":{
>         "time":155,
>         "fromSetSize":24742,
>         "toSetSize":24742,
>         "fromTermCount":117810,
>         "fromTermTotalDf":117810,
>         "fromTermDirectCount":117810,
>         "fromTermHits":24742,
>         "fromTermHitsTotalDf":24742,
>         "toTermHits":24742,
>         "toTermHitsTotalDf":24742,
>         "toTermDirectCount":24627,
>         "smallSetsDeferred":115,
>         "toSetDocsAdded":24742}},
>
> Via profiler and debugger, I see 150 msec spent in the outer
> 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems like a
> lot of time to join the bitsets. Does this seem right?
>
> Peter
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Cross index join query performance

Joel Bernstein
It looks like you are using int join keys so you may want to check out
SOLR-4787, specifically the hjoin and bjoin.

These perform well when you have a large number of results from the
fromIndex. If you have a small number of results in the fromIndex the
standard join will be faster.


On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan <[hidden email]>wrote:

> I forgot to mention - this is Solr 4.3
>
> Peter
>
>
>
> On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan <[hidden email]
> >wrote:
>
> > I'm doing a cross-core join query and the join query is 30X slower than
> > each of the 2 individual queries. Here are the queries:
> >
> > Main query: http://localhost:8983/solr/mainindex/select?q=title:java
> > QTime: 5 msec
> > hit count: 1000
> >
> > Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO
> 0.3]
> > QTime: 4 msec
> > hit count: 25K
> >
> > Join query:
> >
> http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindextoIndex=subindex from=docid to=docid}fld1:[0.1 TO 0.3]
> > QTime: 160 msec
> > hit count: 205
> >
> > Here are the index spec's:
> >
> > mainindex size: 117K docs, 1 segment
> > mainindex schema:
> >    <field name="docid" type="int" indexed="true" stored="true"
> > required="true" multiValued="false" />
> >    <field name="title" type="text_en_splitting" indexed="true"
> > stored="true" multiValued="false" />
> >    <uniqueKey>docid</uniqueKey>
> >
> > subindex size: 117K docs, 1 segment
> > subindex schema:
> >    <field name="docid" type="int" indexed="true" stored="true"
> > required="true" multiValued="false" />
> >    <field name="fld1" type="float" indexed="true" stored="true"
> > required="false" multiValued="false" />
> >    <uniqueKey>docid</uniqueKey>
> >
> > With debugQuery=true I see:
> >   "debug":{
> >     "join":{
> >       "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO 0.3]":{
> >         "time":155,
> >         "fromSetSize":24742,
> >         "toSetSize":24742,
> >         "fromTermCount":117810,
> >         "fromTermTotalDf":117810,
> >         "fromTermDirectCount":117810,
> >         "fromTermHits":24742,
> >         "fromTermHitsTotalDf":24742,
> >         "toTermHits":24742,
> >         "toTermHitsTotalDf":24742,
> >         "toTermDirectCount":24627,
> >         "smallSetsDeferred":115,
> >         "toSetDocsAdded":24742}},
> >
> > Via profiler and debugger, I see 150 msec spent in the outer
> > 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems
> like a
> > lot of time to join the bitsets. Does this seem right?
> >
> > Peter
> >
> >
>



--
Joel Bernstein
Professional Services LucidWorks
Reply | Threaded
Open this post in threaded view
|

Re: Cross index join query performance

Peter Keegan
Hi Joel,

I tried this patch and it is quite a bit faster. Using the same query on a
larger index (500K docs), the 'join' QTime was 1500 msec, and the 'hjoin'
QTime was 100 msec! This was for true for large and small result sets.

A few notes: the patch didn't compile with 4.3 because of the
SolrCore.getLatestSchema call (which I worked around), and the package name
should be:
<queryParser name="hjoin"
class="org.apache.solr.search.joins.HashSetJoinQParserPlugin"/>

Unfortunately, I just learned that our uniqueKey may have to be an
alphanumeric string instead of an int, so I'm not out of the woods yet.

Good stuff - thanks.

Peter


On Thu, Sep 26, 2013 at 6:49 PM, Joel Bernstein <[hidden email]> wrote:

> It looks like you are using int join keys so you may want to check out
> SOLR-4787, specifically the hjoin and bjoin.
>
> These perform well when you have a large number of results from the
> fromIndex. If you have a small number of results in the fromIndex the
> standard join will be faster.
>
>
> On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan <[hidden email]
> >wrote:
>
> > I forgot to mention - this is Solr 4.3
> >
> > Peter
> >
> >
> >
> > On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan <[hidden email]
> > >wrote:
> >
> > > I'm doing a cross-core join query and the join query is 30X slower than
> > > each of the 2 individual queries. Here are the queries:
> > >
> > > Main query: http://localhost:8983/solr/mainindex/select?q=title:java
> > > QTime: 5 msec
> > > hit count: 1000
> > >
> > > Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO
> > 0.3]
> > > QTime: 4 msec
> > > hit count: 25K
> > >
> > > Join query:
> > >
> >
> http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindextoIndex=subindexfrom=docid to=docid}fld1:[0.1 TO 0.3]
> > > QTime: 160 msec
> > > hit count: 205
> > >
> > > Here are the index spec's:
> > >
> > > mainindex size: 117K docs, 1 segment
> > > mainindex schema:
> > >    <field name="docid" type="int" indexed="true" stored="true"
> > > required="true" multiValued="false" />
> > >    <field name="title" type="text_en_splitting" indexed="true"
> > > stored="true" multiValued="false" />
> > >    <uniqueKey>docid</uniqueKey>
> > >
> > > subindex size: 117K docs, 1 segment
> > > subindex schema:
> > >    <field name="docid" type="int" indexed="true" stored="true"
> > > required="true" multiValued="false" />
> > >    <field name="fld1" type="float" indexed="true" stored="true"
> > > required="false" multiValued="false" />
> > >    <uniqueKey>docid</uniqueKey>
> > >
> > > With debugQuery=true I see:
> > >   "debug":{
> > >     "join":{
> > >       "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO
> 0.3]":{
> > >         "time":155,
> > >         "fromSetSize":24742,
> > >         "toSetSize":24742,
> > >         "fromTermCount":117810,
> > >         "fromTermTotalDf":117810,
> > >         "fromTermDirectCount":117810,
> > >         "fromTermHits":24742,
> > >         "fromTermHitsTotalDf":24742,
> > >         "toTermHits":24742,
> > >         "toTermHitsTotalDf":24742,
> > >         "toTermDirectCount":24627,
> > >         "smallSetsDeferred":115,
> > >         "toSetDocsAdded":24742}},
> > >
> > > Via profiler and debugger, I see 150 msec spent in the outer
> > > 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems
> > like a
> > > lot of time to join the bitsets. Does this seem right?
> > >
> > > Peter
> > >
> > >
> >
>
>
>
> --
> Joel Bernstein
> Professional Services LucidWorks
>
Reply | Threaded
Open this post in threaded view
|

Re: Cross index join query performance

Malcolm Upayavira Holmes
The thing here is to understand how a join works.

Effectively, it does the inner query first, which results in a list of
terms. It then effectively does a multi-term query with those values.

q=size:large {!join fromIndex=other from=someid
to=someotherid}type:shirt

Imagine the inner join returned values A,B,C. Your inner query is, on
core 'other', q=type:shirt&fl=someid.

Then your outer query becomes size:large someotherid:(A B C)

Your inner query returns 25k values. You're having to do a multi-term
query for 25k terms. That is *bound* to be slow.

The pseudo-joins in Solr 4.x are intended for a small to medium number
of values returned by the inner query, otherwise performance degrades as
you are seeing.

Is there a way you can reduce the number of values returned by the inner
query?

As Joel mentions, those other joins are attempts to find other ways to
work with this limitation.

Upayavira

On Fri, Sep 27, 2013, at 09:44 PM, Peter Keegan wrote:

> Hi Joel,
>
> I tried this patch and it is quite a bit faster. Using the same query on
> a
> larger index (500K docs), the 'join' QTime was 1500 msec, and the 'hjoin'
> QTime was 100 msec! This was for true for large and small result sets.
>
> A few notes: the patch didn't compile with 4.3 because of the
> SolrCore.getLatestSchema call (which I worked around), and the package
> name
> should be:
> <queryParser name="hjoin"
> class="org.apache.solr.search.joins.HashSetJoinQParserPlugin"/>
>
> Unfortunately, I just learned that our uniqueKey may have to be an
> alphanumeric string instead of an int, so I'm not out of the woods yet.
>
> Good stuff - thanks.
>
> Peter
>
>
> On Thu, Sep 26, 2013 at 6:49 PM, Joel Bernstein <[hidden email]>
> wrote:
>
> > It looks like you are using int join keys so you may want to check out
> > SOLR-4787, specifically the hjoin and bjoin.
> >
> > These perform well when you have a large number of results from the
> > fromIndex. If you have a small number of results in the fromIndex the
> > standard join will be faster.
> >
> >
> > On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan <[hidden email]
> > >wrote:
> >
> > > I forgot to mention - this is Solr 4.3
> > >
> > > Peter
> > >
> > >
> > >
> > > On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan <[hidden email]
> > > >wrote:
> > >
> > > > I'm doing a cross-core join query and the join query is 30X slower than
> > > > each of the 2 individual queries. Here are the queries:
> > > >
> > > > Main query: http://localhost:8983/solr/mainindex/select?q=title:java
> > > > QTime: 5 msec
> > > > hit count: 1000
> > > >
> > > > Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1 TO
> > > 0.3]
> > > > QTime: 4 msec
> > > > hit count: 25K
> > > >
> > > > Join query:
> > > >
> > >
> > http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindextoIndex=subindexfrom=docid to=docid}fld1:[0.1 TO 0.3]
> > > > QTime: 160 msec
> > > > hit count: 205
> > > >
> > > > Here are the index spec's:
> > > >
> > > > mainindex size: 117K docs, 1 segment
> > > > mainindex schema:
> > > >    <field name="docid" type="int" indexed="true" stored="true"
> > > > required="true" multiValued="false" />
> > > >    <field name="title" type="text_en_splitting" indexed="true"
> > > > stored="true" multiValued="false" />
> > > >    <uniqueKey>docid</uniqueKey>
> > > >
> > > > subindex size: 117K docs, 1 segment
> > > > subindex schema:
> > > >    <field name="docid" type="int" indexed="true" stored="true"
> > > > required="true" multiValued="false" />
> > > >    <field name="fld1" type="float" indexed="true" stored="true"
> > > > required="false" multiValued="false" />
> > > >    <uniqueKey>docid</uniqueKey>
> > > >
> > > > With debugQuery=true I see:
> > > >   "debug":{
> > > >     "join":{
> > > >       "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO
> > 0.3]":{
> > > >         "time":155,
> > > >         "fromSetSize":24742,
> > > >         "toSetSize":24742,
> > > >         "fromTermCount":117810,
> > > >         "fromTermTotalDf":117810,
> > > >         "fromTermDirectCount":117810,
> > > >         "fromTermHits":24742,
> > > >         "fromTermHitsTotalDf":24742,
> > > >         "toTermHits":24742,
> > > >         "toTermHitsTotalDf":24742,
> > > >         "toTermDirectCount":24627,
> > > >         "smallSetsDeferred":115,
> > > >         "toSetDocsAdded":24742}},
> > > >
> > > > Via profiler and debugger, I see 150 msec spent in the outer
> > > > 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This seems
> > > like a
> > > > lot of time to join the bitsets. Does this seem right?
> > > >
> > > > Peter
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Joel Bernstein
> > Professional Services LucidWorks
> >
Reply | Threaded
Open this post in threaded view
|

Re: Cross index join query performance

Peter Keegan
Ah, got it now - thanks for the explanation.


On Sat, Sep 28, 2013 at 3:33 AM, Upayavira <[hidden email]> wrote:

> The thing here is to understand how a join works.
>
> Effectively, it does the inner query first, which results in a list of
> terms. It then effectively does a multi-term query with those values.
>
> q=size:large {!join fromIndex=other from=someid
> to=someotherid}type:shirt
>
> Imagine the inner join returned values A,B,C. Your inner query is, on
> core 'other', q=type:shirt&fl=someid.
>
> Then your outer query becomes size:large someotherid:(A B C)
>
> Your inner query returns 25k values. You're having to do a multi-term
> query for 25k terms. That is *bound* to be slow.
>
> The pseudo-joins in Solr 4.x are intended for a small to medium number
> of values returned by the inner query, otherwise performance degrades as
> you are seeing.
>
> Is there a way you can reduce the number of values returned by the inner
> query?
>
> As Joel mentions, those other joins are attempts to find other ways to
> work with this limitation.
>
> Upayavira
>
> On Fri, Sep 27, 2013, at 09:44 PM, Peter Keegan wrote:
> > Hi Joel,
> >
> > I tried this patch and it is quite a bit faster. Using the same query on
> > a
> > larger index (500K docs), the 'join' QTime was 1500 msec, and the 'hjoin'
> > QTime was 100 msec! This was for true for large and small result sets.
> >
> > A few notes: the patch didn't compile with 4.3 because of the
> > SolrCore.getLatestSchema call (which I worked around), and the package
> > name
> > should be:
> > <queryParser name="hjoin"
> > class="org.apache.solr.search.joins.HashSetJoinQParserPlugin"/>
> >
> > Unfortunately, I just learned that our uniqueKey may have to be an
> > alphanumeric string instead of an int, so I'm not out of the woods yet.
> >
> > Good stuff - thanks.
> >
> > Peter
> >
> >
> > On Thu, Sep 26, 2013 at 6:49 PM, Joel Bernstein <[hidden email]>
> > wrote:
> >
> > > It looks like you are using int join keys so you may want to check out
> > > SOLR-4787, specifically the hjoin and bjoin.
> > >
> > > These perform well when you have a large number of results from the
> > > fromIndex. If you have a small number of results in the fromIndex the
> > > standard join will be faster.
> > >
> > >
> > > On Wed, Sep 25, 2013 at 3:39 PM, Peter Keegan <[hidden email]
> > > >wrote:
> > >
> > > > I forgot to mention - this is Solr 4.3
> > > >
> > > > Peter
> > > >
> > > >
> > > >
> > > > On Wed, Sep 25, 2013 at 3:38 PM, Peter Keegan <
> [hidden email]
> > > > >wrote:
> > > >
> > > > > I'm doing a cross-core join query and the join query is 30X slower
> than
> > > > > each of the 2 individual queries. Here are the queries:
> > > > >
> > > > > Main query:
> http://localhost:8983/solr/mainindex/select?q=title:java
> > > > > QTime: 5 msec
> > > > > hit count: 1000
> > > > >
> > > > > Sub query: http://localhost:8983/solr/subindex/select?q=+fld1:[0.1TO
> > > > 0.3]
> > > > > QTime: 4 msec
> > > > > hit count: 25K
> > > > >
> > > > > Join query:
> > > > >
> > > >
> > >
> http://localhost:8983/solr/mainindex/select?q=title:java&fq={!joinfromIndex=mainindextoIndex=subindexfrom=docidto=docid}fld1:[0.1 TO 0.3]
> > > > > QTime: 160 msec
> > > > > hit count: 205
> > > > >
> > > > > Here are the index spec's:
> > > > >
> > > > > mainindex size: 117K docs, 1 segment
> > > > > mainindex schema:
> > > > >    <field name="docid" type="int" indexed="true" stored="true"
> > > > > required="true" multiValued="false" />
> > > > >    <field name="title" type="text_en_splitting" indexed="true"
> > > > > stored="true" multiValued="false" />
> > > > >    <uniqueKey>docid</uniqueKey>
> > > > >
> > > > > subindex size: 117K docs, 1 segment
> > > > > subindex schema:
> > > > >    <field name="docid" type="int" indexed="true" stored="true"
> > > > > required="true" multiValued="false" />
> > > > >    <field name="fld1" type="float" indexed="true" stored="true"
> > > > > required="false" multiValued="false" />
> > > > >    <uniqueKey>docid</uniqueKey>
> > > > >
> > > > > With debugQuery=true I see:
> > > > >   "debug":{
> > > > >     "join":{
> > > > >       "{!join from=docid to=docid fromIndex=subindex}fld1:[0.1 TO
> > > 0.3]":{
> > > > >         "time":155,
> > > > >         "fromSetSize":24742,
> > > > >         "toSetSize":24742,
> > > > >         "fromTermCount":117810,
> > > > >         "fromTermTotalDf":117810,
> > > > >         "fromTermDirectCount":117810,
> > > > >         "fromTermHits":24742,
> > > > >         "fromTermHitsTotalDf":24742,
> > > > >         "toTermHits":24742,
> > > > >         "toTermHitsTotalDf":24742,
> > > > >         "toTermDirectCount":24627,
> > > > >         "smallSetsDeferred":115,
> > > > >         "toSetDocsAdded":24742}},
> > > > >
> > > > > Via profiler and debugger, I see 150 msec spent in the outer
> > > > > 'while(term!=null)' loop in: JoinQueryWeight.getDocSet(). This
> seems
> > > > like a
> > > > > lot of time to join the bitsets. Does this seem right?
> > > > >
> > > > > Peter
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Joel Bernstein
> > > Professional Services LucidWorks
> > >
>