Query Regarding SOLR cross collection join

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Query Regarding SOLR cross collection join

Doss
HI,

SOLR version 8.3.1 (10 nodes), zookeeper ensemble (3 nodes)

One of our use cases requires joins, we are joining 2 large indexes. As
required by SOLR one index (2GB) has one shared and 10 replicas and the
other has 10 shard (40GB / Shard).

The query takes too much time, some times in minutes how can we improve
this?

Debug query produces one or more based on the number of shards (i believe)

        "time":303442,
        "fromSetSize":0,
        "toSetSize":81653955,
        "fromTermCount":0,
        "fromTermTotalDf":0,
        "fromTermDirectCount":0,
        "fromTermHits":0,
        "fromTermHitsTotalDf":0,
        "toTermHits":0,
        "toTermHitsTotalDf":0,
        "toTermDirectCount":0,
        "smallSetsDeferred":0,
        "toSetDocsAdded":0},

here what is the  toSetSize  mean? does it read 81MB of data from the
index? how can we reduce this?

Read somewhere that the score join parser will be faster, but for me it
produces no results. I am using string type fields for from and to.


Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Query Regarding SOLR cross collection join

Alessandro Benedetti
From the Join Query Parser code:

"// most of these statistics are only used for the enum method

int fromSetSize;          // number of docs in the fromSet (that match
the from query)
long resultListDocs;      // total number of docs collected
int fromTermCount;
long fromTermTotalDf;
int fromTermDirectCount;  // number of fromTerms that were too small
to use the filter cache
int fromTermHits;         // number of fromTerms that intersected the from query
long fromTermHitsTotalDf; // sum of the df of the matching terms
int toTermHits;           // num if intersecting from terms that match
a term in the to field
long toTermHitsTotalDf;   // sum of the df for the toTermHits
int toTermDirectCount;    // number of toTerms that we set directly on
a bitset rather than doing set intersections
int smallSetsDeferred;    // number of small sets collected to be used
later to intersect w/ bitset or create another small set

"

The toSetSize has nothing to do with MB of data read from the index, it is
the size in number of docs of the resulting set of documents.

Improving this would require a much deeper analysis I reckon.
Starting from your query and your data model till the architecture involved.

Cheers
--------------------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
www.sease.io


On Wed, 22 Jan 2020 at 13:27, Doss <[hidden email]> wrote:

> HI,
>
> SOLR version 8.3.1 (10 nodes), zookeeper ensemble (3 nodes)
>
> One of our use cases requires joins, we are joining 2 large indexes. As
> required by SOLR one index (2GB) has one shared and 10 replicas and the
> other has 10 shard (40GB / Shard).
>
> The query takes too much time, some times in minutes how can we improve
> this?
>
> Debug query produces one or more based on the number of shards (i believe)
>
>         "time":303442,
>         "fromSetSize":0,
>         "toSetSize":81653955,
>         "fromTermCount":0,
>         "fromTermTotalDf":0,
>         "fromTermDirectCount":0,
>         "fromTermHits":0,
>         "fromTermHitsTotalDf":0,
>         "toTermHits":0,
>         "toTermHitsTotalDf":0,
>         "toTermDirectCount":0,
>         "smallSetsDeferred":0,
>         "toSetDocsAdded":0},
>
> here what is the  toSetSize  mean? does it read 81MB of data from the
> index? how can we reduce this?
>
> Read somewhere that the score join parser will be faster, but for me it
> produces no results. I am using string type fields for from and to.
>
>
> Thanks!
>
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
Reply | Threaded
Open this post in threaded view
|

Re: Query Regarding SOLR cross collection join

Mikhail Khludnev-2
In reply to this post by Doss
On Wed, Jan 22, 2020 at 4:27 PM Doss <[hidden email]> wrote:

> HI,
>
> SOLR version 8.3.1 (10 nodes), zookeeper ensemble (3 nodes)
>
> Read somewhere that the score join parser will be faster, but for me it
> produces no results. I am using string type fields for from and to.
>

That's odd. Can you try to enable docValues on from side and reindex small
portion of data just to check if it works.


>
>
> Thanks!
>


--
Sincerely yours
Mikhail Khludnev
Reply | Threaded
Open this post in threaded view
|

Re: Query Regarding SOLR cross collection join

Doss
@ Alessandro Benedetti , Thanks for your input!

@ Mikhail Khludnev , I made docValues="true" for from & to and did a index
rotation, now the score join works perfectly!  Saw 7x performance increase.
Thanks!


On Thu, Jan 23, 2020 at 9:53 PM Mikhail Khludnev <[hidden email]> wrote:

> On Wed, Jan 22, 2020 at 4:27 PM Doss <[hidden email]> wrote:
>
> > HI,
> >
> > SOLR version 8.3.1 (10 nodes), zookeeper ensemble (3 nodes)
> >
> > Read somewhere that the score join parser will be faster, but for me it
> > produces no results. I am using string type fields for from and to.
> >
>
> That's odd. Can you try to enable docValues on from side and reindex small
> portion of data just to check if it works.
>
>
> >
> >
> > Thanks!
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>
Reply | Threaded
Open this post in threaded view
|

Re: Query Regarding SOLR cross collection join

Mikhail Khludnev-2
It's time to enforce and document field type constraints
https://issues.apache.org/jira/browse/SOLR-14230.

On Mon, Jan 27, 2020 at 4:12 PM Doss <[hidden email]> wrote:

> @ Alessandro Benedetti , Thanks for your input!
>
> @ Mikhail Khludnev , I made docValues="true" for from & to and did a index
> rotation, now the score join works perfectly!  Saw 7x performance increase.
> Thanks!
>
>
> On Thu, Jan 23, 2020 at 9:53 PM Mikhail Khludnev <[hidden email]> wrote:
>
> > On Wed, Jan 22, 2020 at 4:27 PM Doss <[hidden email]> wrote:
> >
> > > HI,
> > >
> > > SOLR version 8.3.1 (10 nodes), zookeeper ensemble (3 nodes)
> > >
> > > Read somewhere that the score join parser will be faster, but for me it
> > > produces no results. I am using string type fields for from and to.
> > >
> >
> > That's odd. Can you try to enable docValues on from side and reindex
> small
> > portion of data just to check if it works.
> >
> >
> > >
> > >
> > > Thanks!
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> >
>


--
Sincerely yours
Mikhail Khludnev