Use stream result like a query (alternative to innerJoin)

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Use stream result like a query (alternative to innerJoin)

ufuk yılmaz
Hi all,

I’m looking for a way to query two collections and find documents that exist in both, I know this can be done with innerJoin streaming expression but I want to avoid it, since one of the collection streams can possibly have billions of results:

Let’s say two collections are:

deletedItems = [{deletedItemId: 1}, {deletedItemId: 2}...]
items = [
        {
                id: 1,
                name: "a"
        },
        { id: 2,
                name: "b"
        },
        {
                id: 3,
                name: "c"
        }.....
]

“deletedItems” contain a few documents compared to “items” collection (1mil vs 2-3 bil). If I query them both with a typical query in our system, deletedItems gives a few thousand results but items give tens/hundreds of millions. To use innerJoin, I have to stream the whole items result to worker node over network.

Is there a way to avoid this, something like using “deletedItems” result as a query to “items” stream?

Thanks in advance for the help

Sent from Mail for Windows 10

Reply | Threaded
Open this post in threaded view
|

Re: Use stream result like a query (alternative to innerJoin)

Joel Bernstein
There are two streams that behave like that.

One is the "nodes" expression, which is not going to work for this use case
because it does everything in memory.

The second one is the "fetch" expression which behaves like a nested loop
join with some limitations. Unfortunately the main limitation is likely to
be a blocker for you which is that it doesn't support one-to-many joins yet.

Joel Bernstein
http://joelsolr.blogspot.com/


On Sun, Nov 22, 2020 at 10:37 AM ufuk yılmaz <[hidden email]>
wrote:

> Hi all,
>
> I’m looking for a way to query two collections and find documents that
> exist in both, I know this can be done with innerJoin streaming expression
> but I want to avoid it, since one of the collection streams can possibly
> have billions of results:
>
> Let’s say two collections are:
>
> deletedItems = [{deletedItemId: 1}, {deletedItemId: 2}...]
> items = [
>         {
>                 id: 1,
>                 name: "a"
>         },
>         {       id: 2,
>                 name: "b"
>         },
>         {
>                 id: 3,
>                 name: "c"
>         }.....
> ]
>
> “deletedItems” contain a few documents compared to “items” collection
> (1mil vs 2-3 bil). If I query them both with a typical query in our system,
> deletedItems gives a few thousand results but items give tens/hundreds of
> millions. To use innerJoin, I have to stream the whole items result to
> worker node over network.
>
> Is there a way to avoid this, something like using “deletedItems” result
> as a query to “items” stream?
>
> Thanks in advance for the help
>
> Sent from Mail for Windows 10
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Use stream result like a query (alternative to innerJoin)

Joel Bernstein
Here is the documentation for fetch:

https://lucene.apache.org/solr/guide/8_4/stream-decorator-reference.html#fetch


Joel Bernstein
http://joelsolr.blogspot.com/


On Mon, Nov 23, 2020 at 3:22 PM Joel Bernstein <[hidden email]> wrote:

> There are two streams that behave like that.
>
> One is the "nodes" expression, which is not going to work for this use
> case because it does everything in memory.
>
> The second one is the "fetch" expression which behaves like a nested loop
> join with some limitations. Unfortunately the main limitation is likely to
> be a blocker for you which is that it doesn't support one-to-many joins yet.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Sun, Nov 22, 2020 at 10:37 AM ufuk yılmaz <[hidden email]>
> wrote:
>
>> Hi all,
>>
>> I’m looking for a way to query two collections and find documents that
>> exist in both, I know this can be done with innerJoin streaming expression
>> but I want to avoid it, since one of the collection streams can possibly
>> have billions of results:
>>
>> Let’s say two collections are:
>>
>> deletedItems = [{deletedItemId: 1}, {deletedItemId: 2}...]
>> items = [
>>         {
>>                 id: 1,
>>                 name: "a"
>>         },
>>         {       id: 2,
>>                 name: "b"
>>         },
>>         {
>>                 id: 3,
>>                 name: "c"
>>         }.....
>> ]
>>
>> “deletedItems” contain a few documents compared to “items” collection
>> (1mil vs 2-3 bil). If I query them both with a typical query in our system,
>> deletedItems gives a few thousand results but items give tens/hundreds of
>> millions. To use innerJoin, I have to stream the whole items result to
>> worker node over network.
>>
>> Is there a way to avoid this, something like using “deletedItems” result
>> as a query to “items” stream?
>>
>> Thanks in advance for the help
>>
>> Sent from Mail for Windows 10
>>
>>
Reply | Threaded
Open this post in threaded view
|

RE: Use stream result like a query (alternative to innerJoin)

ufuk yılmaz
Fetch would work for my specific case (since I’m working with id’s there’s no one to many), if I was able to restrict fetch’s target domain with a query. I would first get all possible deleted ids, then use fetch to the items collection. But then the current fetch implementation would find all deleted items, not something like “deleted items with these names” or “deleted items between this time” etc.

I came upon your video while researching this stuff: https://www.youtube.com/watch?v=kTNe3TaqFvo

I’m trying to use the “let” expression to feed one stream’s result to another as a query, using string concat function and eval stream. So far I couldn’t write a working example, but it’s an idea that I’m playing with.


Sent from Mail for Windows 10

From: Joel Bernstein
Sent: 23 November 2020 23:23
To: [hidden email]
Subject: Re: Use stream result like a query (alternative to innerJoin)

H