parent/child rows in solr

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

parent/child rows in solr

John Smith-5
Hi, I have a document structure like this (this is a made up schema, my
data has nothing to do with departments and employees, but the structure
holds true to my real data):

department 1
    employee 11
    employee 12
    employee 13
    room 11
    room 12
    room 13

department 2
    employee 21
    employee 22
    room 21

... etc

I'm trying to figure out the best way to index this, and perform queries.
Due to the sheer volume of data, I cannot do a simple "flat file" approach,
repeating the header data for each child entry.

So that leaves me with "graph traversal" or "block joins". I've played with
both of those, but I'm running into various issues with each approach.

I need to be able to run filters on any or all of the header + child rows
in the same query, at once (can't seem to get that working in either graph
or block join). One problem I had with graph is that I can't force solr to
return the header, then all the children for that header, then the next
header + all it's children, it just spits them out without keeping them
together. block join seems to return the children nested under the parents,
which is great, but then I can't seem to filter on parent + children in the
same query: I get the dreaded error message "Parent query must not match
any docs besides parent filter"

Kinda lost here, any tips/suggestions?
Reply | Threaded
Open this post in threaded view
|

Re: parent/child rows in solr

Shawn Heisey-2
On 9/7/2018 3:06 PM, John Smith wrote:

> Hi, I have a document structure like this (this is a made up schema, my
> data has nothing to do with departments and employees, but the structure
> holds true to my real data):
>
> department 1
>      employee 11
>      employee 12
>      employee 13
>      room 11
>      room 12
>      room 13
>
> department 2
>      employee 21
>      employee 22
>      room 21
>
> ... etc
>
> I'm trying to figure out the best way to index this, and perform queries.
> Due to the sheer volume of data, I cannot do a simple "flat file" approach,
> repeating the header data for each child entry.

Why not?

For the precise use case you have outlined, Solr will work better if you
only have the child documents and simply have every document contain a
"department" field which contains an identifier for the department. 
Since this precise structure is not what you are doing, you'll need to
adapt what I'm saying to your actual data.

The volume of data should be irrelevant to this decision. Solr will
always work best with a flat document structure.

I have never used the parent/child document feature in Solr, so I cannot
offer any advice on it.  Somebody else will need to help you if you
choose to use that feature.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: parent/child rows in solr

John Smith-5
Thanks Shawn, for your comments. The reason why I don't want to go flat
file structure, is due to all the wasted/duplicated data. If a department
has 100 employees, then it's very wasteful in terms of disk space to repeat
the header data over and over again, 100 times. In this example there is
only a few doc types, but my real-life data is much larger, and the problem
is a "scaling" problem; with just a little bit of data, no problem in
duplicating header fields, but with massive amounts of data it's a large
problem.

My understanding of both graph traversal and block joins, is that the
header data would only be present once, so that's why I'm gravitating
towards those solutions. I just can't seem to line up the "fq" and queries
correctly such that I am able to join 3+ document types together, filter on
them, and return my requested columns.

On Fri, Sep 7, 2018 at 9:32 PM Shawn Heisey <[hidden email]> wrote:

> On 9/7/2018 3:06 PM, John Smith wrote:
> > Hi, I have a document structure like this (this is a made up schema, my
> > data has nothing to do with departments and employees, but the structure
> > holds true to my real data):
> >
> > department 1
> >      employee 11
> >      employee 12
> >      employee 13
> >      room 11
> >      room 12
> >      room 13
> >
> > department 2
> >      employee 21
> >      employee 22
> >      room 21
> >
> > ... etc
> >
> > I'm trying to figure out the best way to index this, and perform queries.
> > Due to the sheer volume of data, I cannot do a simple "flat file"
> approach,
> > repeating the header data for each child entry.
>
> Why not?
>
> For the precise use case you have outlined, Solr will work better if you
> only have the child documents and simply have every document contain a
> "department" field which contains an identifier for the department.
> Since this precise structure is not what you are doing, you'll need to
> adapt what I'm saying to your actual data.
>
> The volume of data should be irrelevant to this decision. Solr will
> always work best with a flat document structure.
>
> I have never used the parent/child document feature in Solr, so I cannot
> offer any advice on it.  Somebody else will need to help you if you
> choose to use that feature.
>
> Thanks,
> Shawn
>
>
Reply | Threaded
Open this post in threaded view
|

Re: parent/child rows in solr

Shawn Heisey-2
On 9/7/2018 7:44 PM, John Smith wrote:
> Thanks Shawn, for your comments. The reason why I don't want to go flat
> file structure, is due to all the wasted/duplicated data. If a department
> has 100 employees, then it's very wasteful in terms of disk space to repeat
> the header data over and over again, 100 times. In this example there is
> only a few doc types, but my real-life data is much larger, and the problem
> is a "scaling" problem; with just a little bit of data, no problem in
> duplicating header fields, but with massive amounts of data it's a large
> problem.

If your goal is data storage, then you are completely correct.  All that
data duplication is something to avoid for a data storage situation. 
Normalizing your data so it's relational makes perfect sense, because
most database software is designed to efficiently deal with those
relationships.

Solr is not designed as a data storage platform, and does not handle
those relationships efficiently.  Solr's design goals are all about
*search*.  It often gets touted as filling a NoSQL role ... but it's not
something I would personally use as a primary data repository.  Search
is a space where data duplication is expected and completely normal. 
This is something that people often have a hard time accepting.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: parent/child rows in solr

John Smith-5
>
> On 9/7/2018 7:44 PM, John Smith wrote:
> > Thanks Shawn, for your comments. The reason why I don't want to go flat
> > file structure, is due to all the wasted/duplicated data. If a department
> > has 100 employees, then it's very wasteful in terms of disk space to
> repeat
> > the header data over and over again, 100 times. In this example there is
> > only a few doc types, but my real-life data is much larger, and the
> problem
> > is a "scaling" problem; with just a little bit of data, no problem in
> > duplicating header fields, but with massive amounts of data it's a large
> > problem.
>
> If your goal is data storage, then you are completely correct.  All that
> data duplication is something to avoid for a data storage situation.
> Normalizing your data so it's relational makes perfect sense, because
> most database software is designed to efficiently deal with those
> relationships.
>
> Solr is not designed as a data storage platform, and does not handle
> those relationships efficiently.  Solr's design goals are all about
> *search*.  It often gets touted as filling a NoSQL role ... but it's not
> something I would personally use as a primary data repository.  Search
> is a space where data duplication is expected and completely normal.
> This is something that people often have a hard time accepting.
>
>
I'm not actually trying to use solr as a data storage platform; all our
data is stored in an sql database, we are using solr strictly for the
search features, not storage features.

Here is a good example from a test I ran today. I have a header table, and
8 child tables which link directly to the header table. The children link
only to 1 header row, and they do not link to other children. So a 1:many
between header and each child. Some row counts:

header:      223,580

child1:      124,978
child2:      254,045
child3:      127,917
child4:    1,009,030
child5:      225,311
child6:      381,561
child7:      438,315
child8:       18,850


Trying to index that into solr with a flatfile schema, blows up into
5,475,316,072 rows. Yes, 5.5 billion rows. I calculated that by running a
left outer join between header and each child and getting a row count in
the database. That's not going to scale, at all, considering the small size
of the source input tables. Some of our indexes would require 50 million
header rows alone, never mind the child tables.

So solr has no way of indexing something like this? I can't believe I would
be the first person to run into this issue, I have a feeling I'm missing
something obvious somewhere.
Reply | Threaded
Open this post in threaded view
|

Re: parent/child rows in solr

Shawn Heisey-2
On 9/11/2018 7:07 PM, John Smith wrote:

> header:      223,580
>
> child1:      124,978
> child2:      254,045
> child3:      127,917
> child4:    1,009,030
> child5:      225,311
> child6:      381,561
> child7:      438,315
> child8:       18,850
>
>
> Trying to index that into solr with a flatfile schema, blows up into
> 5,475,316,072 rows. Yes, 5.5 billion rows. I calculated that by running a

I think you're not getting what I'm suggesting.  Or maybe there's an
aspect of your data that I'm not understanding.

If we add up all those numbers for the child docs, there are 2.5 million
of them.  So you would have 2.5 million docs in Solr.  I have created
Solr indexes far larger than this, and I do not consider my work to be
"big data".  Solr can handle 2.5 million docs easily, as long as the
hardware resources are sufficient.

Where the data duplication will come in is in additional fields in those
2.5 million docs.  Each one will contain some (or maybe all) of the data
that WOULD have been in the parent document.  The amount of data
balloons, but the number of documents (rows) doesn't.

That kind of arrangement is usually enough to accomplish whatever is
needed.  I cannot assume that it will work for your use case, but it
does work for most.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: parent/child rows in solr

John Smith-5
On Tue, Sep 11, 2018 at 9:32 PM Shawn Heisey <[hidden email]> wrote:

> On 9/11/2018 7:07 PM, John Smith wrote:
> > header:      223,580
> >
> > child1:      124,978
> > child2:      254,045
> > child3:      127,917
> > child4:    1,009,030
> > child5:      225,311
> > child6:      381,561
> > child7:      438,315
> > child8:       18,850
> >
> >
> > Trying to index that into solr with a flatfile schema, blows up into
> > 5,475,316,072 rows. Yes, 5.5 billion rows. I calculated that by running a
>
> I think you're not getting what I'm suggesting.  Or maybe there's an
> aspect of your data that I'm not understanding.
>
> If we add up all those numbers for the child docs, there are 2.5 million
> of them.  So you would have 2.5 million docs in Solr.  I have created
> Solr indexes far larger than this, and I do not consider my work to be
> "big data".  Solr can handle 2.5 million docs easily, as long as the
> hardware resources are sufficient.
>
> Where the data duplication will come in is in additional fields in those
> 2.5 million docs.  Each one will contain some (or maybe all) of the data
> that WOULD have been in the parent document.  The amount of data
> balloons, but the number of documents (rows) doesn't.
>
> That kind of arrangement is usually enough to accomplish whatever is
> needed.  I cannot assume that it will work for your use case, but it
> does work for most.
>
> Thanks,
> Shawn
>
>
The problem is that the math isn't a simple case of adding up all the row
counts. These are "left outer join"s. In sql, it would be this query:

select * from header h
left outer join child1 c1 on c1.hid = h.id
left outer join child2 c2 on c2.hid = h.id
...
left outer join child8 c8 on c8.hid = h.id


If there are 10 rows in child1 linked to 1 header with id "abc", and 10
rows in child2 linked to that same header, then we end up with 10 * 10 rows
in solr, not 20. Considering there are 8 child tables in this example,
there is simply an explosion of data.

I can't describe it much better than that (abstractly), though perhaps I
could put together a simple example with live data. Suffice it to say, in
my example row counts above, that is all "live data" in a relatively small
database of ours, the row counts are real, and the final row count of 5.5
billion was calculated inside sql using that query above:

select count(*) from (
    select id from header h
    left outer join child1 c1 on c1.hid = h.id
    left outer join child2 c2 on c2.hid = h.id
    ...
    left outer join child8 c8 on c8.hid = h.id
) tmp;
Reply | Threaded
Open this post in threaded view
|

Re: parent/child rows in solr

Shawn Heisey-2
On 9/11/2018 8:35 PM, John Smith wrote:
> The problem is that the math isn't a simple case of adding up all the row
> counts. These are "left outer join"s. In sql, it would be this query:

I think we'll just have to conclude that I do not understand what you
are doing.  I have no idea what "left outer join" even means, how it's
different than a join that's NOT "left outer".

I will say this:  Solr is not very efficient at joins, and there are a
bunch of caveats involved.  It's usually better to go with a flat
document space for a search engine.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: parent/child rows in solr

Walter Underwood
In reply to this post by John Smith-5
Have you tried modeling it with multivalued fields?

Also, why do you think Solr is a good solution? What is the problem?

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

> On Sep 11, 2018, at 7:35 PM, John Smith <[hidden email]> wrote:
>
> On Tue, Sep 11, 2018 at 9:32 PM Shawn Heisey <[hidden email]> wrote:
>
>> On 9/11/2018 7:07 PM, John Smith wrote:
>>> header:      223,580
>>>
>>> child1:      124,978
>>> child2:      254,045
>>> child3:      127,917
>>> child4:    1,009,030
>>> child5:      225,311
>>> child6:      381,561
>>> child7:      438,315
>>> child8:       18,850
>>>
>>>
>>> Trying to index that into solr with a flatfile schema, blows up into
>>> 5,475,316,072 rows. Yes, 5.5 billion rows. I calculated that by running a
>>
>> I think you're not getting what I'm suggesting.  Or maybe there's an
>> aspect of your data that I'm not understanding.
>>
>> If we add up all those numbers for the child docs, there are 2.5 million
>> of them.  So you would have 2.5 million docs in Solr.  I have created
>> Solr indexes far larger than this, and I do not consider my work to be
>> "big data".  Solr can handle 2.5 million docs easily, as long as the
>> hardware resources are sufficient.
>>
>> Where the data duplication will come in is in additional fields in those
>> 2.5 million docs.  Each one will contain some (or maybe all) of the data
>> that WOULD have been in the parent document.  The amount of data
>> balloons, but the number of documents (rows) doesn't.
>>
>> That kind of arrangement is usually enough to accomplish whatever is
>> needed.  I cannot assume that it will work for your use case, but it
>> does work for most.
>>
>> Thanks,
>> Shawn
>>
>>
> The problem is that the math isn't a simple case of adding up all the row
> counts. These are "left outer join"s. In sql, it would be this query:
>
> select * from header h
> left outer join child1 c1 on c1.hid = h.id
> left outer join child2 c2 on c2.hid = h.id
> ...
> left outer join child8 c8 on c8.hid = h.id
>
>
> If there are 10 rows in child1 linked to 1 header with id "abc", and 10
> rows in child2 linked to that same header, then we end up with 10 * 10 rows
> in solr, not 20. Considering there are 8 child tables in this example,
> there is simply an explosion of data.
>
> I can't describe it much better than that (abstractly), though perhaps I
> could put together a simple example with live data. Suffice it to say, in
> my example row counts above, that is all "live data" in a relatively small
> database of ours, the row counts are real, and the final row count of 5.5
> billion was calculated inside sql using that query above:
>
> select count(*) from (
>    select id from header h
>    left outer join child1 c1 on c1.hid = h.id
>    left outer join child2 c2 on c2.hid = h.id
>    ...
>    left outer join child8 c8 on c8.hid = h.id
> ) tmp;

Reply | Threaded
Open this post in threaded view
|

Re: parent/child rows in solr

John Smith-5
In reply to this post by Shawn Heisey-2
On Tue, Sep 11, 2018 at 11:00 PM Shawn Heisey <[hidden email]> wrote:

> On 9/11/2018 8:35 PM, John Smith wrote:
> > The problem is that the math isn't a simple case of adding up all the row
> > counts. These are "left outer join"s. In sql, it would be this query:
>
> I think we'll just have to conclude that I do not understand what you
> are doing.  I have no idea what "left outer join" even means, how it's
> different than a join that's NOT "left outer".
>
> I will say this:  Solr is not very efficient at joins, and there are a
> bunch of caveats involved.  It's usually better to go with a flat
> document space for a search engine.
>
> Thanks,
> Shawn
>
>
A "left outer join" in sql is a join such that if there is no match in the
child table for a given header id, then the child cells are returned as
"null" values, instead of the header row being removed from the result set
(which is what happens in "inner join" or standard sql join).

A good rundown on the various sql joins:
https://stackoverflow.com/questions/38549/what-is-the-difference-between-inner-join-and-outer-join
Reply | Threaded
Open this post in threaded view
|

Re: parent/child rows in solr

John Smith-5
In reply to this post by Walter Underwood
On Tue, Sep 11, 2018 at 11:05 PM Walter Underwood <[hidden email]>
wrote:

> Have you tried modeling it with multivalued fields?
>
>
That's an interesting idea, but I don't think that would work. We would
lose the concept of "rows". So let's say child1 has col "a" and col "b",
both are turned into multi-value fields in the solr index. Normally in sql
we can query for a specific value in col "a", and then see what the
associated value in col "b" would be, but we can't do that if we stuff the
col values in multi-value; we can no longer see which value from col "a"
corresponds to which value in col "b". I'm probably explaining that poorly,
but I just don't see how that would work.
Reply | Threaded
Open this post in threaded view
|

Re: parent/child rows in solr

Rahul Singh-3
What’s your SLA? It seems that you have two problems - finding correlated information that’s in a hierarchy and potentially displaying it.

I feel your desire to conflate the two is forcing you down a specific path. Often times in complex scenarios I’ve found that an index like Solr is better for the search and not necessarily the storage or display.

The question I have is : what’s your application workflow? Who is querying this data? How are they expecting to see it? How fast do they need that data?

I understand what you’ve described (which seems to be a non functional requirement ) of what you want to do but in order to help it would be helpful for me at least to know how the data is ingested,
Enhanced, and retrieved.

In terms of data volume, you may consider indexing all data , but not storing all of it. This makes sure you aren’t duplicating data that isn’t an awful waste of space.

Rahul Singh
Chief Executive Officer
m 202.905.2818

Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007

We build and manage digital business technology platforms.
On Sep 11, 2018, 11:23 PM -0400, John Smith <[hidden email]>, wrote:

> On Tue, Sep 11, 2018 at 11:05 PM Walter Underwood <[hidden email]>
> wrote:
>
> > Have you tried modeling it with multivalued fields?
> >
> >
> That's an interesting idea, but I don't think that would work. We would
> lose the concept of "rows". So let's say child1 has col "a" and col "b",
> both are turned into multi-value fields in the solr index. Normally in sql
> we can query for a specific value in col "a", and then see what the
> associated value in col "b" would be, but we can't do that if we stuff the
> col values in multi-value; we can no longer see which value from col "a"
> corresponds to which value in col "b". I'm probably explaining that poorly,
> but I just don't see how that would work.