Collating results from multiple indexes

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Collating results from multiple indexes

Aaron McKee

Is there any somewhat convenient way to collate/integrate fields from
separate indices during result writing, if the indices use the same
unique keys? Basically, some sort of cross-index JOIN?

As a bit of background, I have a rather heavyweight dataset of every US
business (~25m records, an on-disk index footprint of ~30g, and 5-10
hours to fully index on a decent box). Given the size and relatively
stability of the dataset, I generally only update this monthly. However,
I have separate advertising-related datasets that need to be updated
either hourly or daily (e.g. today's coupon, click revenue remaining,
etc.) . These advertiser feeds reference the same keyspace that I use in
the main index, but are otherwise significantly lighter weight.
Importing and indexing them discretely only takes a couple minutes.
Given that Solr/Lucene doesn't support field updating, without having to
drop and re-add an entire document, it doesn't seem practical to
integrate this data into the main index (the system would be under a
constant state of churn, if we did document re-inserts, and the
performance impact would probably be debilitating). It may be nice if
this data could participate in filtering (e.g. only show advertisers),
but it doesn't need to participate in scoring/ranking.

I'm guessing that someone else has had a similar need, at some point?  I
can have our front-end query the smaller indices separately, using the
keys returned by the primary index, but would prefer to avoid the extra
sequential roundtrips. I'm hoping to also avoid a coding solution, if
only to avoid the maintenance overhead as we drop in new builds of Solr,
but that's also feasible.

Thank you for your insight,
Aaron

Reply | Threaded
Open this post in threaded view
|

Re: Collating results from multiple indexes

Jan Høydahl / Cominvent
Hi,

There is no JOIN functionality in Solr. The common solution is either to accept the high volume update churn, or to add client side code to build a "join" layer on top of the two indices. I know that Attivio (www.attivio.com) have built some kind of JOIN functionality on top of Solr in their AIE product, but do not know the details or the actual performance.

Why not open a JIRA issue, if there is no such already, to request this as a feature?

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 25. jan. 2010, at 22.01, Aaron McKee wrote:

>
> Is there any somewhat convenient way to collate/integrate fields from separate indices during result writing, if the indices use the same unique keys? Basically, some sort of cross-index JOIN?
>
> As a bit of background, I have a rather heavyweight dataset of every US business (~25m records, an on-disk index footprint of ~30g, and 5-10 hours to fully index on a decent box). Given the size and relatively stability of the dataset, I generally only update this monthly. However, I have separate advertising-related datasets that need to be updated either hourly or daily (e.g. today's coupon, click revenue remaining, etc.) . These advertiser feeds reference the same keyspace that I use in the main index, but are otherwise significantly lighter weight. Importing and indexing them discretely only takes a couple minutes. Given that Solr/Lucene doesn't support field updating, without having to drop and re-add an entire document, it doesn't seem practical to integrate this data into the main index (the system would be under a constant state of churn, if we did document re-inserts, and the performance impact would probably be debilitating). It may be nice if this data could participate in filtering (e.g. only show advertisers), but it doesn't need to participate in scoring/ranking.
>
> I'm guessing that someone else has had a similar need, at some point?  I can have our front-end query the smaller indices separately, using the keys returned by the primary index, but would prefer to avoid the extra sequential roundtrips. I'm hoping to also avoid a coding solution, if only to avoid the maintenance overhead as we drop in new builds of Solr, but that's also feasible.
>
> Thank you for your insight,
> Aaron
>

Reply | Threaded
Open this post in threaded view
|

Re: Collating results from multiple indexes

Otis Gospodnetic-2
Minor correction re Attivio - their stuff runs on top of Lucene, not Solr.  I *think* they are trying to patent this.

 Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



----- Original Message ----

> From: Jan Høydahl / Cominvent <[hidden email]>
> To: [hidden email]
> Sent: Mon, February 8, 2010 3:33:41 PM
> Subject: Re: Collating results from multiple indexes
>
> Hi,
>
> There is no JOIN functionality in Solr. The common solution is either to accept
> the high volume update churn, or to add client side code to build a "join" layer
> on top of the two indices. I know that Attivio (www.attivio.com) have built some
> kind of JOIN functionality on top of Solr in their AIE product, but do not know
> the details or the actual performance.
>
> Why not open a JIRA issue, if there is no such already, to request this as a
> feature?
>
> --
> Jan Høydahl  - search architect
> Cominvent AS - www.cominvent.com
>
> On 25. jan. 2010, at 22.01, Aaron McKee wrote:
>
> >
> > Is there any somewhat convenient way to collate/integrate fields from separate
> indices during result writing, if the indices use the same unique keys?
> Basically, some sort of cross-index JOIN?
> >
> > As a bit of background, I have a rather heavyweight dataset of every US
> business (~25m records, an on-disk index footprint of ~30g, and 5-10 hours to
> fully index on a decent box). Given the size and relatively stability of the
> dataset, I generally only update this monthly. However, I have separate
> advertising-related datasets that need to be updated either hourly or daily
> (e.g. today's coupon, click revenue remaining, etc.) . These advertiser feeds
> reference the same keyspace that I use in the main index, but are otherwise
> significantly lighter weight. Importing and indexing them discretely only takes
> a couple minutes. Given that Solr/Lucene doesn't support field updating, without
> having to drop and re-add an entire document, it doesn't seem practical to
> integrate this data into the main index (the system would be under a constant
> state of churn, if we did document re-inserts, and the performance impact would
> probably be debilitating). It may be nice if this data could participate in
> filtering (e.g. only show advertisers), but it doesn't need to participate in
> scoring/ranking.
> >
> > I'm guessing that someone else has had a similar need, at some point?  I can
> have our front-end query the smaller indices separately, using the keys returned
> by the primary index, but would prefer to avoid the extra sequential roundtrips.
> I'm hoping to also avoid a coding solution, if only to avoid the maintenance
> overhead as we drop in new builds of Solr, but that's also feasible.
> >
> > Thank you for your insight,
> > Aaron
> >

Reply | Threaded
Open this post in threaded view
|

Re: Collating results from multiple indexes

Jan Høydahl / Cominvent
Really? The last time I looked at AIE, I am pretty sure there was Solr core msgs in the logs, so I assumed it used EmbeddedSolr or something. But I may be mistaken. Anyone from Attivio here who can elaborate? Is the join stuff at Lucene level or on top of multiple Solr cores or what?

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 11. feb. 2010, at 23.02, Otis Gospodnetic wrote:

> Minor correction re Attivio - their stuff runs on top of Lucene, not Solr.  I *think* they are trying to patent this.
>
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Hadoop ecosystem search :: http://search-hadoop.com/
>
>
>
> ----- Original Message ----
>> From: Jan Høydahl / Cominvent <[hidden email]>
>> To: [hidden email]
>> Sent: Mon, February 8, 2010 3:33:41 PM
>> Subject: Re: Collating results from multiple indexes
>>
>> Hi,
>>
>> There is no JOIN functionality in Solr. The common solution is either to accept
>> the high volume update churn, or to add client side code to build a "join" layer
>> on top of the two indices. I know that Attivio (www.attivio.com) have built some
>> kind of JOIN functionality on top of Solr in their AIE product, but do not know
>> the details or the actual performance.
>>
>> Why not open a JIRA issue, if there is no such already, to request this as a
>> feature?
>>
>> --
>> Jan Høydahl  - search architect
>> Cominvent AS - www.cominvent.com
>>
>> On 25. jan. 2010, at 22.01, Aaron McKee wrote:
>>
>>>
>>> Is there any somewhat convenient way to collate/integrate fields from separate
>> indices during result writing, if the indices use the same unique keys?
>> Basically, some sort of cross-index JOIN?
>>>
>>> As a bit of background, I have a rather heavyweight dataset of every US
>> business (~25m records, an on-disk index footprint of ~30g, and 5-10 hours to
>> fully index on a decent box). Given the size and relatively stability of the
>> dataset, I generally only update this monthly. However, I have separate
>> advertising-related datasets that need to be updated either hourly or daily
>> (e.g. today's coupon, click revenue remaining, etc.) . These advertiser feeds
>> reference the same keyspace that I use in the main index, but are otherwise
>> significantly lighter weight. Importing and indexing them discretely only takes
>> a couple minutes. Given that Solr/Lucene doesn't support field updating, without
>> having to drop and re-add an entire document, it doesn't seem practical to
>> integrate this data into the main index (the system would be under a constant
>> state of churn, if we did document re-inserts, and the performance impact would
>> probably be debilitating). It may be nice if this data could participate in
>> filtering (e.g. only show advertisers), but it doesn't need to participate in
>> scoring/ranking.
>>>
>>> I'm guessing that someone else has had a similar need, at some point?  I can
>> have our front-end query the smaller indices separately, using the keys returned
>> by the primary index, but would prefer to avoid the extra sequential roundtrips.
>> I'm hoping to also avoid a coding solution, if only to avoid the maintenance
>> overhead as we drop in new builds of Solr, but that's also feasible.
>>>
>>> Thank you for your insight,
>>> Aaron
>>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Collating results from multiple indexes

Will Johnson-2
Jan Hoydal / Otis,



First off, Thanks for mentioning us.  We do use some utility functions from
SOLR but our index engine is built on top of Lucene only, there are no Solr
cores involved.  We do have a JOIN operator that allows us to perform
relational searches while still acting like a search engine in terms of
performance, ranking, faceting, etc.  Our CTO wrote a blog article about it
a month ago that does a pretty good of explaining how it’s used:
http://www.attivio.com/blog/55-industry-insights/507-can-a-search-engine-replace-a-relational-database.html



The join functionality and most of our other higher level features use
separate data structures and don’t really have much to do with Lucene/SOLR
except where they integrate with the query execution.  If you want to learn
more feel free to check out www.attivio.com.



-              [hidden email]


On Fri, Feb 12, 2010 at 10:35 AM, Jan Høydahl / Cominvent <
[hidden email]> wrote:

> Really? The last time I looked at AIE, I am pretty sure there was Solr core
> msgs in the logs, so I assumed it used EmbeddedSolr or something. But I may
> be mistaken. Anyone from Attivio here who can elaborate? Is the join stuff
> at Lucene level or on top of multiple Solr cores or what?
>
> --
> Jan Høydahl  - search architect
> Cominvent AS - www.cominvent.com
>
>  On 11. feb. 2010, at 23.02, Otis Gospodnetic wrote:
>
> > Minor correction re Attivio - their stuff runs on top of Lucene, not
> Solr.  I *think* they are trying to patent this.
> >
> > Otis
> > ----
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > Hadoop ecosystem search :: http://search-hadoop.com/
> >
> >
> >
> > ----- Original Message ----
> >> From: Jan Høydahl / Cominvent <[hidden email]>
> >> To: [hidden email]
> >> Sent: Mon, February 8, 2010 3:33:41 PM
> >> Subject: Re: Collating results from multiple indexes
> >>
> >> Hi,
> >>
> >> There is no JOIN functionality in Solr. The common solution is either to
> accept
> >> the high volume update churn, or to add client side code to build a
> "join" layer
> >> on top of the two indices. I know that Attivio (www.attivio.com) have
> built some
> >> kind of JOIN functionality on top of Solr in their AIE product, but do
> not know
> >> the details or the actual performance.
> >>
> >> Why not open a JIRA issue, if there is no such already, to request this
> as a
> >> feature?
> >>
> >> --
> >> Jan Høydahl  - search architect
> >> Cominvent AS - www.cominvent.com
> >>
> >> On 25. jan. 2010, at 22.01, Aaron McKee wrote:
> >>
> >>>
> >>> Is there any somewhat convenient way to collate/integrate fields from
> separate
> >> indices during result writing, if the indices use the same unique keys?
> >> Basically, some sort of cross-index JOIN?
> >>>
> >>> As a bit of background, I have a rather heavyweight dataset of every US
> >> business (~25m records, an on-disk index footprint of ~30g, and 5-10
> hours to
> >> fully index on a decent box). Given the size and relatively stability of
> the
> >> dataset, I generally only update this monthly. However, I have separate
> >> advertising-related datasets that need to be updated either hourly or
> daily
> >> (e.g. today's coupon, click revenue remaining, etc.) . These advertiser
> feeds
> >> reference the same keyspace that I use in the main index, but are
> otherwise
> >> significantly lighter weight. Importing and indexing them discretely
> only takes
> >> a couple minutes. Given that Solr/Lucene doesn't support field updating,
> without
> >> having to drop and re-add an entire document, it doesn't seem practical
> to
> >> integrate this data into the main index (the system would be under a
> constant
> >> state of churn, if we did document re-inserts, and the performance
> impact would
> >> probably be debilitating). It may be nice if this data could participate
> in
> >> filtering (e.g. only show advertisers), but it doesn't need to
> participate in
> >> scoring/ranking.
> >>>
> >>> I'm guessing that someone else has had a similar need, at some point?
>  I can
> >> have our front-end query the smaller indices separately, using the keys
> returned
> >> by the primary index, but would prefer to avoid the extra sequential
> roundtrips.
> >> I'm hoping to also avoid a coding solution, if only to avoid the
> maintenance
> >> overhead as we drop in new builds of Solr, but that's also feasible.
> >>>
> >>> Thank you for your insight,
> >>> Aaron
> >>>
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Collating results from multiple indexes

Jan Høydahl / Cominvent
Thanks for your clarification and link, Will.

Back to Aaron's question. There is some ongoing work to try to support updating single fields within documents (http://issues.apache.org/jira/browse/SOLR-139 and http://issues.apache.org/jira/browse/SOLR-828) which could perhaps be part of a future solution.

Is it an option for you to write a smart "join" component which can live on top of multiple cores and do multiple sub queries in an efficient way and transparently return the final result? Forking the shards query code could be a starting point? Donating this component back to Solr may free you of maintenance burden, as I'm sure it will be useful to a larger audience?

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 17. feb. 2010, at 03.27, Will Johnson wrote:

> Jan Hoydal / Otis,
>
>
>
> First off, Thanks for mentioning us.  We do use some utility functions from
> SOLR but our index engine is built on top of Lucene only, there are no Solr
> cores involved.  We do have a JOIN operator that allows us to perform
> relational searches while still acting like a search engine in terms of
> performance, ranking, faceting, etc.  Our CTO wrote a blog article about it
> a month ago that does a pretty good of explaining how it’s used:
> http://www.attivio.com/blog/55-industry-insights/507-can-a-search-engine-replace-a-relational-database.html
>
>
>
> The join functionality and most of our other higher level features use
> separate data structures and don’t really have much to do with Lucene/SOLR
> except where they integrate with the query execution.  If you want to learn
> more feel free to check out www.attivio.com.
>
>
>
> -              [hidden email]
>
>
> On Fri, Feb 12, 2010 at 10:35 AM, Jan Høydahl / Cominvent <
> [hidden email]> wrote:
>
>> Really? The last time I looked at AIE, I am pretty sure there was Solr core
>> msgs in the logs, so I assumed it used EmbeddedSolr or something. But I may
>> be mistaken. Anyone from Attivio here who can elaborate? Is the join stuff
>> at Lucene level or on top of multiple Solr cores or what?
>>
>> --
>> Jan Høydahl  - search architect
>> Cominvent AS - www.cominvent.com
>>
>> On 11. feb. 2010, at 23.02, Otis Gospodnetic wrote:
>>
>>> Minor correction re Attivio - their stuff runs on top of Lucene, not
>> Solr.  I *think* they are trying to patent this.
>>>
>>> Otis
>>> ----
>>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>>> Hadoop ecosystem search :: http://search-hadoop.com/
>>>
>>>
>>>
>>> ----- Original Message ----
>>>> From: Jan Høydahl / Cominvent <[hidden email]>
>>>> To: [hidden email]
>>>> Sent: Mon, February 8, 2010 3:33:41 PM
>>>> Subject: Re: Collating results from multiple indexes
>>>>
>>>> Hi,
>>>>
>>>> There is no JOIN functionality in Solr. The common solution is either to
>> accept
>>>> the high volume update churn, or to add client side code to build a
>> "join" layer
>>>> on top of the two indices. I know that Attivio (www.attivio.com) have
>> built some
>>>> kind of JOIN functionality on top of Solr in their AIE product, but do
>> not know
>>>> the details or the actual performance.
>>>>
>>>> Why not open a JIRA issue, if there is no such already, to request this
>> as a
>>>> feature?
>>>>
>>>> --
>>>> Jan Høydahl  - search architect
>>>> Cominvent AS - www.cominvent.com
>>>>
>>>> On 25. jan. 2010, at 22.01, Aaron McKee wrote:
>>>>
>>>>>
>>>>> Is there any somewhat convenient way to collate/integrate fields from
>> separate
>>>> indices during result writing, if the indices use the same unique keys?
>>>> Basically, some sort of cross-index JOIN?
>>>>>
>>>>> As a bit of background, I have a rather heavyweight dataset of every US
>>>> business (~25m records, an on-disk index footprint of ~30g, and 5-10
>> hours to
>>>> fully index on a decent box). Given the size and relatively stability of
>> the
>>>> dataset, I generally only update this monthly. However, I have separate
>>>> advertising-related datasets that need to be updated either hourly or
>> daily
>>>> (e.g. today's coupon, click revenue remaining, etc.) . These advertiser
>> feeds
>>>> reference the same keyspace that I use in the main index, but are
>> otherwise
>>>> significantly lighter weight. Importing and indexing them discretely
>> only takes
>>>> a couple minutes. Given that Solr/Lucene doesn't support field updating,
>> without
>>>> having to drop and re-add an entire document, it doesn't seem practical
>> to
>>>> integrate this data into the main index (the system would be under a
>> constant
>>>> state of churn, if we did document re-inserts, and the performance
>> impact would
>>>> probably be debilitating). It may be nice if this data could participate
>> in
>>>> filtering (e.g. only show advertisers), but it doesn't need to
>> participate in
>>>> scoring/ranking.
>>>>>
>>>>> I'm guessing that someone else has had a similar need, at some point?
>> I can
>>>> have our front-end query the smaller indices separately, using the keys
>> returned
>>>> by the primary index, but would prefer to avoid the extra sequential
>> roundtrips.
>>>> I'm hoping to also avoid a coding solution, if only to avoid the
>> maintenance
>>>> overhead as we drop in new builds of Solr, but that's also feasible.
>>>>>
>>>>> Thank you for your insight,
>>>>> Aaron
>>>>>
>>>
>>
>>