Near real-time search of user data

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Near real-time search of user data

Mark Ferguson
Hi,

I am trying to come up with a strategy for a solr setup in which a user's
indexed data can be nearly immediately available to them for search. My
current strategy (which is starting to cause problems) is as follows:

  - each user has their own personal index (core), which gets committed
after each update
  - there is a main index which is basically an aggregate of all user
indexes. This index gets committed every 5 minutes or so.

In this way, I can search a user's personal index to get real-time results,
and concatenate the world results from the main index, which aren't as
important to be immediate.

This multicore strategy worked well in test scenarios but as the user
indexes get larger it is starting to fall apart as I run into memory issues
in maintaining too many cores. It's not realistic to dedicate a new machine
to every 5K-10K users and I think this is what I will have to do to maintain
the multicore strategy.

So I am hoping that someone will be able to provide some tips on how to
accomplish what I am looking for. One option is to simply send a commit to
the main index every couple seconds, but I was hoping someone with
experience could shed some light on whether this is a viable option before I
attempt that route (i.e. can commits be sent that frequently on a large
index?). The indexes are distributed but they could still be in the 2-100GB
range.

Thanks very much for any suggestions!

Mark
Reply | Threaded
Open this post in threaded view
|

Re: Near real-time search of user data

Otis Gospodnetic-2

I've used a similar strategy for Simpy.com, but with raw Lucene and not Solr.  The crucial piece is to close (inactive) user indices periodically and thus free the memory.  Are you doing the same with your per-user Solr cores and still running into memory issues?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----

> From: Mark Ferguson <[hidden email]>
> To: [hidden email]
> Sent: Friday, February 20, 2009 1:14:15 AM
> Subject: Near real-time search of user data
>
> Hi,
>
> I am trying to come up with a strategy for a solr setup in which a user's
> indexed data can be nearly immediately available to them for search. My
> current strategy (which is starting to cause problems) is as follows:
>
>   - each user has their own personal index (core), which gets committed
> after each update
>   - there is a main index which is basically an aggregate of all user
> indexes. This index gets committed every 5 minutes or so.
>
> In this way, I can search a user's personal index to get real-time results,
> and concatenate the world results from the main index, which aren't as
> important to be immediate.
>
> This multicore strategy worked well in test scenarios but as the user
> indexes get larger it is starting to fall apart as I run into memory issues
> in maintaining too many cores. It's not realistic to dedicate a new machine
> to every 5K-10K users and I think this is what I will have to do to maintain
> the multicore strategy.
>
> So I am hoping that someone will be able to provide some tips on how to
> accomplish what I am looking for. One option is to simply send a commit to
> the main index every couple seconds, but I was hoping someone with
> experience could shed some light on whether this is a viable option before I
> attempt that route (i.e. can commits be sent that frequently on a large
> index?). The indexes are distributed but they could still be in the 2-100GB
> range.
>
> Thanks very much for any suggestions!
>
> Mark

Reply | Threaded
Open this post in threaded view
|

Re: Near real-time search of user data

Noble Paul നോബിള്‍  नोब्ळ्
we have a similar usecase and I have raised an issue for the same (SOLR-880)
currently we are using an internal patch and we hopw to submit one soon.

we also use an LRU based automatic loading unloading feature. if a
request comes up for a core that is 'STOPPED' . the core is 'STARTED'
and the request is served.

We  keep an upper limit of the no:of cores to be kept loaded and if
the limit is crossed, a least recently used core is 'STOPPED' .

--Noble


On Fri, Feb 20, 2009 at 8:53 AM, Otis Gospodnetic
<[hidden email]> wrote:

>
> I've used a similar strategy for Simpy.com, but with raw Lucene and not Solr.  The crucial piece is to close (inactive) user indices periodically and thus free the memory.  Are you doing the same with your per-user Solr cores and still running into memory issues?
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Mark Ferguson <[hidden email]>
>> To: [hidden email]
>> Sent: Friday, February 20, 2009 1:14:15 AM
>> Subject: Near real-time search of user data
>>
>> Hi,
>>
>> I am trying to come up with a strategy for a solr setup in which a user's
>> indexed data can be nearly immediately available to them for search. My
>> current strategy (which is starting to cause problems) is as follows:
>>
>>   - each user has their own personal index (core), which gets committed
>> after each update
>>   - there is a main index which is basically an aggregate of all user
>> indexes. This index gets committed every 5 minutes or so.
>>
>> In this way, I can search a user's personal index to get real-time results,
>> and concatenate the world results from the main index, which aren't as
>> important to be immediate.
>>
>> This multicore strategy worked well in test scenarios but as the user
>> indexes get larger it is starting to fall apart as I run into memory issues
>> in maintaining too many cores. It's not realistic to dedicate a new machine
>> to every 5K-10K users and I think this is what I will have to do to maintain
>> the multicore strategy.
>>
>> So I am hoping that someone will be able to provide some tips on how to
>> accomplish what I am looking for. One option is to simply send a commit to
>> the main index every couple seconds, but I was hoping someone with
>> experience could shed some light on whether this is a viable option before I
>> attempt that route (i.e. can commits be sent that frequently on a large
>> index?). The indexes are distributed but they could still be in the 2-100GB
>> range.
>>
>> Thanks very much for any suggestions!
>>
>> Mark
>
>



--
--Noble Paul
Reply | Threaded
Open this post in threaded view
|

Re: Near real-time search of user data

Mark Ferguson
Thanks Noble and Otis for your suggestions.

After reading more messages on the mailing list relating to this problem, I
decided to implement one suggestion which was to keep an archive index and a
smaller delta index containing only recent updates, then do a distributed
search across them. The delta index is small so can handle rapid commits
(every 1-2 seconds). This setup works well for my architecture because it is
easy to keep track of recent changes in the database and then send those to
the archive index every hour or so, then clear out the delta.

I really like your ideas about closing inactive indexes when using a
multicore setup; having too many indexes open was definitely the issue
plaguing me. Thanks for your great ideas and the time you take on this
project!

Mark



On Thu, Feb 19, 2009 at 9:31 PM, Noble Paul നോബിള്‍ नोब्ळ् <
[hidden email]> wrote:

> we have a similar usecase and I have raised an issue for the same
> (SOLR-880)
> currently we are using an internal patch and we hopw to submit one soon.
>
> we also use an LRU based automatic loading unloading feature. if a
> request comes up for a core that is 'STOPPED' . the core is 'STARTED'
> and the request is served.
>
> We  keep an upper limit of the no:of cores to be kept loaded and if
> the limit is crossed, a least recently used core is 'STOPPED' .
>
> --Noble
>
>
> On Fri, Feb 20, 2009 at 8:53 AM, Otis Gospodnetic
> <[hidden email]> wrote:
> >
> > I've used a similar strategy for Simpy.com, but with raw Lucene and not
> Solr.  The crucial piece is to close (inactive) user indices periodically
> and thus free the memory.  Are you doing the same with your per-user Solr
> cores and still running into memory issues?
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > ----- Original Message ----
> >> From: Mark Ferguson <[hidden email]>
> >> To: [hidden email]
> >> Sent: Friday, February 20, 2009 1:14:15 AM
> >> Subject: Near real-time search of user data
> >>
> >> Hi,
> >>
> >> I am trying to come up with a strategy for a solr setup in which a
> user's
> >> indexed data can be nearly immediately available to them for search. My
> >> current strategy (which is starting to cause problems) is as follows:
> >>
> >>   - each user has their own personal index (core), which gets committed
> >> after each update
> >>   - there is a main index which is basically an aggregate of all user
> >> indexes. This index gets committed every 5 minutes or so.
> >>
> >> In this way, I can search a user's personal index to get real-time
> results,
> >> and concatenate the world results from the main index, which aren't as
> >> important to be immediate.
> >>
> >> This multicore strategy worked well in test scenarios but as the user
> >> indexes get larger it is starting to fall apart as I run into memory
> issues
> >> in maintaining too many cores. It's not realistic to dedicate a new
> machine
> >> to every 5K-10K users and I think this is what I will have to do to
> maintain
> >> the multicore strategy.
> >>
> >> So I am hoping that someone will be able to provide some tips on how to
> >> accomplish what I am looking for. One option is to simply send a commit
> to
> >> the main index every couple seconds, but I was hoping someone with
> >> experience could shed some light on whether this is a viable option
> before I
> >> attempt that route (i.e. can commits be sent that frequently on a large
> >> index?). The indexes are distributed but they could still be in the
> 2-100GB
> >> range.
> >>
> >> Thanks very much for any suggestions!
> >>
> >> Mark
> >
> >
>
>
>
> --
> --Noble Paul
>