Federated Search

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Federated Search

Tim Patton
I just downloaded Solr to try out, it seems like it will replace a ton
of code I've written.  I saw a few posts about the FederatedSearch and
skimmed the ideas at http://wiki.apache.org/solr/FederatedSearch.  The
project I am working on has several Lucene indexes 20-40GB in size
spread among a few machines.  I've also run into problems figuring out
how to work with Lucene in a distributed fashion, though all of my
difficulties were in indexing, searching with Multisearcher and a few
custom classes on top of the hits was not that difficult.

Indexing involved using a SQL database as a master db so you could find
documents by their unique ID and a JMS server to distribute additions,
deletions and updates to each of the indexing servers.  I eventually
replaced the JMS server with someone custom I wrote that is much more
lightweight, and less prone to bogging down.

I'd be curious if Yonik was still on the list and if he or anyone had
any new ideas for Federated Searching.

Tim P.

Reply | Threaded
Open this post in threaded view
|

Re: Federated Search

kkrugler
>I just downloaded Solr to try out, it seems like it will replace a
>ton of code I've written.  I saw a few posts about the
>FederatedSearch and skimmed the ideas at
>http://wiki.apache.org/solr/FederatedSearch.  The project I am
>working on has several Lucene indexes 20-40GB in size spread among a
>few machines.  I've also run into problems figuring out how to work
>with Lucene in a distributed fashion, though all of my difficulties
>were in indexing, searching with Multisearcher and a few custom
>classes on top of the hits was not that difficult.
>
>Indexing involved using a SQL database as a master db so you could
>find documents by their unique ID and a JMS server to distribute
>additions, deletions and updates to each of the indexing servers.  I
>eventually replaced the JMS server with someone custom I wrote that
>is much more lightweight, and less prone to bogging down.
>
>I'd be curious if Yonik was still on the list and if he or anyone
>had any new ideas for Federated Searching.

I'm also interested in this. For me, I don't need sorted output,
faceted browsing, or alternative output formats - so something along
the lines of the "Merge XML responses w/o Schema" proposal would be
just fine.

Open issues:

1. How much better (if at all) would it be to use Hadoop PRC (versus
HTTP) to call the sub-searchers? I'm assuming it has better
performance, and there might be fewer connectivity issues, but then
you aren't leveraging the work being done on embedded Jetty, for
example. Anybody have data points on relative performance?

2. Is there one master schema on the "main" search server that could
get distributed to the remote searchers, or would that be part of a
snappuller-ish update mechanism?

Thanks,

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"
Reply | Threaded
Open this post in threaded view
|

Re: Federated Search

Mike Klaas
In reply to this post by Tim Patton
On 2/27/07, Ken Krugler <[hidden email]> wrote:

> I'm also interested in this. For me, I don't need sorted output,
> faceted browsing, or alternative output formats - so something along
> the lines of the "Merge XML responses w/o Schema" proposal would be
> just fine.
>
> Open issues:

3.  Highlighting as a separate step.

Currently a bit of work needs to be done to do this efficiently with
Solr.  The way I set it up is roughly:
 - turn on lazy field loading.  For best effect, compress the main text field.
 - create a new request handler that is similar to dismax, but uses
the query for highlighting only.  A separate parameter allows the
specification of document keys to highlight
 - highlighting requires the internal lucene document id, not the
document key, and it can be slow to execute queries to get the ids.  I
created a custom cache that maps doc keys -> doc ids, populate it
during the main query, and grab ids from the cache during the
highlighting step.

regards,
-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Federated Search

Tim Patton
In reply to this post by Tim Patton


Venkatesh Seetharam wrote:

> Hi Tim,
>
> Howdy. I saw your post on Solr newsgroup and caught my attention. I'm
> working on a similar problem for searching a vault of over 100 million
> XML documents. I already have the encoding part done using Hadoop and
> Lucene. It works like a  charm. I create N index partitions and have
> been trying to wrap Solr to search each partition, have a Search broker
> that merges the results and returns.
>
> I'm curious about how have you solved the distribution of additions,
> deletions and updates to each of the indexing servers.I use a
> partitioner based on a hash of the document id. Do you broadcast to the
> slaves as to who owns a document?
>
> Also, I'm looking at Hadoop RPC and ICE ( www.zeroc.com
> <http://www.zeroc.com>) for distributing the search across these Solr
> servers. I'm not using HTTP.
>
> Any ideas are greatly appreciated.
>
> PS: I did subscribe to solr newsgroup now but  did not receive a
> confirmation and hence sending it to you directly.
>
> --
> Thanks,
> Venkatesh
>
> "Perfection (in design) is achieved not when there is nothing more to
> add, but rather when there is nothing more to take away."
> - Antoine de Saint-Exupéry


I used a SQL database to keep track of which server had which document.
    Then I originally used JMS and would use a selector for which server
number the document should go to.  I switched over to a home grown,
lightweight message server since JMS behaves really badly when it backs
up and I couldn't find a server that would simply pause the producers if
there was a problem with the consumers.  Additions are pretty much
assigned randomly to whichever server gets them first.  At this point I
am up to around 20 million documents.

The hash idea sounds really interesting and if I had a fixed number of
indexes it would be perfect.  But I don't know how big the index will
grow and I wanted to be able to add servers at any point.  I would like
to eliminate any outside dependencies (SQL, JMS), which is why a
distributed Solr would let me focus on other areas.

How did you work around not being able to update a lucene index that is
stored in Hadoop?  I know there were changes in Lucene 2.1 to support
this but I haven't looked that far into it yet, I've just been testing
the new IndexWriter.  As an aside, I hope those features can be used by
Solr soon (if they aren't already in the nightlys).

Tim

Reply | Threaded
Open this post in threaded view
|

ReFederated Search

Jack L
This is very interesting discussion. I have a few question while
reading Tim and Venkatesh's email:

To Tim:
1. is there any reason you don't want to use HTTP? Since solr has
   an HTTP interface already, I suppose using HTTP is the simplest
   way to communicate the solr servers from the merger/search broker.
   hadoop and ice would both require some additional work - this is
   if you are using solr and not lucent directly.

2. "Do you broadcast to the slaves as to who owns a document?"
   Do the searchers need to know who has what document?
   
To Venkatesh:
1. I suppose solr is ok to handle 20 million document - I hope I'm
   right because that's what I'm planning on doing :) Is it because
   of storage capacity why you you choose to use multiple solr
   servers?

An open question: what's the best way to manage server addition?
- If a hash value-based partitioning is used, re-indexing all
  the document will be needed.
- Otherwise, a database seems to be required to track the documents.

--
Best regards,
Jack

Monday, March 5, 2007, 7:47:36 AM, you wrote:



> Venkatesh Seetharam wrote:
>> Hi Tim,
>>
>> Howdy. I saw your post on Solr newsgroup and caught my attention. I'm
>> working on a similar problem for searching a vault of over 100 million
>> XML documents. I already have the encoding part done using Hadoop and
>> Lucene. It works like a  charm. I create N index partitions and have
>> been trying to wrap Solr to search each partition, have a Search broker
>> that merges the results and returns.
>>
>> I'm curious about how have you solved the distribution of additions,
>> deletions and updates to each of the indexing servers.I use a
>> partitioner based on a hash of the document id. Do you broadcast to the
>> slaves as to who owns a document?
>>
>> Also, I'm looking at Hadoop RPC and ICE ( www.zeroc.com
>> <http://www.zeroc.com>) for distributing the search across these Solr
>> servers. I'm not using HTTP.
>>
>> Any ideas are greatly appreciated.
>>
>> PS: I did subscribe to solr newsgroup now but  did not receive a
>> confirmation and hence sending it to you directly.
>>
>> --
>> Thanks,
>> Venkatesh
>>
>> "Perfection (in design) is achieved not when there is nothing more to
>> add, but rather when there is nothing more to take away."
>> - Antoine de Saint-Exupéry


> I used a SQL database to keep track of which server had which document.
>     Then I originally used JMS and would use a selector for which server
> number the document should go to.  I switched over to a home grown,
> lightweight message server since JMS behaves really badly when it backs
> up and I couldn't find a server that would simply pause the producers if
> there was a problem with the consumers.  Additions are pretty much
> assigned randomly to whichever server gets them first.  At this point I
> am up to around 20 million documents.

> The hash idea sounds really interesting and if I had a fixed number of
> indexes it would be perfect.  But I don't know how big the index will
> grow and I wanted to be able to add servers at any point.  I would like
> to eliminate any outside dependencies (SQL, JMS), which is why a
> distributed Solr would let me focus on other areas.

> How did you work around not being able to update a lucene index that is
> stored in Hadoop?  I know there were changes in Lucene 2.1 to support
> this but I haven't looked that far into it yet, I've just been testing
> the new IndexWriter.  As an aside, I hope those features can be used by
> Solr soon (if they aren't already in the nightlys).

> Tim

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

ReFederated Search

Tim Patton


Jack L wrote:

> This is very interesting discussion. I have a few question while
> reading Tim and Venkatesh's email:
>
> To Tim:
> 1. is there any reason you don't want to use HTTP? Since solr has
>    an HTTP interface already, I suppose using HTTP is the simplest
>    way to communicate the solr servers from the merger/search broker.
>    hadoop and ice would both require some additional work - this is
>    if you are using solr and not lucent directly.
>
> 2. "Do you broadcast to the slaves as to who owns a document?"
>    Do the searchers need to know who has what document?
>    
> To Venkatesh:
> 1. I suppose solr is ok to handle 20 million document - I hope I'm
>    right because that's what I'm planning on doing :) Is it because
>    of storage capacity why you you choose to use multiple solr
>    servers?
>
> An open question: what's the best way to manage server addition?
> - If a hash value-based partitioning is used, re-indexing all
>   the document will be needed.
> - Otherwise, a database seems to be required to track the documents.
>

Jack,

My big stumbling blocks were with indexing more so than searching.  I
did put together an RMI based system to search multiple lucene servers.
  And the searchers don't need to know where everything is.  However
with indexing at some point something needs to know where to send the
documents for updating or who to tell to delete a document, whether it
is the server that does the processing or some sort of broker.   The
processing machines could do the DB look up and talk to Solr over HTTP
no problem and this is part of what I am considering doing.  However I
have some extra code on the indexing machines to handle DB updates
etc..., though I might find a way to move this elsewhere in the system
so I can have pretty much a pure solr server with just a few custom
items (like my own Similarity or QueryParser).

I suppose the DB could be moved to lucene from SQL in the future as well.

Reply | Threaded
Open this post in threaded view
|

Re: Federated Search

Venkatesh Seetharam
In reply to this post by Tim Patton
Hi Tim,

Thanks for your response. Interesting idea. Does the DB scale?  Do you have
one single index which you plan to use Solr for or you have multiple
indexes?

> But I don't know how big the index will grow and I wanted to be able to
add servers at any point.
I'm thinking of having N partitions with a max of 10 million documents per
partition. Adding a server should not be a problem but the newly added
server would take time to grow so that distribution of documents are equal
in the cluster. I've tested with 50 million documents of 10 size each and
looks very promising.

> The hash idea sounds really interesting and if I had a fixed number of
indexes it would be perfect.
I'm infact looking around for a reverse-hash algorithm where in given a
docId, I should be able to find which partition contains the document so I
can save cycles on broadcasting slaves.

I mean, even if you use a DB, how have you solved the problem of
distribution when a new server is added into the mix.

We have the same problem since we get daily updates to documents and
document metadata.

> How did you work around not being able to update a lucene index that is
stored in Hadoop?
I do not use HDFS. I use a NetApp mounted on all the nodes in the cluster
and hence did not need any change to Lucene.

I plan to index using Lucene/Hadoop and use Solr as the partition searcher
and a broker which would merge the results and return 'em.

Thanks,
Venkatesh

On 3/5/07, Tim Patton <[hidden email]> wrote:

>
>
>
> Venkatesh Seetharam wrote:
> > Hi Tim,
> >
> > Howdy. I saw your post on Solr newsgroup and caught my attention. I'm
> > working on a similar problem for searching a vault of over 100 million
> > XML documents. I already have the encoding part done using Hadoop and
> > Lucene. It works like a  charm. I create N index partitions and have
> > been trying to wrap Solr to search each partition, have a Search broker
> > that merges the results and returns.
> >
> > I'm curious about how have you solved the distribution of additions,
> > deletions and updates to each of the indexing servers.I use a
> > partitioner based on a hash of the document id. Do you broadcast to the
> > slaves as to who owns a document?
> >
> > Also, I'm looking at Hadoop RPC and ICE ( www.zeroc.com
> > <http://www.zeroc.com>) for distributing the search across these Solr
> > servers. I'm not using HTTP.
> >
> > Any ideas are greatly appreciated.
> >
> > PS: I did subscribe to solr newsgroup now but  did not receive a
> > confirmation and hence sending it to you directly.
> >
> > --
> > Thanks,
> > Venkatesh
> >
> > "Perfection (in design) is achieved not when there is nothing more to
> > add, but rather when there is nothing more to take away."
> > - Antoine de Saint-Exupéry
>
>
> I used a SQL database to keep track of which server had which document.
>     Then I originally used JMS and would use a selector for which server
> number the document should go to.  I switched over to a home grown,
> lightweight message server since JMS behaves really badly when it backs
> up and I couldn't find a server that would simply pause the producers if
> there was a problem with the consumers.  Additions are pretty much
> assigned randomly to whichever server gets them first.  At this point I
> am up to around 20 million documents.
>
> The hash idea sounds really interesting and if I had a fixed number of
> indexes it would be perfect.  But I don't know how big the index will
> grow and I wanted to be able to add servers at any point.  I would like
> to eliminate any outside dependencies (SQL, JMS), which is why a
> distributed Solr would let me focus on other areas.
>
> How did you work around not being able to update a lucene index that is
> stored in Hadoop?  I know there were changes in Lucene 2.1 to support
> this but I haven't looked that far into it yet, I've just been testing
> the new IndexWriter.  As an aside, I hope those features can be used by
> Solr soon (if they aren't already in the nightlys).
>
> Tim
>
>
Reply | Threaded
Open this post in threaded view
|

Re: ReFederated Search

Venkatesh Seetharam
In reply to this post by Jack L
Hi Jack,

Howdy. Comments are inline.

> is there any reason you don't want to use HTTP?
I've seen that Hadoop RPC is faster then HTTP. Also, Since Solr returns XML
response, you incur overhead in parsing that and then merging. I havent sone
scale testing with HTTP and XML response.

> Do the searchers need to know who has what document?
This is necessary if you are doing updates to the document in the index.

> I suppose solr is ok to handle 20 million document
Storage is not an issue. If the size of the index is huge, then it will take
time and when you want 100 searches/second, its really hard. I've read in
Lucene newsgroup that lucene works well with an index around 8-10GB. It
slows down when its bigger than that. Since my index can run into many GB,
I'd partition that.

> - If a hash value-based partitioning is used, re-indexing all  the
document will be needed.
Why is that necessary? If a document has to be updated, you can broadcast to
slaves as to who owns it and then send an update to that slave.

Venkatesh

On 3/5/07, Jack L <[hidden email]> wrote:

>
> This is very interesting discussion. I have a few question while
> reading Tim and Venkatesh's email:
>
> To Tim:
> 1. is there any reason you don't want to use HTTP? Since solr has
>    an HTTP interface already, I suppose using HTTP is the simplest
>    way to communicate the solr servers from the merger/search broker.
>    hadoop and ice would both require some additional work - this is
>    if you are using solr and not lucent directly.
>
> 2. "Do you broadcast to the slaves as to who owns a document?"
>    Do the searchers need to know who has what document?
>
> To Venkatesh:
> 1. I suppose solr is ok to handle 20 million document - I hope I'm
>    right because that's what I'm planning on doing :) Is it because
>    of storage capacity why you you choose to use multiple solr
>    servers?
>
> An open question: what's the best way to manage server addition?
> - If a hash value-based partitioning is used, re-indexing all
>   the document will be needed.
> - Otherwise, a database seems to be required to track the documents.
>
> --
> Best regards,
> Jack
>
> Monday, March 5, 2007, 7:47:36 AM, you wrote:
>
>
>
> > Venkatesh Seetharam wrote:
> >> Hi Tim,
> >>
> >> Howdy. I saw your post on Solr newsgroup and caught my attention. I'm
> >> working on a similar problem for searching a vault of over 100 million
> >> XML documents. I already have the encoding part done using Hadoop and
> >> Lucene. It works like a  charm. I create N index partitions and have
> >> been trying to wrap Solr to search each partition, have a Search broker
> >> that merges the results and returns.
> >>
> >> I'm curious about how have you solved the distribution of additions,
> >> deletions and updates to each of the indexing servers.I use a
> >> partitioner based on a hash of the document id. Do you broadcast to the
>
> >> slaves as to who owns a document?
> >>
> >> Also, I'm looking at Hadoop RPC and ICE ( www.zeroc.com
> >> <http://www.zeroc.com >) for distributing the search across these Solr
> >> servers. I'm not using HTTP.
> >>
> >> Any ideas are greatly appreciated.
> >>
> >> PS: I did subscribe to solr newsgroup now but  did not receive a
> >> confirmation and hence sending it to you directly.
> >>
> >> --
> >> Thanks,
> >> Venkatesh
> >>
> >> "Perfection (in design) is achieved not when there is nothing more to
> >> add, but rather when there is nothing more to take away."
> >> - Antoine de Saint-Exupéry
>
>
> > I used a SQL database to keep track of which server had which document.
> >     Then I originally used JMS and would use a selector for which server
>
> > number the document should go to.  I switched over to a home grown,
> > lightweight message server since JMS behaves really badly when it backs
> > up and I couldn't find a server that would simply pause the producers if
>
> > there was a problem with the consumers.  Additions are pretty much
> > assigned randomly to whichever server gets them first.  At this point I
> > am up to around 20 million documents.
>
> > The hash idea sounds really interesting and if I had a fixed number of
> > indexes it would be perfect.  But I don't know how big the index will
> > grow and I wanted to be able to add servers at any point.  I would like
> > to eliminate any outside dependencies (SQL, JMS), which is why a
> > distributed Solr would let me focus on other areas.
>
> > How did you work around not being able to update a lucene index that is
> > stored in Hadoop?  I know there were changes in Lucene 2.1 to support
> > this but I haven't looked that far into it yet, I've just been testing
> > the new IndexWriter.  As an aside, I hope those features can be used by
> > Solr soon (if they aren't already in the nightlys).
>
> > Tim
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
Reply | Threaded
Open this post in threaded view
|

Re: Federated Search

Jed Reynolds-2
In reply to this post by Venkatesh Seetharam
    Venkatesh Seetharam wrote:
>
>> The hash idea sounds really interesting and if I had a fixed number of
> indexes it would be perfect.
> I'm infact looking around for a reverse-hash algorithm where in given a
> docId, I should be able to find which partition contains the document
> so I
> can save cycles on broadcasting slaves.

Many large databases partition their data either by load or by another
logical manner, like by alphabet. I hear that Hotmail, for instance,
partitions its users alphabetically. Having a broker will certainly
abstract this mechninism, and of course your application(s) want to be
able to bypass a broker when necessary.

> I mean, even if you use a DB, how have you solved the problem of
> distribution when a new server is added into the mix.

http://www8.org/w8-papers/2a-webserver/caching/paper2.html

I saw this link on the memcached list and the thread surrounding it
certainly covered some similar ground. Some ideas have been discussed like:
- high availability of memcached, redundant entries
- scaling out clusters and facing the need to rebuild the entire cache
on all nodes depending on your bucketing.
I see some similarties with maintaining multiple indicies/lucene
partitions and having a memcache deployment: mostly if you are hashing
your keys to partitions (or buckets or machines) then you might be faced
with a) availability issues if there's a machine/partition outtage b)
rebuilding partitions if adding a partition/bucket changes the hash mapping.

The ways I can think of to scale-out new indexes would be to have your
application maintain two sets of bucket mappings for ids to indexes, and
the second would be to key your documents and partition them by date.
The former method would allow you to rebuild a second set of
repartitioned indexes and buckets and allow you to update your
application to use the new bucket mapping (when all the indexes has been
rebuilt). The latter method would only apply if you could organize your
document ids by date and only added new documents to the 'now' end or
evenly across most dates. You'd have to add a new partition onto the end
as time progressed, and rarely rebuild old indexes unless your documents
grow unevenly.

Interesting topic! I don't yet need to run multiple Lucene partitions,
but I have a few memcached servers and increasing the number of them I
expect will force my site to take a performance accordingly as I am
forced to rebuild the caches. I can see similarly if I had multiple
lucene partitions, that if I had to fission some of them, rebuilding the
resulting partitions would be time intensive and I'd want to have
procedures in place for availibility, scaling out and changing
application code as necessary. Just having one fail-over Solr index is
just so easy in comparison.
 
Jed
Reply | Threaded
Open this post in threaded view
|

Re: SPAM-LOW: Re: Federated Search

Tim Patton
In reply to this post by Venkatesh Seetharam
I have several indexes now (4 at the moment, 20gb each, and I want to be
able to drop in a new machine easily).  I'm using SQL server as a DB and
it scales well.  The DB doesn't get hit too hard, mostly doing location
lookups, and the app does some checking to make sure a document  has
really changed before updating that back in the DB or the index.  When a
new server is added it randomly picks up additions from the message
server (it's approximately round-robin) and the rest of the system
really doesn't even need to know about it.

I've realized partitioned indexing is a difficult, but solvable problem.
  It could be a big project though.  I mean we have all solved it in our
own way but no one has a general solution.  Distributed searching might
be a better area to add to Solr since that should basically be the same
for everyone.  I'm going to mess around with Jini on my own indexes,
there's finally a new book out to go with the newer versions.

How were you planning on using Solr with Hadoop?  Maybe I don't fully
understand how hadoop works.

Tim

Venkatesh Seetharam wrote:

> Hi Tim,
>
> Thanks for your response. Interesting idea. Does the DB scale?  Do you have
> one single index which you plan to use Solr for or you have multiple
> indexes?
>
>> But I don't know how big the index will grow and I wanted to be able to
> add servers at any point.
> I'm thinking of having N partitions with a max of 10 million documents per
> partition. Adding a server should not be a problem but the newly added
> server would take time to grow so that distribution of documents are equal
> in the cluster. I've tested with 50 million documents of 10 size each and
> looks very promising.
>
>> The hash idea sounds really interesting and if I had a fixed number of
> indexes it would be perfect.
> I'm infact looking around for a reverse-hash algorithm where in given a
> docId, I should be able to find which partition contains the document so I
> can save cycles on broadcasting slaves.
>
> I mean, even if you use a DB, how have you solved the problem of
> distribution when a new server is added into the mix.
>
> We have the same problem since we get daily updates to documents and
> document metadata.
>
>> How did you work around not being able to update a lucene index that is
> stored in Hadoop?
> I do not use HDFS. I use a NetApp mounted on all the nodes in the cluster
> and hence did not need any change to Lucene.
>
> I plan to index using Lucene/Hadoop and use Solr as the partition searcher
> and a broker which would merge the results and return 'em.
>
> Thanks,
> Venkatesh
>
> On 3/5/07, Tim Patton <[hidden email]> wrote:
>>
>>
>>
>> Venkatesh Seetharam wrote:
>> > Hi Tim,
>> >
>> > Howdy. I saw your post on Solr newsgroup and caught my attention. I'm
>> > working on a similar problem for searching a vault of over 100 million
>> > XML documents. I already have the encoding part done using Hadoop and
>> > Lucene. It works like a  charm. I create N index partitions and have
>> > been trying to wrap Solr to search each partition, have a Search broker
>> > that merges the results and returns.
>> >
>> > I'm curious about how have you solved the distribution of additions,
>> > deletions and updates to each of the indexing servers.I use a
>> > partitioner based on a hash of the document id. Do you broadcast to the
>> > slaves as to who owns a document?
>> >
>> > Also, I'm looking at Hadoop RPC and ICE ( www.zeroc.com
>> > <http://www.zeroc.com>) for distributing the search across these Solr
>> > servers. I'm not using HTTP.
>> >
>> > Any ideas are greatly appreciated.
>> >
>> > PS: I did subscribe to solr newsgroup now but  did not receive a
>> > confirmation and hence sending it to you directly.
>> >
>> > --
>> > Thanks,
>> > Venkatesh
>> >
>> > "Perfection (in design) is achieved not when there is nothing more to
>> > add, but rather when there is nothing more to take away."
>> > - Antoine de Saint-Exupéry
>>
>>
>> I used a SQL database to keep track of which server had which document.
>>     Then I originally used JMS and would use a selector for which server
>> number the document should go to.  I switched over to a home grown,
>> lightweight message server since JMS behaves really badly when it backs
>> up and I couldn't find a server that would simply pause the producers if
>> there was a problem with the consumers.  Additions are pretty much
>> assigned randomly to whichever server gets them first.  At this point I
>> am up to around 20 million documents.
>>
>> The hash idea sounds really interesting and if I had a fixed number of
>> indexes it would be perfect.  But I don't know how big the index will
>> grow and I wanted to be able to add servers at any point.  I would like
>> to eliminate any outside dependencies (SQL, JMS), which is why a
>> distributed Solr would let me focus on other areas.
>>
>> How did you work around not being able to update a lucene index that is
>> stored in Hadoop?  I know there were changes in Lucene 2.1 to support
>> this but I haven't looked that far into it yet, I've just been testing
>> the new IndexWriter.  As an aside, I hope those features can be used by
>> Solr soon (if they aren't already in the nightlys).
>>
>> Tim
>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Federated Search

Venkatesh Seetharam
In reply to this post by Jed Reynolds-2
Hi Jed,

Thanks for sharing your thoughts and the link.

Venkatesh

On 3/11/07, Jed Reynolds <[hidden email]> wrote:

>
>     Venkatesh Seetharam wrote:
> >
> >> The hash idea sounds really interesting and if I had a fixed number of
> > indexes it would be perfect.
> > I'm infact looking around for a reverse-hash algorithm where in given a
> > docId, I should be able to find which partition contains the document
> > so I
> > can save cycles on broadcasting slaves.
>
> Many large databases partition their data either by load or by another
> logical manner, like by alphabet. I hear that Hotmail, for instance,
> partitions its users alphabetically. Having a broker will certainly
> abstract this mechninism, and of course your application(s) want to be
> able to bypass a broker when necessary.
>
> > I mean, even if you use a DB, how have you solved the problem of
> > distribution when a new server is added into the mix.
>
> http://www8.org/w8-papers/2a-webserver/caching/paper2.html
>
> I saw this link on the memcached list and the thread surrounding it
> certainly covered some similar ground. Some ideas have been discussed
> like:
> - high availability of memcached, redundant entries
> - scaling out clusters and facing the need to rebuild the entire cache
> on all nodes depending on your bucketing.
> I see some similarties with maintaining multiple indicies/lucene
> partitions and having a memcache deployment: mostly if you are hashing
> your keys to partitions (or buckets or machines) then you might be faced
> with a) availability issues if there's a machine/partition outtage b)
> rebuilding partitions if adding a partition/bucket changes the hash
> mapping.
>
> The ways I can think of to scale-out new indexes would be to have your
> application maintain two sets of bucket mappings for ids to indexes, and
> the second would be to key your documents and partition them by date.
> The former method would allow you to rebuild a second set of
> repartitioned indexes and buckets and allow you to update your
> application to use the new bucket mapping (when all the indexes has been
> rebuilt). The latter method would only apply if you could organize your
> document ids by date and only added new documents to the 'now' end or
> evenly across most dates. You'd have to add a new partition onto the end
> as time progressed, and rarely rebuild old indexes unless your documents
> grow unevenly.
>
> Interesting topic! I don't yet need to run multiple Lucene partitions,
> but I have a few memcached servers and increasing the number of them I
> expect will force my site to take a performance accordingly as I am
> forced to rebuild the caches. I can see similarly if I had multiple
> lucene partitions, that if I had to fission some of them, rebuilding the
> resulting partitions would be time intensive and I'd want to have
> procedures in place for availibility, scaling out and changing
> application code as necessary. Just having one fail-over Solr index is
> just so easy in comparison.
>
> Jed
>