solr+hadoop = next solr

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

solr+hadoop = next solr

James liu-2
anyone agree?

Next solr's development 's plan is? anyone know?


--
regards
jl
Reply | Threaded
Open this post in threaded view
|

Re: solr+hadoop = next solr

Yonik Seeley-2
On 6/6/07, James liu <[hidden email]> wrote:
> anyone agree?

No ;-)

At least not if you mean using map-reduce for queries.

When I started looking at distributed search, I immediately went and
read the map-reduce paper (easier concept than it first appeared), and
realized it's really more for the indexing side of things (big batch
jobs, making data from data, etc).  Nutch uses map reduce for
crawling/indexing, but not for querying.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: solr+hadoop = next solr

jrodenburg
I've been exploring distributed search, as of late.  I don't know about the
"next solr" but I could certainly see a "distributed solr" grow out of such
an expansion.

In terms of the FederatedSearch wiki entry (updated last year), has there
been any progress made this year on this topic, at least something worthy of
being added or updated to the wiki page?  Not to splinter efforts here, but
maybe a working group that was focused on that topic could help to move
things forward a bit.

- j

On 6/6/07, Yonik Seeley <[hidden email]> wrote:

>
> On 6/6/07, James liu <[hidden email]> wrote:
> > anyone agree?
>
> No ;-)
>
> At least not if you mean using map-reduce for queries.
>
> When I started looking at distributed search, I immediately went and
> read the map-reduce paper (easier concept than it first appeared), and
> realized it's really more for the indexing side of things (big batch
> jobs, making data from data, etc).  Nutch uses map reduce for
> crawling/indexing, but not for querying.
>
> -Yonik
>
Reply | Threaded
Open this post in threaded view
|

Re: solr+hadoop = next solr

James liu-2
In reply to this post by Yonik Seeley-2
2007/6/7, Yonik Seeley <[hidden email]>:

>
> On 6/6/07, James liu <[hidden email]> wrote:
> > anyone agree?
>
> No ;-)
>
> At least not if you mean using map-reduce for queries.
>
> When I started looking at distributed search, I immediately went and
> read the map-reduce paper (easier concept than it first appeared), and
> realized it's really more for the indexing side of things (big batch
> jobs, making data from data, etc).  Nutch uses map reduce for
> crawling/indexing, but not for querying.


Yes, nutch use map reduce only for crawling/indexing, not for querying.


http://www.nabble.com/something-i-think-important-and-should-be-added-tf3813838.html#a10796136

map-reduce just for indexing to decrease "Master solr query *instance" *index
size and increase query speed.

It will cost many time to index and merge but it will increase query
accuracy.

index and data not in same box. so we just only sure master query server
hardware is powerful and
slave query server hardware is not very important.

Master index server should support multi index.

If solr support it.

I think user who use solr will quick setup their search.


It just my thought.

how do u think, yonik,,,and how do u think next solr?


-Yonik
>



--
regards
jl
Reply | Threaded
Open this post in threaded view
|

Re: solr+hadoop = next solr

Yonik Seeley-2
In reply to this post by jrodenburg
On 6/6/07, Jeff Rodenburg <[hidden email]> wrote:
> In terms of the FederatedSearch wiki entry (updated last year), has there
> been any progress made this year on this topic, at least something worthy of
> being added or updated to the wiki page?

Priorities shifted, and I dropped it for a while.
I recently started working with a CNET group that may need it, so I
could start working on it again in the next few months.  Don't wait
for me if you have ideas though... I'll try to follow along and chime
in.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: solr+hadoop = next solr

Ian Holsman (Lists)
Yonik Seeley wrote:

> On 6/6/07, Jeff Rodenburg <[hidden email]> wrote:
>> In terms of the FederatedSearch wiki entry (updated last year), has
>> there
>> been any progress made this year on this topic, at least something
>> worthy of
>> being added or updated to the wiki page?
>
> Priorities shifted, and I dropped it for a while.
> I recently started working with a CNET group that may need it, so I
> could start working on it again in the next few months.  Don't wait
> for me if you have ideas though... I'll try to follow along and chime
> in.
>
> -Yonik
>
Hi Yonik,

we also have a need for federated search where I work, and are planning
on getting going in the week or two hopefully.

The team will post to the list when they have something more concrete to
add.
Reply | Threaded
Open this post in threaded view
|

Re: solr+hadoop = next solr

Mike Klaas
In reply to this post by jrodenburg
On 6-Jun-07, at 7:44 PM, Jeff Rodenburg wrote:

> I've been exploring distributed search, as of late.  I don't know  
> about the
> "next solr" but I could certainly see a "distributed solr" grow out  
> of such
> an expansion.

I've implemented a highly-distributed search engine using Solr (200m  
docs and growing, 60+ servers).   It is not a Solr-based solution in  
the vein of FederatedSearch--it is a higher-level architecture that  
uses Solr as indexing nodes.  I'll note that it is a lot of work and  
would be even more work to develop in the generic extensible  
philosophy that Solr espouses.

It is not really suitable for contribution, unfortunately (being  
written in python and proprietary).

> In terms of the FederatedSearch wiki entry (updated last year), has  
> there
> been any progress made this year on this topic, at least something  
> worthy of
> being added or updated to the wiki page?  Not to splinter efforts  
> here, but
> maybe a working group that was focused on that topic could help to  
> move
> things forward a bit.

I don't believe that absence of organization has been the cause of  
lack of forward progress on this issue, but simply that there has  
been no-one sufficiently interested and committed to prioritizing  
this huge task to work on it.  There is no need to form a working  
group (not when there are only a handful of active committers to  
begin with)--all interested people could just use solr-dev@ for  
discussion.

Solr is an open-source project, so huge features will get implemented  
when there is a person or group of people devoted to leading the  
charge on the issue.  If you're interested in being that person,  
that's great!

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: solr+hadoop = next solr

jrodenburg
Mike - thanks for the comments.  Some responses added below.

On 6/7/07, Mike Klaas <[hidden email]> wrote:
>
>
> I've implemented a highly-distributed search engine using Solr (200m
> docs and growing, 60+ servers).   It is not a Solr-based solution in
> the vein of FederatedSearch--it is a higher-level architecture that
> uses Solr as indexing nodes.  I'll note that it is a lot of work and
> would be even more work to develop in the generic extensible
> philosophy that Solr espouses.


Yeah, we've done the same thing in the .Net world, and it's a tough slog.
We're in the same situation -- making our solution generically extensible is
pretty much a non-starter.

> In terms of the FederatedSearch wiki entry (updated last year), has
> > there
> > been any progress made this year on this topic, at least something
> > worthy of
> > being added or updated to the wiki page?  Not to splinter efforts
> > here, but
> > maybe a working group that was focused on that topic could help to
> > move
> > things forward a bit.
>
> I don't believe that absence of organization has been the cause of
> lack of forward progress on this issue, but simply that there has
> been no-one sufficiently interested and committed to prioritizing
> this huge task to work on it.  There is no need to form a working
> group (not when there are only a handful of active committers to
> begin with)--all interested people could just use solr-dev@ for
> discussion.


That makes sense, just didn't want to bombard the list with the subject if
it was a detractor from the core project, i.e. keep lucene messages on
lucene, solr messages on solr, etc.  The good-community-participant
approach, if you will.

Solr is an open-source project, so huge features will get implemented
> when there is a person or group of people devoted to leading the
> charge on the issue.  If you're interested in being that person,
> that's great!
>
>
Glad to jump in, not sure I qualify as such for that, but certainly a big
cheerleader nonetheless.
Reply | Threaded
Open this post in threaded view
|

Re: solr+hadoop = next solr

rossini
Hi, Jeff and Mike.

   Would you mind telling us about the architecture of your solutions a
little bit? Mike, you said that you implemented a highly-distributed search
engine using Solr as indexing nodes. What does that mean? You guys
implemented a master, multi-slave solution for replication? Or the whole
index shards for high availability and fail over?


On 6/7/07, Jeff Rodenburg <[hidden email]> wrote:

>
> Mike - thanks for the comments.  Some responses added below.
>
> On 6/7/07, Mike Klaas <[hidden email]> wrote:
> >
> >
> > I've implemented a highly-distributed search engine using Solr (200m
> > docs and growing, 60+ servers).   It is not a Solr-based solution in
> > the vein of FederatedSearch--it is a higher-level architecture that
> > uses Solr as indexing nodes.  I'll note that it is a lot of work and
> > would be even more work to develop in the generic extensible
> > philosophy that Solr espouses.
>
>
> Yeah, we've done the same thing in the .Net world, and it's a tough slog.
> We're in the same situation -- making our solution generically extensible
> is
> pretty much a non-starter.
>
> > In terms of the FederatedSearch wiki entry (updated last year), has
> > > there
> > > been any progress made this year on this topic, at least something
> > > worthy of
> > > being added or updated to the wiki page?  Not to splinter efforts
> > > here, but
> > > maybe a working group that was focused on that topic could help to
> > > move
> > > things forward a bit.
> >
> > I don't believe that absence of organization has been the cause of
> > lack of forward progress on this issue, but simply that there has
> > been no-one sufficiently interested and committed to prioritizing
> > this huge task to work on it.  There is no need to form a working
> > group (not when there are only a handful of active committers to
> > begin with)--all interested people could just use solr-dev@ for
> > discussion.
>
>
> That makes sense, just didn't want to bombard the list with the subject if
> it was a detractor from the core project, i.e. keep lucene messages on
> lucene, solr messages on solr, etc.  The good-community-participant
> approach, if you will.
>
> Solr is an open-source project, so huge features will get implemented
> > when there is a person or group of people devoted to leading the
> > charge on the issue.  If you're interested in being that person,
> > that's great!
> >
> >
> Glad to jump in, not sure I qualify as such for that, but certainly a big
> cheerleader nonetheless.
>
Reply | Threaded
Open this post in threaded view
|

Re: solr+hadoop = next solr

jrodenburg
On 6/7/07, Rafael Rossini <[hidden email]> wrote:

>
> Hi, Jeff and Mike.
>
>    Would you mind telling us about the architecture of your solutions a
> little bit? Mike, you said that you implemented a highly-distributed
> search
> engine using Solr as indexing nodes. What does that mean? You guys
> implemented a master, multi-slave solution for replication? Or the whole
> index shards for high availability and fail over?
>

Our solution doesn't use solr, but goes directly to lucene.  It's built on
windows, so the interop communication service is built on .net remoting (tcp
based).  Microsoft has deprecated ongoing development with .net remoting, in
favor of other more standard mechanisms, i.e. http.  So, we're looking to
migrate our solution to a more community-supported model.

The underlying structure sounds similar to what others have done: index
shards distributed to various servers, each responsible for a subset of the
index.  A merging server handles coordination of concurrent thread requests
and synchronizes the results as they're returned.  The thread coordination
and search results interleaving process is functional but not really
scalable.  It works for our user model, where users tend not to page deeply
through results.  We want to change that so we can use solr as our primary
data source read mechanism for our site.

-- j