anyone use hadoop+solr?

classic Classic list List threaded Threaded
30 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

Yonik Seeley-2-2
On Mon, Sep 6, 2010 at 10:18 AM, MitchK <[hidden email]> wrote:
[...consistent hashing...]
> But it doesn't solve the problem at all, correct me if I am wrong, but: If
> you add a new server, let's call him IP3-1, and IP3-1 is nearer to the
> current ressource X, than doc x will be indexed at IP3-1 - even if IP2-1
> holds the older version.
> Am I right?

Right.  You still need code to handle migration.

Consistent hashing is a way for everyone to be able to agree on the
mapping, and for the mapping to change incrementally.  i.e. you add a
node and it only changes the docid->node mapping of a limited percent
of the mappings, rather than changing the mappings of potentially
everything, as a simple MOD would do.

For SolrCloud, I don't think we'll end up using consistent hashing -
we don't need it (although some of the concepts may still be useful).

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

Andrzej Białecki-2
On 2010-09-06 16:41, Yonik Seeley wrote:

> On Mon, Sep 6, 2010 at 10:18 AM, MitchK<[hidden email]>  wrote:
> [...consistent hashing...]
>> But it doesn't solve the problem at all, correct me if I am wrong, but: If
>> you add a new server, let's call him IP3-1, and IP3-1 is nearer to the
>> current ressource X, than doc x will be indexed at IP3-1 - even if IP2-1
>> holds the older version.
>> Am I right?
>
> Right.  You still need code to handle migration.
>
> Consistent hashing is a way for everyone to be able to agree on the
> mapping, and for the mapping to change incrementally.  i.e. you add a
> node and it only changes the docid->node mapping of a limited percent
> of the mappings, rather than changing the mappings of potentially
> everything, as a simple MOD would do.

Another strategy to avoid excessive reindexing is to keep splitting the
largest shards, and then your mapping becomes a regular MOD plus a list
of these additional splits. Really, there's an infinite number of ways
you could implement this...

>
> For SolrCloud, I don't think we'll end up using consistent hashing -
> we don't need it (although some of the concepts may still be useful).

I imagine there could be situations where a simple MOD won't do ;) so I
think it would be good to hide this strategy behind an
interface/abstract class. It costs nothing, and gives you flexibility in
how you implement this mapping.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

gearond
What is a 'simple MOD'?

Dennis Gearon

Signature Warning
----------------
EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Mon, 9/6/10, Andrzej Bialecki <[hidden email]> wrote:

> From: Andrzej Bialecki <[hidden email]>
> Subject: Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
> To: [hidden email]
> Date: Monday, September 6, 2010, 11:30 AM
> On 2010-09-06 16:41, Yonik Seeley
> wrote:
> > On Mon, Sep 6, 2010 at 10:18 AM, MitchK<[hidden email]
> wrote:
> > [...consistent hashing...]
> >> But it doesn't solve the problem at all, correct
> me if I am wrong, but: If
> >> you add a new server, let's call him IP3-1, and
> IP3-1 is nearer to the
> >> current ressource X, than doc x will be indexed at
> IP3-1 - even if IP2-1
> >> holds the older version.
> >> Am I right?
> >
> > Right.  You still need code to handle migration.
> >
> > Consistent hashing is a way for everyone to be able to
> agree on the
> > mapping, and for the mapping to change
> incrementally.  i.e. you add a
> > node and it only changes the docid->node mapping of
> a limited percent
> > of the mappings, rather than changing the mappings of
> potentially
> > everything, as a simple MOD would do.
>
> Another strategy to avoid excessive reindexing is to keep
> splitting the largest shards, and then your mapping becomes
> a regular MOD plus a list of these additional splits.
> Really, there's an infinite number of ways you could
> implement this...
>
> >
> > For SolrCloud, I don't think we'll end up using
> consistent hashing -
> > we don't need it (although some of the concepts may
> still be useful).
>
> I imagine there could be situations where a simple MOD
> won't do ;) so I think it would be good to hide this
> strategy behind an interface/abstract class. It costs
> nothing, and gives you flexibility in how you implement this
> mapping.
>
> -- Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _
> _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic
> Web
> ___|||__||  \|  ||  |  Embedded Unix,
> System Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
>
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

Andrzej Białecki-2
On 2010-09-06 22:03, Dennis Gearon wrote:
> What is a 'simple MOD'?

md5(docId) % numShards

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

RE: Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

Markus Jelsma
In reply to this post by gearond
The remainder of an arithmetic division

http://en.wikipedia.org/wiki/Modulo_operation
-----Original message-----
From: Dennis Gearon <[hidden email]>
Sent: Mon 06-09-2010 22:04
To: [hidden email];
Subject: Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

What is a 'simple MOD'?

Dennis Gearon

Signature Warning
----------------
EARTH has a Right To Life,
 otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Mon, 9/6/10, Andrzej Bialecki <[hidden email]> wrote:

> From: Andrzej Bialecki <[hidden email]>
> Subject: Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
> To: [hidden email]
> Date: Monday, September 6, 2010, 11:30 AM
> On 2010-09-06 16:41, Yonik Seeley
> wrote:
> > On Mon, Sep 6, 2010 at 10:18 AM, MitchK<[hidden email]
> wrote:
> > [...consistent hashing...]
> >> But it doesn't solve the problem at all, correct
> me if I am wrong, but: If
> >> you add a new server, let's call him IP3-1, and
> IP3-1 is nearer to the
> >> current ressource X, than doc x will be indexed at
> IP3-1 - even if IP2-1
> >> holds the older version.
> >> Am I right?
> >
> > Right.  You still need code to handle migration.
> >
> > Consistent hashing is a way for everyone to be able to
> agree on the
> > mapping, and for the mapping to change
> incrementally.  i.e. you add a
> > node and it only changes the docid->node mapping of
> a limited percent
> > of the mappings, rather than changing the mappings of
> potentially
> > everything, as a simple MOD would do.
>
> Another strategy to avoid excessive reindexing is to keep
> splitting the largest shards, and then your mapping becomes
> a regular MOD plus a list of these additional splits.
> Really, there's an infinite number of ways you could
> implement this...
>
> >
> > For SolrCloud, I don't think we'll end up using
> consistent hashing -
> > we don't need it (although some of the concepts may
> still be useful).
>
> I imagine there could be situations where a simple MOD
> won't do ;) so I think it would be good to hide this
> strategy behind an interface/abstract class. It costs
> nothing, and gives you flexibility in how you implement this
> mapping.
>
> -- Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _
> _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic
> Web
> ___|||__||  \|  ||  |  Embedded Unix,
> System Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

gearond
Oh, THAT MOD! LOL!

I thought it was some search engine specific acronym.
Dennis Gearon

Signature Warning
----------------
EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Mon, 9/6/10, Markus Jelsma <[hidden email]> wrote:

> From: Markus Jelsma <[hidden email]>
> Subject: RE: Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)
> To: [hidden email]
> Date: Monday, September 6, 2010, 2:53 PM
> The remainder of an arithmetic
> division
>
> http://en.wikipedia.org/wiki/Modulo_operation

> -----Original message-----
> From: Dennis Gearon <[hidden email]>
> Sent: Mon 06-09-2010 22:04
> To: [hidden email];
>
> Subject: Re: SolrCloud distributed indexing (Re: anyone use
> hadoop+solr?)
>
> What is a 'simple MOD'?
>
> Dennis Gearon
>
> Signature Warning
> ----------------
> EARTH has a Right To Life,
>  otherwise we all die.
>
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php

>
>
> --- On Mon, 9/6/10, Andrzej Bialecki <[hidden email]>
> wrote:
>
> > From: Andrzej Bialecki <[hidden email]>
> > Subject: Re: SolrCloud distributed indexing (Re:
> anyone use hadoop+solr?)
> > To: [hidden email]
> > Date: Monday, September 6, 2010, 11:30 AM
> > On 2010-09-06 16:41, Yonik Seeley
> > wrote:
> > > On Mon, Sep 6, 2010 at 10:18 AM, MitchK<[hidden email]
> > wrote:
> > > [...consistent hashing...]
> > >> But it doesn't solve the problem at all,
> correct
> > me if I am wrong, but: If
> > >> you add a new server, let's call him IP3-1,
> and
> > IP3-1 is nearer to the
> > >> current ressource X, than doc x will be
> indexed at
> > IP3-1 - even if IP2-1
> > >> holds the older version.
> > >> Am I right?
> > >
> > > Right.  You still need code to handle
> migration.
> > >
> > > Consistent hashing is a way for everyone to be
> able to
> > agree on the
> > > mapping, and for the mapping to change
> > incrementally.  i.e. you add a
> > > node and it only changes the docid->node
> mapping of
> > a limited percent
> > > of the mappings, rather than changing the
> mappings of
> > potentially
> > > everything, as a simple MOD would do.
> >
> > Another strategy to avoid excessive reindexing is to
> keep
> > splitting the largest shards, and then your mapping
> becomes
> > a regular MOD plus a list of these additional splits.
> > Really, there's an infinite number of ways you could
> > implement this...
> >
> > >
> > > For SolrCloud, I don't think we'll end up using
> > consistent hashing -
> > > we don't need it (although some of the concepts
> may
> > still be useful).
> >
> > I imagine there could be situations where a simple
> MOD
> > won't do ;) so I think it would be good to hide this
> > strategy behind an interface/abstract class. It costs
> > nothing, and gives you flexibility in how you
> implement this
> > mapping.
> >
> > -- Best regards,
> > Andrzej Bialecki     <><
> >  ___. ___ ___ ___ _
> > _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval,
> Semantic
> > Web
> > ___|||__||  \|  ||  |  Embedded Unix,
> > System Integration
> > http://www.sigram.com  Contact: info at sigram dot
> > com
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

MitchK
In reply to this post by Andrzej Białecki-2
What if we do not care about the version of a document at index-time?

When it comes to distributed search, we currently decide aggregating documents based on their uniqueKey. But what would be, if we decide additionally decide on uniqueKey plus indexingDate, so that we only aggregate the last indexed version of a document?

The concept could look like this:
When Solr aggregated the documents for a response, it could store what shard responsed an older version of document x.

Now a crawler can crawl through our SolrCloud and asking each shard whether it noticed something like "shard y got an older version of doc x"-case.
The crawler aggregates those information. After he finished crawling, he sends delete-by-query-requests to those shards which have older versions of documents than they should have.

I will call these "stores document versions that are older than the newest version" ODV (Old Document Versions) for better understanding.

So, what can happen:
Before the crawler can visit shard A - who noticed that shard y stores an ODV of doc x - shard A can go down. That's okay, because either another shard noticed the same, or shard A will be available later on. If those information will we stored at HD, it will also be available. If it was stored in RAM the information is lost... however, you could replicate those information over more than one shard, right? :-)

Another case:
Shard y can go down - so someone has to care for storing the noticed ODV-information, so that one can delete the document when Shard Y comes back.

Pros:
- You can do something like consistent hashing in connection with a concept where each node has to care for its neighbour-nodes. This is because only the neighbour nodes can store ODVs.

- using the described concept, you can do nightly batches, looking for ODVs in the neigbour-nodes.

- ODVs will be found at requesting time, so we can avoid to response ODVs over newer versions.

Cons:
- We are wasting disc space.

- This works only for smaller clusters, not for large ones where the number of machines changes very frequently

... this is just another idea - and it is very very lazy.

I must emphasize, that I assume that neighbour-machines do not go down very frequently. Of course, it is not a question whether a machine crashes, but when it crashes - but I assume that the same server does not crash every hour. :-)

Thoughts?

Kind regards

Andrzej Bialecki wrote
On 2010-09-06 16:41, Yonik Seeley wrote:
> On Mon, Sep 6, 2010 at 10:18 AM, MitchK<mitch91@web.de>  wrote:
> [...consistent hashing...]
>> But it doesn't solve the problem at all, correct me if I am wrong, but: If
>> you add a new server, let's call him IP3-1, and IP3-1 is nearer to the
>> current ressource X, than doc x will be indexed at IP3-1 - even if IP2-1
>> holds the older version.
>> Am I right?
>
> Right.  You still need code to handle migration.
>
> Consistent hashing is a way for everyone to be able to agree on the
> mapping, and for the mapping to change incrementally.  i.e. you add a
> node and it only changes the docid->node mapping of a limited percent
> of the mappings, rather than changing the mappings of potentially
> everything, as a simple MOD would do.

Another strategy to avoid excessive reindexing is to keep splitting the
largest shards, and then your mapping becomes a regular MOD plus a list
of these additional splits. Really, there's an infinite number of ways
you could implement this...

>
> For SolrCloud, I don't think we'll end up using consistent hashing -
> we don't need it (although some of the concepts may still be useful).

I imagine there could be situations where a simple MOD won't do ;) so I
think it would be good to hide this strategy behind an
interface/abstract class. It costs nothing, and gives you flexibility in
how you implement this mapping.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

MitchK
I must add something to my last post:

When saying it could be used together with techniques like consistent hashing, I mean it could be used at indexing time for indexing documents, since I assumed that the number of shards does not change frequently and therefore an ODV-case becomes relatively infrequent. Furthermore the overhead of searching for and removing those ODV-documents is relatively low.
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

Grant Ingersoll-2
In reply to this post by Yonik Seeley-2-2

On Sep 6, 2010, at 10:41 AM, Yonik Seeley wrote:

>
> For SolrCloud, I don't think we'll end up using consistent hashing -
> we don't need it (although some of the concepts may still be useful).

Can you elaborate on why we don't need it?
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

Yonik Seeley-2-2
On Thu, Sep 9, 2010 at 11:51 AM, Grant Ingersoll <[hidden email]> wrote:
> On Sep 6, 2010, at 10:41 AM, Yonik Seeley wrote:
>> For SolrCloud, I don't think we'll end up using consistent hashing -
>> we don't need it (although some of the concepts may still be useful).
>
> Can you elaborate on why we don't need it?

I guess because I can't think of a reason why we would need it - hence
it seems we don't?

Random node placement and virtual nodes would seem to be a
disadvantage for us since we aren't just a key-value store and care
about more than one key at a time.  Larger partitions in conjunction
with user-based/directed partitioning will allow doing things like
querying a specific user's email box (for example) by hitting a single
(or very few) nodes in the complete cluster.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8
12