Does lucene support distributed indexing?

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Does lucene support distributed indexing?

Samuel Guo-2
Hi all,

I am a lucene newbie:)

It seems that lucene doesn't support distributed indexing:(
As some IR research papers mentioned, when the documents collection become
large, the index will be large also. When one single machine can't hold all
the index, some strategies are used to solve it. such as that we can part
the whole collection into several small sub-collections. According to
different partitions, we can got different strategies : document-partittion
and term-partition. but I don't know why not lucene support these ways:(
can't anyone explain it ? or maybe lucene use other ways to solve this
problem?

Hope for your replies :)
Best Wishes

Samuel Guo
Reply | Threaded
Open this post in threaded view
|

Re: Does lucene support distributed indexing?

Grant Ingersoll-2

On Apr 26, 2008, at 2:33 AM, Samuel Guo wrote:

> Hi all,
>
> I am a lucene newbie:)
>
> It seems that lucene doesn't support distributed indexing:(
> As some IR research papers mentioned, when the documents collection  
> become
> large, the index will be large also. When one single machine can't  
> hold all
> the index, some strategies are used to solve it. such as that we can  
> part
> the whole collection into several small sub-collections. According to
> different partitions, we can got different strategies : document-
> partittion
> and term-partition. but I don't know why not lucene support these  
> ways:(
> can't anyone explain it ?

Because no one has donated the code to do it.  You can do distributed  
indexing via Nutch and some (albeit non fault tolerant) distributed  
Search in Lucene.  Solr also now has distributed search.

-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Does lucene support distributed indexing?

Samuel Guo-2
Thanks a lot :)

2008/4/26 Grant Ingersoll <[hidden email]>:

>
> On Apr 26, 2008, at 2:33 AM, Samuel Guo wrote:
>
>  Hi all,
> >
> > I am a lucene newbie:)
> >
> > It seems that lucene doesn't support distributed indexing:(
> > As some IR research papers mentioned, when the documents collection
> > become
> > large, the index will be large also. When one single machine can't hold
> > all
> > the index, some strategies are used to solve it. such as that we can
> > part
> > the whole collection into several small sub-collections. According to
> > different partitions, we can got different strategies :
> > document-partittion
> > and term-partition. but I don't know why not lucene support these ways:(
> > can't anyone explain it ?
> >
>
> Because no one has donated the code to do it.  You can do distributed
> indexing via Nutch and some (albeit non fault tolerant) distributed Search
> in Lucene.  Solr also now has distributed search.
>
> -Grant
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Does lucene support distributed indexing?

Otis Gospodnetic-2
In reply to this post by Samuel Guo-2
There are actually several distributed indexing or searching projects in Lucene (the top-level ASF Lucene project, not Lucene Java), and it's time to start thinking about the possibility of bringing them together, finding commonalities, etc.

Here is the summary:
- Lucene - distributed search via ParallelMultiSearcher.  How you split indices/shards is up to you.
- Solr - distributed indexing via SOLR-303 (see DistributedSearch on its Wiki).  How you split indices/shards is up to you.
- Nutch - see its org.apache.nutch.ipc (I think).  How you split indices/segments is up to you.
- Nutch - see the bottom of http://wiki.apache.org/nutch/Nutch2Architecture

There is also Hadoop:
- Using MapReduce + HDFS to build a single Lucene index in a distributed fashion (see contrib/ in Hadoop)

There is also GridLucene project somewhere on the web...

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----

> From: Grant Ingersoll <[hidden email]>
> To: [hidden email]
> Sent: Saturday, April 26, 2008 4:20:19 PM
> Subject: Re: Does lucene support distributed indexing?
>
>
> On Apr 26, 2008, at 2:33 AM, Samuel Guo wrote:
>
> > Hi all,
> >
> > I am a lucene newbie:)
> >
> > It seems that lucene doesn't support distributed indexing:(
> > As some IR research papers mentioned, when the documents collection  
> > become
> > large, the index will be large also. When one single machine can't  
> > hold all
> > the index, some strategies are used to solve it. such as that we can  
> > part
> > the whole collection into several small sub-collections. According to
> > different partitions, we can got different strategies : document-
> > partittion
> > and term-partition. but I don't know why not lucene support these  
> > ways:(
> > can't anyone explain it ?
>
> Because no one has donated the code to do it.  You can do distributed  
> indexing via Nutch and some (albeit non fault tolerant) distributed  
> Search in Lucene.  Solr also now has distributed search.
>
> -Grant
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Does lucene support distributed indexing?

Fang_Li
In reply to this post by Samuel Guo-2
Solr does not do distributed indexing, but index replication. All copies are identical.
Lucene has some build in support for distributed search, please take a look at RemoteSearchable. For indexing, you can add a front load balancer in a naïve way.

Regards,

-----Original Message-----
From: Samuel Guo [mailto:[hidden email]]
Sent: Sunday, April 27, 2008 4:22 PM
To: [hidden email]
Subject: Re: Does lucene support distributed indexing?

Thanks a lot :)

2008/4/26 Grant Ingersoll <[hidden email]>:

>
> On Apr 26, 2008, at 2:33 AM, Samuel Guo wrote:
>
>  Hi all,
> >
> > I am a lucene newbie:)
> >
> > It seems that lucene doesn't support distributed indexing:(
> > As some IR research papers mentioned, when the documents collection
> > become
> > large, the index will be large also. When one single machine can't hold
> > all
> > the index, some strategies are used to solve it. such as that we can
> > part
> > the whole collection into several small sub-collections. According to
> > different partitions, we can got different strategies :
> > document-partittion
> > and term-partition. but I don't know why not lucene support these ways:(
> > can't anyone explain it ?
> >
>
> Because no one has donated the code to do it.  You can do distributed
> indexing via Nutch and some (albeit non fault tolerant) distributed Search
> in Lucene.  Solr also now has distributed search.
>
> -Grant
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Does lucene support distributed indexing?

Stu Hood
In reply to this post by Samuel Guo-2
Solr does not do distributed indexing, but the development version _does_ do distributed search, in addition to replication. Currently, you can manually shard up your data to a set of Solr instances, and then query them by adding a 'shard=localhost:8080/solr_1,localhost:8080/solr_2' parameter.

See https://issues.apache.org/jira/browse/SOLR-303

Thanks,
Stu


-----Original Message-----
From: [hidden email]
Sent: Monday, April 28, 2008 5:04am
To: [hidden email]
Subject: RE: Does lucene support distributed indexing?

Solr does not do distributed indexing, but index replication. All copies are identical.
Lucene has some build in support for distributed search, please take a look at RemoteSearchable. For indexing, you can add a front load balancer in a naïve way.

Regards,

-----Original Message-----
From: Samuel Guo [mailto:[hidden email]]
Sent: Sunday, April 27, 2008 4:22 PM
To: [hidden email]
Subject: Re: Does lucene support distributed indexing?

Thanks a lot :)

2008/4/26 Grant Ingersoll <[hidden email]>:

>
> On Apr 26, 2008, at 2:33 AM, Samuel Guo wrote:
>
>  Hi all,
> >
> > I am a lucene newbie:)
> >
> > It seems that lucene doesn't support distributed indexing:(
> > As some IR research papers mentioned, when the documents collection
> > become
> > large, the index will be large also. When one single machine can't hold
> > all
> > the index, some strategies are used to solve it. such as that we can
> > part
> > the whole collection into several small sub-collections. According to
> > different partitions, we can got different strategies :
> > document-partittion
> > and term-partition. but I don't know why not lucene support these ways:(
> > can't anyone explain it ?
> >
>
> Because no one has donated the code to do it.  You can do distributed
> indexing via Nutch and some (albeit non fault tolerant) distributed Search
> in Lucene.  Solr also now has distributed search.
>
> -Grant
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Does lucene support distributed indexing?

hossman
In reply to this post by Otis Gospodnetic-2

: There are actually several distributed indexing or searching projects in
: Lucene (the top-level ASF Lucene project, not Lucene Java), and it's
: time to start thinking about the possibility of bringing them together,
: finding commonalities, etc.

I would actually argue that almost all of the examples you listed describe
"distributed searching" to query multiple shards.

As far as i know, none of them address the "distributed indexing" aspect:
throw some raw data at the system and trust that it it will be indexed by
one (or more) shard(s) in a way that "evenly" distributes the indexing
"load"



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Does lucene support distributed indexing?

Otis Gospodnetic-2
In reply to this post by Samuel Guo-2
That's right - most of them are about distributed searching (hence my notes about sharding being up to the app).  Hadoop's contrib/index is about dist indexing:

"This contrib package provides a utility to build or update an index
using Map/Reduce.

A distributed "index" is partitioned into "shards". Each shard corresponds
to a Lucene instance."

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----

> From: Chris Hostetter <[hidden email]>
> To: [hidden email]
> Sent: Monday, April 28, 2008 7:53:43 PM
> Subject: Re: Does lucene support distributed indexing?
>
>
> : There are actually several distributed indexing or searching projects in
> : Lucene (the top-level ASF Lucene project, not Lucene Java), and it's
> : time to start thinking about the possibility of bringing them together,
> : finding commonalities, etc.
>
> I would actually argue that almost all of the examples you listed describe
> "distributed searching" to query multiple shards.
>
> As far as i know, none of them address the "distributed indexing" aspect:
> throw some raw data at the system and trust that it it will be indexed by
> one (or more) shard(s) in a way that "evenly" distributes the indexing
> "load"
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Does lucene support distributed indexing?

Vaijanath N. Rao-2
Hi all,

How about adding hadoop support for distributed indexing. If required I
can start working on this. If Hadoop is the fesiable option.

Also what other technique one can think for doing distributed Indexing.
Currently I am planning on extending the SolrJ to keep a map of where
the document has gone and trying to get a distibuted Indexing.

--Thanks and Regards
Vaijanath


Otis Gospodnetic wrote:

> That's right - most of them are about distributed searching (hence my notes about sharding being up to the app).  Hadoop's contrib/index is about dist indexing:
>
> "This contrib package provides a utility to build or update an index
> using Map/Reduce.
>
> A distributed "index" is partitioned into "shards". Each shard corresponds
> to a Lucene instance."
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
>  
>> From: Chris Hostetter <[hidden email]>
>> To: [hidden email]
>> Sent: Monday, April 28, 2008 7:53:43 PM
>> Subject: Re: Does lucene support distributed indexing?
>>
>>
>> : There are actually several distributed indexing or searching projects in
>> : Lucene (the top-level ASF Lucene project, not Lucene Java), and it's
>> : time to start thinking about the possibility of bringing them together,
>> : finding commonalities, etc.
>>
>> I would actually argue that almost all of the examples you listed describe
>> "distributed searching" to query multiple shards.
>>
>> As far as i know, none of them address the "distributed indexing" aspect:
>> throw some raw data at the system and trust that it it will be indexed by
>> one (or more) shard(s) in a way that "evenly" distributes the indexing
>> "load"
>>
>>
>>
>> -Hoss
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>    
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Does lucene support distributed indexing?

Andrzej Białecki-2
Vaijanath N. Rao wrote:
> Hi all,
>
> How about adding hadoop support for distributed indexing. If required I
> can start working on this. If Hadoop is the fesiable option.
>
> Also what other technique one can think for doing distributed Indexing.
> Currently I am planning on extending the SolrJ to keep a map of where
> the document has gone and trying to get a distibuted Indexing.

DistributedFileSystem performance for random seeks is several times
worse than that of LocalFileSystem. This directly impacts Lucene
response time.

One solution would be to implement the searching as an application that
executes in a distributed fashion (not sure if map-reduce is the best
model here), but first copies the indexes to LocalFileSystem.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Does lucene support distributed indexing?

Adrian Tarau
In reply to this post by Otis Gospodnetic-2
I've started an year ago a different implementation of ParallelMultiSearcher using a ThreadPoolExecutor where everything is parallelized.
Unfortunately, I had to interrupt this and work on something else, but this month I'll start working again. Right now there are some dependencies so it cannot be used outside my infrastructure(like discovering new nodes, notifications between nodes), but I'm thinking to extract this as a separate project(maybe latter) so can be used as an Lucene extension.

I will post some code as soon as I will have something to show :)

Thanks.





Otis Gospodnetic wrote
There are actually several distributed indexing or searching projects in Lucene (the top-level ASF Lucene project, not Lucene Java), and it's time to start thinking about the possibility of bringing them together, finding commonalities, etc.

Here is the summary:
- Lucene - distributed search via ParallelMultiSearcher.  How you split indices/shards is up to you.
- Solr - distributed indexing via SOLR-303 (see DistributedSearch on its Wiki).  How you split indices/shards is up to you.
- Nutch - see its org.apache.nutch.ipc (I think).  How you split indices/segments is up to you.
- Nutch - see the bottom of http://wiki.apache.org/nutch/Nutch2Architecture

There is also Hadoop:
- Using MapReduce + HDFS to build a single Lucene index in a distributed fashion (see contrib/ in Hadoop)

There is also GridLucene project somewhere on the web...

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
> From: Grant Ingersoll <gsingers@apache.org>
> To: java-user@lucene.apache.org
> Sent: Saturday, April 26, 2008 4:20:19 PM
> Subject: Re: Does lucene support distributed indexing?
>
>
> On Apr 26, 2008, at 2:33 AM, Samuel Guo wrote:
>
> > Hi all,
> >
> > I am a lucene newbie:)
> >
> > It seems that lucene doesn't support distributed indexing:(
> > As some IR research papers mentioned, when the documents collection  
> > become
> > large, the index will be large also. When one single machine can't  
> > hold all
> > the index, some strategies are used to solve it. such as that we can  
> > part
> > the whole collection into several small sub-collections. According to
> > different partitions, we can got different strategies : document-
> > partittion
> > and term-partition. but I don't know why not lucene support these  
> > ways:(
> > can't anyone explain it ?
>
> Because no one has donated the code to do it.  You can do distributed  
> indexing via Nutch and some (albeit non fault tolerant) distributed  
> Search in Lucene.  Solr also now has distributed search.
>
> -Grant
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org