Distributed Lucene..

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Distributed Lucene..

Prasenjit Mukherjee-3
I already have an implementation of a distributed crawler farm, where
crawler instances are runnign on different boxes. I want to come up with
a distributed indexing scheme using lucene and take advantage of the
distributed nature of my crawlers' distributed nature. Here is what I am
thinking.

Crawlers will analyze and tokenize the content for every URLs(aka
Documents) and create the following data for every url document:
<url-id,  <field1, <term-f1-t1,term-f1-t2,term-f1-t3 etc.>>   <field-2,
<term-f2-t1,term-f2-t2,term-f2-t3, >>  ...... >

And then based on some partitioning function the carwlers can send a
subset of tokens(aka terms)  to the indexing server. The partitioning
function can be as simple as based on the starting character of the
terms.  Lets say if we have 5 indexers, we will distribute the indexing
data in the following manner :

Indexer1 - a-e
Indexer2 - f-j
Indexer3 - k-o
Indexer4 - p-t
Indexer5 - u-z

Does it make any sense ? Also would like to know if there are other ways
to distribute lucene's indexing/searching  ?

thanks,
prasen

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Distributed Lucene..

Samuru Jackson
> Does it make any sense ? Also would like to know if there are other ways
> to distribute lucene's indexing/searching  ?

I'm interested in such a distributed architecture too.

What I have got in mind is some kind of lucene index cluster where you
have got several machines having subindexes in memory. So if you have
got a a searchquery the machines should perfom fast because the index
is in memory and no hard disk access is performed.

Is there anything like this available?

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Distributed Lucene..

Andrew Schetinin
In reply to this post by Prasenjit Mukherjee-3
Hello,

We are implementing a distributed searcher and indexer based on Lucene.
I cannot share its code but I may provide hints basing on our
experience.

What we did basically is having several machines indexing documents and
creating small Lucene indexes.
We hacked :-) IndexWriter of Lucene to start all segment names with a
prefix unique for each small index part.
Then, when adding it to the actual index, we simply copy the new segment
to the folder with the other segments, and add it in such a way so that
the optimize() function cannot be called.
This way adding a new segment is very unintrusive for the searcher.
Optimization is scheduled to happen at night.

The index is divided into parts located on different physical machines,
and here is where complexity begins.
We do not try normalizing term weights across the machines, assuming
that with large data quantities they will be more or less evenly
distributed.
But the problem exist, and probably in the future we will think how to
handle it.
The documents are distributed across the machines randomally, and
merging the results becames a little head pain :-)

Best Regards,

Andrew Schetinin


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Distributed Lucene..

Samuru Jackson
Do you plan to release some kind of a commerical product including an API?

I ask because I'm evaluating different technologies for a prototype
which is part of my diploma thesis.

The problem is that I have to deal with real huge data amounts and one
machine is simply not enough to handle those amounts of data.

Lucene seems to be a good choice but it won't scale up for real big
data amounts. So I thought about expanding the indexes over several
machines in junks so that it fits into the memory of those machines.

One machine should collect the results and calculate some kind of
score out of the delivered hits from the machines.

As I'm not familiar with the concrete mechanisms of Lucene this is
just a naive thought, but I think that such a clustering mechanism
could become a killer app.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Distributed Lucene..

Andrew Schetinin
In reply to this post by Prasenjit Mukherjee-3
Hi Samuru,

No, it is a part of a bigger project (quite small part), and nobody is
going to sell parts of it, at least for less than $X00,000 :-)

Best Regards,

Andrew Schetinin


-----Original Message-----
From: Samuru Jackson [mailto:[hidden email]]
Sent: Monday, March 06, 2006 5:05 PM
To: [hidden email]
Subject: Re: Distributed Lucene..

Do you plan to release some kind of a commerical product including an
API?

I ask because I'm evaluating different technologies for a prototype
which is part of my diploma thesis.

The problem is that I have to deal with real huge data amounts and one
machine is simply not enough to handle those amounts of data.

Lucene seems to be a good choice but it won't scale up for real big data
amounts. So I thought about expanding the indexes over several machines
in junks so that it fits into the memory of those machines.

One machine should collect the results and calculate some kind of score
out of the delivered hits from the machines.

As I'm not familiar with the concrete mechanisms of Lucene this is just
a naive thought, but I think that such a clustering mechanism could
become a killer app.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Distributed Lucene..

Prasenjit Mukherjee-3
In reply to this post by Samuru Jackson
I think nutch has a distributed lucene implementation. I could have used
nutch straightaway, but I have a different crawler, and also dont want
to use NDFS(which is used by nutch) . What I have proposed earlier is
basically based on mapReduce paradigm, which is used by nutch as well.

It would be nice to get some articles specifically detailing out  the
distributed architecture used in nutch.

prasen

Samuru Jackson wrote:

>>Does it make any sense ? Also would like to know if there are other ways
>>to distribute lucene's indexing/searching  ?
>>    
>>
>
>I'm interested in such a distributed architecture too.
>
>What I have got in mind is some kind of lucene index cluster where you
>have got several machines having subindexes in memory. So if you have
>got a a searchquery the machines should perfom fast because the index
>is in memory and no hard disk access is performed.
>
>Is there anything like this available?
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Distributed Lucene..

Andrzej Białecki-2
Prasenjit Mukherjee wrote:
> I think nutch has a distributed lucene implementation. I could have
> used nutch straightaway, but I have a different crawler, and also dont
> want to use NDFS(which is used by nutch) . What I have proposed
> earlier is basically based on mapReduce paradigm, which is used by
> nutch as well.
>
> It would be nice to get some articles specifically detailing out  the
> distributed architecture used in nutch.
>

A few comments:

* you can use your own crawler, and then only write some glue code to
convert the output of that crawler to the format that Nutch uses.

* Nutch can be run in a so called "local" mode, without using NDFS

* the core map-reduce and I/O functionality has been split to its own
project, Hadoop, where the development is taking place at a furious rate
;-) This code is completely independent of Nutch or Lucene. You can
implement your own data processing using this framework.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Distributed Lucene..

Otis Gospodnetic-2
In reply to this post by Andrew Schetinin
Hi,
Just curious about this:

> We hacked :-) IndexWriter of Lucene to start all segment names with a
> prefix unique for each small index part.
> Then, when adding it to the actual index, we simply copy the new segment
> to the folder with the other segments, and add it in such a way so that
> the optimize() function cannot be called.
> This way adding a new segment is very unintrusive for the searcher.
> Optimization is scheduled to happen at night.


You just copy your uniquely-named segments in the index directory and manually modify the "segments" file to list all copied segments?

Thanks,
Otis




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Distributed Lucene..

Andrew Schetinin
In reply to this post by Prasenjit Mukherjee-3
Hi,

Sure not. We created another IndexWriter class and modified its function
addIndexes (if I remember the function name correctly) so it will not
call to optimize at the end - that's all.
Having unique segment names was necessary because the segment file name
is used inside the file itself, and cannot be changed on the fly.

Best Regards,

Andrew

 


 

--
Andrew Schetinin
C++ System Architect
Phone: +972 8 643 6560, ext. 212
Email: mailto:[hidden email]

www.entopia.com

Entopia Awards:
"Visionary in Enterprise Search Magic Quadrant" Gartner Group
"Best Search Engine" SIIA Codie Award
"Trend Setting Product" KMWorld Magazine


-----Original Message-----
From: Otis Gospodnetic [mailto:[hidden email]]
Sent: Tuesday, March 07, 2006 8:55 PM
To: [hidden email]
Subject: Re: Distributed Lucene..

Hi,
Just curious about this:

> We hacked :-) IndexWriter of Lucene to start all segment names with a
> prefix unique for each small index part.
> Then, when adding it to the actual index, we simply copy the new
> segment to the folder with the other segments, and add it in such a
> way so that the optimize() function cannot be called.
> This way adding a new segment is very unintrusive for the searcher.
> Optimization is scheduled to happen at night.


You just copy your uniquely-named segments in the index directory and
manually modify the "segments" file to list all copied segments?

Thanks,
Otis




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Distributed Lucene.. - clustering as a requirement

Dmitry Goldenberg
In reply to this post by Samuru Jackson
I firmly believe that clustering support should be a part of Lucene.  We've tried implementing it ourselves and so far have been unsuccessful.  We tried storing Lucene indices in a database that is the back-end repository for our app in a clustered environment and could not overcome the indexing exceptions in our custom Directory implementation.
 
I think it'd be perfect if some of the Lucene gurus were to implement an RDBMS-backed Directory and post it (in addition to the sleepycat.db package that's currently in contrib).  The nitty-gritties of dealing with Lucene indexing structures at a single byte level are just way too much trouble to deal with for application integrator like myself.
 
- Dmitry

________________________________

From: Samuru Jackson [mailto:[hidden email]]
Sent: Mon 3/6/2006 10:05 AM
To: [hidden email]
Subject: Re: Distributed Lucene..



Do you plan to release some kind of a commerical product including an API?

I ask because I'm evaluating different technologies for a prototype
which is part of my diploma thesis.

The problem is that I have to deal with real huge data amounts and one
machine is simply not enough to handle those amounts of data.

Lucene seems to be a good choice but it won't scale up for real big
data amounts. So I thought about expanding the indexes over several
machines in junks so that it fits into the memory of those machines.

One machine should collect the results and calculate some kind of
score out of the delivered hits from the machines.

As I'm not familiar with the concrete mechanisms of Lucene this is
just a naive thought, but I think that such a clustering mechanism
could become a killer app.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Distributed Lucene.. - clustering as a requirement

Chris Lamprecht
What about using lucene just for searching (i.e., no stored fields
except maybe one "ID" primary key field), and using an RDBMS for
storing the actual "documents"?  This way you're using lucene for what
lucene is best at, and using the database for what it's good at.  At
least up to a point -- RDBMSs have their limits too.  OR maybe if you
have a huge dataset, you might want to check out Nutch.

On 4/6/06, Dmitry Goldenberg <[hidden email]> wrote:

> I firmly believe that clustering support should be a part of Lucene.  We've tried implementing it ourselves and so far have been unsuccessful.  We tried storing Lucene indices in a database that is the back-end repository for our app in a clustered environment and could not overcome the indexing exceptions in our custom Directory implementation.
>
> I think it'd be perfect if some of the Lucene gurus were to implement an RDBMS-backed Directory and post it (in addition to the sleepycat.db package that's currently in contrib).  The nitty-gritties of dealing with Lucene indexing structures at a single byte level are just way too much trouble to deal with for application integrator like myself.
>
> - Dmitry
>
> ________________________________
>
> From: Samuru Jackson [mailto:[hidden email]]
> Sent: Mon 3/6/2006 10:05 AM
> To: [hidden email]
> Subject: Re: Distributed Lucene..
>
>
>
> Do you plan to release some kind of a commerical product including an API?
>
> I ask because I'm evaluating different technologies for a prototype
> which is part of my diploma thesis.
>
> The problem is that I have to deal with real huge data amounts and one
> machine is simply not enough to handle those amounts of data.
>
> Lucene seems to be a good choice but it won't scale up for real big
> data amounts. So I thought about expanding the indexes over several
> machines in junks so that it fits into the memory of those machines.
>
> One machine should collect the results and calculate some kind of
> score out of the delivered hits from the machines.
>
> As I'm not familiar with the concrete mechanisms of Lucene this is
> just a naive thought, but I think that such a clustering mechanism
> could become a killer app.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Distributed Lucene.. - clustering as a requirement

Dmitry Goldenberg
I think it's a good idea.  For an enterprise-level application, Lucene appears too file-system and too byte-sequence-centric a technology.  Just my opinion.  The Directory API is just too low-level.
 
I'd be OK with an RDBMS-based Directory implementation I could take and use.  But generally, I think the Lucene authors might like to take a step back and consider splitting off the repository and making it more extensible and high-level.  Perhaps something like JSR-170 (Java repository API) may be a good route to go....

________________________________

From: Chris Lamprecht [mailto:[hidden email]]
Sent: Thu 4/6/2006 3:55 PM
To: [hidden email]
Subject: Re: Distributed Lucene.. - clustering as a requirement



What about using lucene just for searching (i.e., no stored fields
except maybe one "ID" primary key field), and using an RDBMS for
storing the actual "documents"?  This way you're using lucene for what
lucene is best at, and using the database for what it's good at.  At
least up to a point -- RDBMSs have their limits too.  OR maybe if you
have a huge dataset, you might want to check out Nutch.

On 4/6/06, Dmitry Goldenberg <[hidden email]> wrote:

> I firmly believe that clustering support should be a part of Lucene.  We've tried implementing it ourselves and so far have been unsuccessful.  We tried storing Lucene indices in a database that is the back-end repository for our app in a clustered environment and could not overcome the indexing exceptions in our custom Directory implementation.
>
> I think it'd be perfect if some of the Lucene gurus were to implement an RDBMS-backed Directory and post it (in addition to the sleepycat.db package that's currently in contrib).  The nitty-gritties of dealing with Lucene indexing structures at a single byte level are just way too much trouble to deal with for application integrator like myself.
>
> - Dmitry
>
> ________________________________
>
> From: Samuru Jackson [mailto:[hidden email]]
> Sent: Mon 3/6/2006 10:05 AM
> To: [hidden email]
> Subject: Re: Distributed Lucene..
>
>
>
> Do you plan to release some kind of a commerical product including an API?
>
> I ask because I'm evaluating different technologies for a prototype
> which is part of my diploma thesis.
>
> The problem is that I have to deal with real huge data amounts and one
> machine is simply not enough to handle those amounts of data.
>
> Lucene seems to be a good choice but it won't scale up for real big
> data amounts. So I thought about expanding the indexes over several
> machines in junks so that it fits into the memory of those machines.
>
> One machine should collect the results and calculate some kind of
> score out of the delivered hits from the machines.
>
> As I'm not familiar with the concrete mechanisms of Lucene this is
> just a naive thought, but I think that such a clustering mechanism
> could become a killer app.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Distributed Lucene.. - clustering as a requirement

Doug Cutting
Dmitry Goldenberg wrote:
> For an enterprise-level application, Lucene appears too file-system and
too byte-sequence-centric a technology.  Just my opinion.  The Directory
API is just too low-level.

There are good reasons why Lucene is not built on top of a RDBMS.  An
inverted index is not efficiently maintained in a B-Tree, and B-Trees
are the foundation of RDBMSes.

http://www.haifa.ibm.com/Workshops/ir2005/papers/DougCutting-Haifa05.pdf

> I'd be OK with an RDBMS-based Directory implementation I could take and use.  But generally, I think the Lucene authors might like to take a step back and consider splitting off the repository and making it more extensible and high-level.  Perhaps something like JSR-170 (Java repository API) may be a good route to go....

If you have concrete ideas for an improvements to Lucene's Directory
interface, please propose them to the java-dev mailing list, ideally as
a patch.

Cheers,

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Distributed Lucene.. - clustering as a requirement

Prasenjit Mukherjee-3
Agreed, an inverted index cannot be efficiently maintained in a
B-tree(hence RDBMS).  But I think we can(or should)  have the option of
a  B-tree based storage for unindexed fields, whereas for indexed fields
we can use the existing lucene's architecture.

prasen

[hidden email] wrote:

> Dmitry Goldenberg wrote:
>
>> For an enterprise-level application, Lucene appears too file-system and
>
> too byte-sequence-centric a technology.  Just my opinion.  The
> Directory API is just too low-level.
>
> There are good reasons why Lucene is not built on top of a RDBMS.  An
> inverted index is not efficiently maintained in a B-Tree, and B-Trees
> are the foundation of RDBMSes.
>
> http://www.haifa.ibm.com/Workshops/ir2005/papers/DougCutting-Haifa05.pdf
>
>> I'd be OK with an RDBMS-based Directory implementation I could take
>> and use.  But generally, I think the Lucene authors might like to
>> take a step back and consider splitting off the repository and making
>> it more extensible and high-level.  Perhaps something like JSR-170
>> (Java repository API) may be a good route to go....
>
>
> If you have concrete ideas for an improvements to Lucene's Directory
> interface, please propose them to the java-dev mailing list, ideally
> as a patch.
>
> Cheers,
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Clusterization of searching

Anton Potekhin
What be way for clusterizations of searching?




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Distributed Lucene.. - clustering as a requirement

Dmitry Goldenberg
In reply to this post by Prasenjit Mukherjee-3
I guess Compass is probably the way to go - http://www.opensymphony.com/compass/

________________________________

From: Prasenjit Mukherjee [mailto:[hidden email]]
Sent: Tue 4/11/2006 2:45 AM
To: [hidden email]
Subject: Re: Distributed Lucene.. - clustering as a requirement



Agreed, an inverted index cannot be efficiently maintained in a
B-tree(hence RDBMS).  But I think we can(or should)  have the option of
a  B-tree based storage for unindexed fields, whereas for indexed fields
we can use the existing lucene's architecture.

prasen

[hidden email] wrote:

> Dmitry Goldenberg wrote:
>
>> For an enterprise-level application, Lucene appears too file-system and
>
> too byte-sequence-centric a technology.  Just my opinion.  The
> Directory API is just too low-level.
>
> There are good reasons why Lucene is not built on top of a RDBMS.  An
> inverted index is not efficiently maintained in a B-Tree, and B-Trees
> are the foundation of RDBMSes.
>
> http://www.haifa.ibm.com/Workshops/ir2005/papers/DougCutting-Haifa05.pdf
>
>> I'd be OK with an RDBMS-based Directory implementation I could take
>> and use.  But generally, I think the Lucene authors might like to
>> take a step back and consider splitting off the repository and making
>> it more extensible and high-level.  Perhaps something like JSR-170
>> (Java repository API) may be a good route to go....
>
>
> If you have concrete ideas for an improvements to Lucene's Directory
> interface, please propose them to the java-dev mailing list, ideally
> as a patch.
>
> Cheers,
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]