Is Lucene a good choice for PB scale mailbox search?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Is Lucene a good choice for PB scale mailbox search?

fulin tang
We are going to add full-text search for our mailbox service .

The problem is we have more than 1 PB mails there , and obviously we
don't want to add another PB storage for search service , so we hope
the index data will be small enough for storage while the search keeps
fast .

The lucky is that every user just search with mails of their own , so
we can split the data into a lot of indexes instead of keeping them in
a big one .

So, after all these concerns ,  the question is , is lucene a good
choice for this ? or which is the right way to do this ? Does anyone
have done this  before ?

All opinions and comments are welcome !

fulin


--
梦的开始挣扎于城市的边缘
心的远方执着在脚步的瞬间
我的宿命埋藏了寂寞的永远

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Is Lucene a good choice for PB scale mailbox search?

Shashi Kant-2
Hi, I have not worked on a petascale (yet!) - mostly on the scale of tens of
terabyes - but I do think Lucene would be very helpful for such usecase. I
would indeed suggest partitioning the index by users (seems the most
logical., straightforward way, also offers the security of insulating one
user's emails from others.

Take a look at Compass and Solr (based on Lucene) and they might be more
oriented to your needs.

HTH,
Shashi


On Mon, Nov 23, 2009 at 9:35 PM, fulin tang <[hidden email]> wrote:

> We are going to add full-text search for our mailbox service .
>
> The problem is we have more than 1 PB mails there , and obviously we
> don't want to add another PB storage for search service , so we hope
> the index data will be small enough for storage while the search keeps
> fast .
>
> The lucky is that every user just search with mails of their own , so
> we can split the data into a lot of indexes instead of keeping them in
> a big one .
>
> So, after all these concerns ,  the question is , is lucene a good
> choice for this ? or which is the right way to do this ? Does anyone
> have done this  before ?
>
> All opinions and comments are welcome !
>
> fulin
>
>
> --
> 梦的开始挣扎于城市的边缘
> 心的远方执着在脚步的瞬间
> 我的宿命埋藏了寂寞的永远
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Is Lucene a good choice for PB scale mailbox search?

Jason Rutherglen
In reply to this post by fulin tang
A sharded architecture (i.e. smaller indexes) used by Google for
example and implemented by open source in the Katta project may be
best for scaling to sizable levels.  Katta is also useful for
redundancy and fault tolerance.

On Mon, Nov 23, 2009 at 6:35 PM, fulin tang <[hidden email]> wrote:

> We are going to add full-text search for our mailbox service .
>
> The problem is we have more than 1 PB mails there , and obviously we
> don't want to add another PB storage for search service , so we hope
> the index data will be small enough for storage while the search keeps
> fast .
>
> The lucky is that every user just search with mails of their own , so
> we can split the data into a lot of indexes instead of keeping them in
> a big one .
>
> So, after all these concerns ,  the question is , is lucene a good
> choice for this ? or which is the right way to do this ? Does anyone
> have done this  before ?
>
> All opinions and comments are welcome !
>
> fulin
>
>
> --
> 梦的开始挣扎于城市的边缘
> 心的远方执着在脚步的瞬间
> 我的宿命埋藏了寂寞的永远
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Is Lucene a good choice for PB scale mailbox search?

Kay Kay-2-3
In reply to this post by fulin tang
fulin tang wrote:

> We are going to add full-text search for our mailbox service .
>
> The problem is we have more than 1 PB mails there , and obviously we
> don't want to add another PB storage for search service , so we hope
> the index data will be small enough for storage while the search keeps
> fast .
>
> The lucky is that every user just search with mails of their own , so
> we can split the data into a lot of indexes instead of keeping them in
> a big one .
>  
If it is going to be sharded by the 'To' or 'Cc' list - then potentially
the mail information is going to be duplicated proportional to the
number of people in an email thread. Selecting some other dimension like
time, for sharding  might be useful to begin with.
> So, after all these concerns ,  the question is , is lucene a good
> choice for this ? or which is the right way to do this ? Does anyone
> have done this  before ?
>  

With PB of storage - check out solr sharding / katta for prior work in
this arena.

> All opinions and comments are welcome !
>
> fulin
>
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Is Lucene a good choice for PB scale mailbox search?

Otis Gospodnetic-2
In reply to this post by fulin tang
For what it's worth, AOL uses a Solr cluster to handle searches for @aol users.  Each user has his own index.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----

> From: fulin tang <[hidden email]>
> To: [hidden email]
> Sent: Mon, November 23, 2009 9:35:57 PM
> Subject: Is Lucene a good choice for PB scale mailbox search?
>
> We are going to add full-text search for our mailbox service .
>
> The problem is we have more than 1 PB mails there , and obviously we
> don't want to add another PB storage for search service , so we hope
> the index data will be small enough for storage while the search keeps
> fast .
>
> The lucky is that every user just search with mails of their own , so
> we can split the data into a lot of indexes instead of keeping them in
> a big one .
>
> So, after all these concerns ,  the question is , is lucene a good
> choice for this ? or which is the right way to do this ? Does anyone
> have done this  before ?
>
> All opinions and comments are welcome !
>
> fulin
>
>
> --
> 梦的开始挣扎于城市的边缘
> 心的远方执着在脚步的瞬间
> 我的宿命埋藏了寂寞的永远
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Is Lucene a good choice for PB scale mailbox search?

fulin tang
In reply to this post by Kay Kay-2-3
Thanks all for the good suggestions !

But any idea of the storage? How can we make the indexes as small as possible?

We know compressing is the only way, but when and where to compress is
best for search?

Thanks all again!


2009/11/24 Kay Kay <[hidden email]>:

> fulin tang wrote:
>>
>> We are going to add full-text search for our mailbox service .
>>
>> The problem is we have more than 1 PB mails there , and obviously we
>> don't want to add another PB storage for search service , so we hope
>> the index data will be small enough for storage while the search keeps
>> fast .
>>
>> The lucky is that every user just search with mails of their own , so
>> we can split the data into a lot of indexes instead of keeping them in
>> a big one .
>>
>
> If it is going to be sharded by the 'To' or 'Cc' list - then potentially the
> mail information is going to be duplicated proportional to the number of
> people in an email thread. Selecting some other dimension like time, for
> sharding  might be useful to begin with.
>>
>> So, after all these concerns ,  the question is , is lucene a good
>> choice for this ? or which is the right way to do this ? Does anyone
>> have done this  before ?
>>
>
> With PB of storage - check out solr sharding / katta for prior work in this
> arena.
>
>> All opinions and comments are welcome !
>>
>> fulin
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



--
梦的开始挣扎于城市的边缘
心的远方执着在脚步的瞬间
我的宿命埋藏了寂寞的永远

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Is Lucene a good choice for PB scale mailbox search?

Ian Lea
If you are planning on using lucene only for searching then you don't
need to store much data at all - just the message id or whatever you
use to identify messages.  And there won't be much point in
compressing that.

If on the other hand you plan on storing data in lucene, perhaps for
displaying hits on a web page, you might want to compress it.  That
will save some space but at the cost of some performance at indexing
and retrieval time.  If you are storing, say, From:, To: and Subject:
for display in search results and message body only displayed when
they want to view the message, you could leave the first three
uncompressed and compress the message body.

Personally, I only use compression in indexes storing large fields but
with low search/retrieval rate.  But my indexes are only a few Gb in
size.

Lucene's handling of compressed fields is changing in 3.0 - see the
release notes or 2.9 javadocs for Field.Store.html#COMPRESS


--
Ian.

On Thu, Nov 26, 2009 at 1:34 AM, fulin tang <[hidden email]> wrote:

> Thanks all for the good suggestions !
>
> But any idea of the storage? How can we make the indexes as small as possible?
>
> We know compressing is the only way, but when and where to compress is
> best for search?
>
> Thanks all again!
>
>
> 2009/11/24 Kay Kay <[hidden email]>:
>> fulin tang wrote:
>>>
>>> We are going to add full-text search for our mailbox service .
>>>
>>> The problem is we have more than 1 PB mails there , and obviously we
>>> don't want to add another PB storage for search service , so we hope
>>> the index data will be small enough for storage while the search keeps
>>> fast .
>>>
>>> The lucky is that every user just search with mails of their own , so
>>> we can split the data into a lot of indexes instead of keeping them in
>>> a big one .
>>>
>>
>> If it is going to be sharded by the 'To' or 'Cc' list - then potentially the
>> mail information is going to be duplicated proportional to the number of
>> people in an email thread. Selecting some other dimension like time, for
>> sharding  might be useful to begin with.
>>>
>>> So, after all these concerns ,  the question is , is lucene a good
>>> choice for this ? or which is the right way to do this ? Does anyone
>>> have done this  before ?
>>>
>>
>> With PB of storage - check out solr sharding / katta for prior work in this
>> arena.
>>
>>> All opinions and comments are welcome !
>>>
>>> fulin
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>
>
> --
> 梦的开始挣扎于城市的边缘
> 心的远方执着在脚步的瞬间
> 我的宿命埋藏了寂寞的永远
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]