Using Lucene partly as DB and 'joining' search results.

classic Classic list List threaded Threaded
9 messages Options
adb
Reply | Threaded
Open this post in threaded view
|

Using Lucene partly as DB and 'joining' search results.

adb
We're planning to archive email over many years and have been looking at using
DB to store mail meta data and Lucene for the indexed mail data, or just Lucene
on its own with email data and structure stored as XML and the raw message
stored in the file system.

For some customers, the volumes are likely to be well over 1 billion mails over
10 years, so some  partitioning of data is needed.  At the moment the thoughts
are moving away from using a DB + Lucene to just Lucene along with a file system
representation of the complete message.  All searches will be against the index
then the XML mail meta data is loaded from the file system.

The archive is read only apart from bulk deletes, but one of the requirements is
for users to be able to label their own mail.  Given that a Lucene Document
cannot be updated, I have thought about having a separate Lucene index that has
just the 3 terms (or some combination of) userId + mailId + label.

That of course would mean joining searches from the main mail data index and the
label index.

Does anyone have any experience of using Lucene this way and is it a realistic
option of avoiding the DB at all?  I'd rather the headache of scaling just
Lucene, which is a simple beast, than the whole bundle of 'stuff' that comes
with the database as well.

Antony




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Using Lucene partly as DB and 'joining' search results.

Mathieu Lecarme
Antony Bowesman a écrit :

> We're planning to archive email over many years and have been looking
> at using DB to store mail meta data and Lucene for the indexed mail
> data, or just Lucene on its own with email data and structure stored
> as XML and the raw message stored in the file system.
>
> For some customers, the volumes are likely to be well over 1 billion
> mails over
> 10 years, so some  partitioning of data is needed.  At the moment the
> thoughts
> are moving away from using a DB + Lucene to just Lucene along with a
> file system
> representation of the complete message.  All searches will be against
> the index then the XML mail meta data is loaded from the file system.
>
> The archive is read only apart from bulk deletes, but one of the
> requirements is for users to be able to label their own mail.  Given
> that a Lucene Document cannot be updated, I have thought about having
> a separate Lucene index that has just the 3 terms (or some combination
> of) userId + mailId + label.
>
> That of course would mean joining searches from the main mail data
> index and the label index.
>
> Does anyone have any experience of using Lucene this way and is it a
> realistic option of avoiding the DB at all?  I'd rather the headache
> of scaling just Lucene, which is a simple beast, than the whole bundle
> of 'stuff' that comes with the database as well.
Use Filter and BitSet.
 From the personnal data, you build a Filter
(http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/Filter.html)
wich is used in the main index.

M.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Using Lucene partly as DB and 'joining' search results.

Paul Elschot
Op Friday 11 April 2008 13:49:59 schreef Mathieu Lecarme:

> Antony Bowesman a écrit :
> > We're planning to archive email over many years and have been
> > looking at using DB to store mail meta data and Lucene for the
> > indexed mail data, or just Lucene on its own with email data and
> > structure stored as XML and the raw message stored in the file
> > system.
> >
> > For some customers, the volumes are likely to be well over 1
> > billion mails over
> > 10 years, so some  partitioning of data is needed.  At the moment
> > the thoughts
> > are moving away from using a DB + Lucene to just Lucene along with
> > a file system
> > representation of the complete message.  All searches will be
> > against the index then the XML mail meta data is loaded from the
> > file system.
> >
> > The archive is read only apart from bulk deletes, but one of the
> > requirements is for users to be able to label their own mail.
> > Given that a Lucene Document cannot be updated, I have thought
> > about having a separate Lucene index that has just the 3 terms (or
> > some combination of) userId + mailId + label.
> >
> > That of course would mean joining searches from the main mail data
> > index and the label index.
> >
> > Does anyone have any experience of using Lucene this way and is it
> > a realistic option of avoiding the DB at all?  I'd rather the
> > headache of scaling just Lucene, which is a simple beast, than the
> > whole bundle of 'stuff' that comes with the database as well.
>
> Use Filter and BitSet.
>  From the personnal data, you build a Filter
> (http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/Fil
>ter.html) wich is used in the main index.

With 1 billion mails, and possibly a Filter per user, you may want to
use more compact filters than BitSets, which is currently possible
in the development trunk of lucene.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

adb
Reply | Threaded
Open this post in threaded view
|

Re: Using Lucene partly as DB and 'joining' search results.

adb
Paul Elschot wrote:
> Op Friday 11 April 2008 13:49:59 schreef Mathieu Lecarme:

>> Use Filter and BitSet.
>>  From the personnal data, you build a Filter
>> (http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/Fil
>> ter.html) wich is used in the main index.
>
> With 1 billion mails, and possibly a Filter per user, you may want to
> use more compact filters than BitSets, which is currently possible
> in the development trunk of lucene.

Thanks for the pointers.  I've already used Solr's DocSet interface in my
implementation, which I think is where the ideas for the current Lucene
enhancements came from.  They work well to reduce the filter's footprint.  I'm
also caching filters.

The intention is that there is a user data index and the mail index(es).  The
search against user data index will return a set of mail Ids, which is the
common key between the two.  Doc Ids are no good between the indexes, so that
means a potentially large boolean OR query to create the filter of labelled
mails in the mail indexes.  I know it's a theoretical question, but will this
perform?

The read only data and modifiable user data need to be kept separate because the
RO data can easily be re-created, which means I can't just create the filter as
part of the base search.

Regards
Antony





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Using Lucene partly as DB and 'joining' search results.

Paul Elschot
Op Saturday 12 April 2008 00:03:13 schreef Antony Bowesman:

> Paul Elschot wrote:
> > Op Friday 11 April 2008 13:49:59 schreef Mathieu Lecarme:
> >> Use Filter and BitSet.
> >>  From the personnal data, you build a Filter
> >> (http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/
> >>Fil ter.html) wich is used in the main index.
> >
> > With 1 billion mails, and possibly a Filter per user, you may want
> > to use more compact filters than BitSets, which is currently
> > possible in the development trunk of lucene.
>
> Thanks for the pointers.  I've already used Solr's DocSet interface
> in my implementation, which I think is where the ideas for the
> current Lucene enhancements came from.

The ideas came from quite a few sources. They can be traced
starting from changes.txt in the sources.

> They work well to reduce the
> filter's footprint.  I'm also caching filters.
>
> The intention is that there is a user data index and the mail
> index(es).  The search against user data index will return a set of
> mail Ids, which is the common key between the two. Doc Ids are no
> good between the indexes, so that means a potentially large boolean
> OR query to create the filter of labelled mails in the mail indexes.
> I know it's a theoretical question, but will this perform?

The normal way to collect doc ids for a filter is into a bitset
iterating over the indexed ids (mail ids in your case). A bitset
has random access, so there is no need to do this in doc id order.
An OR query has to work in doc id order so it can compute a score
per doc id, and the ordering loses some performance.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Using Lucene partly as DB and 'joining' search results.

hossman
In reply to this post by adb

: The archive is read only apart from bulk deletes, but one of the requirements
: is for users to be able to label their own mail.  Given that a Lucene Document
: cannot be updated, I have thought about having a separate Lucene index that
: has just the 3 terms (or some combination of) userId + mailId + label.
:
: That of course would mean joining searches from the main mail data index and
: the label index.

tangential to the existing follwups about ways to use Filters efficiently
to get some of the behavior, take a look at ParallelReader ... your use
case sounds like it might be perfect for it: one really large main dataset
that changes fairly infrequently, and what changes do occur are mainly
about adding new records; plus a small "parallel" set of fields about
each record in the main set which do change fairly frequently.

you build up an index for the main data, and then you periodicly build up
a second index with the docs in the exact same order as the main index.

additions to the main index do't need to block on rebuilding the secondary
index.  deletes do (since you need to delete from both indexes in parallel
to keep the ids in sync) ... but that's ok since you said you only need
occasional bulk deletes (you could process them as an initial step of your
recuring rebuild of the smaller index).



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

adb
Reply | Threaded
Open this post in threaded view
|

Re: Using Lucene partly as DB and 'joining' search results.

adb
Thanks all for the suggestions - there was also another thread "Lucene index on
relational data" which had crossover here.

That's an interesting idea about using ParallelReader for the changable index.
I had thought to just have a triplet indexed 'owner:mailId:label' in each Doc
and have multiple Documents for the same mailId, e.g. if each recipient adds
labels for the same mail, or if multiple labels are added by one recipient.  I
would then have to make a join using mailId against the core.  However, if I
want to use PR, I could have a single Document with multiple field, and using
stored fields can 'modify' that Document.  However, what happens to the DocId
when the delete+add occurs and how do I ensure it stays the same.

I'm on 2.3.1.  I seem to recall a discussion on this in another thread, but
cannot find it.

Antony



Chris Hostetter wrote:

> : The archive is read only apart from bulk deletes, but one of the requirements
> : is for users to be able to label their own mail.  Given that a Lucene Document
> : cannot be updated, I have thought about having a separate Lucene index that
> : has just the 3 terms (or some combination of) userId + mailId + label.
> :
> : That of course would mean joining searches from the main mail data index and
> : the label index.
>
> tangential to the existing follwups about ways to use Filters efficiently
> to get some of the behavior, take a look at ParallelReader ... your use
> case sounds like it might be perfect for it: one really large main dataset
> that changes fairly infrequently, and what changes do occur are mainly
> about adding new records; plus a small "parallel" set of fields about
> each record in the main set which do change fairly frequently.
>
> you build up an index for the main data, and then you periodicly build up
> a second index with the docs in the exact same order as the main index.
>
> additions to the main index do't need to block on rebuilding the secondary
> index.  deletes do (since you need to delete from both indexes in parallel
> to keep the ids in sync) ... but that's ok since you said you only need
> occasional bulk deletes (you could process them as an initial step of your
> recuring rebuild of the smaller index).
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Using Lucene partly as DB and 'joining' search results.

hossman

: would then have to make a join using mailId against the core.  However, if I
: want to use PR, I could have a single Document with multiple field, and using
: stored fields can 'modify' that Document.  However, what happens to the DocId
: when the delete+add occurs and how do I ensure it stays the same.

you can't ... that's why i said you'd need to rebuild the smaller index
completley on a periodic basis (going in the same order as the docs in the
big index) ... it might not be feasible if the rate at which you need to
surface annotations has to be "near instanteneous" but assuming most
emails won't ever get annotations, they'll just be "empty" docs that
should index lighting fast.

i can also imagine a situation where you break both indexes up into lots
of pieces (shards) and use a MultiReader over lots of ParallelReaders ...
that way you have much smaller "small" indexes to rebuild when someone
annotates an email -- and if hte shards are organized by date, you're less
likely to ever need to rebuild many of them since people will tend to
focus on annotating more recent mail, and if queries focus on a specific
date range (which i'm guessing most email searches will) you can use
MultiReaders over a subset of all the ParallelReaders to save time on
scanning through older docs you know won't match.

Disclaimer: all of this is purely brainstorming, i've never actually tried
anything like this, it may be more trouble then it's worth.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

adb
Reply | Threaded
Open this post in threaded view
|

Re: Using Lucene partly as DB and 'joining' search results.

adb
Chris Hostetter wrote:
> you can't ... that's why i said you'd need to rebuild the smaller index
> completley on a periodic basis (going in the same order as the docs in the

Mmm, the annotations would only be stored in the index.  It would be possible to
store them elsewhere, so I can investigate that, in which case the rebuild would
be possible.

> i can also imagine a situation where you break both indexes up into lots
> of pieces (shards) and use a MultiReader over lots of ParallelReaders ...
> that way you have much smaller "small" indexes to rebuild when someone
> annotates an email -- and if hte shards are organized by date, you're less
> likely to ever need to rebuild many of them since people will tend to

Data will be 'sharded' anyway, by date of some granularity.  Looking at the
source for MultiReader/MultiSearcher, they are single threaded.  Is there a
performance trade off between single-thread/many small indexes and
single-thread/some large indexes.  Can a MultiReader work with one..n reader per
thread, something like a thread pool of IndexReaders.  I expect it would be
faster to run the searches in parallel?

> Disclaimer: all of this is purely brainstorming, i've never actually tried
> anything like this, it may be more trouble then it's worth.

:) Thanks for the sounding board - it's always useful to get new ideas!
Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]