GData Server - Lucene storage

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

GData Server - Lucene storage

Simon Willnauer
Hello folks,
as I'm the only developer on the project due to  the SummerOfCode
program it is quiet a tough task to discuss all the architecture with
you on the mailing list. For this reason I decided to create UML
diagrams to discuss the main components. I will not attach the uml to
the mails rather upload it to a server so you can download an study
it.
Well, the next thing I have to implement is a storage to store the
entries in. I will provide 2 kinds of storage's (lucene and BerkleyDB
based). The first will be a lucene index to store the entries
identified by the entry ID and  feed ID stored in the index as a
Keyword (used to be Field.Keyword). The underlaying lucene storage
will only be used to store the entries compressed. Which feed entries
to retrieve from the lucene storage will be based on results of the
indexing/search component as every client request to a gdata server is
a query to the index. So the results of the search are entry ids and a
corresponding feed. These entries will be retrieved from the storage
and send back to the client. The storage component does also provide
delete / update and insert functionality (wouldn't be a storage
without these).
The biggest problem with the lucene storage is to achieve a
transactional state. Imagine the following scenario:
An update request comes in. -> the entry to update will be added to
the lucene writer   who writes the update. But another delete request
has locked the index and an IOException will be thrown. So the update
request will queue the entry and retries to obtain the lock. No
problem so far. But if the index writer can not open the index due to
some other error (the index could not be found)  the exception will
also be an IOExc. Is there any way to figure out whether the
IOException is caused due to a lock which would be alright or due to
some other serious reasons?

I added some comments on the UML to describe the arch. to you in more
detail. So please download the file and have a look at it.

http://www.javawithchopsticks.de/webaccess/lucenestorage.pdf

I will appreciate all your comments!!

regards Simon

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: GData Server - Lucene storage

Ian Holsman (Lists)

On 02/06/2006, at 9:37 AM, Simon Willnauer wrote:

> The biggest problem with the lucene storage is to achieve a
> transactional state. Imagine the following scenario:
> An update request comes in. -> the entry to update will be added to
> the lucene writer   who writes the update. But another delete request
> has locked the index and an IOException will be thrown. So the update
> request will queue the entry and retries to obtain the lock. No
> problem so far. But if the index writer can not open the index due to
> some other error (the index could not be found)  the exception will
> also be an IOExc. Is there any way to figure out whether the
> IOException is caused due to a lock which would be alright or due to
> some other serious reasons?

Hi Simon.
Here is my 2c's I am in no way shape or form a lucene expert, but I  
have seen a server/service design once or twice.


am I reading this a bit incorrectly?

are you saying you will have a set of threads which are going to  
handle the interaction with the client, which will then queue up that  
request
to another set of threads which will actually write to the lucene  
backend?

I'm not sure that this is a good way to go, in most designs I've seen  
this queue is the cause of a lot of design/operational issues. but  
I'll leave it to the lucene experts to comment on this.... personally  
I would think just having the client thread do the write to lucene as  
easier (and if you need to queue it do it outside of the app via jms  
or something)


I also think your focusing on something here which is too low level  
at this stage.
right now I suggest you log an error, and return a error back to the  
client (and make it their problem).
as long as you can guarantee that you either will:
* write the whole thing properly on success
* fail and leave the server in the same state as it was before the  
update (ie.. leave the request in the queue so it will retry it  
later, or if you choose a simpler route just return it straight to  
the client)

you can worry about queuing and retrying later on if you like.


regards
Ian.

Reply | Threaded
Open this post in threaded view
|

Re: GData Server - Lucene storage

Simon Willnauer
On 6/2/06, Ian Holsman <[hidden email]> wrote:

>
> On 02/06/2006, at 9:37 AM, Simon Willnauer wrote:
>
> > The biggest problem with the lucene storage is to achieve a
> > transactional state. Imagine the following scenario:
> > An update request comes in. -> the entry to update will be added to
> > the lucene writer   who writes the update. But another delete request
> > has locked the index and an IOException will be thrown. So the update
> > request will queue the entry and retries to obtain the lock. No
> > problem so far. But if the index writer can not open the index due to
> > some other error (the index could not be found)  the exception will
> > also be an IOExc. Is there any way to figure out whether the
> > IOException is caused due to a lock which would be alright or due to
> > some other serious reasons?
>
> Hi Simon.
> Here is my 2c's I am in no way shape or form a lucene expert, but I
> have seen a server/service design once or twice.
>
>
> am I reading this a bit incorrectly?
 You did.

> are you saying you will have a set of threads which are going to
> handle the interaction with the client, which will then queue up that
> request
> to another set of threads which will actually write to the lucene
> backend?
>
> I'm not sure that this is a good way to go, in most designs I've seen
> this queue is the cause of a lot of design/operational issues. but
> I'll leave it to the lucene experts to comment on this.... personally
> I would think just having the client thread do the write to lucene as
> easier (and if you need to queue it do it outside of the app via jms
> or something)


> I also think your focusing on something here which is too low level
> at this stage.
I guess you are right with your assumption I was already two steps
ahead. I guess I should first go for the simplest way to use lucene as
a storage.
Using the client thread as the indexing thread might just cause some
performance drawback but that's considerable for this state of
development. I will provide a second implementation anyway and and
public interface for customizing the storage.

> right now I suggest you log an error, and return a error back to the
> client (and make it their problem).
> as long as you can guarantee that you either will:
> * write the whole thing properly on success
> * fail and leave the server in the same state as it was before the
> update (ie.. leave the request in the queue so it will retry it
> later, or if you choose a simpler route just return it straight to
> the client)
>
> you can worry about queuing and retrying later on if you like.
>
Using the client threads gives me the possibility to send a HTTP 500
back to the client if something happens during the store process. But
the index has to be synchronized to prevent concurrent requests
altering the index  at the same time. That means that no IO Exception
occurs caused on not released locks.
If I would go for the structure I showed in the UML it would be quiet
tough to achieve a kind of a transactional state.

I mean a performance drawback would be alright for this storage. users
could still switch to the berkleyDB or customize the storage
component.

I will keep the interface and an abstract implementation of the
storage component protected for a while to maturate the interface.

>
> regards
> Ian.
>
>
>


Simon

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: GData Server - Lucene storage

Yonik Seeley
In reply to this post by Simon Willnauer
On 6/1/06, Simon Willnauer <[hidden email]> wrote:
> So the results of the search are entry ids and a
> corresponding feed. These entries will be retrieved from the storage
> and send back to the client.

In the simplest case of using a lucene stored field to store the
original entry, it's a single operation right?  You do a search, get
back lucene Documents, and all the info is right there.

> An update request comes in. -> the entry to update will be added to
> the lucene writer   who writes the update. But another delete request
> has locked the index and an IOException will be thrown.

Normally for Lucene, some batching of updates and deletes are needed
for decent performance.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: GData Server - Lucene storage

Simon Willnauer
On 6/2/06, Yonik Seeley <[hidden email]> wrote:
>
> On 6/1/06, Simon Willnauer <[hidden email]> wrote:
> > So the results of the search are entry ids and a
> > corresponding feed. These entries will be retrieved from the storage
> > and send back to the client.
>
> In the simplest case of using a lucene stored field to store the
> original entry, it's a single operation right?  You do a search, get
> back lucene Documents, and all the info is right there.


It is a single operation thats right.


> An update request comes in. -> the entry to update will be added to
> > the lucene writer   who writes the update. But another delete request
> > has locked the index and an IOException will be thrown.
>
> Normally for Lucene, some batching of updates and deletes are needed
> for decent performance.


This is also true. This problem is still the server response, if i queue
some updates / inserts or index them into a RamDir I still have the problem
of concurrent indexing. The client should wait for the writing process to
finish correctly otherwise the reponse should be some Error 500. If the
client will not wait (be hold) there is a risk of a lost update.
The same problem appears in indexing entries into the search index. There
won't be a lot of inserts and update concurrent so  I can't wait for other
inserts to do batch indexing. I could index them into ramDirs and search
multiple indexes. but what happens if the server crashes with a certain
amount of entries indexed into a ramDir?

any solutions for that in the solr project?

Another approach would be storing entries in a per Feed instance index. It's
the same problem with the batch / performance but better than letting the
client wait for entries of other feed to be indexed (stored).

simon

-Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search
> server
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: GData Server - Lucene storage

Yonik Seeley
On 6/2/06, Simon Willnauer <[hidden email]> wrote:

> This is also true. This problem is still the server response, if i queue
> some updates / inserts or index them into a RamDir I still have the problem
> of concurrent indexing. The client should wait for the writing process to
> finish correctly otherwise the reponse should be some Error 500. If the
> client will not wait (be hold) there is a risk of a lost update.
> The same problem appears in indexing entries into the search index. There
> won't be a lot of inserts and update concurrent so  I can't wait for other
> inserts to do batch indexing. I could index them into ramDirs and search
> multiple indexes. but what happens if the server crashes with a certain
> amount of entries indexed into a ramDir?
>
> any solutions for that in the solr project?

But the problem is twofold:
 1) You can't freely mix adds and deletes in Lucene.
 2) changes are not immediately visible... you need to close the
current writer and open a new IndexSearcher, which are relatively
heavyweight operations.

Solr solved (1) by adding all documents immediately as they come in
(using the same thread as the client request).  Deletes are replied to
immediately, but are defered.  When a "commit" happens, the writer is
closed, a new reader is opened, and all the deletes are processed.
Then a new IndexSearcher is opened, making all the adds and deletes
visible.

Solr doesn't do anything to solve (2).  It's main focus has been
providing high throughput and low latency queries, not on the
"freshness" of updates.

Decoupling the indexing from storage might help if new additions don't
need to be searchable (but do need to be retrievable by id)... you
could make storage synchronous, but batch the adds/deletes in some
manner and open a new IndexSearcher less frequently.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: GData Server - Lucene storage

Robert Engels
What we've done is that if the number of incoming documents is less than
some threshold, we serialize the documents to a "pending" file instead of
using the IndexWriter. If it greater than the threshold it is assumed an
index rebuild is occurring and so the updates are passed directly to the
IndexWriter.

We always process the pending file before any queries. This allows for rapid
index updates, and transactional control, since the updates can be batched,
and if the server crashes, the pending updates are available in the
"pending" file. The pending file is deleted after successful processing.

A possible improvement would be to create a RAMDirectory of the pending file
as well, and then perform the query on the RAMdirectory in addition to the
main directory. Since our documents are uniquely "keyed", we can eliminate
matches in the main directory for those that exist in the RAMDirectory (or
have been deleted in the RAMDirectory) fairly efficiently. If the number of
updates is low, this would improve the search latency for the readers.

-----Original Message-----
From: Yonik Seeley [mailto:[hidden email]]
Sent: Friday, June 02, 2006 10:55 AM
To: [hidden email]
Subject: Re: GData Server - Lucene storage

On 6/2/06, Simon Willnauer <[hidden email]> wrote:
> This is also true. This problem is still the server response, if i
> queue some updates / inserts or index them into a RamDir I still have
> the problem of concurrent indexing. The client should wait for the
> writing process to finish correctly otherwise the reponse should be
> some Error 500. If the client will not wait (be hold) there is a risk of a
lost update.
> The same problem appears in indexing entries into the search index.
> There won't be a lot of inserts and update concurrent so  I can't wait
> for other inserts to do batch indexing. I could index them into
> ramDirs and search multiple indexes. but what happens if the server
> crashes with a certain amount of entries indexed into a ramDir?
>
> any solutions for that in the solr project?

But the problem is twofold:
 1) You can't freely mix adds and deletes in Lucene.
 2) changes are not immediately visible... you need to close the current
writer and open a new IndexSearcher, which are relatively heavyweight
operations.

Solr solved (1) by adding all documents immediately as they come in (using
the same thread as the client request).  Deletes are replied to immediately,
but are defered.  When a "commit" happens, the writer is closed, a new
reader is opened, and all the deletes are processed.
Then a new IndexSearcher is opened, making all the adds and deletes visible.

Solr doesn't do anything to solve (2).  It's main focus has been providing
high throughput and low latency queries, not on the "freshness" of updates.

Decoupling the indexing from storage might help if new additions don't need
to be searchable (but do need to be retrievable by id)... you could make
storage synchronous, but batch the adds/deletes in some manner and open a
new IndexSearcher less frequently.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: GData Server - Lucene storage

Simon Willnauer
In reply to this post by Yonik Seeley
On 6/2/06, Yonik Seeley <[hidden email]> wrote:

>
> On 6/2/06, Simon Willnauer <[hidden email]> wrote:
> > This is also true. This problem is still the server response, if i queue
> > some updates / inserts or index them into a RamDir I still have the
> problem
> > of concurrent indexing. The client should wait for the writing process
> to
> > finish correctly otherwise the reponse should be some Error 500. If the
> > client will not wait (be hold) there is a risk of a lost update.
> > The same problem appears in indexing entries into the search index.
> There
> > won't be a lot of inserts and update concurrent so  I can't wait for
> other
> > inserts to do batch indexing. I could index them into ramDirs and search
> > multiple indexes. but what happens if the server crashes with a certain
> > amount of entries indexed into a ramDir?
> >
> > any solutions for that in the solr project?
>
> But the problem is twofold:
> 1) You can't freely mix adds and deletes in Lucene.
> 2) changes are not immediately visible... you need to close the
> current writer and open a new IndexSearcher, which are relatively
> heavyweight operations.
>
> Solr solved (1) by adding all documents immediately as they come in
> (using the same thread as the client request).  Deletes are replied to
> immediately, but are defered.  When a "commit" happens, the writer is
> closed, a new reader is opened, and all the deletes are processed.
> Then a new IndexSearcher is opened, making all the adds and deletes
> visible.


The problem here is that there is no action comparable to commit. The entry
comes in and will be added to the storage. The delete will be queued but
when should the delete operation start. Waiting for the writer to idle?! We
could do it that way. but if a search request comes in the old entries will
be found an can be retrieved from the storage. In that case I have to hold
the already added but not yet deleted entries in a storage cache to prefent
the storage from retrieving outdated and updated entries because they have
the same ID.

You use multiple indexSearcher instances to serve searches right? so when
all the deletes are done you have to reopen all indexsearchers again right?!
So this would happen quiet often due to updates and inserts.
Hmm it is more and more a bad idea to use a lucene index as a storage.
Rather go straight to a Database.



Solr doesn't do anything to solve (2).  It's main focus has been

> providing high throughput and low latency queries, not on the
> "freshness" of updates.
>
> Decoupling the indexing from storage might help if new additions don't
> need to be searchable (but do need to be retrievable by id)... you
> could make storage synchronous, but batch the adds/deletes in some
> manner and open a new IndexSearcher less frequently.


The indexing will be decoupled from the storage anyway. otherwise I could
not provide a plugable storage. But I know I did confuse you due to using
the word indexing in previous mails.

-Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search
> server
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

simon
Reply | Threaded
Open this post in threaded view
|

Re: GData Server - Lucene storage

Yonik Seeley
On 6/2/06, Simon Willnauer <[hidden email]> wrote:

> The problem here is that there is no action comparable to commit. The entry
> comes in and will be added to the storage. The delete will be queued but
> when should the delete operation start. Waiting for the writer to idle?! We
> could do it that way. but if a search request comes in the old entries will
> be found an can be retrieved from the storage. In that case I have to hold
> the already added but not yet deleted entries in a storage cache to prefent
> the storage from retrieving outdated and updated entries because they have
> the same ID.
>
> You use multiple indexSearcher instances to serve searches right?

There is normally only a single searcher open at a time (except when warming).
A main "registered" searcher handles all live requests.  When a new
searcher is opened, it is warmed up (requests run against it in the
background, caches pre-populated, etc) and then it is registered.

> so when
> all the deletes are done you have to reopen all indexsearchers again right?!

Just one.

> So this would happen quiet often due to updates and inserts.
> Hmm it is more and more a bad idea to use a lucene index as a storage.
> Rather go straight to a Database.

Yes, if you need to be able to *instantly* retrieve (but not search)
updates you just inserted, and you need to support a high volume of
updates and queries.

You could also do that in-memory by supporting retrieval by id from
your "batch" of documents not yet "committed" in Lucene.  Only
downside is it's volatile.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: GData Server - Lucene storage

Tatu Saloranta
In reply to this post by Simon Willnauer
--- Simon Willnauer <[hidden email]>
wrote:

...
> Using the client thread as the indexing thread might
> just cause some
> performance drawback but that's considerable for

Actually, I would not even assume that: handing tasks
over between threads causes context switch, and more
cache misses. In general, doing everything as a
'batch' from the same thread may well have higher
throughput than handing things over. Or it might be as
fast. Or, for specific problem subset, you may be
right that it'd be slower. ;-)
You can consider this a subset of the old
'co-operative vs. pre-emptive scheduling' debate.

Point obviously is that this is kind of thing that
needs to be measured, if you care deeply about
performance. Intuition often leads one to wrong
direction in cases like this.
So often you start with the simplest workable
solution, and consider more sophisticated approaches
when you have time and/or find out this part is the
main performance bottleneck.

-+ Tatu +-


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: GData Server - Lucene storage

Simon Willnauer
In reply to this post by Yonik Seeley
On 6/2/06, Yonik Seeley <[hidden email]> wrote:

>
> On 6/2/06, Simon Willnauer <[hidden email]> wrote:
>
>
> > So this would happen quiet often due to updates and inserts.
> > Hmm it is more and more a bad idea to use a lucene index as a storage.
> > Rather go straight to a Database.
>
> Yes, if you need to be able to *instantly* retrieve (but not search)
> updates you just inserted, and you need to support a high volume of
> updates and queries.
>
> You could also do that in-memory by supporting retrieval by id from
> your "batch" of documents not yet "committed" in Lucene.  Only
> downside is it's volatile.


So that's actually what I expected. You can't have everything with this
approach. I always have to lower one's sights. But I wan't to be prepared to
serv a high volume of inserts and updates. I used lucene a couple of times
with version 1.4 and  concurrent write access hasn't change. I will use the
Berkley DB for Java as the default storage, I still can implement a lucene
based storage if there is enough time. I had a look at the licence of the
sleepycat BerkleyDB for Java dist. and in my opinion it is alright to use
and distribute it with the gdata server.
Are there any experts on licencing? Is it ok for the ASF to use that?

simon
Reply | Threaded
Open this post in threaded view
|

Re: GData Server - Lucene storage

Erik Hatcher

On Jun 2, 2006, at 12:56 PM, Simon Willnauer wrote:
> I had a look at the licence of the
> sleepycat BerkleyDB for Java dist. and in my opinion it is alright  
> to use
> and distribute it with the gdata server.
> Are there any experts on licencing? Is it ok for the ASF to use that?

It's ok to use it, just like the DBDirectory does (in contrib/db).  
But we cannot, as far as I know, distribute BDB itself.  It will  
require those that download the GData server to separately download  
BDB.  Distributions from Apache should be entirely under the ASL, I  
believe.  Though IANAL.

It'd be fine for someone to aggregate projects and distribute them  
from elsewhere though.

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: GData Server - Lucene storage

jason rutherglen-2
In reply to this post by Simon Willnauer
Yonik,

It might be interesting to merge using BDB into Solr, as an option to provide better realtime updates.  Perhaps the replication could be used as well in place of rsync?  I don't have any experience with BDB replication, anyone have thoughts on the matter?

Jason

----- Original Message ----
From: Simon Willnauer <[hidden email]>
To: Yonik Seeley <[hidden email]>
Cc: [hidden email]
Sent: Friday, June 2, 2006 9:56:24 AM
Subject: Re: GData Server - Lucene storage

On 6/2/06, Yonik Seeley <[hidden email]> wrote:

>
> On 6/2/06, Simon Willnauer <[hidden email]> wrote:
>
>
> > So this would happen quiet often due to updates and inserts.
> > Hmm it is more and more a bad idea to use a lucene index as a storage.
> > Rather go straight to a Database.
>
> Yes, if you need to be able to *instantly* retrieve (but not search)
> updates you just inserted, and you need to support a high volume of
> updates and queries.
>
> You could also do that in-memory by supporting retrieval by id from
> your "batch" of documents not yet "committed" in Lucene.  Only
> downside is it's volatile.


So that's actually what I expected. You can't have everything with this
approach. I always have to lower one's sights. But I wan't to be prepared to
serv a high volume of inserts and updates. I used lucene a couple of times
with version 1.4 and  concurrent write access hasn't change. I will use the
Berkley DB for Java as the default storage, I still can implement a lucene
based storage if there is enough time. I had a look at the licence of the
sleepycat BerkleyDB for Java dist. and in my opinion it is alright to use
and distribute it with the gdata server.
Are there any experts on licencing? Is it ok for the ASF to use that?

simon




Reply | Threaded
Open this post in threaded view
|

Re: GData Server - Lucene storage

Yonik Seeley
On 6/2/06, jason rutherglen <[hidden email]> wrote:
> It might be interesting to merge using BDB into Solr, as an option to provide better realtime updates.  Perhaps the replication could be used as well in place of rsync?  I don't have any experience with BDB replication, anyone have thoughts on the matter?

It only matters if you need immediate retrieval by id, but not
immediate searchability.

In most of these scenarios, Lucene/Solr is not used as the primary
data store.  Updates go directly to the database, and a separate
process periodically pulls changes from the database and indexes the
content.

An interesting idea for integration on the search-side might be to
allow a hook to retrieve certain stored fields from other data sources
(like a DB) so that info need not be duplicated in the Lucene index if
it's big.  That does tie you tighter to the DB though (not a good
thing in general).

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: GData Server - Lucene storage

Andi Vajda
In reply to this post by jason rutherglen-2

On Fri, 2 Jun 2006, jason rutherglen wrote:

> It might be interesting to merge using BDB into Solr, as an option to
> provide better realtime updates.  Perhaps the replication could be used as
> well in place of rsync?  I don't have any experience with BDB replication,
> anyone have thoughts on the matter?

If you're thinking of using Berkeley DB as a the store behind the Lucene index
via the DbDirectory Directory implementation, here are a few things to keep in
mind:

   - always setUseCompoundFile(false)
     don't use compound lucene index files on top of Berkeley DB:
      . there is a bug that prevents this from working correctly
      . it makes no sense anyway since it duplicates what DbDirectory is
        already doing (all index files are stored in the same Berkeley DB file)
      . it slows things down

   - if you are using a transaction around all the index updates, you may want
     to consider doing all the index updates in a RAMDirectory first and then
     adding the RAMDirectory wholesale to the DbDirectory in that transaction.
     This makes indexing considerably faster (3 times for me) and does a LOT
     less thrashing around in Berkeley DB which can lead to a large number of
     transactional log files rapidly filling up your hard drive.

I'm not really sure if and how index merging works. For my use, having no
merging is good enough since I never update existing documents, but always
instead add a new version of them. The concept of version is tied to my
application and each transaction corresponds to a new version.

Andi..

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: GData Server - Lucene storage

Otis Gospodnetic-2
In reply to this post by Simon Willnauer
Simon,

I look a quick look at the UML PDF.  It seems to me that various *Services are overly complicated.  Since you can have only 1 thread modifying the Lucene index, perhaps you should go the same route as IndexModifier (I never used it, but it looks like people are using it to manage write/delete/search concurrency).  So perhaps all you need are IndexStorageService and SearchService for the searchable Lucene index(es), and a DataStorageService for storing and reading data from the BDB store or whatever you end up using.

Regarding the naming of StorageCache - this confused me at first.  Seeing "cache" makes me think "previously retrieved/found data stored in a cache for faster subsequent requests/searches".  But from what I can tell, that is not what StorageCache is about.  It looks like StorageCache is really a buffer of entries that are scheduled to be written to or deleted from the index+storage.  If that's so, I would consder renaming this "StorageBuffer" or some such.

Otis

----- Original Message ----
From: Simon Willnauer <[hidden email]>
To: [hidden email]
Sent: Thursday, June 1, 2006 7:37:44 PM
Subject: GData Server - Lucene storage

Hello folks,
as I'm the only developer on the project due to  the SummerOfCode
program it is quiet a tough task to discuss all the architecture with
you on the mailing list. For this reason I decided to create UML
diagrams to discuss the main components. I will not attach the uml to
the mails rather upload it to a server so you can download an study
it.
Well, the next thing I have to implement is a storage to store the
entries in. I will provide 2 kinds of storage's (lucene and BerkleyDB
based). The first will be a lucene index to store the entries
identified by the entry ID and  feed ID stored in the index as a
Keyword (used to be Field.Keyword). The underlaying lucene storage
will only be used to store the entries compressed. Which feed entries
to retrieve from the lucene storage will be based on results of the
indexing/search component as every client request to a gdata server is
a query to the index. So the results of the search are entry ids and a
corresponding feed. These entries will be retrieved from the storage
and send back to the client. The storage component does also provide
delete / update and insert functionality (wouldn't be a storage
without these).
The biggest problem with the lucene storage is to achieve a
transactional state. Imagine the following scenario:
An update request comes in. -> the entry to update will be added to
the lucene writer   who writes the update. But another delete request
has locked the index and an IOException will be thrown. So the update
request will queue the entry and retries to obtain the lock. No
problem so far. But if the index writer can not open the index due to
some other error (the index could not be found)  the exception will
also be an IOExc. Is there any way to figure out whether the
IOException is caused due to a lock which would be alright or due to
some other serious reasons?

I added some comments on the UML to describe the arch. to you in more
detail. So please download the file and have a look at it.

http://www.javawithchopsticks.de/webaccess/lucenestorage.pdf

I will appreciate all your comments!!

regards Simon

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: GData Server - Lucene storage

Simon Willnauer
On 6/2/06, Otis Gospodnetic <[hidden email]> wrote:

>
> Simon,
>
> I look a quick look at the UML PDF.  It seems to me that various *Services
> are overly complicated.  Since you can have only 1 thread modifying the
> Lucene index, perhaps you should go the same route as IndexModifier (I never
> used it, but it looks like people are using it to manage write/delete/search
> concurrency).  So perhaps all you need are IndexStorageService and
> SearchService for the searchable Lucene index(es), and a DataStorageService
> for storing and reading data from the BDB store or whatever you end up
> using.


The UML is just about the storage nothing to do with the search index.  The
search index will be a different index.
Thank you for the hint with the indexmodifier. I changed the uml and
uploaded it again if you wanna have a look at it.
I guess the performance drawback won't be too big due to the size of the
entries i will store. A feed server does also mainly serve get requests. I
will implements 2 storage types anyway but not sure yet which one will be
the first ;) guess i'll go for lucene.



Regarding the naming of StorageCache - this confused me at first.  Seeing
> "cache" makes me think "previously retrieved/found data stored in a cache
> for faster subsequent requests/searches".  But from what I can tell, that is
> not what StorageCache is about.  It looks like StorageCache is really a
> buffer of entries that are scheduled to be written to or deleted from the
> index+storage.  If that's so, I would consder renaming this "StorageBuffer"
> or some such.


This is true.  that should be changed. :)

Otis

>
> ----- Original Message ----
> From: Simon Willnauer <[hidden email]>
> To: [hidden email]
> Sent: Thursday, June 1, 2006 7:37:44 PM
> Subject: GData Server - Lucene storage
>
> Hello folks,
> as I'm the only developer on the project due to  the SummerOfCode
> program it is quiet a tough task to discuss all the architecture with
> you on the mailing list. For this reason I decided to create UML
> diagrams to discuss the main components. I will not attach the uml to
> the mails rather upload it to a server so you can download an study
> it.
> Well, the next thing I have to implement is a storage to store the
> entries in. I will provide 2 kinds of storage's (lucene and BerkleyDB
> based). The first will be a lucene index to store the entries
> identified by the entry ID and  feed ID stored in the index as a
> Keyword (used to be Field.Keyword). The underlaying lucene storage
> will only be used to store the entries compressed. Which feed entries
> to retrieve from the lucene storage will be based on results of the
> indexing/search component as every client request to a gdata server is
> a query to the index. So the results of the search are entry ids and a
> corresponding feed. These entries will be retrieved from the storage
> and send back to the client. The storage component does also provide
> delete / update and insert functionality (wouldn't be a storage
> without these).
> The biggest problem with the lucene storage is to achieve a
> transactional state. Imagine the following scenario:
> An update request comes in. -> the entry to update will be added to
> the lucene writer   who writes the update. But another delete request
> has locked the index and an IOException will be thrown. So the update
> request will queue the entry and retries to obtain the lock. No
> problem so far. But if the index writer can not open the index due to
> some other error (the index could not be found)  the exception will
> also be an IOExc. Is there any way to figure out whether the
> IOException is caused due to a lock which would be alright or due to
> some other serious reasons?
>
> I added some comments on the UML to describe the arch. to you in more
> detail. So please download the file and have a look at it.
>
> http://www.javawithchopsticks.de/webaccess/lucenestorage.pdf
>
> I will appreciate all your comments!!
>
> regards Simon
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: GData Server - Lucene storage

jason rutherglen-2
In reply to this post by Andi Vajda
Is it possible to turn off directory locking with BDB?  How is the performance compared to regular FSDirectory for queries?

----- Original Message ----
From: Andi Vajda <[hidden email]>
To: [hidden email]; jason rutherglen <[hidden email]>
Sent: Friday, June 2, 2006 10:52:27 AM
Subject: Re: GData Server - Lucene storage


On Fri, 2 Jun 2006, jason rutherglen wrote:

> It might be interesting to merge using BDB into Solr, as an option to
> provide better realtime updates.  Perhaps the replication could be used as
> well in place of rsync?  I don't have any experience with BDB replication,
> anyone have thoughts on the matter?

If you're thinking of using Berkeley DB as a the store behind the Lucene index
via the DbDirectory Directory implementation, here are a few things to keep in
mind:

   - always setUseCompoundFile(false)
     don't use compound lucene index files on top of Berkeley DB:
      . there is a bug that prevents this from working correctly
      . it makes no sense anyway since it duplicates what DbDirectory is
        already doing (all index files are stored in the same Berkeley DB file)
      . it slows things down

   - if you are using a transaction around all the index updates, you may want
     to consider doing all the index updates in a RAMDirectory first and then
     adding the RAMDirectory wholesale to the DbDirectory in that transaction.
     This makes indexing considerably faster (3 times for me) and does a LOT
     less thrashing around in Berkeley DB which can lead to a large number of
     transactional log files rapidly filling up your hard drive.

I'm not really sure if and how index merging works. For my use, having no
merging is good enough since I never update existing documents, but always
instead add a new version of them. The concept of version is tied to my
application and each transaction corresponds to a new version.

Andi..




Reply | Threaded
Open this post in threaded view
|

Re: GData Server - Lucene storage

Andi Vajda

On Fri, 2 Jun 2006, jason rutherglen wrote:

> Is it possible to turn off directory locking with BDB?  How is the
> performance compared to regular FSDirectory for queries?

The DBLock class in the org.apache.lucene.store.db package (to which
DbDirectory belongs) does absolutely nothing. This is because Berkeley DB will
do the locking it needs to keep the transactions isolated anyway.

If you run several transactions concurrently, be ready to delve into the
delicacies of recovering from aborted-by-deadlock transactions and/or avoiding
hard deadlocks.

More about this here: http://www.sleepycat.com/docs/ref/lock/am_conv.html

Andi..


>
> ----- Original Message ----
> From: Andi Vajda <[hidden email]>
> To: [hidden email]; jason rutherglen <[hidden email]>
> Sent: Friday, June 2, 2006 10:52:27 AM
> Subject: Re: GData Server - Lucene storage
>
>
> On Fri, 2 Jun 2006, jason rutherglen wrote:
>
>> It might be interesting to merge using BDB into Solr, as an option to
>> provide better realtime updates.  Perhaps the replication could be used as
>> well in place of rsync?  I don't have any experience with BDB replication,
>> anyone have thoughts on the matter?
>
> If you're thinking of using Berkeley DB as a the store behind the Lucene index
> via the DbDirectory Directory implementation, here are a few things to keep in
> mind:
>
>   - always setUseCompoundFile(false)
>     don't use compound lucene index files on top of Berkeley DB:
>      . there is a bug that prevents this from working correctly
>      . it makes no sense anyway since it duplicates what DbDirectory is
>        already doing (all index files are stored in the same Berkeley DB file)
>      . it slows things down
>
>   - if you are using a transaction around all the index updates, you may want
>     to consider doing all the index updates in a RAMDirectory first and then
>     adding the RAMDirectory wholesale to the DbDirectory in that transaction.
>     This makes indexing considerably faster (3 times for me) and does a LOT
>     less thrashing around in Berkeley DB which can lead to a large number of
>     transactional log files rapidly filling up your hard drive.
>
> I'm not really sure if and how index merging works. For my use, having no
> merging is good enough since I never update existing documents, but always
> instead add a new version of them. The concept of version is tied to my
> application and each transaction corresponds to a new version.
>
> Andi..
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: GData Server - Lucene storage

Ian Boston
In reply to this post by Simon Willnauer
Simon,
Im picking this thread up from the web archive, but I there was some
talk of replication of indexes. This message may not be threaded
correctly. I've just completed a custom FSDirectory implementation that
is designed to work in a cluster with replication.

The anatomy of this cluster is a shared database (mysql or oracle) and
stateless nodes with local disk storage. The index load is not that high
(when you look at big Nutch installations), but not tiny either, maybe
1TB of raw, with an index of 10GB (a guess).

I would have used rsync, but ideally I wanted it to work with no sys
admin setup (pure java install). I looked at, and really liked NDFS but
decided it was too much admin over head to setup. The deployers like to
do maven build deploy; tomcat/catalina.sh start to get up and running
(easy life!)

Indexing is performed using a queue (persisted in the db), with a
distributed lock manager allowing one of the nodes in the cluster to
take responsibility for indexing, notifying all other nodes when done.
(then they reload the index). This happens every few minutes in production.

FSDirectory is efficient and fast, and I wanted that in the cluster. I
looked at JDBCDirectory (from compass framework) but found that even
with a non compound index, the DB overhead was just too great, (on
average 1/10 performance on MySQL compared to local Disk, Oracle might
be better) the problem mainly being seeks into blobs. I guess the
Berkley DB Directory is going to be similar in some ways except the
seeks may be more efficient.

Eventually I borrowed some concepts from Nutch. The index writer writes
a new segment with FSDirectory, then merges into the current segment,
that segment is compresses and checksumed (MD5) and sent to the
database. Current segments are rotated when they get over 2M. When a
node recieves an index reload event, it syncs its local segments with
the DB, and loads them with a MultiReader using FSDirectory.

The sweet spot features are.

Performance is almost the same as FSDirectory, except the end of the
IndexWrite operation and the start of the IndexReader operation has
slightly more overhead.

When nodes are added to the cluster, they can validate there local
segment copies and bring them uptodate against the cluster.

There is a real time backup of the the index.

The segments are validated prior to being send to the DB.


You could easily use a SAN/NAS in place of the Db to ship the segments.

-

I haven't done real heavy production tests, but I have had it running
indexing the contents of my hard disk flat our for over 48 hours with
200, 2MB segments in the DB.

There is probably some housekeeping (eg merging) that should be done,
and not being a Lucene expert, I am bound to have missed something.

If anyone spots anything, please let me know :)


Ian


If your interested you can find the code at
https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/

The Distributed Lock manager is at
https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/component/service/impl/SearchIndexBuilderWorkerImpl.java

The Indexer is at
https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/component/dao/impl/SearchIndexBuilderWorkerDaoImpl.java

and the JDBC Index shipper is at
https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/index/ClusterFilesystem.java
https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/index/impl/ClusterFSIndexStorage.java
https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/index/impl/JDBCClusterIndexStore.java

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]