MemoryIndex

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

MemoryIndex

Robert Engels
Along the lines of Lucene-550, what about having a MemoryIndex that accepts
multiple documents, then wrote the index once at the end in the Lucene file
format (so it could be merged) during close.

When adding documents using an IndexWriter, a new segment is created for
each document, and then the segments are periodically merged in memory,
and/or with disk segments. It seems that when constructing an Index or
updating a "lot" of documents in an existing index, the write, read, merge
cycle is inefficient, and if the documents/field information were maintained
in order (TreeMaps) greater efficiency would be realized.

With a memory index, the memory needed during update will increase
dramatically, but this could still be bounded, and a "disk based" index
segment written when too many documents are in the memory index (max
buffered documents).

Does this "sound" like an improvement? Has anyone else tried something like
this?

Reply | Threaded
Open this post in threaded view
|

Re: MemoryIndex

Wolfgang Hoschek-2
MemoryIndex was designed to maximize performance for a specific use  
case: pure in-memory datastructure, at most one document per  
MemoryIndex instance, any number of fields, high frequency reads,  
high frequency index writes, no thread-safety required, optional  
support for storing offsets.

I briefly considered extending it to the multi-document case, but  
eventually refrained from doing so, because I didn't really need such  
functionality myself (no itch). Here are some issues to consider when  
attempting such an extension:

- The internal datastructure would probably look quite different
- Datastructure/algorithmic trade-offs regarding time vs space, read  
vs. write frequency, common vs. less common use cases
- Hence, it may well turn out that there's not much to reuse.
- A priori, it isn't clear whether a new solution would be  
significantly faster than normal RAMDirectory usage. Thus...
- Need benchmark suite to evaluate the chosen trade-offs.
- Need tests to ensure correctness (in practise, meaning, it behaves  
just like the existing alternative).

I'd say it's a non-trival untertaking. For example, right now, I  
don't have time for such an effort. That doesn't mean it's impossible  
or shouldn't be done, of course. If someone would like to run with it  
that would be great, but in light of the above issues, I'd suggest  
doing it in a new class (say MultiMemoryIndex or similar).

I believe Mark has dome some initial work in that direction, based on  
an independent (and different) implementation strategy.

Wolfgang.

On May 2, 2006, at 12:25 AM, Robert Engels wrote:

> Along the lines of Lucene-550, what about having a MemoryIndex that  
> accepts
> multiple documents, then wrote the index once at the end in the  
> Lucene file
> format (so it could be merged) during close.
>
> When adding documents using an IndexWriter, a new segment is  
> created for
> each document, and then the segments are periodically merged in  
> memory,
> and/or with disk segments. It seems that when constructing an Index or
> updating a "lot" of documents in an existing index, the write,  
> read, merge
> cycle is inefficient, and if the documents/field information were  
> maintained
> in order (TreeMaps) greater efficiency would be realized.
>
> With a memory index, the memory needed during update will increase
> dramatically, but this could still be bounded, and a "disk based"  
> index
> segment written when too many documents are in the memory index (max
> buffered documents).
>
> Does this "sound" like an improvement? Has anyone else tried  
> something like
> this?
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

IndexWriter mergeSegments

Karel Tejnora
In reply to this post by Robert Engels
Hi,
    I found a small issue when I add 10GB index to 20GB index using
addIndexes when useCompoundFile == true.

Before compound file is created the segments info are written but points
to non-existing coumpound file then new .tmp is created and renamed to .cfs
Between time when new segments was written and new cfs is creating Index
reader in different application whith no care about a lock can take
segment with empty cfs resulting in empty hits.

The lines of the code can be found in IndexWriter mergeSegments(int,int)
btw. 696 and 712
/* segments written, new cfs has not been created */
synchronized (directory) {                 // in- & inter-process sync
        new Lock.With(directory.makeLock(COMMIT_LOCK_NAME),
COMMIT_LOCK_TIMEOUT) {
            public Object doBody() throws IOException {
              segmentInfos.write(directory);     // commit before deleting
              deleteSegments(segmentsToDelete);  // delete now-unused
segments
              return null;
            }
          }.run();
      }
/* create cfs */
if (useCompoundFile) {
      final Vector filesToDelete = merger.createCompoundFile(mergedName
+ ".tmp");
      synchronized (directory) { // in- & inter-process sync
        new Lock.With(directory.makeLock(COMMIT_LOCK_NAME),
COMMIT_LOCK_TIMEOUT) {
          public Object doBody() throws IOException {
            // make compound file visible for SegmentReaders
            directory.renameFile(mergedName + ".tmp", mergedName + ".cfs");
            // delete now unused files of segment
            deleteFiles(filesToDelete);  
            return null;
          }
        }.run();
      }
    }

I see solution in swap those two blocks. firstly create final cfs and
than write and delete old segments.

Has anybody seen a problem with that?

I can send patch but firstly I need to find svn client in gentoo :) and
it's to late here.
Can be smb so kind and give me link where I can find how to generate
patch in lucene/apache way?

Thx,
Karel

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter mergeSegments

Nadav Har'El-2
Karel Tejnora <[hidden email]> wrote on 03/05/2006 02:29:06 AM:>
>..
> Before compound file is created the segments info are written but points
> to non-existing coumpound file then new .tmp is created and renamed to
.cfs
> Between time when new segments was written and new cfs is creating Index
> reader in different application whith no care about a lock can take
> segment with empty cfs resulting in empty hits.

Hi Karel, I've also been bothered by Lucene crash-intolerance bugs,
where killing Lucene in a specific (ill-)chosen moment leaves you
with an unusable index.

I filed a bug report in http://issues.apache.org/jira/browse/LUCENE-554
about another case where a crash can leave in an inconsistent state,
but it likely that the problem you reported is even more serious,
because it appears that the amount of time we are at risk is relatively
high in the case you reported, unlike in my bug report which requires
extraordinary bad luck to see.

> I see solution in swap those two blocks. firstly create final cfs and
> than write and delete old segments.
> Has anybody seen a problem with that?

Looking at the code, I see no reason why your suggested fix won't
work, and I don't see any negative side-effects. But I'm new to
Lucene so I hope someone with more experience can take a look.

By the way, I wonder why the current code first creates a file
called mergedName+".tmp", and only later renames it to ".cfs".
What's the point of doing that, when directory.renameFile() is
not atomic (see LUCENE-554), and in some cases it even resorts
to copying (see FSDirectory.renameFile()'s comments about "some jvms".)
And I also wonder why the existing code does the renameFile()
with the commit lock held, and not outside it.

> I can send patch but firstly I need to find svn client in gentoo :) and
> it's to late here.
> Can be smb so kind and give me link where I can find how to generate
> patch in lucene/apache way?

I'm sorry I can't really help you with that, I'm also new to Lucene
and now working on my first patch for it...
Good luck,
Nadav.

--
Nadav Har'El


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]