Quantcast

Forcing specific index file names

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Forcing specific index file names

Earl Hood
It is possible to always have Lucene end up with the
same set of index filenames for each index generation
process?

I have an application that creates an index for a
set of files, and generally, the index files created
are the following:

_0.cfs  segments_2  segments.gen

However, it appears sometimes, I get the following:

_2.cfs  segments_4  segments.gen

Not sure what triggers the difference, however, is
there a "rename" operation so I can rename
the files to be like the first list?

Thanks,

--ewh

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Forcing specific index file names

Chris Hostetter-3

: It is possible to always have Lucene end up with the
: same set of index filenames for each index generation
: process?

this smells like an XY problem .... why do you car what the file names
are? that's an implementtaion detail of lucene -- the directory as a whole
is the index -- what are you trying to do that you are concerned about
wanting to "rename" the files?

http://people.apache.org/~hossman/#xyproblem
XY Problem

Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341



-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Forcing specific index file names

Earl Hood
On Tue, Dec 14, 2010 at 12:53 AM, Chris Hostetter
<[hidden email]> wrote:
>
> : It is possible to always have Lucene end up with the
> : same set of index filenames for each index generation
> : process?
>
> this smells like an XY problem .... why do you car what the file names
> are? that's an implementtaion detail of lucene -- the directory as a whole
> is the index -- what are you trying to do that you are concerned about
> wanting to "rename" the files?

I have to create patch sets against two version of a data
set in a directory tree structure, and the data set contains
a lucene index.

However, if the filenames are not consistent for the index,
then the delta program thinks they are completely new
files vs just doing an xdelta on the index data.

If renaming is not possible, the delta program will
have to have lucene awareness about variations in
the filenames between two versions of a data set
tree.  I guess I will have to do this if I am going
to be lectured about how to develop software.

From a design perspective, I figured if the process
that builds the data sets and the lucene index can
be modified to make sure the lucene index files are
consistently named, the delta computation program
can stay agnostic about lucene and just do a basic
tree differencing algorithm.

--ewh

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Forcing specific index file names

Erick Erickson
I'm missing something here. You mention "two versions of
a data set in a directory tree structure". The Lucene indexes
will have different names if they have been merged. Usually
this is a result of changing the data, issuing an optimize, etc.
That is, the data *is* different so it seems perfectly appropriate
to consider them new...

Lucene never changes an existing segments file once it is committed.
It only merges segments then deletes the old ones. So if the file names
are different, then it seems that renaming them wouldn't be what you
really want.

So either it really is an XY problem (as in "I really don't think you want
to do that") or I completely  misunderstand what
you're trying to do.

Best
Erick

On Tue, Dec 14, 2010 at 9:46 AM, Earl Hood <[hidden email]> wrote:

> On Tue, Dec 14, 2010 at 12:53 AM, Chris Hostetter
> <[hidden email]> wrote:
> >
> > : It is possible to always have Lucene end up with the
> > : same set of index filenames for each index generation
> > : process?
> >
> > this smells like an XY problem .... why do you car what the file names
> > are? that's an implementtaion detail of lucene -- the directory as a
> whole
> > is the index -- what are you trying to do that you are concerned about
> > wanting to "rename" the files?
>
> I have to create patch sets against two version of a data
> set in a directory tree structure, and the data set contains
> a lucene index.
>
> However, if the filenames are not consistent for the index,
> then the delta program thinks they are completely new
> files vs just doing an xdelta on the index data.
>
> If renaming is not possible, the delta program will
> have to have lucene awareness about variations in
> the filenames between two versions of a data set
> tree.  I guess I will have to do this if I am going
> to be lectured about how to develop software.
>
> From a design perspective, I figured if the process
> that builds the data sets and the lucene index can
> be modified to make sure the lucene index files are
> consistently named, the delta computation program
> can stay agnostic about lucene and just do a basic
> tree differencing algorithm.
>
> --ewh
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Forcing specific index file names

Earl Hood
On Tue, Dec 14, 2010 at 9:45 AM, Erick Erickson <[hidden email]> wrote:
> Lucene never changes an existing segments file once it is committed.
> It only merges segments then deletes the old ones. So if the file names
> are different, then it seems that renaming them wouldn't be what you
> really want.
>
> So either it really is an XY problem (as in "I really don't think you want
> to do that") or I completely  misunderstand what
> you're trying to do.

In my testing, when the filenames are the same, doing an xdelta on the
files (mainly the file that contains most of the data, the .cfs file),
there is a significant reduction in the size of the patch file created.

Since bandwidth is a critical factor in the project I'm on, the
reduction in size is very beneficial.  The changes in the data
set are of nature that the search index data itself should not
be drastically different, and hence, xdelta being able to provide
a smaller patch file than the entire new .cfs file.

I could make an exception in the patch creation program to detect
that there is a lucene directly, and diff the .cfs files, even if
they have different names, but was seeing if I can avoid that
so the patch program can be agnostic about the contents of the
directory tree.

--ewh

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Forcing specific index file names

Doron Cohen-2
> I could make an exception in the patch creation program to detect
> that there is a lucene directly, and diff the .cfs files, even if
> they have different names, but was seeing if I can avoid that
> so the patch program can be agnostic about the contents of the
> directory tree.
>

Doing only this is insufficient - .cfs files are referred to (by name) by
segments files. There can be multiple .cfs files. There can be multiple
segments files. See Lucene's File Format documentation - e.g.
http://lucene.apache.org/java/3_0_3/fileformats.html#Segments%20File.

As Erick pointed, when exactly the same indexing scenario takes place you
should have ended up with the same index files (content and name). So if
running into different file names is something that happens only in your
test env, better make sure that indeed the tests behavior reflect actual
"field behavior" - just to make sure you are not spending too much time on
optimizing a scenario that can happen in your tests but will never happen in
"production".

Assuming you check this and find that the scenario that creates identical
indexes with different file names is possible and common and should be
optimized, then a more involved solution would be required to make sure that
the decision not to copy a certain file is correct.

Perhaps I'll change my mind after understanding the scenario that creates
this, but for now I'd rather not to ignore the file names differences.

Doron
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Forcing specific index file names

Earl Hood
On Wed, Dec 15, 2010 at 7:49 AM, Doron Cohen wrote:
> Perhaps I'll change my mind after understanding the scenario that creates
> this, but for now I'd rather not to ignore the file names differences.

It may be possible to control the data generation process, so
the filenames are consistent.  Changes in the filenames seem
to happen when doing incremental builds of the data sets, but
we may be allowed to require full builds of the data sets
for purposes of creating patch sets.

I do realize that the segment file is important also, so I would
take the collection of files (3 files that I know of) into account
if I have to deal with filename differences.

Since it appears that index file renaming is not readily available,
I will wait and see how some real-world scenarios work out to
determine if the different index filenames will be a major concern for
the project.

Thanks for everyone's feedback,

--ewh

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Forcing specific index file names

Chris Hostetter-3
In reply to this post by Earl Hood

: In my testing, when the filenames are the same, doing an xdelta on the
: files (mainly the file that contains most of the data, the .cfs file),
: there is a significant reduction in the size of the patch file created.

AS noted elsewhere in this thread, the filenames themselves are
significant because they are tracked in the segments file and indicate
generational information.

files with the same names should be the same, files with differnet names
should be very different -- but if your binary diff tool is finding
commonalities between files in new segments as the index grows overtime,
and you feel like you can take advantage of this, then i would suggest
using a simple tool like "tar" to combine all of the index files int oa
single file with a predictable name before running your diff tool.


-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Forcing specific index file names

Earl Hood
On Wed, Dec 15, 2010 at 1:41 PM, Chris Hostetter
<[hidden email]> wrote:
> files with the same names should be the same, files with differnet names
> should be very different -- but if your binary diff tool is finding
> commonalities between files in new segments as the index grows overtime,
> and you feel like you can take advantage of this, then i would suggest
> using a simple tool like "tar" to combine all of the index files int oa
> single file with a predictable name before running your diff tool.

I've considered that tar-style approach.

My initial query to the list was to see if I can avoid making
the patch creation program agnostic to the set of files.

Anyway, the tar-style approach seems the way to go if testing
confirms it is needed.

Thanks,

--ewh

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...