|
It is possible to always have Lucene end up with the
same set of index filenames for each index generation process? I have an application that creates an index for a set of files, and generally, the index files created are the following: _0.cfs segments_2 segments.gen However, it appears sometimes, I get the following: _2.cfs segments_4 segments.gen Not sure what triggers the difference, however, is there a "rename" operation so I can rename the files to be like the first list? Thanks, --ewh --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
: It is possible to always have Lucene end up with the : same set of index filenames for each index generation : process? this smells like an XY problem .... why do you car what the file names are? that's an implementtaion detail of lucene -- the directory as a whole is the index -- what are you trying to do that you are concerned about wanting to "rename" the files? http://people.apache.org/~hossman/#xyproblem XY Problem Your question appears to be an "XY Problem" ... that is: you are dealing with "X", you are assuming "Y" will help you, and you are asking about "Y" without giving more details about the "X" so that we can understand the full issue. Perhaps the best solution doesn't involve "Y" at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
On Tue, Dec 14, 2010 at 12:53 AM, Chris Hostetter
<[hidden email]> wrote: > > : It is possible to always have Lucene end up with the > : same set of index filenames for each index generation > : process? > > this smells like an XY problem .... why do you car what the file names > are? that's an implementtaion detail of lucene -- the directory as a whole > is the index -- what are you trying to do that you are concerned about > wanting to "rename" the files? I have to create patch sets against two version of a data set in a directory tree structure, and the data set contains a lucene index. However, if the filenames are not consistent for the index, then the delta program thinks they are completely new files vs just doing an xdelta on the index data. If renaming is not possible, the delta program will have to have lucene awareness about variations in the filenames between two versions of a data set tree. I guess I will have to do this if I am going to be lectured about how to develop software. From a design perspective, I figured if the process that builds the data sets and the lucene index can be modified to make sure the lucene index files are consistently named, the delta computation program can stay agnostic about lucene and just do a basic tree differencing algorithm. --ewh --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
I'm missing something here. You mention "two versions of
a data set in a directory tree structure". The Lucene indexes will have different names if they have been merged. Usually this is a result of changing the data, issuing an optimize, etc. That is, the data *is* different so it seems perfectly appropriate to consider them new... Lucene never changes an existing segments file once it is committed. It only merges segments then deletes the old ones. So if the file names are different, then it seems that renaming them wouldn't be what you really want. So either it really is an XY problem (as in "I really don't think you want to do that") or I completely misunderstand what you're trying to do. Best Erick On Tue, Dec 14, 2010 at 9:46 AM, Earl Hood <[hidden email]> wrote: > On Tue, Dec 14, 2010 at 12:53 AM, Chris Hostetter > <[hidden email]> wrote: > > > > : It is possible to always have Lucene end up with the > > : same set of index filenames for each index generation > > : process? > > > > this smells like an XY problem .... why do you car what the file names > > are? that's an implementtaion detail of lucene -- the directory as a > whole > > is the index -- what are you trying to do that you are concerned about > > wanting to "rename" the files? > > I have to create patch sets against two version of a data > set in a directory tree structure, and the data set contains > a lucene index. > > However, if the filenames are not consistent for the index, > then the delta program thinks they are completely new > files vs just doing an xdelta on the index data. > > If renaming is not possible, the delta program will > have to have lucene awareness about variations in > the filenames between two versions of a data set > tree. I guess I will have to do this if I am going > to be lectured about how to develop software. > > From a design perspective, I figured if the process > that builds the data sets and the lucene index can > be modified to make sure the lucene index files are > consistently named, the delta computation program > can stay agnostic about lucene and just do a basic > tree differencing algorithm. > > --ewh > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [hidden email] > For additional commands, e-mail: [hidden email] > > |
|
On Tue, Dec 14, 2010 at 9:45 AM, Erick Erickson <[hidden email]> wrote:
> Lucene never changes an existing segments file once it is committed. > It only merges segments then deletes the old ones. So if the file names > are different, then it seems that renaming them wouldn't be what you > really want. > > So either it really is an XY problem (as in "I really don't think you want > to do that") or I completely misunderstand what > you're trying to do. In my testing, when the filenames are the same, doing an xdelta on the files (mainly the file that contains most of the data, the .cfs file), there is a significant reduction in the size of the patch file created. Since bandwidth is a critical factor in the project I'm on, the reduction in size is very beneficial. The changes in the data set are of nature that the search index data itself should not be drastically different, and hence, xdelta being able to provide a smaller patch file than the entire new .cfs file. I could make an exception in the patch creation program to detect that there is a lucene directly, and diff the .cfs files, even if they have different names, but was seeing if I can avoid that so the patch program can be agnostic about the contents of the directory tree. --ewh --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
> I could make an exception in the patch creation program to detect
> that there is a lucene directly, and diff the .cfs files, even if > they have different names, but was seeing if I can avoid that > so the patch program can be agnostic about the contents of the > directory tree. > Doing only this is insufficient - .cfs files are referred to (by name) by segments files. There can be multiple .cfs files. There can be multiple segments files. See Lucene's File Format documentation - e.g. http://lucene.apache.org/java/3_0_3/fileformats.html#Segments%20File. As Erick pointed, when exactly the same indexing scenario takes place you should have ended up with the same index files (content and name). So if running into different file names is something that happens only in your test env, better make sure that indeed the tests behavior reflect actual "field behavior" - just to make sure you are not spending too much time on optimizing a scenario that can happen in your tests but will never happen in "production". Assuming you check this and find that the scenario that creates identical indexes with different file names is possible and common and should be optimized, then a more involved solution would be required to make sure that the decision not to copy a certain file is correct. Perhaps I'll change my mind after understanding the scenario that creates this, but for now I'd rather not to ignore the file names differences. Doron |
|
On Wed, Dec 15, 2010 at 7:49 AM, Doron Cohen wrote:
> Perhaps I'll change my mind after understanding the scenario that creates > this, but for now I'd rather not to ignore the file names differences. It may be possible to control the data generation process, so the filenames are consistent. Changes in the filenames seem to happen when doing incremental builds of the data sets, but we may be allowed to require full builds of the data sets for purposes of creating patch sets. I do realize that the segment file is important also, so I would take the collection of files (3 files that I know of) into account if I have to deal with filename differences. Since it appears that index file renaming is not readily available, I will wait and see how some real-world scenarios work out to determine if the different index filenames will be a major concern for the project. Thanks for everyone's feedback, --ewh --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
In reply to this post by Earl Hood
: In my testing, when the filenames are the same, doing an xdelta on the : files (mainly the file that contains most of the data, the .cfs file), : there is a significant reduction in the size of the patch file created. AS noted elsewhere in this thread, the filenames themselves are significant because they are tracked in the segments file and indicate generational information. files with the same names should be the same, files with differnet names should be very different -- but if your binary diff tool is finding commonalities between files in new segments as the index grows overtime, and you feel like you can take advantage of this, then i would suggest using a simple tool like "tar" to combine all of the index files int oa single file with a predictable name before running your diff tool. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
|
On Wed, Dec 15, 2010 at 1:41 PM, Chris Hostetter
<[hidden email]> wrote: > files with the same names should be the same, files with differnet names > should be very different -- but if your binary diff tool is finding > commonalities between files in new segments as the index grows overtime, > and you feel like you can take advantage of this, then i would suggest > using a simple tool like "tar" to combine all of the index files int oa > single file with a predictable name before running your diff tool. I've considered that tar-style approach. My initial query to the list was to see if I can avoid making the patch creation program agnostic to the set of files. Anyway, the tar-style approach seems the way to go if testing confirms it is needed. Thanks, --ewh --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
| Powered by Nabble | Edit this page |
