[lucy-user] C library - RAM index serialization/deserialization

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[lucy-user] C library - RAM index serialization/deserialization

serkanmulayim@gmail.com
Hi,

For security reasons we need to work in memory index. This is why we are working with the RAMFolder instead of FSFolder. It is tested and it seems to be working good as long as an OOM issue occurs.

The question is at some point we need to encrypt the index (RAMFolder) and save it in disk in encrypted format. Adding in memory index to FS index and then encrypting  FS index then deleting the FS index is not an option due to security reasons.

So my question is how can we achieve it. Is it possible to serialize/deserialize the whole RAM folder to a buffer, so it would be possible for us to encrypt the buffer (or decrypt to a buffer) only? This would be an ideal solution for us but I could not find any information about it in the code.

If this is not an option, is it possible to traverse the RAMFolder and find the files in folders, then it would be possible for us to create the same structure where file contents are encrypted individually?

Please let me know if you need any clarifications

Thanks,
 Serkan
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] C library - RAM index serialization/deserialization

serkanmulayim@gmail.com
Hi again,

I realized that it is not possible to serialize the RAM index directly to bytes. This is why I made some tests to copy all files in the ram folder to FS folder by iterating over the contents for the RAM folder. With this approach I can create an FS folder holding the index elements. On the other hand, at the deserialization part where I need to create RAM folder directly from the FS folder (again iterating over the contents of the FS folder), I see a problem.

It seems like there are virtual files in the RAM folder where the file names are matching with the entries of cfmeta.json file. When I try to open an indexer or a searcher over the RAM folder, I see an error as following: "File not found: 'mainindex/seg_2/sort-14.ord'". So the RAM folder does not exactly work as the FS folder since it requires the existence of the virtual files in the RAM folder.

Why are there these virtual files? (I suspect that for optimization purposes (e.g. in order not to read the cfmeta.json file over again), virtual files hold the cfmeta.json values). So my question is, is it possible to create the virtual files from cfmeta.json value with an API call? Or do you have any other suggestions.

Thanks,
Serkan







On 2018/03/27 22:35:53, [hidden email] <[hidden email]> wrote:

> Hi,
> For security reasons we need to work in memory index. This is why we are working with the RAMFolder instead of FSFolder. It is tested and it seems to be working good as long as an OOM issue occurs.
>
> The question is at some point we need to encrypt the index (RAMFolder) and save it in disk in encrypted format. Adding in memory index to FS index and then encrypting  FS index then deleting the FS index is not an option due to security reasons.
>
> So my question is how can we achieve it. Is it possible to serialize/deserialize the whole RAM folder to a buffer, so it would be possible for us to encrypt the buffer (or decrypt to a buffer) only? This would be an ideal solution for us but I could not find any information about it in the code.
>
> If this is not an option, is it possible to traverse the RAMFolder and find the files in folders, then it would be possible for us to create the same structure where file contents are encrypted individually?
>
> Please let me know if you need any clarifications
>
> Thanks,
>  Serkan
>
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] C library - RAM index serialization/deserialization

Nick Wellnhofer
On 02/04/2018 19:52, [hidden email] wrote:
> I realized that it is not possible to serialize the RAM index directly to bytes. This is why I made some tests to copy all files in the ram folder to FS folder by iterating over the contents for the RAM folder.

Yes, you have to inspect the RAMFolder using the private API of Folder,
FileHandle, InStream, etc. It's documented in the .cfh files in
core/Lucy/Store but it seems that you already figured this out.

     https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=tree;f=core/Lucy/Store

> Why are there these virtual files? (I suspect that for optimization purposes (e.g. in order not to read the cfmeta.json file over again), virtual files hold the cfmeta.json values). So my question is, is it possible to create the virtual files from cfmeta.json value with an API call? Or do you have any other suggestions.

These are so-called "compound files" used to consolidate multiple files into a
single one and reduce the number of open file handles. On the filesystem
level, there are two files cfmeta.json and cf.dat but Lucy's Store API
automatically returns information about the virtual files. If you want to
treat compound files as regular files, you have to check the Folder objects
returned by Folder_Find_Folder. If it's a Lucy::Store::CompoundFileReader,
call CFReader_Get_Real_Folder to get the actual RAMFolder or FSFolder:

     if (Folder_is_a(subfolder, COMPOUNDFILEREADER)) {
         CompoundFileReader *cf_reader = (CompoundFileReader*)subfolder;
         subfolder = CFReader_Get_Real_Folder(cf_reader);
     }

After deserializing cfmeta.json and cf.dat into a RAMFolder, you'll have to
recreate the CFReaders and replace the entry in the enclosing folder. Have a
look at Folder_Consolidate to get the idea. But the `entries` hash isn't
exposed, so you probably can't do that without changes to the Lucy source code.

As an alternative, you could try to change Lucy's behavior to not create
compound files for RAMFolders at all. Subclassing RAMFolder and making
Folder_Consolidate a no-op should work.

Nick
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] C library - RAM index serialization/deserialization

serkanmulayim@gmail.com
Hi Nick,

Thank you very much for your response. I was exactly stuck on what you mentioned. I need to create an entry for the folder with the CFReader, but there is no internal functions supporting that. Folder implementation fails since it does a check for the cfmeta.json.

Secondly it does not seem like I can link CFReader (Folder*) as an entry to the enclosing folder since there is no function to link a folder inside a folder as far as I see. Can you confirm?

If the above is not an option, I was thinking of subclassing RAMFolder. I do not think I need to create a cfh file for that, but I will create my own Folder implementation with a static header file. I think this is the right approach. (Unless there is a better approach which does not require subclassing, e.g. by using inStreams)

Thanks again,
Serkan

On 2018/04/03 13:26:43, Nick Wellnhofer <[hidden email]> wrote:

> On 02/04/2018 19:52, [hidden email] wrote:
> > I realized that it is not possible to serialize the RAM index directly to bytes. This is why I made some tests to copy all files in the ram folder to FS folder by iterating over the contents for the RAM folder.
>
> Yes, you have to inspect the RAMFolder using the private API of Folder,
> FileHandle, InStream, etc. It's documented in the .cfh files in
> core/Lucy/Store but it seems that you already figured this out.
>
>      https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=tree;f=core/Lucy/Store
>
> > Why are there these virtual files? (I suspect that for optimization purposes (e.g. in order not to read the cfmeta.json file over again), virtual files hold the cfmeta.json values). So my question is, is it possible to create the virtual files from cfmeta.json value with an API call? Or do you have any other suggestions.
>
> These are so-called "compound files" used to consolidate multiple files into a
> single one and reduce the number of open file handles. On the filesystem
> level, there are two files cfmeta.json and cf.dat but Lucy's Store API
> automatically returns information about the virtual files. If you want to
> treat compound files as regular files, you have to check the Folder objects
> returned by Folder_Find_Folder. If it's a Lucy::Store::CompoundFileReader,
> call CFReader_Get_Real_Folder to get the actual RAMFolder or FSFolder:
>
>      if (Folder_is_a(subfolder, COMPOUNDFILEREADER)) {
>          CompoundFileReader *cf_reader = (CompoundFileReader*)subfolder;
>          subfolder = CFReader_Get_Real_Folder(cf_reader);
>      }
>
> After deserializing cfmeta.json and cf.dat into a RAMFolder, you'll have to
> recreate the CFReaders and replace the entry in the enclosing folder. Have a
> look at Folder_Consolidate to get the idea. But the `entries` hash isn't
> exposed, so you probably can't do that without changes to the Lucy source code.
>
> As an alternative, you could try to change Lucy's behavior to not create
> compound files for RAMFolders at all. Subclassing RAMFolder and making
> Folder_Consolidate a no-op should work.
>
> Nick
>
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] C library - RAM index serialization/deserialization

serkanmulayim@gmail.com
I made a test by creating a new method, Folder_Consolidate2 which does NOT call CFWriter_Consolidate. This one works. And it was only for testing purposes (I am not changing the source code). Is there a way to get the ivars struct for folder somehow? This would make things much easier for me.

If this is not feasible, extending the RAM_Folder to contain a new consolidate method which does not call the CFWriter_Consolidate method might be an option. If I need to move this way, what is the best way to have the new subclass which would not mess up the casting of the pointers.?

I am still open to other suggestions though.

Thanks,
Serkan


On 2018/04/03 17:26:57, [hidden email] <[hidden email]> wrote:

> Hi Nick,
>
> Thank you very much for your response. I was exactly stuck on what you mentioned. I need to create an entry for the folder with the CFReader, but there is no internal functions supporting that. Folder implementation fails since it does a check for the cfmeta.json.
>
> Secondly it does not seem like I can link CFReader (Folder*) as an entry to the enclosing folder since there is no function to link a folder inside a folder as far as I see. Can you confirm?
>
> If the above is not an option, I was thinking of subclassing RAMFolder. I do not think I need to create a cfh file for that, but I will create my own Folder implementation with a static header file. I think this is the right approach. (Unless there is a better approach which does not require subclassing, e.g. by using inStreams)
>
> Thanks again,
> Serkan
>
> On 2018/04/03 13:26:43, Nick Wellnhofer <[hidden email]> wrote:
> > On 02/04/2018 19:52, [hidden email] wrote:
> > > I realized that it is not possible to serialize the RAM index directly to bytes. This is why I made some tests to copy all files in the ram folder to FS folder by iterating over the contents for the RAM folder.
> >
> > Yes, you have to inspect the RAMFolder using the private API of Folder,
> > FileHandle, InStream, etc. It's documented in the .cfh files in
> > core/Lucy/Store but it seems that you already figured this out.
> >
> >      https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=tree;f=core/Lucy/Store
> >
> > > Why are there these virtual files? (I suspect that for optimization purposes (e.g. in order not to read the cfmeta.json file over again), virtual files hold the cfmeta.json values). So my question is, is it possible to create the virtual files from cfmeta.json value with an API call? Or do you have any other suggestions.
> >
> > These are so-called "compound files" used to consolidate multiple files into a
> > single one and reduce the number of open file handles. On the filesystem
> > level, there are two files cfmeta.json and cf.dat but Lucy's Store API
> > automatically returns information about the virtual files. If you want to
> > treat compound files as regular files, you have to check the Folder objects
> > returned by Folder_Find_Folder. If it's a Lucy::Store::CompoundFileReader,
> > call CFReader_Get_Real_Folder to get the actual RAMFolder or FSFolder:
> >
> >      if (Folder_is_a(subfolder, COMPOUNDFILEREADER)) {
> >          CompoundFileReader *cf_reader = (CompoundFileReader*)subfolder;
> >          subfolder = CFReader_Get_Real_Folder(cf_reader);
> >      }
> >
> > After deserializing cfmeta.json and cf.dat into a RAMFolder, you'll have to
> > recreate the CFReaders and replace the entry in the enclosing folder. Have a
> > look at Folder_Consolidate to get the idea. But the `entries` hash isn't
> > exposed, so you probably can't do that without changes to the Lucy source code.
> >
> > As an alternative, you could try to change Lucy's behavior to not create
> > compound files for RAMFolders at all. Subclassing RAMFolder and making
> > Folder_Consolidate a no-op should work.
> >
> > Nick
> >
>