Encrypted index?

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Encrypted index?

Adam Retter

I wondered if there is any facility already existing in Lucene for encrypting the values stored into the index and still being able to search them?

If not, I wondered if anyone could tell me if this is impossible to implement, and if not to point me perhaps in the right direction?

I imagine that just the text values and document fields to index (and optionally store) in the index would be either encrypted on the fly by Lucene using perhaps a public/private key mechanism. When a user issues a search query to Lucene they would also provide a key so that Lucene can decrypt the values as necessary to try and answer their query.

Thanks Adam.

--
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk
Reply | Threaded
Open this post in threaded view
|

Re: Encrypted index?

Shawn Heisey-2
On 9/5/2015 5:06 AM, Adam Retter wrote:

> I wondered if there is any facility already existing in Lucene for
> encrypting the values stored into the index and still being able to
> search them?
>
> If not, I wondered if anyone could tell me if this is impossible to
> implement, and if not to point me perhaps in the right direction?
>
> I imagine that just the text values and document fields to index (and
> optionally store) in the index would be either encrypted on the fly by
> Lucene using perhaps a public/private key mechanism. When a user issues
> a search query to Lucene they would also provide a key so that Lucene
> can decrypt the values as necessary to try and answer their query.

I think you could probably add transparent encryption/decryption at the
Lucene level in a custom codec.  That probably has implications for
being able to read the older index when it's time to upgrade Lucene,
with a complete reindex being the likely solution.  Others will need to
confirm ... I'm not very familiar with Lucene code, I'm here for Solr.

Any verification of user identity/permission is probably best done in
your own code, before it makes the Lucene query, and wouldn't
necessarily be related to the encryption.

Requirements like this are usually driven by paranoid customers or
product managers.  I think that when you really start to examine what an
attacker has to do to actually reach the unencrypted information (Lucene
index in this case), they already have acquired so much access that the
system is completely breached and it won't matter what kind of
encryption is added.

I find many of these requirements to be silly, and put an incredible
burden on admin and developer resources with little or no benefit.
Here's an example of similar customer encryption requirement which I
encountered recently:

We have a web application that has three "hops" involved.  A user talks
to a load balancer, which talks to Apache, where the connection is then
proxied to a Tomcat server with the AJP protocol.  The customer wanted
all three hops encrypted.  The first hop was already encrypted, the
second was easy, but the third proved to be very difficult.  Finally we
decided that we did not need load balancing on that last hop, and it
could simply talk to localhost, eliminating the need to encrypt it.

The customer was worried about an attacker sniffing the traffic on the
LAN and seeing details like passwords.  I consider this to be an insane
requirement.  In order to sniff that traffic, the attacker would need
one of two things:  Root access on a server, or physical access to the
infrastructure.  Physical access can be escalated to root access if you
know what you're doing.  Once someone has either of those things,
encrypted traffic won't matter, they will be able to learn anything they
need or do any damage they desire, without even needing to sniff the
traffic.

Thanks,
Shawn


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Encrypted index?

Erick Erickson
The easiest way to do this is put the index over
an encrypted file system. Encrypting the actual
_tokens_ has a few problems, not the least of
which is that any encryption algorithm worth
its salt is going to make most searching totally
impossible.

Consider run, runner, running and runs with
simple wildcards. Searching for run* requires that all 4
variants have 'run' as a prefix, and any decent
encryption algorithm will not do that. Any
encryption that _does_ make that search possible
is trivially broken. I usually stop my thinking there,
but ngrams, casing, WordDelimiterFilterFactory
all come immediately to mind as "interesting".

But what about stored data you ask? Yes, the
stored fields are compressed but stored verbatim,
so I've seen arguments for encrypting _that_ stream,
but that's really a "feel good" fig-leaf. If I get access to the
index and it has position information, I can reconstruct
documents without the stored data as Luke does. The
process is a bit lossy, but the reconstructed document
has enough fidelity that it'll give people seriously
concerned about encryption conniption fits.

So all in all I have to back up Shawn's comments: You're
better off isolating your Solr/Lucene system, putting
authorization to view _documents_ at that level, and possibly
using an encrypted filesystem.

FWIW,
Erick

On Sat, Sep 5, 2015 at 7:27 AM, Shawn Heisey <[hidden email]> wrote:

> On 9/5/2015 5:06 AM, Adam Retter wrote:
>> I wondered if there is any facility already existing in Lucene for
>> encrypting the values stored into the index and still being able to
>> search them?
>>
>> If not, I wondered if anyone could tell me if this is impossible to
>> implement, and if not to point me perhaps in the right direction?
>>
>> I imagine that just the text values and document fields to index (and
>> optionally store) in the index would be either encrypted on the fly by
>> Lucene using perhaps a public/private key mechanism. When a user issues
>> a search query to Lucene they would also provide a key so that Lucene
>> can decrypt the values as necessary to try and answer their query.
>
> I think you could probably add transparent encryption/decryption at the
> Lucene level in a custom codec.  That probably has implications for
> being able to read the older index when it's time to upgrade Lucene,
> with a complete reindex being the likely solution.  Others will need to
> confirm ... I'm not very familiar with Lucene code, I'm here for Solr.
>
> Any verification of user identity/permission is probably best done in
> your own code, before it makes the Lucene query, and wouldn't
> necessarily be related to the encryption.
>
> Requirements like this are usually driven by paranoid customers or
> product managers.  I think that when you really start to examine what an
> attacker has to do to actually reach the unencrypted information (Lucene
> index in this case), they already have acquired so much access that the
> system is completely breached and it won't matter what kind of
> encryption is added.
>
> I find many of these requirements to be silly, and put an incredible
> burden on admin and developer resources with little or no benefit.
> Here's an example of similar customer encryption requirement which I
> encountered recently:
>
> We have a web application that has three "hops" involved.  A user talks
> to a load balancer, which talks to Apache, where the connection is then
> proxied to a Tomcat server with the AJP protocol.  The customer wanted
> all three hops encrypted.  The first hop was already encrypted, the
> second was easy, but the third proved to be very difficult.  Finally we
> decided that we did not need load balancing on that last hop, and it
> could simply talk to localhost, eliminating the need to encrypt it.
>
> The customer was worried about an attacker sniffing the traffic on the
> LAN and seeing details like passwords.  I consider this to be an insane
> requirement.  In order to sniff that traffic, the attacker would need
> one of two things:  Root access on a server, or physical access to the
> infrastructure.  Physical access can be escalated to root access if you
> know what you're doing.  Once someone has either of those things,
> encrypted traffic won't matter, they will be able to learn anything they
> need or do any damage they desire, without even needing to sniff the
> traffic.
>
> Thanks,
> Shawn
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Encrypted index?

Walter Underwood
Alternatively, do not store values in the Solr fields. Return a key and fetch encrypted data from a database or other repository.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)


On Sep 5, 2015, at 9:40 AM, Erick Erickson <[hidden email]> wrote:

The easiest way to do this is put the index over
an encrypted file system. Encrypting the actual
_tokens_ has a few problems, not the least of
which is that any encryption algorithm worth
its salt is going to make most searching totally
impossible.

Consider run, runner, running and runs with
simple wildcards. Searching for run* requires that all 4
variants have 'run' as a prefix, and any decent
encryption algorithm will not do that. Any
encryption that _does_ make that search possible
is trivially broken. I usually stop my thinking there,
but ngrams, casing, WordDelimiterFilterFactory
all come immediately to mind as "interesting".

But what about stored data you ask? Yes, the
stored fields are compressed but stored verbatim,
so I've seen arguments for encrypting _that_ stream,
but that's really a "feel good" fig-leaf. If I get access to the
index and it has position information, I can reconstruct
documents without the stored data as Luke does. The
process is a bit lossy, but the reconstructed document
has enough fidelity that it'll give people seriously
concerned about encryption conniption fits.

So all in all I have to back up Shawn's comments: You're
better off isolating your Solr/Lucene system, putting
authorization to view _documents_ at that level, and possibly
using an encrypted filesystem.

FWIW,
Erick

On Sat, Sep 5, 2015 at 7:27 AM, Shawn Heisey <[hidden email]> wrote:
On 9/5/2015 5:06 AM, Adam Retter wrote:
I wondered if there is any facility already existing in Lucene for
encrypting the values stored into the index and still being able to
search them?

If not, I wondered if anyone could tell me if this is impossible to
implement, and if not to point me perhaps in the right direction?

I imagine that just the text values and document fields to index (and
optionally store) in the index would be either encrypted on the fly by
Lucene using perhaps a public/private key mechanism. When a user issues
a search query to Lucene they would also provide a key so that Lucene
can decrypt the values as necessary to try and answer their query.

I think you could probably add transparent encryption/decryption at the
Lucene level in a custom codec.  That probably has implications for
being able to read the older index when it's time to upgrade Lucene,
with a complete reindex being the likely solution.  Others will need to
confirm ... I'm not very familiar with Lucene code, I'm here for Solr.

Any verification of user identity/permission is probably best done in
your own code, before it makes the Lucene query, and wouldn't
necessarily be related to the encryption.

Requirements like this are usually driven by paranoid customers or
product managers.  I think that when you really start to examine what an
attacker has to do to actually reach the unencrypted information (Lucene
index in this case), they already have acquired so much access that the
system is completely breached and it won't matter what kind of
encryption is added.

I find many of these requirements to be silly, and put an incredible
burden on admin and developer resources with little or no benefit.
Here's an example of similar customer encryption requirement which I
encountered recently:

We have a web application that has three "hops" involved.  A user talks
to a load balancer, which talks to Apache, where the connection is then
proxied to a Tomcat server with the AJP protocol.  The customer wanted
all three hops encrypted.  The first hop was already encrypted, the
second was easy, but the third proved to be very difficult.  Finally we
decided that we did not need load balancing on that last hop, and it
could simply talk to localhost, eliminating the need to encrypt it.

The customer was worried about an attacker sniffing the traffic on the
LAN and seeing details like passwords.  I consider this to be an insane
requirement.  In order to sniff that traffic, the attacker would need
one of two things:  Root access on a server, or physical access to the
infrastructure.  Physical access can be escalated to root access if you
know what you're doing.  Once someone has either of those things,
encrypted traffic won't matter, they will be able to learn anything they
need or do any damage they desire, without even needing to sniff the
traffic.

Thanks,
Shawn


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Encrypted index?

Adam Retter
In reply to this post by Shawn Heisey-2
I think you could probably add transparent encryption/decryption at the
Lucene level in a custom codec.  That probably has implications for
being able to read the older index when it's time to upgrade Lucene,
with a complete reindex being the likely solution.  Others will need to
confirm ... I'm not very familiar with Lucene code, I'm here for Solr.

Thanks, that sounds interesting, and an avenue I will investigate further...
 

Any verification of user identity/permission is probably best done in
your own code, before it makes the Lucene query, and wouldn't
necessarily be related to the encryption.

Okay, but somehow my codec is going to need to know the key to use to encrypt/decrypt the data, only the user has that, so they will need to pass it in somehow I imagine.
 

Requirements like this are usually driven by paranoid customers or
product managers.  I think that when you really start to examine what an
attacker has to do to actually reach the unencrypted information (Lucene
index in this case), they already have acquired so much access that the
system is completely breached and it won't matter what kind of
encryption is added.

I find many of these requirements to be silly, and put an incredible
burden on admin and developer resources with little or no benefit.

Your preaching to the converted ;-) I already tried pointing out that futility of this approach and that it really doesn't bring much if anything to the security of the system. I also suggested just using an encrypted filesystem. Unfortunately, as you have most likely experienced, customers and their requirements whether wrong or right often have to be fulfilled if you want to get paid by them.


--
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk
Reply | Threaded
Open this post in threaded view
|

Re: Encrypted index?

Adam Retter
In reply to this post by Erick Erickson

The easiest way to do this is put the index over
an encrypted file system. Encrypting the actual
_tokens_ has a few problems, not the least of
which is that any encryption algorithm worth
its salt is going to make most searching totally
impossible.

I already suggested an encrypted filesystem to the customer but unfortunately that was rejected.
 

Consider run, runner, running and runs with
simple wildcards. Searching for run* requires that all 4
variants have 'run' as a prefix, and any decent
encryption algorithm will not do that. Any
encryption that _does_ make that search possible
is trivially broken. I usually stop my thinking there,
but ngrams, casing, WordDelimiterFilterFactory
all come immediately to mind as "interesting".

I was rather hoping that I could do the encryption and subsequent decryption at a level below the search itself, so that when the query examines the data it sees the decrypted values so that things like prefix scans etc would indeed still work. Previously in this thread, Shawn suggested writing a custom codec, I wonder if that would enable querying?
 
But what about stored data you ask? Yes, the
stored fields are compressed but stored verbatim,
so I've seen arguments for encrypting _that_ stream,
but that's really a "feel good" fig-leaf. If I get access to the
index and it has position information, I can reconstruct
documents without the stored data as Luke does. The
process is a bit lossy, but the reconstructed document
has enough fidelity that it'll give people seriously
concerned about encryption conniption fits.

Exactly!
 

So all in all I have to back up Shawn's comments: You're
better off isolating your Solr/Lucene system, putting
authorization to view _documents_ at that level, and possibly
using an encrypted filesystem.

FWIW,
Erick

On Sat, Sep 5, 2015 at 7:27 AM, Shawn Heisey <[hidden email]> wrote:
> On 9/5/2015 5:06 AM, Adam Retter wrote:
>> I wondered if there is any facility already existing in Lucene for
>> encrypting the values stored into the index and still being able to
>> search them?
>>
>> If not, I wondered if anyone could tell me if this is impossible to
>> implement, and if not to point me perhaps in the right direction?
>>
>> I imagine that just the text values and document fields to index (and
>> optionally store) in the index would be either encrypted on the fly by
>> Lucene using perhaps a public/private key mechanism. When a user issues
>> a search query to Lucene they would also provide a key so that Lucene
>> can decrypt the values as necessary to try and answer their query.
>
> I think you could probably add transparent encryption/decryption at the
> Lucene level in a custom codec.  That probably has implications for
> being able to read the older index when it's time to upgrade Lucene,
> with a complete reindex being the likely solution.  Others will need to
> confirm ... I'm not very familiar with Lucene code, I'm here for Solr.
>
> Any verification of user identity/permission is probably best done in
> your own code, before it makes the Lucene query, and wouldn't
> necessarily be related to the encryption.
>
> Requirements like this are usually driven by paranoid customers or
> product managers.  I think that when you really start to examine what an
> attacker has to do to actually reach the unencrypted information (Lucene
> index in this case), they already have acquired so much access that the
> system is completely breached and it won't matter what kind of
> encryption is added.
>
> I find many of these requirements to be silly, and put an incredible
> burden on admin and developer resources with little or no benefit.
> Here's an example of similar customer encryption requirement which I
> encountered recently:
>
> We have a web application that has three "hops" involved.  A user talks
> to a load balancer, which talks to Apache, where the connection is then
> proxied to a Tomcat server with the AJP protocol.  The customer wanted
> all three hops encrypted.  The first hop was already encrypted, the
> second was easy, but the third proved to be very difficult.  Finally we
> decided that we did not need load balancing on that last hop, and it
> could simply talk to localhost, eliminating the need to encrypt it.
>
> The customer was worried about an attacker sniffing the traffic on the
> LAN and seeing details like passwords.  I consider this to be an insane
> requirement.  In order to sniff that traffic, the attacker would need
> one of two things:  Root access on a server, or physical access to the
> infrastructure.  Physical access can be escalated to root access if you
> know what you're doing.  Once someone has either of those things,
> encrypted traffic won't matter, they will be able to learn anything they
> need or do any damage they desire, without even needing to sniff the
> traffic.
>
> Thanks,
> Shawn
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk
Reply | Threaded
Open this post in threaded view
|

Re: Encrypted index?

Adam Retter
In reply to this post by Walter Underwood
Thanks Walter, that would be a neat solution if we just wanted to store values, but we also want full-text query capabilities.

On 5 September 2015 at 17:56, Walter Underwood <[hidden email]> wrote:
Alternatively, do not store values in the Solr fields. Return a key and fetch encrypted data from a database or other repository.

wunder

On Sep 5, 2015, at 9:40 AM, Erick Erickson <[hidden email]> wrote:

The easiest way to do this is put the index over
an encrypted file system. Encrypting the actual
_tokens_ has a few problems, not the least of
which is that any encryption algorithm worth
its salt is going to make most searching totally
impossible.

Consider run, runner, running and runs with
simple wildcards. Searching for run* requires that all 4
variants have 'run' as a prefix, and any decent
encryption algorithm will not do that. Any
encryption that _does_ make that search possible
is trivially broken. I usually stop my thinking there,
but ngrams, casing, WordDelimiterFilterFactory
all come immediately to mind as "interesting".

But what about stored data you ask? Yes, the
stored fields are compressed but stored verbatim,
so I've seen arguments for encrypting _that_ stream,
but that's really a "feel good" fig-leaf. If I get access to the
index and it has position information, I can reconstruct
documents without the stored data as Luke does. The
process is a bit lossy, but the reconstructed document
has enough fidelity that it'll give people seriously
concerned about encryption conniption fits.

So all in all I have to back up Shawn's comments: You're
better off isolating your Solr/Lucene system, putting
authorization to view _documents_ at that level, and possibly
using an encrypted filesystem.

FWIW,
Erick

On Sat, Sep 5, 2015 at 7:27 AM, Shawn Heisey <[hidden email]> wrote:
On 9/5/2015 5:06 AM, Adam Retter wrote:
I wondered if there is any facility already existing in Lucene for
encrypting the values stored into the index and still being able to
search them?

If not, I wondered if anyone could tell me if this is impossible to
implement, and if not to point me perhaps in the right direction?

I imagine that just the text values and document fields to index (and
optionally store) in the index would be either encrypted on the fly by
Lucene using perhaps a public/private key mechanism. When a user issues
a search query to Lucene they would also provide a key so that Lucene
can decrypt the values as necessary to try and answer their query.

I think you could probably add transparent encryption/decryption at the
Lucene level in a custom codec.  That probably has implications for
being able to read the older index when it's time to upgrade Lucene,
with a complete reindex being the likely solution.  Others will need to
confirm ... I'm not very familiar with Lucene code, I'm here for Solr.

Any verification of user identity/permission is probably best done in
your own code, before it makes the Lucene query, and wouldn't
necessarily be related to the encryption.

Requirements like this are usually driven by paranoid customers or
product managers.  I think that when you really start to examine what an
attacker has to do to actually reach the unencrypted information (Lucene
index in this case), they already have acquired so much access that the
system is completely breached and it won't matter what kind of
encryption is added.

I find many of these requirements to be silly, and put an incredible
burden on admin and developer resources with little or no benefit.
Here's an example of similar customer encryption requirement which I
encountered recently:

We have a web application that has three "hops" involved.  A user talks
to a load balancer, which talks to Apache, where the connection is then
proxied to a Tomcat server with the AJP protocol.  The customer wanted
all three hops encrypted.  The first hop was already encrypted, the
second was easy, but the third proved to be very difficult.  Finally we
decided that we did not need load balancing on that last hop, and it
could simply talk to localhost, eliminating the need to encrypt it.

The customer was worried about an attacker sniffing the traffic on the
LAN and seeing details like passwords.  I consider this to be an insane
requirement.  In order to sniff that traffic, the attacker would need
one of two things:  Root access on a server, or physical access to the
infrastructure.  Physical access can be escalated to root access if you
know what you're doing.  Once someone has either of those things,
encrypted traffic won't matter, they will be able to learn anything they
need or do any damage they desire, without even needing to sniff the
traffic.

Thanks,
Shawn


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





--
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk
Reply | Threaded
Open this post in threaded view
|

Re: Encrypted index?

Jack Krupansky-3
In reply to this post by Adam Retter
Here's an old Lucene issue/patch for an AES encrypted Lucene directory class that might give you some ideas:

No idea what happened to it.

An even older issue attempting to add encryption for specific fields:

-- Jack Krupansky

On Tue, Sep 8, 2015 at 11:07 AM, Adam Retter <[hidden email]> wrote:

The easiest way to do this is put the index over
an encrypted file system. Encrypting the actual
_tokens_ has a few problems, not the least of
which is that any encryption algorithm worth
its salt is going to make most searching totally
impossible.

I already suggested an encrypted filesystem to the customer but unfortunately that was rejected.
 

Consider run, runner, running and runs with
simple wildcards. Searching for run* requires that all 4
variants have 'run' as a prefix, and any decent
encryption algorithm will not do that. Any
encryption that _does_ make that search possible
is trivially broken. I usually stop my thinking there,
but ngrams, casing, WordDelimiterFilterFactory
all come immediately to mind as "interesting".

I was rather hoping that I could do the encryption and subsequent decryption at a level below the search itself, so that when the query examines the data it sees the decrypted values so that things like prefix scans etc would indeed still work. Previously in this thread, Shawn suggested writing a custom codec, I wonder if that would enable querying?
 
But what about stored data you ask? Yes, the
stored fields are compressed but stored verbatim,
so I've seen arguments for encrypting _that_ stream,
but that's really a "feel good" fig-leaf. If I get access to the
index and it has position information, I can reconstruct
documents without the stored data as Luke does. The
process is a bit lossy, but the reconstructed document
has enough fidelity that it'll give people seriously
concerned about encryption conniption fits.

Exactly!
 

So all in all I have to back up Shawn's comments: You're
better off isolating your Solr/Lucene system, putting
authorization to view _documents_ at that level, and possibly
using an encrypted filesystem.

FWIW,
Erick

On Sat, Sep 5, 2015 at 7:27 AM, Shawn Heisey <[hidden email]> wrote:
> On 9/5/2015 5:06 AM, Adam Retter wrote:
>> I wondered if there is any facility already existing in Lucene for
>> encrypting the values stored into the index and still being able to
>> search them?
>>
>> If not, I wondered if anyone could tell me if this is impossible to
>> implement, and if not to point me perhaps in the right direction?
>>
>> I imagine that just the text values and document fields to index (and
>> optionally store) in the index would be either encrypted on the fly by
>> Lucene using perhaps a public/private key mechanism. When a user issues
>> a search query to Lucene they would also provide a key so that Lucene
>> can decrypt the values as necessary to try and answer their query.
>
> I think you could probably add transparent encryption/decryption at the
> Lucene level in a custom codec.  That probably has implications for
> being able to read the older index when it's time to upgrade Lucene,
> with a complete reindex being the likely solution.  Others will need to
> confirm ... I'm not very familiar with Lucene code, I'm here for Solr.
>
> Any verification of user identity/permission is probably best done in
> your own code, before it makes the Lucene query, and wouldn't
> necessarily be related to the encryption.
>
> Requirements like this are usually driven by paranoid customers or
> product managers.  I think that when you really start to examine what an
> attacker has to do to actually reach the unencrypted information (Lucene
> index in this case), they already have acquired so much access that the
> system is completely breached and it won't matter what kind of
> encryption is added.
>
> I find many of these requirements to be silly, and put an incredible
> burden on admin and developer resources with little or no benefit.
> Here's an example of similar customer encryption requirement which I
> encountered recently:
>
> We have a web application that has three "hops" involved.  A user talks
> to a load balancer, which talks to Apache, where the connection is then
> proxied to a Tomcat server with the AJP protocol.  The customer wanted
> all three hops encrypted.  The first hop was already encrypted, the
> second was easy, but the third proved to be very difficult.  Finally we
> decided that we did not need load balancing on that last hop, and it
> could simply talk to localhost, eliminating the need to encrypt it.
>
> The customer was worried about an attacker sniffing the traffic on the
> LAN and seeing details like passwords.  I consider this to be an insane
> requirement.  In order to sniff that traffic, the attacker would need
> one of two things:  Root access on a server, or physical access to the
> infrastructure.  Physical access can be escalated to root access if you
> know what you're doing.  Once someone has either of those things,
> encrypted traffic won't matter, they will be able to learn anything they
> need or do any damage they desire, without even needing to sniff the
> traffic.
>
> Thanks,
> Shawn
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk

Reply | Threaded
Open this post in threaded view
|

Re: Encrypted index?

Erick Erickson
Adam:

Yeah, I've seen client requirements that cause me to scratch
my head. I suppose, though, some argument can be made
that having a separate encrypting key for the index itself that's
completely separate from any more widely-known encryption
key for a disk is a valid argument. You could even have different
encryption keys for, say, each user's index or something.

bq: I was rather hoping that I could do the encryption and subsequent
decryption at a level below the search itself

Aside from the different encryption key per index (or whatever), why
does the client think this is any more secure than an encrypted disk?

Just askin'....

Erick

On Tue, Sep 8, 2015 at 8:21 AM, Jack Krupansky <[hidden email]> wrote:

> Here's an old Lucene issue/patch for an AES encrypted Lucene directory class
> that might give you some ideas:
> https://issues.apache.org/jira/browse/LUCENE-2228
>
> No idea what happened to it.
>
> An even older issue attempting to add encryption for specific fields:
> https://issues.apache.org/jira/browse/LUCENE-737
>
> -- Jack Krupansky
>
> On Tue, Sep 8, 2015 at 11:07 AM, Adam Retter <[hidden email]>
> wrote:
>>
>>
>>> The easiest way to do this is put the index over
>>> an encrypted file system. Encrypting the actual
>>> _tokens_ has a few problems, not the least of
>>> which is that any encryption algorithm worth
>>> its salt is going to make most searching totally
>>> impossible.
>>
>>
>> I already suggested an encrypted filesystem to the customer but
>> unfortunately that was rejected.
>>
>>
>>> Consider run, runner, running and runs with
>>> simple wildcards. Searching for run* requires that all 4
>>> variants have 'run' as a prefix, and any decent
>>> encryption algorithm will not do that. Any
>>> encryption that _does_ make that search possible
>>> is trivially broken. I usually stop my thinking there,
>>> but ngrams, casing, WordDelimiterFilterFactory
>>> all come immediately to mind as "interesting".
>>
>>
>> I was rather hoping that I could do the encryption and subsequent
>> decryption at a level below the search itself, so that when the query
>> examines the data it sees the decrypted values so that things like prefix
>> scans etc would indeed still work. Previously in this thread, Shawn
>> suggested writing a custom codec, I wonder if that would enable querying?
>>
>>>
>>> But what about stored data you ask? Yes, the
>>> stored fields are compressed but stored verbatim,
>>> so I've seen arguments for encrypting _that_ stream,
>>> but that's really a "feel good" fig-leaf. If I get access to the
>>> index and it has position information, I can reconstruct
>>> documents without the stored data as Luke does. The
>>> process is a bit lossy, but the reconstructed document
>>> has enough fidelity that it'll give people seriously
>>> concerned about encryption conniption fits.
>>
>>
>> Exactly!
>>
>>>
>>>
>>> So all in all I have to back up Shawn's comments: You're
>>> better off isolating your Solr/Lucene system, putting
>>> authorization to view _documents_ at that level, and possibly
>>> using an encrypted filesystem.
>>>
>>> FWIW,
>>> Erick
>>>
>>> On Sat, Sep 5, 2015 at 7:27 AM, Shawn Heisey <[hidden email]> wrote:
>>> > On 9/5/2015 5:06 AM, Adam Retter wrote:
>>> >> I wondered if there is any facility already existing in Lucene for
>>> >> encrypting the values stored into the index and still being able to
>>> >> search them?
>>> >>
>>> >> If not, I wondered if anyone could tell me if this is impossible to
>>> >> implement, and if not to point me perhaps in the right direction?
>>> >>
>>> >> I imagine that just the text values and document fields to index (and
>>> >> optionally store) in the index would be either encrypted on the fly by
>>> >> Lucene using perhaps a public/private key mechanism. When a user
>>> >> issues
>>> >> a search query to Lucene they would also provide a key so that Lucene
>>> >> can decrypt the values as necessary to try and answer their query.
>>> >
>>> > I think you could probably add transparent encryption/decryption at the
>>> > Lucene level in a custom codec.  That probably has implications for
>>> > being able to read the older index when it's time to upgrade Lucene,
>>> > with a complete reindex being the likely solution.  Others will need to
>>> > confirm ... I'm not very familiar with Lucene code, I'm here for Solr.
>>> >
>>> > Any verification of user identity/permission is probably best done in
>>> > your own code, before it makes the Lucene query, and wouldn't
>>> > necessarily be related to the encryption.
>>> >
>>> > Requirements like this are usually driven by paranoid customers or
>>> > product managers.  I think that when you really start to examine what
>>> > an
>>> > attacker has to do to actually reach the unencrypted information
>>> > (Lucene
>>> > index in this case), they already have acquired so much access that the
>>> > system is completely breached and it won't matter what kind of
>>> > encryption is added.
>>> >
>>> > I find many of these requirements to be silly, and put an incredible
>>> > burden on admin and developer resources with little or no benefit.
>>> > Here's an example of similar customer encryption requirement which I
>>> > encountered recently:
>>> >
>>> > We have a web application that has three "hops" involved.  A user talks
>>> > to a load balancer, which talks to Apache, where the connection is then
>>> > proxied to a Tomcat server with the AJP protocol.  The customer wanted
>>> > all three hops encrypted.  The first hop was already encrypted, the
>>> > second was easy, but the third proved to be very difficult.  Finally we
>>> > decided that we did not need load balancing on that last hop, and it
>>> > could simply talk to localhost, eliminating the need to encrypt it.
>>> >
>>> > The customer was worried about an attacker sniffing the traffic on the
>>> > LAN and seeing details like passwords.  I consider this to be an insane
>>> > requirement.  In order to sniff that traffic, the attacker would need
>>> > one of two things:  Root access on a server, or physical access to the
>>> > infrastructure.  Physical access can be escalated to root access if you
>>> > know what you're doing.  Once someone has either of those things,
>>> > encrypted traffic won't matter, they will be able to learn anything
>>> > they
>>> > need or do any damage they desire, without even needing to sniff the
>>> > traffic.
>>> >
>>> > Thanks,
>>> > Shawn
>>> >
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: [hidden email]
>>> > For additional commands, e-mail: [hidden email]
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>
>>
>>
>> --
>> Adam Retter
>>
>> skype: adam.retter
>> tweet: adamretter
>> http://www.adamretter.org.uk
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Encrypted index?

Shai Erera

The problem with encrypted file systems is that if someone gets access to the file system (not the disk, the file system e.g via ssh), it is wide open to it. It's like my work laptop's disk is encrypted, but after I've entered my password, all files are readable to me. However, files that are password protected, aren't, and that's what security experts want - that even if an attacker stole the machine and has all the passwords and the time in the world, without the public/private key of the encrypted index, he won't be able to read it. I'm not justifying it, just repeating what I was told. Even though I think it's silly - if someone managed to get a hold of the machine, the login password, root access... what are the chance he doesn't already have the other keys?

Anyway, we're here to solve the technical problem, and we obviously aren't the ones making these decisions, and it's futile attempting to argue with security folks, so let's address the question of how to achieve encryption.

I wouldn't go with a Codec, personally, to achieve encryption. It's over complicated IMO. Rather an encrypted Directory is a simpler solution. You will need to implement an EncryptingIndexOutput and a matching DecryptingIndexInput, but that's more or less it. The encryption/decryption happens in buffers, so you will want to extend the respective BufferedIO classes. The issues mentioned above should give you a head start, even though the patches are old and likely don't compile against new versions, but they contain the gist of it.

Just make sure your application, or actually the process running Lucene, receive the public/private key in a non obvious way, so that if someone does get a hold of the machine, he can't obtain that information!

Also, as for encrypting the terms themselves, beyond the problems mentioned above about wildcard queries, there is the risk of someone guessing the terms based on their statistics. If the attacker knows the corpus domain, I assume it shouldn't be hard for him to guess that a certain word with a high DF and TF is probably "the" and proceed from there.

Again, I'm no security expert and I've learned it's sometimes futile trying to argue with them. If you can convince them though that the system as a whole is protected enough, and if breached an encrypted index is likely already breached too, you can avoid the complexity. From my experience, encryption hurts performance, but you can improve that by eg buffering parts unencrypted, but then you also need to prove your program's memory is protected...

Hope this helps.

Shai

On Sep 8, 2015 8:18 PM, "Erick Erickson" <[hidden email]> wrote:
Adam:

Yeah, I've seen client requirements that cause me to scratch
my head. I suppose, though, some argument can be made
that having a separate encrypting key for the index itself that's
completely separate from any more widely-known encryption
key for a disk is a valid argument. You could even have different
encryption keys for, say, each user's index or something.

bq: I was rather hoping that I could do the encryption and subsequent
decryption at a level below the search itself

Aside from the different encryption key per index (or whatever), why
does the client think this is any more secure than an encrypted disk?

Just askin'....

Erick

On Tue, Sep 8, 2015 at 8:21 AM, Jack Krupansky <[hidden email]> wrote:
> Here's an old Lucene issue/patch for an AES encrypted Lucene directory class
> that might give you some ideas:
> https://issues.apache.org/jira/browse/LUCENE-2228
>
> No idea what happened to it.
>
> An even older issue attempting to add encryption for specific fields:
> https://issues.apache.org/jira/browse/LUCENE-737
>
> -- Jack Krupansky
>
> On Tue, Sep 8, 2015 at 11:07 AM, Adam Retter <[hidden email]>
> wrote:
>>
>>
>>> The easiest way to do this is put the index over
>>> an encrypted file system. Encrypting the actual
>>> _tokens_ has a few problems, not the least of
>>> which is that any encryption algorithm worth
>>> its salt is going to make most searching totally
>>> impossible.
>>
>>
>> I already suggested an encrypted filesystem to the customer but
>> unfortunately that was rejected.
>>
>>
>>> Consider run, runner, running and runs with
>>> simple wildcards. Searching for run* requires that all 4
>>> variants have 'run' as a prefix, and any decent
>>> encryption algorithm will not do that. Any
>>> encryption that _does_ make that search possible
>>> is trivially broken. I usually stop my thinking there,
>>> but ngrams, casing, WordDelimiterFilterFactory
>>> all come immediately to mind as "interesting".
>>
>>
>> I was rather hoping that I could do the encryption and subsequent
>> decryption at a level below the search itself, so that when the query
>> examines the data it sees the decrypted values so that things like prefix
>> scans etc would indeed still work. Previously in this thread, Shawn
>> suggested writing a custom codec, I wonder if that would enable querying?
>>
>>>
>>> But what about stored data you ask? Yes, the
>>> stored fields are compressed but stored verbatim,
>>> so I've seen arguments for encrypting _that_ stream,
>>> but that's really a "feel good" fig-leaf. If I get access to the
>>> index and it has position information, I can reconstruct
>>> documents without the stored data as Luke does. The
>>> process is a bit lossy, but the reconstructed document
>>> has enough fidelity that it'll give people seriously
>>> concerned about encryption conniption fits.
>>
>>
>> Exactly!
>>
>>>
>>>
>>> So all in all I have to back up Shawn's comments: You're
>>> better off isolating your Solr/Lucene system, putting
>>> authorization to view _documents_ at that level, and possibly
>>> using an encrypted filesystem.
>>>
>>> FWIW,
>>> Erick
>>>
>>> On Sat, Sep 5, 2015 at 7:27 AM, Shawn Heisey <[hidden email]> wrote:
>>> > On 9/5/2015 5:06 AM, Adam Retter wrote:
>>> >> I wondered if there is any facility already existing in Lucene for
>>> >> encrypting the values stored into the index and still being able to
>>> >> search them?
>>> >>
>>> >> If not, I wondered if anyone could tell me if this is impossible to
>>> >> implement, and if not to point me perhaps in the right direction?
>>> >>
>>> >> I imagine that just the text values and document fields to index (and
>>> >> optionally store) in the index would be either encrypted on the fly by
>>> >> Lucene using perhaps a public/private key mechanism. When a user
>>> >> issues
>>> >> a search query to Lucene they would also provide a key so that Lucene
>>> >> can decrypt the values as necessary to try and answer their query.
>>> >
>>> > I think you could probably add transparent encryption/decryption at the
>>> > Lucene level in a custom codec.  That probably has implications for
>>> > being able to read the older index when it's time to upgrade Lucene,
>>> > with a complete reindex being the likely solution.  Others will need to
>>> > confirm ... I'm not very familiar with Lucene code, I'm here for Solr.
>>> >
>>> > Any verification of user identity/permission is probably best done in
>>> > your own code, before it makes the Lucene query, and wouldn't
>>> > necessarily be related to the encryption.
>>> >
>>> > Requirements like this are usually driven by paranoid customers or
>>> > product managers.  I think that when you really start to examine what
>>> > an
>>> > attacker has to do to actually reach the unencrypted information
>>> > (Lucene
>>> > index in this case), they already have acquired so much access that the
>>> > system is completely breached and it won't matter what kind of
>>> > encryption is added.
>>> >
>>> > I find many of these requirements to be silly, and put an incredible
>>> > burden on admin and developer resources with little or no benefit.
>>> > Here's an example of similar customer encryption requirement which I
>>> > encountered recently:
>>> >
>>> > We have a web application that has three "hops" involved.  A user talks
>>> > to a load balancer, which talks to Apache, where the connection is then
>>> > proxied to a Tomcat server with the AJP protocol.  The customer wanted
>>> > all three hops encrypted.  The first hop was already encrypted, the
>>> > second was easy, but the third proved to be very difficult.  Finally we
>>> > decided that we did not need load balancing on that last hop, and it
>>> > could simply talk to localhost, eliminating the need to encrypt it.
>>> >
>>> > The customer was worried about an attacker sniffing the traffic on the
>>> > LAN and seeing details like passwords.  I consider this to be an insane
>>> > requirement.  In order to sniff that traffic, the attacker would need
>>> > one of two things:  Root access on a server, or physical access to the
>>> > infrastructure.  Physical access can be escalated to root access if you
>>> > know what you're doing.  Once someone has either of those things,
>>> > encrypted traffic won't matter, they will be able to learn anything
>>> > they
>>> > need or do any damage they desire, without even needing to sniff the
>>> > traffic.
>>> >
>>> > Thanks,
>>> > Shawn
>>> >
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: [hidden email]
>>> > For additional commands, e-mail: [hidden email]
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>
>>
>>
>> --
>> Adam Retter
>>
>> skype: adam.retter
>> tweet: adamretter
>> http://www.adamretter.org.uk
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Encrypted index?

Adam Retter
In reply to this post by Jack Krupansky-3
Thanks very much Jack, I will take a look into those.

On 8 September 2015 at 16:21, Jack Krupansky <[hidden email]> wrote:
Here's an old Lucene issue/patch for an AES encrypted Lucene directory class that might give you some ideas:

No idea what happened to it.

An even older issue attempting to add encryption for specific fields:

-- Jack Krupansky

On Tue, Sep 8, 2015 at 11:07 AM, Adam Retter <[hidden email]> wrote:

The easiest way to do this is put the index over
an encrypted file system. Encrypting the actual
_tokens_ has a few problems, not the least of
which is that any encryption algorithm worth
its salt is going to make most searching totally
impossible.

I already suggested an encrypted filesystem to the customer but unfortunately that was rejected.
 

Consider run, runner, running and runs with
simple wildcards. Searching for run* requires that all 4
variants have 'run' as a prefix, and any decent
encryption algorithm will not do that. Any
encryption that _does_ make that search possible
is trivially broken. I usually stop my thinking there,
but ngrams, casing, WordDelimiterFilterFactory
all come immediately to mind as "interesting".

I was rather hoping that I could do the encryption and subsequent decryption at a level below the search itself, so that when the query examines the data it sees the decrypted values so that things like prefix scans etc would indeed still work. Previously in this thread, Shawn suggested writing a custom codec, I wonder if that would enable querying?
 
But what about stored data you ask? Yes, the
stored fields are compressed but stored verbatim,
so I've seen arguments for encrypting _that_ stream,
but that's really a "feel good" fig-leaf. If I get access to the
index and it has position information, I can reconstruct
documents without the stored data as Luke does. The
process is a bit lossy, but the reconstructed document
has enough fidelity that it'll give people seriously
concerned about encryption conniption fits.

Exactly!
 

So all in all I have to back up Shawn's comments: You're
better off isolating your Solr/Lucene system, putting
authorization to view _documents_ at that level, and possibly
using an encrypted filesystem.

FWIW,
Erick

On Sat, Sep 5, 2015 at 7:27 AM, Shawn Heisey <[hidden email]> wrote:
> On 9/5/2015 5:06 AM, Adam Retter wrote:
>> I wondered if there is any facility already existing in Lucene for
>> encrypting the values stored into the index and still being able to
>> search them?
>>
>> If not, I wondered if anyone could tell me if this is impossible to
>> implement, and if not to point me perhaps in the right direction?
>>
>> I imagine that just the text values and document fields to index (and
>> optionally store) in the index would be either encrypted on the fly by
>> Lucene using perhaps a public/private key mechanism. When a user issues
>> a search query to Lucene they would also provide a key so that Lucene
>> can decrypt the values as necessary to try and answer their query.
>
> I think you could probably add transparent encryption/decryption at the
> Lucene level in a custom codec.  That probably has implications for
> being able to read the older index when it's time to upgrade Lucene,
> with a complete reindex being the likely solution.  Others will need to
> confirm ... I'm not very familiar with Lucene code, I'm here for Solr.
>
> Any verification of user identity/permission is probably best done in
> your own code, before it makes the Lucene query, and wouldn't
> necessarily be related to the encryption.
>
> Requirements like this are usually driven by paranoid customers or
> product managers.  I think that when you really start to examine what an
> attacker has to do to actually reach the unencrypted information (Lucene
> index in this case), they already have acquired so much access that the
> system is completely breached and it won't matter what kind of
> encryption is added.
>
> I find many of these requirements to be silly, and put an incredible
> burden on admin and developer resources with little or no benefit.
> Here's an example of similar customer encryption requirement which I
> encountered recently:
>
> We have a web application that has three "hops" involved.  A user talks
> to a load balancer, which talks to Apache, where the connection is then
> proxied to a Tomcat server with the AJP protocol.  The customer wanted
> all three hops encrypted.  The first hop was already encrypted, the
> second was easy, but the third proved to be very difficult.  Finally we
> decided that we did not need load balancing on that last hop, and it
> could simply talk to localhost, eliminating the need to encrypt it.
>
> The customer was worried about an attacker sniffing the traffic on the
> LAN and seeing details like passwords.  I consider this to be an insane
> requirement.  In order to sniff that traffic, the attacker would need
> one of two things:  Root access on a server, or physical access to the
> infrastructure.  Physical access can be escalated to root access if you
> know what you're doing.  Once someone has either of those things,
> encrypted traffic won't matter, they will be able to learn anything they
> need or do any damage they desire, without even needing to sniff the
> traffic.
>
> Thanks,
> Shawn
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk




--
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk
Reply | Threaded
Open this post in threaded view
|

Re: Encrypted index?

Adam Retter
In reply to this post by Erick Erickson
bq: I was rather hoping that I could do the encryption and subsequent
decryption at a level below the search itself


I am not sure what "bq" standard for?


Aside from the different encryption key per index (or whatever), why
does the client think this is any more secure than an encrypted disk?

Just askin'....

Well I never said that Client was reasonable or even wanted to explain their thought process in any logical manner ;-) The client wants it because they think they need it, they think they need it quite likely because they don't understand what it means. When you try and explain why they don't need it or possibly better solutions they are not interested, because... they *know* they need it!


--
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk
Reply | Threaded
Open this post in threaded view
|

Re: Encrypted index?

Adam Retter
In reply to this post by Shai Erera

The problem with encrypted file systems is that if someone gets access to the file system (not the disk, the file system e.g via ssh), it is wide open to it. It's like my work laptop's disk is encrypted, but after I've entered my password, all files are readable to me. However, files that are password protected, aren't, and that's what security experts want - that even if an attacker stole the machine and has all the passwords and the time in the world, without the public/private key of the encrypted index, he won't be able to read it. I'm not justifying it, just repeating what I was told. Even though I think it's silly - if someone managed to get a hold of the machine, the login password, root access... what are the chance he doesn't already have the other keys?


I was rather assuming an encrypted filesystem (a partition if you like) that is only available to a specific system user which our application runs under. This filesystem would only hold the Lucene indexes, it would not be a general purpose system boot filesystem as you are describing.
 

Anyway, we're here to solve the technical problem, and we obviously aren't the ones making these decisions, and it's futile attempting to argue with security folks, so let's address the question of how to achieve encryption.


I'm not a security folk, some of the responders might be. I am just trying to deliver a requirement, and have been told by the client that the suggested encrypted filesystem etc is not good enough.
 

I wouldn't go with a Codec, personally, to achieve encryption. It's over complicated IMO. Rather an encrypted Directory is a simpler solution. You will need to implement an EncryptingIndexOutput and a matching DecryptingIndexInput, but that's more or less it. The encryption/decryption happens in buffers, so you will want to extend the respective BufferedIO classes. The issues mentioned above should give you a head start, even though the patches are old and likely don't compile against new versions, but they contain the gist of it.


Thanks I will take a look. At the moment I am predominantly just trying to understand if it is even possible, it is unlikely the client will sign off any real development work on this until the New Year; If they sign-off, expect some more questions to the list from me :-p
 

Just make sure your application, or actually the process running Lucene, receive the public/private key in a non obvious way, so that if someone does get a hold of the machine, he can't obtain that information!

 Ok of course I will try and protect my app and paths to and from. However, I assume that if someone gets root access to the server, they can just dump the server's RAM to a disk file and have access to all the keys that happen to be in RAM anyway and that I can't really protect against that.

Also, as for encrypting the terms themselves, beyond the problems mentioned above about wildcard queries, there is the risk of someone guessing the terms based on their statistics. If the attacker knows the corpus domain, I assume it shouldn't be hard for him to guess that a certain word with a high DF and TF is probably "the" and proceed from there.


Based on the fact that my client doesn't seem to understand that this is probably not a good idea. I think the fact that someone might use statistical analysis to guess and potentially decrypt the index will be of little worry to them (even if I explain it).
 

Again, I'm no security expert and I've learned it's sometimes futile trying to argue with them. If you can convince them though that the system as a whole is protected enough, and if breached an encrypted index is likely already breached too, you can avoid the complexity. From my experience, encryption hurts performance, but you can improve that by eg buffering parts unencrypted, but then you also need to prove your program's memory is protected...

Mainly understood, but can you elaborate on "prove your program's memory is protected"?


Thanks

--
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk
Reply | Threaded
Open this post in threaded view
|

Re: Encrypted index?

Shai Erera
I'm not a security folk, some of the responders might be. I am just trying to deliver a requirement, and have been told by the client that the suggested encrypted filesystem etc is not good enough.

By 'we' I meant both you and the rest of us. I consider you on our side, the Lucene/Solr folks, and not the annoying side, the security folks :).

At the moment I am predominantly just trying to understand if it is even possible

Sure, understood. It is possible, at least to the extent that I've tested this in the past. The AESDirectory on one of those issues does what you (or your client) want.

However, I assume that if someone gets root access to the server, they can just dump the server's RAM to a disk file and have access to all the keys that happen to be in RAM anyway and that I can't really protect against that.

This is what I meant (relates to your last question too) by protecting the program's RAM. If the attacker can dump the process's RAM and derive the encryption keys (or the un-encrypted cached index content), you're back to square 1. This is why I believe most of us think that encrypting the index is not THE solution for protecting the data, but rather protect the system itself. After someone already broke in, the assumption should be that there's very little (if anything) you can do to prevent data theft.

I think the fact that someone might use statistical analysis to guess and potentially decrypt the index will be of little worry to them (even if I explain it).

Just in case I wasn't clear, let me clarify this. Using an EncryptingDirectory *does not* allow one to use statistical analysis to guess the index's content. It is a low-level solution, one level above encrypted file system. To an attacker who doesn't have the encryption keys the index will look like a series of garbage bytes. Even if you know where to look for the terms (i.e. which files), the bytes will not conform to the regular Lucene index file format.

The Codec works at the same level BTW, and would achieve the same results. Only I believe a Codec is an overkill. However, if all you want to do is encrypt some parts of the index, e.g. only the terms, you could explore the Codec approach. But as I wrote before, I don't believe it's a good solution - it's better to encrypt everything into one giant blob, to avoid decryption by statistical analysis.

Shai

On Wed, Sep 9, 2015 at 2:28 AM, Adam Retter <[hidden email]> wrote:

The problem with encrypted file systems is that if someone gets access to the file system (not the disk, the file system e.g via ssh), it is wide open to it. It's like my work laptop's disk is encrypted, but after I've entered my password, all files are readable to me. However, files that are password protected, aren't, and that's what security experts want - that even if an attacker stole the machine and has all the passwords and the time in the world, without the public/private key of the encrypted index, he won't be able to read it. I'm not justifying it, just repeating what I was told. Even though I think it's silly - if someone managed to get a hold of the machine, the login password, root access... what are the chance he doesn't already have the other keys?


I was rather assuming an encrypted filesystem (a partition if you like) that is only available to a specific system user which our application runs under. This filesystem would only hold the Lucene indexes, it would not be a general purpose system boot filesystem as you are describing.
 

Anyway, we're here to solve the technical problem, and we obviously aren't the ones making these decisions, and it's futile attempting to argue with security folks, so let's address the question of how to achieve encryption.


I'm not a security folk, some of the responders might be. I am just trying to deliver a requirement, and have been told by the client that the suggested encrypted filesystem etc is not good enough.
 

I wouldn't go with a Codec, personally, to achieve encryption. It's over complicated IMO. Rather an encrypted Directory is a simpler solution. You will need to implement an EncryptingIndexOutput and a matching DecryptingIndexInput, but that's more or less it. The encryption/decryption happens in buffers, so you will want to extend the respective BufferedIO classes. The issues mentioned above should give you a head start, even though the patches are old and likely don't compile against new versions, but they contain the gist of it.


Thanks I will take a look. At the moment I am predominantly just trying to understand if it is even possible, it is unlikely the client will sign off any real development work on this until the New Year; If they sign-off, expect some more questions to the list from me :-p
 

Just make sure your application, or actually the process running Lucene, receive the public/private key in a non obvious way, so that if someone does get a hold of the machine, he can't obtain that information!

 Ok of course I will try and protect my app and paths to and from. However, I assume that if someone gets root access to the server, they can just dump the server's RAM to a disk file and have access to all the keys that happen to be in RAM anyway and that I can't really protect against that.

Also, as for encrypting the terms themselves, beyond the problems mentioned above about wildcard queries, there is the risk of someone guessing the terms based on their statistics. If the attacker knows the corpus domain, I assume it shouldn't be hard for him to guess that a certain word with a high DF and TF is probably "the" and proceed from there.


Based on the fact that my client doesn't seem to understand that this is probably not a good idea. I think the fact that someone might use statistical analysis to guess and potentially decrypt the index will be of little worry to them (even if I explain it).
 

Again, I'm no security expert and I've learned it's sometimes futile trying to argue with them. If you can convince them though that the system as a whole is protected enough, and if breached an encrypted index is likely already breached too, you can avoid the complexity. From my experience, encryption hurts performance, but you can improve that by eg buffering parts unencrypted, but then you also need to prove your program's memory is protected...

Mainly understood, but can you elaborate on "prove your program's memory is protected"?


Thanks

--
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk