indexing documents (or pieces of a document) by access controls

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

indexing documents (or pieces of a document) by access controls

natjohns
Hi all,

Can anyone give me some advice on breaking a document up and indexing it
by access control lists.  What we have are xml documents that are
transformed based on the user viewing it.  Some users might see all of
the document, while other may see a few fields, and yet others see
nothing at all.  The access control lists may be a role the user belongs
to, it may be a list of groups, or even a combination of the two.

I can transform the xml to the plain text that I want to index, and key
it off of the acls and then pass along a list of acls that the user
issuing a query belongs to when searching.  But I guess I'm not really
sure how to do this the best way.

Anyone have any thoughts?

Thanks!
Nate

Reply | Threaded
Open this post in threaded view
|

RE: indexing documents (or pieces of a document) by access controls

Ard Schrijvers
Hello Nate,

IMHO, you will not be able to do this in solr unless you accept pretty hard constraints on your ACLs (I will get back to this in a moment). IMO, it is not possible to index documents along with ACLs. ACLs can be very fine grained, and the thing you describe, ACL specific parts of a document....well, I wouldn't know how you would index this. (imagine you change the ACL for a specific user. How do you know what to re-index and what not. Suppose you add a user? I really do not think it is possible based on fine grained ACLs).

You also should realize you are trying to find an answer to an extremely complex problem: authorisation in an index (I am trying to develop facetted navigation in combination with authorisation in a lucene index in jackrabbit, but I think this is not the place to discuss it)

So, in your case, if you want to use solr and some way of ACLs, I think basically you can only manage this if:

1) you ACLs are some sort of paths in a hiearchical based structure, where you index the hierarchical structure along with the content. Then when quering you have to include the folders that user is allowed to see

2) you need to keep bitset for each user which documents are allowed (but, you have even ACLs inside documents). Also, keeping bitsets up2date for many users is almost impossible, because
- lucene document ids possible change after merging segments
- updating documents might mean updating many many bitsets if you have many users

For these reasons, I do not think you can achieve with solar what you want, unless you are going to work with something like: updating the index and ACL bitsets once a day.

Regards Ard


Can anyone give me some advice on breaking a document up and indexing it
by access control lists.  What we have are xml documents that are
transformed based on the user viewing it.  Some users might see all of
the document, while other may see a few fields, and yet others see
nothing at all.  The access control lists may be a role the user belongs
to, it may be a list of groups, or even a combination of the two.

I can transform the xml to the plain text that I want to index, and key
it off of the acls and then pass along a list of acls that the user
issuing a query belongs to when searching.  But I guess I'm not really
sure how to do this the best way.

Anyone have any thoughts?

Thanks!
Nate




Reply | Threaded
Open this post in threaded view
|

RE: indexing documents (or pieces of a document) by access controls

Ard Schrijvers
Excuse me, I meant solr ofcourse :-)

> For these reasons, I do not think you can achieve with solar
Reply | Threaded
Open this post in threaded view
|

Re: indexing documents (or pieces of a document) by access controls

kkrugler
In reply to this post by natjohns
>Hi all,
>
>Can anyone give me some advice on breaking a document up and indexing it
>by access control lists.  What we have are xml documents that are
>transformed based on the user viewing it.  Some users might see all of
>the document, while other may see a few fields, and yet others see
>nothing at all.  The access control lists may be a role the user belongs
>to, it may be a list of groups, or even a combination of the two.
>
>I can transform the xml to the plain text that I want to index, and key
>it off of the acls and then pass along a list of acls that the user
>issuing a query belongs to when searching.  But I guess I'm not really
>sure how to do this the best way.
>
>Anyone have any thoughts?

Given the requirement to break down a document into separately
controlled pieces, I'd create a servlet that "fronts" the Solr
servlet and handles this conversion. I could think of ways to do it
using Solr, but they feel like unnatural acts.

As a general comment on ACLs, one relatively easy way to handle this
is via group ids that you use to restrict the query. Each document
has a groupid with a list of group ids that are authorized to access
it. Each user query is converted into a (query) AND (groupid:xx OR
groupid:yy), where xx/yy (and so on) are the groups that the user
belongs to.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"
Reply | Threaded
Open this post in threaded view
|

Re: indexing documents (or pieces of a document) by access controls

dma_bamboo
Hi

And about the fields, if they are/aren't going to be present on the
responses based on the user group, you can do it in many different ways
(using XML transformation to remove the undesirable fields, implementing
your own RequestHandler able to process your group information, filtering
the data and showing only what should be shown to the user, ...)

Regards,
Daniel


On 12/6/07 16:14, "Ken Krugler" <[hidden email]> wrote:

>> Hi all,
>>
>> Can anyone give me some advice on breaking a document up and indexing it
>> by access control lists.  What we have are xml documents that are
>> transformed based on the user viewing it.  Some users might see all of
>> the document, while other may see a few fields, and yet others see
>> nothing at all.  The access control lists may be a role the user belongs
>> to, it may be a list of groups, or even a combination of the two.
>>
>> I can transform the xml to the plain text that I want to index, and key
>> it off of the acls and then pass along a list of acls that the user
>> issuing a query belongs to when searching.  But I guess I'm not really
>> sure how to do this the best way.
>>
>> Anyone have any thoughts?
>
> Given the requirement to break down a document into separately
> controlled pieces, I'd create a servlet that "fronts" the Solr
> servlet and handles this conversion. I could think of ways to do it
> using Solr, but they feel like unnatural acts.
>
> As a general comment on ACLs, one relatively easy way to handle this
> is via group ids that you use to restrict the query. Each document
> has a groupid with a list of group ids that are authorized to access
> it. Each user query is converted into a (query) AND (groupid:xx OR
> groupid:yy), where xx/yy (and so on) are the groups that the user
> belongs to.
>
> -- Ken


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
Reply | Threaded
Open this post in threaded view
|

RE: indexing documents (or pieces of a document) by access controls

Ard Schrijvers
In reply to this post by natjohns
Hello,


> Given the requirement to break down a document into separately
> controlled pieces, I'd create a servlet that "fronts" the Solr
> servlet and handles this conversion. I could think of ways to do it
> using Solr, but they feel like unnatural acts.
>
> As a general comment on ACLs, one relatively easy way to handle this
> is via group ids that you use to restrict the query. Each document
> has a groupid with a list of group ids that are authorized to access
> it. Each user query is converted into a (query) AND (groupid:xx OR
> groupid:yy), where xx/yy (and so on) are the groups that the user
> belongs to.

With all do respect, I really think the problem is largely underestimated here, and is far more complex then these suggestions...unless we are talking about 100.000 documents, couple of users, and updating ones a day. If you want millions of documents, facetted authorized navigation including counting and every second a new indexed document which should be reflected in the result instantly and changing authorisations....the problem isn't relatively easy to solve anymore :-)

Regards Ard

>
> -- Ken
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "If you can't find it, you can't fix it"
>
Reply | Threaded
Open this post in threaded view
|

RE: indexing documents (or pieces of a document) by access controls

Ard Schrijvers
In reply to this post by natjohns
Hello,

> Hi
>
> And about the fields, if they are/aren't going to be present on the
> responses based on the user group, you can do it in many
> different ways
> (using XML transformation to remove the undesirable fields,
> implementing
> your own RequestHandler able to process your group
> information, filtering
> the data and showing only what should be shown to the user, ...)

So suppose, you want to see 10 documents, but on average you are authorized to see 1 in 100 docs. Then on average, you need to fetch 100 docs to find 10 results...100 XML transformations....that will be slow. And I left out the fact that you still do not know the number of pages that user is allowed to see, the counting if you want facetted navigation, etc etc

Regards Ard

>
> Regards,
> Daniel
>
>
> On 12/6/07 16:14, "Ken Krugler" <[hidden email]> wrote:
>
> >> Hi all,
> >>
> >> Can anyone give me some advice on breaking a document up
> and indexing it
> >> by access control lists.  What we have are xml documents that are
> >> transformed based on the user viewing it.  Some users
> might see all of
> >> the document, while other may see a few fields, and yet others see
> >> nothing at all.  The access control lists may be a role
> the user belongs
> >> to, it may be a list of groups, or even a combination of the two.
> >>
> >> I can transform the xml to the plain text that I want to
> index, and key
> >> it off of the acls and then pass along a list of acls that the user
> >> issuing a query belongs to when searching.  But I guess
> I'm not really
> >> sure how to do this the best way.
> >>
> >> Anyone have any thoughts?
> >
> > Given the requirement to break down a document into separately
> > controlled pieces, I'd create a servlet that "fronts" the Solr
> > servlet and handles this conversion. I could think of ways to do it
> > using Solr, but they feel like unnatural acts.
> >
> > As a general comment on ACLs, one relatively easy way to handle this
> > is via group ids that you use to restrict the query. Each document
> > has a groupid with a list of group ids that are authorized to access
> > it. Each user query is converted into a (query) AND (groupid:xx OR
> > groupid:yy), where xx/yy (and so on) are the groups that the user
> > belongs to.
> >
> > -- Ken
>
>
> http://www.bbc.co.uk/
> This e-mail (and any attachments) is confidential and may
> contain personal views which are not the views of the BBC
> unless specifically stated.
> If you have received it in error, please delete it from your system.
> Do not use, copy or disclose the information in any way nor
> act in reliance on it and notify the sender immediately.
> Please note that the BBC monitors e-mails sent or received.
> Further communication will signify your consent to this.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: indexing documents (or pieces of a document) by access controls

Frédéric Glorieux
In reply to this post by Ard Schrijvers


Hello,

 > With all do respect, I really think the problem is largely
underestimated here, and is far more complex then these
suggestions...unless we are talking about 100.000 documents, couple of
users, and updating ones a day. If you want millions of documents,
facetted authorized navigation including counting and every second a new
indexed document which should be reflected in the result instantly and
changing authorisations....the problem isn't relatively easy to solve
anymore :-)

When I had those kind of problems (less complex) with lucene, the only
idea was to filter from the front-end, according to the ACL policy.
Lucene docs and fields weren't protected, but tagged. Searching was
always applied with a field "audience", with hierarchical values like
"public, reserved, protected, secret", so that a "public" document has
the "secret" value also, to be found with a "audience:secret", according
to the rights of the user who searchs. For the fields, the not allowed
ones for some users where striped.

May be you can have a look to the xmldb Exist ? The search engine,
xquery based, is not focused on the same goals as lucene, but I can
promise you that all queries will never return results from documents
you are not allowed to read.


--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique
Reply | Threaded
Open this post in threaded view
|

RE: indexing documents (or pieces of a document) by access controls

Ard Schrijvers
In reply to this post by natjohns
Hello,


> When I had those kind of problems (less complex) with lucene,
> the only
> idea was to filter from the front-end, according to the ACL policy.
> Lucene docs and fields weren't protected, but tagged. Searching was
> always applied with a field "audience", with hierarchical values like
> "public, reserved, protected, secret", so that a "public"
> document has
> the "secret" value also, to be found with a
> "audience:secret", according
> to the rights of the user who searchs. For the fields, the
> not allowed
> ones for some users where striped.

Yes I know this is a possibility...but we happen to want our authorisation facetted based. I am attacking the problem with keeping derived data from lucene in memory all translated into some byte/int values. The hardest part is keeping the derived data in sink with lucene *and* the different jackrabbit users (some have changes in there session but not yet saved their data)

Anyway, I can do facetted authorisation + counting in less than 20 ms for 1.000.000 documents (normal pc) so hopefully I can succeed. I must admit OTH, that I did not find some sort of ingenious algorithm, but merely depend on the speed of the processor: doubling the number of documents means doubling the response time and needed memory (though 1.000.000 doc fitted in 25 Mb, so 40.000.000 in a Gb...that is fine by me)

>
> May be you can have a look to the xmldb Exist ? The search engine,
> xquery based, is not focused on the same goals as lucene, but I can
> promise you that all queries will never return results from documents
> you are not allowed to read.

I did not look at it, but my feeling is that it is not fast enough,

Regards Ard

>
>
> --
> Frédéric Glorieux
> École nationale des chartes
> direction des nouvelles technologies et de l'informatique
>