searching only within allowed documents

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

searching only within allowed documents

Stephen Weiss-2
Hi,

I'm new to Solr (and Lucene) and I'm trying to work out just how I  
could fit this technology into my app (I'm moving over from using  
MySQL fulltext indexes).  Things are actually going really well - the  
facet functionality fits in just perfectly, and the basic full-text  
searching is working very well for me as well, especially considering  
that I'm trying to index several languages at once.  It's really much,  
much faster than MySQL.  Somehow, I thought that would be the hard  
part!  Unfortunately, I'm getting tripped up on something that seems  
far more complicated...

So, there are two kinds of searches you can do in this application.  
There's an "Advanced Search" and a basic "Text Search".  For the  
Advanced Search, users pick out one or more sets of documents which  
they are allowed to see, and some set of tags to filter by, and they  
get a list of documents.  This part is easy, I can do all of this with  
the functionality I picked up reading the docs and tutorials, and  
since my application is handling what sets of documents that my users  
can choose, Solr doesn't need to know anything about the permissions  
model.

The text search is where I'm running into trouble.  Right now, the  
application automatically filters the documents to search through with  
a join in MySQL.  In order to do this through Solr, I need to figure  
out a good way for Solr to know what sets of documents in which to  
search.

Here's what I have so far:

1)  Each document has a field folder_id, which contains one value,  
which is the ID of the folder to which the document belongs.  There  
are right now about 6000 different folders altogether.

2)  Each user is permitted to see documents from a particular subset  
of folders.  Some users can see only 100-200 folders, some users can  
see 4000-5000 folders (all depends on what they have subscribed to).

In the advanced search, in order to restrict the available documents,  
I use a filter query:  fq=folder_id:1 OR folder_id:2 etc...  In the  
advanced search, the user is only ever searching through a max of 80  
or 90 folders (and usually more like 1 or 2), so this seems quite  
workable.

However, in the plain text search, the user automatically searches  
through *all* of the folders to which they have subscribed.  This  
means, for (good!) users who have subscribed to a large (1000+) number  
of folders, the filter query would be quite long, and would exceed the  
default number of boolean parameters allowed.  Of course, I could just  
increase the limit, but the fact that a limit is there in the first  
place leads me to believe this is probably not the most scalable  
solution.

Now, I'm reading on this tutorial page for Lucene:  http://www.lucenetutorial.com/techniques/permission-filtering.html 
  that the best way to do this would involve some combination of  
HitCollector & FieldCache.  From what the author is saying, this  
sounds like exactly what I need.  Unfortunately, I am almost  
completely Java-illiterate, and on top of that, I'm  not really  
finding any explanation of:

a) What exactly I would do with the HitCollector & FieldCache objects  
that would help me achieve this goal - even just at the level of  
Lucene, there's no real explanation in the tutorial
or
b) Where exactly these classes fit in to Solr (if they do at all)


So far I have already written my own (tiny, tiny) Tokenizer and  
TokenizerFactory for correctly parsing the tags that come in from the  
database, and that works great, so I'm thinking, if there's something  
I can sub-class or modify somewhere to get this working, even with my  
meager Java knowledge I could do it...  But I have no clue even where  
to start with this.  Do I need my own custom version of  
SolrIndexSearcher, or SolrRequestHandler... or some other class I  
haven't even gotten to yet?

If it helps, I am using version 1.2, and trying to integrate this with  
a LAMP-based application.  I already have hooks set up to allow PHP to  
index documents, query solr, and parse responses.  Since everything  
else is already working so well, and it's just a matter of getting  
permissions working, I would really, really like to stick with Solr.  
Has anyone done anything like this or can point me in the right  
direction?  I can figure out the mechanics of getting the list of  
allowed folder_ids to Solr, all I really need to know is what kind of  
modifications I would need to make, where, to get Solr to limit the  
search to a particular subset of documents without using a gigantic  
filter query.

Many thanks for any advice.  My apologies if this has been asked a  
million times before, I am new to the list however I did read and  
search through the archives and didn't really find anything on this  
subject.

Best regards,
Steve
Reply | Threaded
Open this post in threaded view
|

Re: searching only within allowed documents

Yonik Seeley-2
On Mon, Jun 9, 2008 at 7:44 PM, Stephen Weiss <[hidden email]> wrote:
> However, in the plain text search, the user automatically searches through
> *all* of the folders to which they have subscribed.  This means, for (good!)
> users who have subscribed to a large (1000+) number of folders, the filter
> query would be quite long,

This is not a well-solved problem in Lucene & Solr in general.

> and would exceed the default number of boolean
> parameters allowed.

Solr allows you to specify filters in separate parameters that are
applied to the main query, but cached separately.

q=the user query&fq=folder:f13&fq=folder:f24

The other option is to have a user field and index the users that have
access to the specific document.  The downside to this is that the
document must be re-indexed to reflect permission changes (like a new
user that now has access to it).  This may or may not be feasible,
depending on how many users you have to support and how fast
permissions must change.

> Now, I'm reading on this tutorial page for Lucene:
>  http://www.lucenetutorial.com/techniques/permission-filtering.html that the
> best way to do this would involve some combination of HitCollector &
> FieldCache.  From what the author is saying, this sounds like exactly what I
> need.  Unfortunately, I am almost completely Java-illiterate, and on top of
> that, I'm  not really finding any explanation of:
>
> a) What exactly I would do with the HitCollector & FieldCache objects that
> would help me achieve this goal - even just at the level of Lucene, there's
> no real explanation in the tutorial
> or

I think he's saying that with the FieldCache, you can get the external
String id of each matching document and then through some other
external mechanism, determine if that document should be allowed.  So
that still leaves that application-specific part to be solved.

> b) Where exactly these classes fit in to Solr (if they do at all)

A custom request handler or a custom query component would be the
likely place to add/change behavior.

> So far I have already written my own (tiny, tiny) Tokenizer and
> TokenizerFactory for correctly parsing the tags that come in from the
> database, and that works great,

What's the format of the tags... you might be able to use an existing
tokenizer (a regex one perhaps).

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: searching only within allowed documents

Stephen Weiss-2
Thanks for the advice Yonik.

We have new users at least every few hours so it would be kinda  
difficult to maintain the indexes this way.  However, we do have a  
smaller set of tokens describing the different subscription sets  
available (<100).  Basically, each folder_id is attached to a certain  
number of subscription sets, and these associations don't change  
much.  With MySQL using this field would have taken too many joins,  
but with solr this may actually end up being better overall.

Only problem is, right now the current workflow would require indexing  
once to put the images in the system, and then a second time to set  
the permissions on them.  We'll have to change the order of some  
processes around, which means retraining, but in the end I think this  
is going to be the most workable solution.

I didn't realize there was a regular expression tokenizer, but now I  
see there's PatternAnalyzer.  I'll give it a shot.

Regards,

Steve


On Jun 10, 2008, at 5:18 PM, Yonik Seeley wrote:

> On Mon, Jun 9, 2008 at 7:44 PM, Stephen Weiss  
> <[hidden email]> wrote:
>> However, in the plain text search, the user automatically searches  
>> through
>> *all* of the folders to which they have subscribed.  This means,  
>> for (good!)
>> users who have subscribed to a large (1000+) number of folders, the  
>> filter
>> query would be quite long,
>
> This is not a well-solved problem in Lucene & Solr in general.
>
>> and would exceed the default number of boolean
>> parameters allowed.
...
Reply | Threaded
Open this post in threaded view
|

Re: searching only within allowed documents

Geoffrey Young
In reply to this post by Yonik Seeley-2


> Solr allows you to specify filters in separate parameters that are
> applied to the main query, but cached separately.
>
> q=the user query&fq=folder:f13&fq=folder:f24

I've been wanting more explanation around this for a while, so maybe now
is a good time to ask :)

the "cached separately" verbiage here is the same as in the twiki, but I
don't really understand what it means.  more precisely, I'm wondering
what the real performance, caching, etc differences are between

   q=fielda:foo+fieldb:bar&mm=100%

and

   q=fielda:foo&fq=fieldb:bar

my situation is similar to the original poster's in that documents
matching fielda is very large and common (say theaters across the world)
while fieldb would narrow it considerably (one by country, then one by
zipcode, etc).

thanks

--Geoff


Reply | Threaded
Open this post in threaded view
|

Re: searching only within allowed documents

climbingrose
It depends on your query. The second query is better if you know that
fieldb:bar filtered query will be reused often since it will be cached
separately from the query. The first query occuppies one cache entry while
the second one occuppies two cache entries, one in queryCache and one in
filteredCache. Therefore, if you're not going to reuse fieldb:bar, the
second query is better.

On Wed, Jun 11, 2008 at 10:53 PM, Geoffrey Young <[hidden email]>
wrote:

>
>
>  Solr allows you to specify filters in separate parameters that are
>> applied to the main query, but cached separately.
>>
>> q=the user query&fq=folder:f13&fq=folder:f24
>>
>
> I've been wanting more explanation around this for a while, so maybe now is
> a good time to ask :)
>
> the "cached separately" verbiage here is the same as in the twiki, but I
> don't really understand what it means.  more precisely, I'm wondering what
> the real performance, caching, etc differences are between
>
>  q=fielda:foo+fieldb:bar&mm=100%
>
> and
>
>  q=fielda:foo&fq=fieldb:bar
>
> my situation is similar to the original poster's in that documents matching
> fielda is very large and common (say theaters across the world) while fieldb
> would narrow it considerably (one by country, then one by zipcode, etc).
>
> thanks
>
> --Geoff
>
>
>


--
Regards,

Cuong Hoang
Reply | Threaded
Open this post in threaded view
|

Re: searching only within allowed documents

climbingrose
Just correct myself, in the last setence, the first query is better if
fieldb:bar isn't reused often

On Thu, Jun 12, 2008 at 2:02 PM, climbingrose <[hidden email]>
wrote:

> It depends on your query. The second query is better if you know that
> fieldb:bar filtered query will be reused often since it will be cached
> separately from the query. The first query occuppies one cache entry while
> the second one occuppies two cache entries, one in queryCache and one in
> filteredCache. Therefore, if you're not going to reuse fieldb:bar, the
> second query is better.
>
>
> On Wed, Jun 11, 2008 at 10:53 PM, Geoffrey Young <
> [hidden email]> wrote:
>
>>
>>
>>  Solr allows you to specify filters in separate parameters that are
>>> applied to the main query, but cached separately.
>>>
>>> q=the user query&fq=folder:f13&fq=folder:f24
>>>
>>
>> I've been wanting more explanation around this for a while, so maybe now
>> is a good time to ask :)
>>
>> the "cached separately" verbiage here is the same as in the twiki, but I
>> don't really understand what it means.  more precisely, I'm wondering what
>> the real performance, caching, etc differences are between
>>
>>  q=fielda:foo+fieldb:bar&mm=100%
>>
>> and
>>
>>  q=fielda:foo&fq=fieldb:bar
>>
>> my situation is similar to the original poster's in that documents
>> matching fielda is very large and common (say theaters across the world)
>> while fieldb would narrow it considerably (one by country, then one by
>> zipcode, etc).
>>
>> thanks
>>
>> --Geoff
>>
>>
>>
>
>
> --
> Regards,
>
> Cuong Hoang




--
Regards,

Cuong Hoang
Reply | Threaded
Open this post in threaded view
|

Re: searching only within allowed documents

Geoffrey Young
In reply to this post by climbingrose


climbingrose wrote:
> It depends on your query. The second query is better if you know that
> fieldb:bar filtered query will be reused often since it will be cached
> separately from the query. The first query occuppies one cache entry while
> the second one occuppies two cache entries, one in queryCache and one in
> filteredCache. Therefore, if you're not going to reuse fieldb:bar, the
> second query is better.

ok, that makes more sense.  thanks.

--Geoff