Per user data store

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Per user data store

Ganesh - yahoo
Hello all,

Documents coressponding to multiple users are to be indexed. Each user is going to search only his documents. Only Administrator could search all users data.

Is it good to have one database for each User or to have only one database for all Users? Which will be better?

My opinion is to have one database for all users and to have field 'Username'. Using this field data will get filtered out and the search results will be served to the User. In this approach, whether Username should be part of boolean query or TermFilter will be the better approach?

One more technical question: Username field will have repeated entry of the user names. Whether the space for this field will be consumped for every document / record or the data will be tokenzied and a pointer to the document will be stored.

Regards
Ganesh
Reply | Threaded
Open this post in threaded view
|

Re: Per user data store

Erick Erickson
I'd start out with one index, if for no other reason
than keeping track of one index for each user would
be a royal pain in the neck. You haven't told us
how many users or documents you expect,
so that's just a guess. There's one answer perhaps
if you wind up with a 10M index, another if it's 10T.....

Filtering on the username is a fine idea, although
I'd also start by just ANDing in the username to
the query to start. Then measure your resonse
time. Note that the first time you open a reader, the
response will be slow so measure queries 2-n
instead.

I don't know the guts of Lucene, but my indexes do NOT
grow linearly with the data. After a very few docs, adding,
say, 1M of data does not cause the data to grow by 1M (or
even close to that) for fields that are NOT stored. I've
learned to just trust that the very bright people who work
on Lucene have "done the right thing" <G>...

Best
Erick

On Tue, Aug 5, 2008 at 8:36 AM, Ganesh - yahoo <[hidden email]>wrote:

> Hello all,
>
> Documents coressponding to multiple users are to be indexed. Each user is
> going to search only his documents. Only Administrator could search all
> users data.
>
> Is it good to have one database for each User or to have only one database
> for all Users? Which will be better?
>
> My opinion is to have one database for all users and to have field
> 'Username'. Using this field data will get filtered out and the search
> results will be served to the User. In this approach, whether Username
> should be part of boolean query or TermFilter will be the better approach?
>
> One more technical question: Username field will have repeated entry of the
> user names. Whether the space for this field will be consumped for every
> document / record or the data will be tokenzied and a pointer to the
> document will be stored.
>
> Regards
> Ganesh
adb
Reply | Threaded
Open this post in threaded view
|

Re: Per user data store

adb
In reply to this post by Ganesh - yahoo
Ganesh - yahoo wrote:
> Hello all,
>
> Documents coressponding to multiple users are to be indexed. Each user is
> going to search only his documents. Only Administrator could search all users
> data.
>
> Is it good to have one database for each User or to have only one database
> for all Users? Which will be better?

I created a hybrid approach that supported 1..n databases based on a hash of the
user's user Id.  This was to allow for the situation where a single database
would not scale - at the time there was not good information about Lucene's
performance with large data sets.

In practice, we are now using a single database with data for all users.  There
is an 'ownerId' field with the unique user Id in every document.

 > My opinion is to have one database for all users and to have field
 > 'Username'. Using this field data will get filtered out and the search
 > results will be served to the User. In this approach, whether Username should
 > be part of boolean query or TermFilter will be the better approach?

The ownerId is used as a cached filter rather than always added to the query, so
that only that user's documents influence the score.  If it is part of the
query, the complete document set for other users will influence the hits for
this user.

Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Per user data store

Karsten F.-2
In reply to this post by Erick Erickson
Hi,

I want to agree with the advice of using only one index.

And I want to add two reasons:
1. Sorting and caching are working with the lucene-document-numbers.
In case of lucene "warming up" means that a lot of int-Arrays and bitsets are stored in main memory.
If you using different MultiReader for each user all caching is also seperately.

2. you should think about what happened, if you get new users:
Most possible you will get a user "with the same permissions as XY".
So you don't want to copy a index-file or insert a new value in an existing document-field.
But you can easly copy the filter of an existing user.
(Which also means that I suggest not to use a field "userids with read-permission". It is better to decouple userids and index).

But this reasons are only good for my thinking of amount of users, ratio of deleting and adding documents and period of valid documents.

So I again agree with Erick, that you should tell more about your use case.

Best regards

  Karsten

Erick Erickson wrote
I'd start out with one index, if for no other reason
than keeping track of one index for each user would
be a royal pain in the neck. You haven't told us
how many users or documents you expect,
so that's just a guess. There's one answer perhaps
if you wind up with a 10M index, another if it's 10T.....

Filtering on the username is a fine idea, although
I'd also start by just ANDing in the username to
the query to start. Then measure your resonse
time. Note that the first time you open a reader, the
response will be slow so measure queries 2-n
instead.

I don't know the guts of Lucene, but my indexes do NOT
grow linearly with the data. After a very few docs, adding,
say, 1M of data does not cause the data to grow by 1M (or
even close to that) for fields that are NOT stored. I've
learned to just trust that the very bright people who work
on Lucene have "done the right thing" <G>...

Best
Erick