Multiple index performance

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Multiple index performance

Cyndy
This post was updated on .
Hello, I am new to Lucene and I want to make sure what I am trying to do will not cause a performance hit. My scenario is the following:

I want to keep user text files indexed separately, I will have about 10,000 users and each user may have about 20,000 short files, and I need to keep privacy. So the idea is to have one folder with the text files and  index for each user, so when users search for their documents, the search will be pointing to the corresponding file directory. Would this approach hit performance? is this a good solution? Any recommendation?

Thanks in advance.

adb
Reply | Threaded
Open this post in threaded view
|

Re: Multiple index performance

adb
Cyndy wrote:
>
> I want to keep user text files indexed separately, I will have about 10,000
> users and each user may have about 20,000 short files, and I need to keep
> privacy. So the idea is to have one folder with the text files and  index
> for each user, so when search will be done, it will be pointing to the
> corresponding file directory. Would this approach hit performance? is this a
> good solution? Any recommendation?

For access control, we use an ownerId field in Lucene which indexes the owning
user.  We filter all searches using ownerId.  This allows all Documents to be
kept in a single index.

We also support sharding across multiple index files for performance/scaling
considerations, via a hash of the ownerId, but in practice have not needed it.
Much will depend on your search usage.

YMMV
Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

adb
Reply | Threaded
Open this post in threaded view
|

Re: Multiple index performance

adb
In reply to this post by Cyndy
[hidden email] wrote:
> Thanks Anthony for your response, I did not know about that field.

You make your own fields in Lucene, it is not something Lucene gives you.


> But still I have a problem and it is about privacy. The users are concerned
> about privacy and so, we thought we could have all their files in a folder
> and encrypt the whole folder and index with a user key, so then when user
> logs in, decrypt the folder with the key and so Lucene can reach the
> documents, so that is why I am concerned about efficiency, since I do not
> know if Lucene could handle the 10,000 indexes.


It seems like you may be confusing what Lucene will give you.  The original file
content and the Lucene indexes are two different things.  It sounds like you
want to protect access to the original content on some shared storage, but that
is not related to the searching provided by your Lucene app, or maybe I
misunderstood your use case.

Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Multiple index performance

Erick Erickson
In reply to this post by Cyndy
Another issue is opening/closing your indexes. When you open an
index for searching, the first few queries you fire invoke considerable
overhead as caches warm up, etc. Plus, you don't get any efficiencies
of scale (that is, pretty soon adding 2X the amount of text to an index
increases the size of the index considerably less than 2X if you're
not storing the text).

So, you either have to keep 10,000 indexes open for efficient searching,
or open/close each one on demand and live with the consequent hit to
your searching performance.

I'd think about keeping it all in a large index, storing the user's name
as a field and appending something like "AND user:cyndy" to each
search. You could also assemble a filter for your user and tack that on
to the query. But the above clause is conceptually simplest.

Best
Erick

On Mon, Aug 18, 2008 at 10:34 PM, Cyndy <[hidden email]> wrote:

>
> Hello, I am new into Lucene and I want to make sure what I am trying to do
> will not hit performance. My scenario is the following:
>
> I want to keep user text files indexed separately, I will have about 10,000
> users and each user may have about 20,000 short files, and I need to keep
> privacy. So the idea is to have one folder with the text files and  index
> for each user, so when search will be done, it will be pointing to the
> corresponding file directory. Would this approach hit performance? is this
> a
> good solution? Any recommendation?
>
> Thanks in advance.
>
>
> --
> View this message in context:
> http://www.nabble.com/Multiple-index-performance-tp19043404p19043404.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Multiple index performance

Cyndy
In reply to this post by adb
Thanks Anthony,

I understand your comment, and I think it makes sense, the only thing is that I have the issue that I need to guarantee privacy to the users, so if I am able to read the indexes (if they are not encrypted), then I can pretty much know what he says in the document, so that is why I was thinking to encrypt the whole directory of text files as well as the index files, so the user by giving his password can decrypt all the files and then Lucene can do its job. In that sense I will have to open/close the indexs on demand. And so my concern was that: if I have at a moment 1000 indexes open, would that hit performance?

Thanks again for your answer.




Antony Bowesman wrote
cmunoz@mit.edu wrote:
> Thanks Anthony for your response, I did not know about that field.

You make your own fields in Lucene, it is not something Lucene gives you.


> But still I have a problem and it is about privacy. The users are concerned
> about privacy and so, we thought we could have all their files in a folder
> and encrypt the whole folder and index with a user key, so then when user
> logs in, decrypt the folder with the key and so Lucene can reach the
> documents, so that is why I am concerned about efficiency, since I do not
> know if Lucene could handle the 10,000 indexes.


It seems like you may be confusing what Lucene will give you.  The original file
content and the Lucene indexes are two different things.  It sounds like you
want to protect access to the original content on some shared storage, but that
is not related to the searching provided by your Lucene app, or maybe I
misunderstood your use case.

Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org