WIll storing docs affect lucene's search performance ?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

WIll storing docs affect lucene's search performance ?

Prasenjit Mukherjee-3
I have a requirement ( use highlighter) to  store the doc content
somewhere., and I am not allowed to use a RDBMS. I am thinking of using
Lucene's Field with (Field.Store.YES and Field.Index.NO) to store the
doc content. Will it have any negative affect on my search performance ?

I think I have read somewhere that  Lucene shouldn't be used(or
misused)  to provide RDBMS like storage.

--prasen

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: WIll storing docs affect lucene's search performance ?

Øyvind Stegard
On Friday 11 August 2006 15:07, Prasenjit Mukherjee wrote:
> I have a requirement ( use highlighter) to  store the doc content
> somewhere., and I am not allowed to use a RDBMS. I am thinking of using
> Lucene's Field with (Field.Store.YES and Field.Index.NO) to store the
> doc content. Will it have any negative affect on my search performance ?
>
> I think I have read somewhere that  Lucene shouldn't be used(or
> misused)  to provide RDBMS like storage.
We are using a stored binary version of every field we index in our content
repository implementation (mostly just primitive data types, though). I asked
a similar question earlier on this list. I'll just quote the reply I got
here:

> On 3/9/06, Øyvind Stegard <[hidden email]> wrote:
> > - How does many stored fields eventually affect indexing/query
> > performance compared to if no fields were stored (only indexed) ?
>
> Additional stored fields should have no effect on querying (the
> internal information about a field is looked up in a hashmap).
>
> Additional stored fields that are used has an impact on indexing since
> that data must be copied every time segments are merged.
>
> Additional stored fields that are not used in most documents (sparse)
> should have very little performance impact on indexing.  The field
> list is walked a few times linearly (in-memory) during a segment
> merge, which should be very fast, but it's still O(n), so don't go
> crazy and have a million stored field types.
>
> > - Are there any known scalability issues with a large amount of distinct
> > fields in an index (not necessarily the same set of fields for every doc)
> > ?
>
> If they are indexed fields, yes.
> Each indexed field has a 1 byte norm *per document*, regardless of if
> the document contains that field.  In the current version of lucene,
> there is a way to omit these norms on a per field basis (see
> Field.setOmitNorms()) if you don't need length normalization or
> index-time field boosting.
>
> -Yonik
> http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

Øyvind
--
< Øyvind Stegard < oyvind stegard at usit uio no >
 < SAUS/USIT, UiO

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: WIll storing docs affect lucene's search performance ?

Grant Ingersoll
In reply to this post by Prasenjit Mukherjee-3

Large stored fields can affect performance when you are iterating  
over your hits (assuming you are not interested in the value of the  
stored field at that point in time) for a results display since all  
Fields are loaded when getting the Document.  The SVN trunk has a  
version of lazy loading that allows you to specify which fields are  
loaded and which ones are lazy, so you can avoid loading fields that  
a user will never view.

-Grant

On Aug 11, 2006, at 9:07 AM, Prasenjit Mukherjee wrote:

> I have a requirement ( use highlighter) to  store the doc content  
> somewhere., and I am not allowed to use a RDBMS. I am thinking of  
> using Lucene's Field with (Field.Store.YES and Field.Index.NO) to  
> store the doc content. Will it have any negative affect on my  
> search performance ?
> I think I have read somewhere that  Lucene shouldn't be used(or  
> misused)  to provide RDBMS like storage.
>
> --prasen
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: WIll storing docs affect lucene's search performance ?

SnowCrash
IMO you should avoid storing any data in the index that you don't need for
display.  Lucene is an index (and a damn good one), not a database.  If you
find yourself storing large amounts of data in the index, this could be an
indication that you may need to re-think your architecture.

In its simplest case, data storage is for storing data. Lucene is for
indexing the data and searching it.

You will certainly see performance implications with storing data in the
index, particularly if you elect to have the data compressed by lucene.  The
lazy loading in the current trunk will help enormously with this (great work
by the dev team), but I would still encourage you to design a system in
which lucene is not the primary source of data.  That is, if you need to
re-index, get the data from its source location (or some interim location)
rather than relying on storing it in lucene.

I had all sorts of struggles with this very issue, and after several failed
attempts came to the conclusion that whilst Lucene often forms a critical
part of a good solution, it should only be used as an index/search tool..
not a database.


On 8/12/06, Grant Ingersoll <[hidden email]> wrote:

>
>
> Large stored fields can affect performance when you are iterating
> over your hits (assuming you are not interested in the value of the
> stored field at that point in time) for a results display since all
> Fields are loaded when getting the Document.  The SVN trunk has a
> version of lazy loading that allows you to specify which fields are
> loaded and which ones are lazy, so you can avoid loading fields that
> a user will never view.
>
> -Grant
>
> On Aug 11, 2006, at 9:07 AM, Prasenjit Mukherjee wrote:
>
> > I have a requirement ( use highlighter) to  store the doc content
> > somewhere., and I am not allowed to use a RDBMS. I am thinking of
> > using Lucene's Field with (Field.Store.YES and Field.Index.NO) to
> > store the doc content. Will it have any negative affect on my
> > search performance ?
> > I think I have read somewhere that  Lucene shouldn't be used(or
> > misused)  to provide RDBMS like storage.
> >
> > --prasen
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
> --------------------------
> Grant Ingersoll
> Sr. Software Engineer
> Center for Natural Language Processing
> Syracuse University
> 335 Hinds Hall
> Syracuse, NY 13244
> http://www.cnlp.org
>
> Voice: 315-443-5484
> Skype: grant_ingersoll
> Fax: 315-443-6886
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: WIll storing docs affect lucene's search performance ?

Rupinder Singh Mazara
In reply to this post by Grant Ingersoll
Where can I find information which version / tag to checkout so as to
get the lazy loading verity of lucene


Grant Ingersoll wrote:

>
> Large stored fields can affect performance when you are iterating over
> your hits (assuming you are not interested in the value of the stored
> field at that point in time) for a results display since all Fields
> are loaded when getting the Document.  The SVN trunk has a version of
> lazy loading that allows you to specify which fields are loaded and
> which ones are lazy, so you can avoid loading fields that a user will
> never view.
>
> -Grant
>
> On Aug 11, 2006, at 9:07 AM, Prasenjit Mukherjee wrote:
>
>> I have a requirement ( use highlighter) to  store the doc content
>> somewhere., and I am not allowed to use a RDBMS. I am thinking of
>> using Lucene's Field with (Field.Store.YES and Field.Index.NO) to
>> store the doc content. Will it have any negative affect on my search
>> performance ?
>> I think I have read somewhere that  Lucene shouldn't be used(or
>> misused)  to provide RDBMS like storage.
>>
>> --prasen
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
> --------------------------
> Grant Ingersoll
> Sr. Software Engineer
> Center for Natural Language Processing
> Syracuse University
> 335 Hinds Hall
> Syracuse, NY 13244
> http://www.cnlp.org
>
> Voice: 315-443-5484
> Skype: grant_ingersoll
> Fax: 315-443-6886
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: WIll storing docs affect lucene's search performance ?

Grant Ingersoll-4
It is on the HEAD version in SVN.

See  http://wiki.apache.org/jakarta-lucene/SourceRepository for info  
on checking out from SVN.


On Aug 25, 2006, at 10:44 AM, Rupinder Singh Mazara wrote:

> Where can I find information which version / tag to checkout so as to
> get the lazy loading verity of lucene
>
>
> Grant Ingersoll wrote:
>>
>> Large stored fields can affect performance when you are iterating  
>> over your hits (assuming you are not interested in the value of  
>> the stored field at that point in time) for a results display  
>> since all Fields are loaded when getting the Document.  The SVN  
>> trunk has a version of lazy loading that allows you to specify  
>> which fields are loaded and which ones are lazy, so you can avoid  
>> loading fields that a user will never view.
>>
>> -Grant
>>
>> On Aug 11, 2006, at 9:07 AM, Prasenjit Mukherjee wrote:
>>
>>> I have a requirement ( use highlighter) to  store the doc content  
>>> somewhere., and I am not allowed to use a RDBMS. I am thinking of  
>>> using Lucene's Field with (Field.Store.YES and Field.Index.NO) to  
>>> store the doc content. Will it have any negative affect on my  
>>> search performance ?
>>> I think I have read somewhere that  Lucene shouldn't be used(or  
>>> misused)  to provide RDBMS like storage.
>>>
>>> --prasen
>>>
>>> --------------------------------------------------------------------
>>> -
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> Sr. Software Engineer
>> Center for Natural Language Processing
>> Syracuse University
>> 335 Hinds Hall
>> Syracuse, NY 13244
>> http://www.cnlp.org
>>
>> Voice: 315-443-5484
>> Skype: grant_ingersoll
>> Fax: 315-443-6886
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: WIll storing docs affect lucene's search performance ?

Grant Ingersoll
In reply to this post by Rupinder Singh Mazara
It is on the HEAD version in SVN.

See  http://wiki.apache.org/jakarta-lucene/SourceRepository for info  
on checking out from SVN.


-Grant

On Aug 25, 2006, at 10:44 AM, Rupinder Singh Mazara wrote:

> Where can I find information which version / tag to checkout so as to
> get the lazy loading verity of lucene
>
>
> Grant Ingersoll wrote:
>>
>> Large stored fields can affect performance when you are iterating  
>> over your hits (assuming you are not interested in the value of  
>> the stored field at that point in time) for a results display  
>> since all Fields are loaded when getting the Document.  The SVN  
>> trunk has a version of lazy loading that allows you to specify  
>> which fields are loaded and which ones are lazy, so you can avoid  
>> loading fields that a user will never view.
>>
>> -Grant
>>
>> On Aug 11, 2006, at 9:07 AM, Prasenjit Mukherjee wrote:
>>
>>> I have a requirement ( use highlighter) to  store the doc content  
>>> somewhere., and I am not allowed to use a RDBMS. I am thinking of  
>>> using Lucene's Field with (Field.Store.YES and Field.Index.NO) to  
>>> store the doc content. Will it have any negative affect on my  
>>> search performance ?
>>> I think I have read somewhere that  Lucene shouldn't be used(or  
>>> misused)  to provide RDBMS like storage.
>>>
>>> --prasen
>>>
>>> --------------------------------------------------------------------
>>> -
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> Sr. Software Engineer
>> Center for Natural Language Processing
>> Syracuse University
>> 335 Hinds Hall
>> Syracuse, NY 13244
>> http://www.cnlp.org
>>
>> Voice: 315-443-5484
>> Skype: grant_ingersoll
>> Fax: 315-443-6886
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]