[jira] Created: (SOLR-52) Lazy Field loading

classic Classic list List threaded Threaded
26 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (SOLR-52) Lazy Field loading

ASF GitHub Bot (Jira)
Lazy Field loading
------------------

                 Key: SOLR-52
                 URL: http://issues.apache.org/jira/browse/SOLR-52
             Project: Solr
          Issue Type: Improvement
          Components: search
            Reporter: Mike Klaas
         Assigned To: Mike Klaas
            Priority: Minor
         Attachments: lazyfields_patch.diff

Add lazy field loading to solr.

Currently solr reads all stored fields and filters the undesired fields based on the field list.  This is usually not a performance concern, but when using solr to store large numbers of fields, or just one large field (doc contents, eg. for highlighting), it is perceptible.

Now, there is a concern with the doc cache of SolrIndexSearcher, which assumes it has the whole document in the cache.  To maintain this invariant, it is still the case that all the fields in a document are loaded in a searcher.doc(i) call.  However, if a field set is given to teh method, only the given fields are loaded directly, while the rest are loaded lazily.

Some concerns about lazy field loading
  1. Lazy field are only valid while the IndexReader is open.  I believe this is fine since the IndexReader is kept alive by the SolrIndexSearcher, so all docs in the cache have the reader available.  
  2. It is slower to read a field lazily and retrieve its value later than retrieve it directory to begin with (though I don't know how much--depends on i/o factors).  We certainly don't want this to be the common case.  I added an optional call which accumulates all the field likely to be used in the request (highlighting, reponse writing), and populates the IndexSearcher cache a priori.  This has the added advantage of concentrating doc retrieval in a single place, which is nice from a performance testing perspective.
 3. LazyFields are incompatible with the sundry Field declarations scattered about Solr.  I believe I've changed all the necessary locations to Fieldable.

Comments appreciated


--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (SOLR-52) Lazy Field loading

ASF GitHub Bot (Jira)
     [ http://issues.apache.org/jira/browse/SOLR-52?page=all ]

Mike Klaas updated SOLR-52:
---------------------------

    Attachment: lazyfields_patch.diff

> Lazy Field loading
> ------------------
>
>                 Key: SOLR-52
>                 URL: http://issues.apache.org/jira/browse/SOLR-52
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Mike Klaas
>         Assigned To: Mike Klaas
>            Priority: Minor
>         Attachments: lazyfields_patch.diff
>
>
> Add lazy field loading to solr.
> Currently solr reads all stored fields and filters the undesired fields based on the field list.  This is usually not a performance concern, but when using solr to store large numbers of fields, or just one large field (doc contents, eg. for highlighting), it is perceptible.
> Now, there is a concern with the doc cache of SolrIndexSearcher, which assumes it has the whole document in the cache.  To maintain this invariant, it is still the case that all the fields in a document are loaded in a searcher.doc(i) call.  However, if a field set is given to teh method, only the given fields are loaded directly, while the rest are loaded lazily.
> Some concerns about lazy field loading
>   1. Lazy field are only valid while the IndexReader is open.  I believe this is fine since the IndexReader is kept alive by the SolrIndexSearcher, so all docs in the cache have the reader available.  
>   2. It is slower to read a field lazily and retrieve its value later than retrieve it directory to begin with (though I don't know how much--depends on i/o factors).  We certainly don't want this to be the common case.  I added an optional call which accumulates all the field likely to be used in the request (highlighting, reponse writing), and populates the IndexSearcher cache a priori.  This has the added advantage of concentrating doc retrieval in a single place, which is nice from a performance testing perspective.
>  3. LazyFields are incompatible with the sundry Field declarations scattered about Solr.  I believe I've changed all the necessary locations to Fieldable.
> Comments appreciated

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-52) Lazy Field loading

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)
    [ http://issues.apache.org/jira/browse/SOLR-52?page=comments#action_12440682 ]
           
Yonik Seeley commented on SOLR-52:
----------------------------------

+1, looks good.

There are some small backward incompatabilities (any place that returns a Fieldable, like getUniqueKeyField), but it can't be helped, and it's fairly expert level anyway.

My only concern was about a memory increase for lazy-loaded short fields.  I reviewed some of the LazyField code just now, and it looks like this shouldn't be the case:
 - LazyField is an inner class that contains an extra 3 members.   It's outer class that it will retain a reference to is FieldsReader.    The fieldsReader instance is a member of SegmentReader, and has the same lifetime as the SegmentReader.  Hence the LazyField won't extend the lifetime of any other objects.

One thing I did see is the internal char[] buffer used to read the string in LazyField is a member for some reason (hence the data will be stored in the field *twice* for some reason).  I think this is probably a bug, and I'll bring it up on the Lucene list.

Ideas for future optimizations:
- if there is no document cache, change lazy to no-load
- special cases: if only a single field (like the ID field) is selected out of many documents to be return, consider bypassing doc cache and use LOAD_AND_BREAK if we know there is only a single value.

> Lazy Field loading
> ------------------
>
>                 Key: SOLR-52
>                 URL: http://issues.apache.org/jira/browse/SOLR-52
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Mike Klaas
>         Assigned To: Mike Klaas
>            Priority: Minor
>         Attachments: lazyfields_patch.diff
>
>
> Add lazy field loading to solr.
> Currently solr reads all stored fields and filters the undesired fields based on the field list.  This is usually not a performance concern, but when using solr to store large numbers of fields, or just one large field (doc contents, eg. for highlighting), it is perceptible.
> Now, there is a concern with the doc cache of SolrIndexSearcher, which assumes it has the whole document in the cache.  To maintain this invariant, it is still the case that all the fields in a document are loaded in a searcher.doc(i) call.  However, if a field set is given to teh method, only the given fields are loaded directly, while the rest are loaded lazily.
> Some concerns about lazy field loading
>   1. Lazy field are only valid while the IndexReader is open.  I believe this is fine since the IndexReader is kept alive by the SolrIndexSearcher, so all docs in the cache have the reader available.  
>   2. It is slower to read a field lazily and retrieve its value later than retrieve it directory to begin with (though I don't know how much--depends on i/o factors).  We certainly don't want this to be the common case.  I added an optional call which accumulates all the field likely to be used in the request (highlighting, reponse writing), and populates the IndexSearcher cache a priori.  This has the added advantage of concentrating doc retrieval in a single place, which is nice from a performance testing perspective.
>  3. LazyFields are incompatible with the sundry Field declarations scattered about Solr.  I believe I've changed all the necessary locations to Fieldable.
> Comments appreciated

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-52) Lazy Field loading

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)
    [ http://issues.apache.org/jira/browse/SOLR-52?page=comments#action_12440683 ]
           
Yonik Seeley commented on SOLR-52:
----------------------------------

The one other memory increase I can see from using lazy fields is due to the thread local... a cloned IndexInput (containing a 1K byte buffer + other object overhead).  That shouldn't be a big deal since it's related to the number of different threads used to access lazy loaded fields, and not directly to the number of lazy fields themselves.

In any case, your optimization of retrieving all the fields needed for the request probably prevents many lazy field invocations.

> Lazy Field loading
> ------------------
>
>                 Key: SOLR-52
>                 URL: http://issues.apache.org/jira/browse/SOLR-52
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Mike Klaas
>         Assigned To: Mike Klaas
>            Priority: Minor
>         Attachments: lazyfields_patch.diff
>
>
> Add lazy field loading to solr.
> Currently solr reads all stored fields and filters the undesired fields based on the field list.  This is usually not a performance concern, but when using solr to store large numbers of fields, or just one large field (doc contents, eg. for highlighting), it is perceptible.
> Now, there is a concern with the doc cache of SolrIndexSearcher, which assumes it has the whole document in the cache.  To maintain this invariant, it is still the case that all the fields in a document are loaded in a searcher.doc(i) call.  However, if a field set is given to teh method, only the given fields are loaded directly, while the rest are loaded lazily.
> Some concerns about lazy field loading
>   1. Lazy field are only valid while the IndexReader is open.  I believe this is fine since the IndexReader is kept alive by the SolrIndexSearcher, so all docs in the cache have the reader available.  
>   2. It is slower to read a field lazily and retrieve its value later than retrieve it directory to begin with (though I don't know how much--depends on i/o factors).  We certainly don't want this to be the common case.  I added an optional call which accumulates all the field likely to be used in the request (highlighting, reponse writing), and populates the IndexSearcher cache a priori.  This has the added advantage of concentrating doc retrieval in a single place, which is nice from a performance testing perspective.
>  3. LazyFields are incompatible with the sundry Field declarations scattered about Solr.  I believe I've changed all the necessary locations to Fieldable.
> Comments appreciated

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Created: (SOLR-52) Lazy Field loading

Chris Hostetter-3
In reply to this post by ASF GitHub Bot (Jira)

Thanks for tackling this Mike ... I've been dreading the whole issue of
Lazy Loading but your patch gives me hope.  I haven't had a chance to try
it out, but reading through it, it seems a lot more straight forward then
I'd feared.

A couple of concerns jump out at me though, starting with the biggee...

: Now, there is a concern with the doc cache of SolrIndexSearcher, which
: assumes it has the whole document in the cache.  To maintain this
: invariant, it is still the case that all the fields in a document are
: loaded in a searcher.doc(i) call.  However, if a field set is given to
: teh method, only the given fields are loaded directly, while the rest
: are loaded lazily.

it doesn't look like this would actually be an invarient with your patch.
Consider the case of two request handlers: X which expects/uses
all fields, and Y which only uses a few fields consistently so the
rest are left to lazy loading.   If Y hits doc N first it puts N in
cache with only partial fields, isn't X forced to use lazy loading every
time after that?

Or does lazy loading not work that way? ... would the lazy fields still
only be loaded once (the first time X asked for them) but then stored in
the Document (which is still in the cache) ? ... even if that is hte case
would hte performance hit of X loading each field Lazily one at a time be
more expensive then refetching completely?

(can you tell how little i understand the mechanics of Lazy Loading in
general?)

Depending on what would happen in the situation i described, perhaps
the Solr document cache should only be used by the single arg version of
SolrIndexSearcher.doc? (if we do this, two arg version of readDocs needs
changed to use it directly)

An alternate idea: the single arg version could check if an item found in
the cache contains lazy fields and if so re-fetch and recache the full
Document?

FWIW: This isn't an obscure situation, The main index I deal with has a
lot of metadata documents (one per category) which are *HUGE*, and there
are two custom request handlers which use those documents -- one uses
every field in them, and the other uses only a single field.  That second
handler would be an ideal candidate to start using your new
SolrIndexSearcher.doc(int,Set<String>) method, but it would suck if doing
thta ment that the other handler's performance suffered because it started
taking longer to fetch all of the fields)


Other smaller issues...

1) it's not clear to me ... will optimizePreFetchDocs trigger an NPE if
there are highlight fields but no uniqueKey for the schema?
(same potential in HighlightUtils which has the same line)

2) why doesn't optimizePreFetchDocs use SolrIndexSearcher.readDocs (looks
like cut/paste of method body)

3) should we be concerned about letting people specify prefixes/suffixes
of the fields they want to forcably load for dynamicFields instead of just
a Set<String> of names? .. or should we cross that bridge when we come to
it?  (I ask because we have no cache aware method that takes in a
FieldSelector, just the one that takes in the Set<String>)


And a few minor nits...

* lists in javadoc comments should us HTML <li> tags so they can be read
  cleanly in generated docs (see SolrPluginUtils.optimizePreFetchDocs)
* there are some double imports of Fieldable in several classes (it
  looks like maybe they were already importing it once for no reason)



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Re: [jira] Created: (SOLR-52) Lazy Field loading

Mike Klaas
On 10/8/06, Chris Hostetter <[hidden email]> wrote:

>
> Thanks for tackling this Mike ... I've been dreading the whole issue of
> Lazy Loading but your patch gives me hope.  I haven't had a chance to try
> it out, but reading through it, it seems a lot more straight forward then
> I'd feared.
>
> A couple of concerns jump out at me though, starting with the biggee...
>
> : Now, there is a concern with the doc cache of SolrIndexSearcher, which
> : assumes it has the whole document in the cache.  To maintain this
> : invariant, it is still the case that all the fields in a document are
> : loaded in a searcher.doc(i) call.  However, if a field set is given to
> : teh method, only the given fields are loaded directly, while the rest
> : are loaded lazily.
>
> it doesn't look like this would actually be an invarient with your patch.
> Consider the case of two request handlers: X which expects/uses
> all fields, and Y which only uses a few fields consistently so the
> rest are left to lazy loading.

Good point--I hadn't considered this case.

> If Y hits doc N first it puts N in
> cache with only partial fields, isn't X forced to use lazy loading every
> time after that?

Yes, but as you guess below...

> Or does lazy loading not work that way? ... would the lazy fields still
> only be loaded once (the first time X asked for them) but then stored in
> the Document (which is still in the cache) ?

... lazy fields read their value once the first time it is requested,
and operate as a Field thereafter.

> ... even if that is hte case
> would hte performance hit of X loading each field Lazily one at a time be
> more expensive then refetching completely?

I wouldn't expect there to be much of a difference.  Lazy fields hold
on to a stream and an offset, and operate by seek()'ing to the right
position and loading the fields as normal.  Now, if the lazy fields
were loaded in exactly the right order, the seeks would  be no-ops.
In practice, this will not happen, but we will expect that all seeks
(after the first one) will be in the same disk block, which will be
buffered, and so the seeks will be but pointer arithmetic.

But that isn't quite the right comparason, as lazy fields have to
"skip" the field in the first place, which is cheap but not free (it
is cheaper for binary and compressed fields than string fields).

The performance gain/loss will depend heavily on which request handler
gets called more often and in what order.  I don't think the cost
would be too great, but I'm loathe to guess.

> Depending on what would happen in the situation i described, perhaps
> the Solr document cache should only be used by the single arg version of
> SolrIndexSearcher.doc? (if we do this, two arg version of readDocs needs
> changed to use it directly)

perhaps, but that has other problems (in the situation you describe
above, every hit on the "few fields" handler would reload every
document).

> An alternate idea: the single arg version could check if an item found in
> the cache contains lazy fields and if so re-fetch and recache the full
> Document?

That could work though I wonder if the O(num fields) cost per document
access is worth it.  Perhaps the document could be stored with a
"lazy" flag in the cache, to make this check O(1).

> FWIW: This isn't an obscure situation, The main index I deal with has a
> lot of metadata documents (one per category) which are *HUGE*, and there
> are two custom request handlers which use those documents -- one uses
> every field in them, and the other uses only a single field.  That second
> handler would be an ideal candidate to start using your new
> SolrIndexSearcher.doc(int,Set<String>) method, but it would suck if doing
> thta ment that the other handler's performance suffered because it started
> taking longer to fetch all of the fields)

Perhaps instead of a "lazy" flag, the number of real fields could be
stored in the cache.  On the next document request, if there are more
than 1-2 more fields requeste than "real" in the cache, the full
document is returned.

How much of a problem this is depends also on how often documets are
hit once in the cache.  If it if more than a few times, the load-once
behaviour of lazy fields should amortize out the extra cost.

Incidently, I think one of the major benefits to lazy field loaded
isn't necessary the retrieval cost, but the memory savings.  In your
example, assuming that many documents are returned by the 1-field
handler which the many-field handler never touches.  These document
will occupy much less memory, and consequently the size of the
document cache can be increased.

> Other smaller issues...
>
> 1) it's not clear to me ... will optimizePreFetchDocs trigger an NPE if
> there are highlight fields but no uniqueKey for the schema?
> (same potential in HighlightUtils which has the same line)

Good point.  I don't think the test suite covers this case--and I've
been bitten by it before.

> 2) why doesn't optimizePreFetchDocs use SolrIndexSearcher.readDocs (looks
> like cut/paste of method body)

Avoids the allocation of the Document[] array, and is three lines (vs.
two lines to allocate array and call readDocs).

> 3) should we be concerned about letting people specify prefixes/suffixes
> of the fields they want to forcably load for dynamicFields instead of just
> a Set<String> of names? .. or should we cross that bridge when we come to
> it?  (I ask because we have no cache aware method that takes in a
> FieldSelector, just the one that takes in the Set<String>)

It would be very easy to add a parallel method which takes a
FieldSelector.  My only concern with that is that it might make it
hard to do cache flushing heuristics like you suggested above.

> And a few minor nits...
>
> * lists in javadoc comments should us HTML <li> tags so they can be read
>   cleanly in generated docs (see SolrPluginUtils.optimizePreFetchDocs)
> * there are some double imports of Fieldable in several classes (it
>   looks like maybe they were already importing it once for no reason)

Thanks for the comments!
-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Re: [jira] Created: (SOLR-52) Lazy Field loading

Yonik Seeley-2
> > 3) should we be concerned about letting people specify prefixes/suffixes
> > of the fields they want to forcably load for dynamicFields instead of just
> > a Set<String> of names? .. or should we cross that bridge when we come to
> > it?  (I ask because we have no cache aware method that takes in a
> > FieldSelector, just the one that takes in the Set<String>)
>
> It would be very easy to add a parallel method which takes a
> FieldSelector.  My only concern with that is that it might make it
> hard to do cache flushing heuristics like you suggested above.

Yeah, I had thought about that and decided it was probably best left
out for now... one can always get the IndexReader and use it's methods
to provide uncached doc access with a FieldSelector.

-Yonik
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-52) Lazy Field loading

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)
    [ http://issues.apache.org/jira/browse/SOLR-52?page=comments#action_12440945 ]
           
Yonik Seeley commented on SOLR-52:
----------------------------------

> hence the data will be stored in the field *twice* for some reason

FYI, I just checked in a Lucene fix for this.

> Lazy Field loading
> ------------------
>
>                 Key: SOLR-52
>                 URL: http://issues.apache.org/jira/browse/SOLR-52
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Mike Klaas
>         Assigned To: Mike Klaas
>            Priority: Minor
>         Attachments: lazyfields_patch.diff
>
>
> Add lazy field loading to solr.
> Currently solr reads all stored fields and filters the undesired fields based on the field list.  This is usually not a performance concern, but when using solr to store large numbers of fields, or just one large field (doc contents, eg. for highlighting), it is perceptible.
> Now, there is a concern with the doc cache of SolrIndexSearcher, which assumes it has the whole document in the cache.  To maintain this invariant, it is still the case that all the fields in a document are loaded in a searcher.doc(i) call.  However, if a field set is given to teh method, only the given fields are loaded directly, while the rest are loaded lazily.
> Some concerns about lazy field loading
>   1. Lazy field are only valid while the IndexReader is open.  I believe this is fine since the IndexReader is kept alive by the SolrIndexSearcher, so all docs in the cache have the reader available.  
>   2. It is slower to read a field lazily and retrieve its value later than retrieve it directory to begin with (though I don't know how much--depends on i/o factors).  We certainly don't want this to be the common case.  I added an optional call which accumulates all the field likely to be used in the request (highlighting, reponse writing), and populates the IndexSearcher cache a priori.  This has the added advantage of concentrating doc retrieval in a single place, which is nice from a performance testing perspective.
>  3. LazyFields are incompatible with the sundry Field declarations scattered about Solr.  I believe I've changed all the necessary locations to Fieldable.
> Comments appreciated

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

Re: Re: [jira] Created: (SOLR-52) Lazy Field loading

Chris Hostetter-3
In reply to this post by Mike Klaas

: I wouldn't expect there to be much of a difference.  Lazy fields hold
: on to a stream and an offset, and operate by seek()'ing to the right
        ...

Hmmm... yeah it sounds like it shouldn't matter.  If i get soem time i'll
try to do a micro benchmark to compare loading a doc with one field and
then loading the rest lazy vs loading the doc twice.

: > An alternate idea: the single arg version could check if an item found in
: > the cache contains lazy fields and if so re-fetch and recache the full
: > Document?
:
: That could work though I wonder if the O(num fields) cost per document
: access is worth it.  Perhaps the document could be stored with a
: "lazy" flag in the cache, to make this check O(1).

right ... checking the individual fields would be a very bad idea.

: Perhaps instead of a "lazy" flag, the number of real fields could be
: stored in the cache.  On the next document request, if there are more
: than 1-2 more fields requeste than "real" in the cache, the full
: document is returned.

that's the kind of crazy, hueristic/AIish kind of soulution approach I
love! ... but probably not worth the effort unless we see a demonstratable
problem.

: How much of a problem this is depends also on how often documets are
: hit once in the cache.  If it if more than a few times, the load-once
: behaviour of lazy fields should amortize out the extra cost.

right ... and if it's not more then a few times, you might as well skip
the doc cache completely.

: > 2) why doesn't optimizePreFetchDocs use SolrIndexSearcher.readDocs (looks
: > like cut/paste of method body)
:
: Avoids the allocation of the Document[] array, and is three lines (vs.
: two lines to allocate array and call readDocs).

Ah ... that makes sense.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Re: [jira] Created: (SOLR-52) Lazy Field loading

Yonik Seeley-2
On 10/9/06, Chris Hostetter <[hidden email]> wrote:
>
> : I wouldn't expect there to be much of a difference.  Lazy fields hold
> : on to a stream and an offset, and operate by seek()'ing to the right
>         ...
>
> Hmmm... yeah it sounds like it shouldn't matter.  If i get soem time i'll
> try to do a micro benchmark to compare loading a doc with one field and
> then loading the rest lazy vs loading the doc twice.

If lazy loading is ever shown to be a performance problem, a simple
solution would be to have a switch in solrconfig.xml to disable it.

-Yonik
Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (SOLR-52) Lazy Field loading

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)
     [ http://issues.apache.org/jira/browse/SOLR-52?page=all ]

Mike Klaas updated SOLR-52:
---------------------------

    Attachment: lazyfields_patch.diff

updated version of patch.  Addresses some of Hoss' (minor) comments.  Also, the .doc() method of SolrIndexSearcher will added the unique key field unconditionally if it is present in the schema.  IT is used randomly in several places and including checks for it in other places decreases readability.

> Lazy Field loading
> ------------------
>
>                 Key: SOLR-52
>                 URL: http://issues.apache.org/jira/browse/SOLR-52
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Mike Klaas
>         Assigned To: Mike Klaas
>            Priority: Minor
>         Attachments: lazyfields_patch.diff, lazyfields_patch.diff
>
>
> Add lazy field loading to solr.
> Currently solr reads all stored fields and filters the undesired fields based on the field list.  This is usually not a performance concern, but when using solr to store large numbers of fields, or just one large field (doc contents, eg. for highlighting), it is perceptible.
> Now, there is a concern with the doc cache of SolrIndexSearcher, which assumes it has the whole document in the cache.  To maintain this invariant, it is still the case that all the fields in a document are loaded in a searcher.doc(i) call.  However, if a field set is given to teh method, only the given fields are loaded directly, while the rest are loaded lazily.
> Some concerns about lazy field loading
>   1. Lazy field are only valid while the IndexReader is open.  I believe this is fine since the IndexReader is kept alive by the SolrIndexSearcher, so all docs in the cache have the reader available.  
>   2. It is slower to read a field lazily and retrieve its value later than retrieve it directory to begin with (though I don't know how much--depends on i/o factors).  We certainly don't want this to be the common case.  I added an optional call which accumulates all the field likely to be used in the request (highlighting, reponse writing), and populates the IndexSearcher cache a priori.  This has the added advantage of concentrating doc retrieval in a single place, which is nice from a performance testing perspective.
>  3. LazyFields are incompatible with the sundry Field declarations scattered about Solr.  I believe I've changed all the necessary locations to Fieldable.
> Comments appreciated

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-52) Lazy Field loading

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)
    [ http://issues.apache.org/jira/browse/SOLR-52?page=comments#action_12440974 ]
           
Mike Klaas commented on SOLR-52:
--------------------------------

Note the above patch does not address the issue of lazy field use mismatch between two handlers (see solr-dev)

> Lazy Field loading
> ------------------
>
>                 Key: SOLR-52
>                 URL: http://issues.apache.org/jira/browse/SOLR-52
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Mike Klaas
>         Assigned To: Mike Klaas
>            Priority: Minor
>         Attachments: lazyfields_patch.diff, lazyfields_patch.diff
>
>
> Add lazy field loading to solr.
> Currently solr reads all stored fields and filters the undesired fields based on the field list.  This is usually not a performance concern, but when using solr to store large numbers of fields, or just one large field (doc contents, eg. for highlighting), it is perceptible.
> Now, there is a concern with the doc cache of SolrIndexSearcher, which assumes it has the whole document in the cache.  To maintain this invariant, it is still the case that all the fields in a document are loaded in a searcher.doc(i) call.  However, if a field set is given to teh method, only the given fields are loaded directly, while the rest are loaded lazily.
> Some concerns about lazy field loading
>   1. Lazy field are only valid while the IndexReader is open.  I believe this is fine since the IndexReader is kept alive by the SolrIndexSearcher, so all docs in the cache have the reader available.  
>   2. It is slower to read a field lazily and retrieve its value later than retrieve it directory to begin with (though I don't know how much--depends on i/o factors).  We certainly don't want this to be the common case.  I added an optional call which accumulates all the field likely to be used in the request (highlighting, reponse writing), and populates the IndexSearcher cache a priori.  This has the added advantage of concentrating doc retrieval in a single place, which is nice from a performance testing perspective.
>  3. LazyFields are incompatible with the sundry Field declarations scattered about Solr.  I believe I've changed all the necessary locations to Fieldable.
> Comments appreciated

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (SOLR-52) Lazy Field loading

Chris Hostetter-3
In reply to this post by ASF GitHub Bot (Jira)

: updated version of patch.  Addresses some of Hoss' (minor) comments.
: Also, the .doc() method of SolrIndexSearcher will added the unique key
: field unconditionally if it is present in the schema.  IT is used
: randomly in several places and including checks for it in other places
: decreases readability.

We probably don't want to add the unique key field directly to Set passed
by the client -- partially because it's bad form to modify a collection as
a side affect of another method, but also because Set.add is an optional
method that might through UnsupportedOperationException.



-Hoss

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (SOLR-52) Lazy Field loading

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)
     [ http://issues.apache.org/jira/browse/SOLR-52?page=all ]

Mike Klaas updated SOLR-52:
---------------------------

    Attachment: lazyfields_patch.diff

Moved id field selection out of SolrIndexSearcher.doc()

Chris: What would you like to see vis-a-vis the many field issues before committing?  Should we put in a global lazy-field-disable option?

> Lazy Field loading
> ------------------
>
>                 Key: SOLR-52
>                 URL: http://issues.apache.org/jira/browse/SOLR-52
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Mike Klaas
>         Assigned To: Mike Klaas
>            Priority: Minor
>         Attachments: lazyfields_patch.diff, lazyfields_patch.diff, lazyfields_patch.diff
>
>
> Add lazy field loading to solr.
> Currently solr reads all stored fields and filters the undesired fields based on the field list.  This is usually not a performance concern, but when using solr to store large numbers of fields, or just one large field (doc contents, eg. for highlighting), it is perceptible.
> Now, there is a concern with the doc cache of SolrIndexSearcher, which assumes it has the whole document in the cache.  To maintain this invariant, it is still the case that all the fields in a document are loaded in a searcher.doc(i) call.  However, if a field set is given to teh method, only the given fields are loaded directly, while the rest are loaded lazily.
> Some concerns about lazy field loading
>   1. Lazy field are only valid while the IndexReader is open.  I believe this is fine since the IndexReader is kept alive by the SolrIndexSearcher, so all docs in the cache have the reader available.  
>   2. It is slower to read a field lazily and retrieve its value later than retrieve it directory to begin with (though I don't know how much--depends on i/o factors).  We certainly don't want this to be the common case.  I added an optional call which accumulates all the field likely to be used in the request (highlighting, reponse writing), and populates the IndexSearcher cache a priori.  This has the added advantage of concentrating doc retrieval in a single place, which is nice from a performance testing perspective.
>  3. LazyFields are incompatible with the sundry Field declarations scattered about Solr.  I believe I've changed all the necessary locations to Fieldable.
> Comments appreciated

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

Re: Re: [jira] Updated: (SOLR-52) Lazy Field loading

Mike Klaas
In reply to this post by Chris Hostetter-3
On 10/9/06, Chris Hostetter <[hidden email]> wrote:

> We probably don't want to add the unique key field directly to Set passed
> by the client -- partially because it's bad form to modify a collection as
> a side affect of another method, but also because Set.add is an optional
> method that might through UnsupportedOperationException.

Good points. I updated the patch.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (SOLR-52) Lazy Field loading

Chris Hostetter-3
In reply to this post by ASF GitHub Bot (Jira)

: Chris: What would you like to see vis-a-vis the many field issues before
: committing?  Should we put in a global lazy-field-disable option?

Yeah, a simple solrconfig option that lets you disable it completley is
probably a good idea (especailly in light of LUCENE-683) and i don't see
any reason why we need any more complicated solution right now.

this is the microbenchmark i was working on when i discovered LUCENE-683,
i had to put a littl hack in to ignore the last few docs when randomly
picking them, but besides that, in all of hte differnet scenerios i tried,
i couldn't find one where re-fetching a document after it had already been
loaded with lazy fields was ever faster then just reusing the existing
doc (who knows if that will change after the bug get's fixed though)...


package org.apache.lucene;

/**
 * Copyright 2004 The Apache Software Foundation
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import java.io.PrintWriter;
import java.io.StringWriter;
import java.util.Random;
import java.util.Set;
import java.util.HashSet;
import java.util.List;
import java.util.Iterator;

import junit.framework.TestCase;
import junit.framework.TestSuite;
import junit.textui.TestRunner;

import org.apache.lucene.store.*;
import org.apache.lucene.document.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.index.*;
import org.apache.lucene.search.*;
import org.apache.lucene.queryParser.*;

public class TestLazyBenchmark extends TestCase {

  public static int BASE_SEED = 13;

  public static int getProp(String n, int def) {
    return Integer.valueOf(System.getProperty(n,""+def)).intValue();
  }
  public static int NUM_DOCS = getProp("bench.docs",2000);
  public static int NUM_FIELDS = getProp("bench.fields",100);
  public static int NUM_ITERS = getProp("bench.iters",2000);
  public static int NUM_HITS = getProp("bench.hits",1);


  /** work arround for bug in lazy loading last field of last doc
   * (or maybe more)
   */
  public static int FUDGE = 5;

  private static String[] data = new String[] {
    "asdf qqwert lkj weroia lkjadsf kljsdfowq iero ",
    " 8432 lkj nadsf w3r9 lk 3r4 l,sdf 0werlk anm adsf rewr ",
    "lkjadf ;lkj kjlsa; aoi2winm lksa;93r lka adsfwr90 ",
    ";lkj ;lak -2-fdsaj w309r5 klasdfn ,dvoawo oiewf j;las;ldf w2 ",
    " ;lkjdsaf; kwe ;ladsfn [0924r52n ldsanf jt498ut5a nlkma oi49ut ",
    "lkj asd9u0942t ;lkndv moaiewjut 09sadlkf 43wt [j'sadnm at [ualknef ;a43 "
  };

  private static String MAGIC_FIELD = "f"+Integer.valueOf(NUM_FIELDS / 3);

  private static FieldSelector SELECTOR = new FieldSelector() {
      public FieldSelectorResult accept(String f) {
        if (f.equals(MAGIC_FIELD)) {
          return FieldSelectorResult.LOAD;
        }
        return FieldSelectorResult.LAZY_LOAD;
      }
    };

  private static Directory makeIndex() throws RuntimeException {
    System.out.println("bench.docs   = " + NUM_DOCS);
    System.out.println("bench.fields = " + NUM_FIELDS);
    System.out.println("bench.iters  = " + NUM_ITERS);
    System.out.println("bench.hits   = " + NUM_HITS);

    Directory dir = new RAMDirectory();
    try {
      Random r = new Random(BASE_SEED + 42) ;
      Analyzer analyzer = new SimpleAnalyzer();
      IndexWriter writer = new IndexWriter(dir, analyzer, true);

      writer.setUseCompoundFile(false);

      for (int d = 1; d <= NUM_DOCS; d++) {
        Document doc = new Document();
        for (int f = 1; f <= NUM_FIELDS; f++ ) {
          doc.add(new Field("f"+f,
                            data[f % data.length]
                            + data[r.nextInt(data.length)],
                            Field.Store.YES,
                            Field.Index.TOKENIZED));
        }
        writer.addDocument(doc);
      }
      writer.close();
    } catch (Exception e) {
      throw new RuntimeException(e);
    }
    return dir;
  }

  private static Directory DIR = makeIndex();

  /**
   * collector for field values in case JVM tries to optimize
   * away the field gets
   *
   * I'm probably being paranoid.
   */
  public static Set VALS = null;

  public void testLazy() throws Exception {
    Random r = new Random(BASE_SEED);

    IndexReader reader = IndexReader.open(DIR);
    for (int i = 0; i < NUM_ITERS; i++) {
      VALS = new HashSet();
      int docId = r.nextInt(NUM_DOCS - FUDGE);

      // zero-th lazy hit
      Document d = reader.document(docId, SELECTOR);
      VALS.add(d.get(MAGIC_FIELD));

      // remaining full hits, reuse doc
      for (int h = 1; h <= NUM_HITS; h++) {
        for (int f = 1; f <= NUM_FIELDS; f++) {
          VALS.add(d.get("f"+f));
        }
      }
      VALS = null;
    }
    reader.close();
  }

  public void testComplete() throws Exception {
    Random r = new Random(BASE_SEED);

    IndexReader reader = IndexReader.open(DIR);
    for (int i = 0; i < NUM_ITERS; i++) {
      VALS = new HashSet();
      int docId = r.nextInt(NUM_DOCS - FUDGE);

      // zero-th lazy hit
      Document d = reader.document(docId, SELECTOR);
      VALS.add(d.get(MAGIC_FIELD));

      // first full hit, fetch complete document
      d = reader.document(docId);
      for (int f = 1; f <= NUM_FIELDS; f++) {
        VALS.add(d.get("f"+f));
      }

      // remaining hits
      for (int h = 2; h <= NUM_HITS; h++) {
        for (int f = 1; f <= NUM_FIELDS; f++) {
          VALS.add(d.get("f"+f));
        }
      }
      VALS = null;
    }
    reader.close();
  }

  public void testLazyA() throws Exception { testLazy(); }
  public void testCompleteA() throws Exception { testComplete(); }

  public void testLazyB() throws Exception { testLazy(); }
  public void testCompleteB() throws Exception { testComplete(); }

  public void testLazyC() throws Exception { testLazy(); }
  public void testCompleteC() throws Exception { testComplete(); }

  public void testLazyD() throws Exception { testLazy(); }
  public void testCompleteD() throws Exception { testComplete(); }

}
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-52) Lazy Field loading

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)
    [ http://issues.apache.org/jira/browse/SOLR-52?page=comments#action_12442432 ]
           
Yonik Seeley commented on SOLR-52:
----------------------------------

FYI, we need a lucene refresh before we use lazy fields because of this:
http://issues.apache.org/jira/browse/LUCENE-683


> Lazy Field loading
> ------------------
>
>                 Key: SOLR-52
>                 URL: http://issues.apache.org/jira/browse/SOLR-52
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Mike Klaas
>         Assigned To: Mike Klaas
>            Priority: Minor
>         Attachments: lazyfields_patch.diff, lazyfields_patch.diff, lazyfields_patch.diff
>
>
> Add lazy field loading to solr.
> Currently solr reads all stored fields and filters the undesired fields based on the field list.  This is usually not a performance concern, but when using solr to store large numbers of fields, or just one large field (doc contents, eg. for highlighting), it is perceptible.
> Now, there is a concern with the doc cache of SolrIndexSearcher, which assumes it has the whole document in the cache.  To maintain this invariant, it is still the case that all the fields in a document are loaded in a searcher.doc(i) call.  However, if a field set is given to teh method, only the given fields are loaded directly, while the rest are loaded lazily.
> Some concerns about lazy field loading
>   1. Lazy field are only valid while the IndexReader is open.  I believe this is fine since the IndexReader is kept alive by the SolrIndexSearcher, so all docs in the cache have the reader available.  
>   2. It is slower to read a field lazily and retrieve its value later than retrieve it directory to begin with (though I don't know how much--depends on i/o factors).  We certainly don't want this to be the common case.  I added an optional call which accumulates all the field likely to be used in the request (highlighting, reponse writing), and populates the IndexSearcher cache a priori.  This has the added advantage of concentrating doc retrieval in a single place, which is nice from a performance testing perspective.
>  3. LazyFields are incompatible with the sundry Field declarations scattered about Solr.  I believe I've changed all the necessary locations to Fieldable.
> Comments appreciated

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (SOLR-52) Lazy Field loading

Mike Klaas
In reply to this post by Chris Hostetter-3
On 10/14/06, Chris Hostetter <[hidden email]> wrote:
>
> : Chris: What would you like to see vis-a-vis the many field issues before
> : committing?  Should we put in a global lazy-field-disable option?
>
> Yeah, a simple solrconfig option that lets you disable it completley is
> probably a good idea (especailly in light of LUCENE-683) and i don't see
> any reason why we need any more complicated solution right now.

Will do.  I think I'm going to wait for a bit until the lazy field
issues in lucene (of which yonik seems to be unearthing a plethora)
get ironed out before proceeding further with this issue.

-Mike
Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (SOLR-52) Lazy Field loading

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)
     [ http://issues.apache.org/jira/browse/SOLR-52?page=all ]

Mike Klaas updated SOLR-52:
---------------------------

    Attachment: lazyfields_patch.diff

This version adds a solrconfig parameter which allows lazy fields to be enabled or disabled (disabled by default).

Still needs testing after syncing with lucene changes

> Lazy Field loading
> ------------------
>
>                 Key: SOLR-52
>                 URL: http://issues.apache.org/jira/browse/SOLR-52
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Mike Klaas
>         Assigned To: Mike Klaas
>            Priority: Minor
>         Attachments: lazyfields_patch.diff, lazyfields_patch.diff, lazyfields_patch.diff, lazyfields_patch.diff
>
>
> Add lazy field loading to solr.
> Currently solr reads all stored fields and filters the undesired fields based on the field list.  This is usually not a performance concern, but when using solr to store large numbers of fields, or just one large field (doc contents, eg. for highlighting), it is perceptible.
> Now, there is a concern with the doc cache of SolrIndexSearcher, which assumes it has the whole document in the cache.  To maintain this invariant, it is still the case that all the fields in a document are loaded in a searcher.doc(i) call.  However, if a field set is given to teh method, only the given fields are loaded directly, while the rest are loaded lazily.
> Some concerns about lazy field loading
>   1. Lazy field are only valid while the IndexReader is open.  I believe this is fine since the IndexReader is kept alive by the SolrIndexSearcher, so all docs in the cache have the reader available.  
>   2. It is slower to read a field lazily and retrieve its value later than retrieve it directory to begin with (though I don't know how much--depends on i/o factors).  We certainly don't want this to be the common case.  I added an optional call which accumulates all the field likely to be used in the request (highlighting, reponse writing), and populates the IndexSearcher cache a priori.  This has the added advantage of concentrating doc retrieval in a single place, which is nice from a performance testing perspective.
>  3. LazyFields are incompatible with the sundry Field declarations scattered about Solr.  I believe I've changed all the necessary locations to Fieldable.
> Comments appreciated

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

Re: Re: [jira] Updated: (SOLR-52) Lazy Field loading

Mike Klaas
In reply to this post by Chris Hostetter-3
On 10/14/06, Chris Hostetter <[hidden email]> wrote:
>
> : Chris: What would you like to see vis-a-vis the many field issues before
> : committing?  Should we put in a global lazy-field-disable option?
>
> Yeah, a simple solrconfig option that lets you disable it completley is
> probably a good idea (especailly in light of LUCENE-683) and i don't see
> any reason why we need any more complicated solution right now.

Any objections to sync'ing solr with lucene trunk?  It might be nice
from an impact perspective to do so before lockless commits are
committed.

-Mike
12