[jira] Created: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

JIRA jira@apache.org
maxDoc should be explicitly stored in the index, not derived from file length
-----------------------------------------------------------------------------

                 Key: LUCENE-767
                 URL: https://issues.apache.org/jira/browse/LUCENE-767
             Project: Lucene - Java
          Issue Type: Improvement
    Affects Versions: 2.0.0, 1.9, 2.0.1, 2.1
            Reporter: Michael McCandless
         Assigned To: Michael McCandless
            Priority: Minor


This is a spinoff of LUCENE-140

In general we should rely on "as little as possible" from the file system.  Right now, maxDoc is derived by checking the file length of the FieldsReader index file (.fdx) which makes me nervous.  I think we should explicitly store it instead.

Note that there are no known cases where this is actually causing a problem. There was some speculation in the discussion of LUCENE-140 that it could be one of the possible, but in digging / discussion there were no specifically relevant JVM bugs found (yet!).  So this would be a defensive fix at this point.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463322 ]

Chuck Williams commented on LUCENE-767:
---------------------------------------

Isn't maxDoc always the same as the docCount of the segment, which is stored?  I.e., couldn't SegmentReader.maxDoc() be equivalently defined as:

  public int maxDoc() {
    return si.docCount;
  }

Since maxDoc==numDocs==docCount for a newly merged segment, and deletion with a reader never changes numDocs or maxDoc, it seems to me these values should always be the same.

All Lucene tests pass with this definition.  I have code that relies on this equivalence and so would appreciate knowledge of any case where this equivalence might not hold.


> maxDoc should be explicitly stored in the index, not derived from file length
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-767
>                 URL: https://issues.apache.org/jira/browse/LUCENE-767
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> This is a spinoff of LUCENE-140
> In general we should rely on "as little as possible" from the file system.  Right now, maxDoc is derived by checking the file length of the FieldsReader index file (.fdx) which makes me nervous.  I think we should explicitly store it instead.
> Note that there are no known cases where this is actually causing a problem. There was some speculation in the discussion of LUCENE-140 that it could be one of the possible, but in digging / discussion there were no specifically relevant JVM bugs found (yet!).  So this would be a defensive fix at this point.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463335 ]

Michael McCandless commented on LUCENE-767:
-------------------------------------------

Ooh that's great!  I think your logic is correct.

But I do see one unit test failing when I make that change locally (testIndexAndMerge in src/test/org/apache/lucene/index/TestDoc.java).  Actually, this unit test only fails with my last commit (yesterday) for LUCENE-140 , because I made the checking for "docs out of order" more strict (catch a previously missing boundary case), and this test seems to hit that boundary case.

However, that test is buggy because it manually creates SegmentInfos with an incorrect docCount.  So I will fix the test, and commit your solution above.  Thanks!

> maxDoc should be explicitly stored in the index, not derived from file length
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-767
>                 URL: https://issues.apache.org/jira/browse/LUCENE-767
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> This is a spinoff of LUCENE-140
> In general we should rely on "as little as possible" from the file system.  Right now, maxDoc is derived by checking the file length of the FieldsReader index file (.fdx) which makes me nervous.  I think we should explicitly store it instead.
> Note that there are no known cases where this is actually causing a problem. There was some speculation in the discussion of LUCENE-140 that it could be one of the possible, but in digging / discussion there were no specifically relevant JVM bugs found (yet!).  So this would be a defensive fix at this point.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Created: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

Grant Ingersoll-2
In reply to this post by JIRA jira@apache.org
Hi Michael,

Can you explain in more detail on this bug why this makes you nervous?

Thanks,
Grant

On Jan 9, 2007, at 6:41 AM, Michael McCandless (JIRA) wrote:

> maxDoc should be explicitly stored in the index, not derived from  
> file length
> ----------------------------------------------------------------------
> -------
>
>                  Key: LUCENE-767
>                  URL: https://issues.apache.org/jira/browse/LUCENE-767
>              Project: Lucene - Java
>           Issue Type: Improvement
>     Affects Versions: 2.0.0, 1.9, 2.0.1, 2.1
>             Reporter: Michael McCandless
>          Assigned To: Michael McCandless
>             Priority: Minor
>
>
> This is a spinoff of LUCENE-140
>
> In general we should rely on "as little as possible" from the file  
> system.  Right now, maxDoc is derived by checking the file length  
> of the FieldsReader index file (.fdx) which makes me nervous.  I  
> think we should explicitly store it instead.
>
> Note that there are no known cases where this is actually causing a  
> problem. There was some speculation in the discussion of LUCENE-140  
> that it could be one of the possible, but in digging / discussion  
> there were no specifically relevant JVM bugs found (yet!).  So this  
> would be a defensive fix at this point.
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the  
> administrators: https://issues.apache.org/jira/secure/ 
> Administrators.jspa
> -
> For more information on JIRA, see: http://www.atlassian.com/ 
> software/jira
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463358 ]

Michael McCandless commented on LUCENE-767:
-------------------------------------------


Carrying over from the java-dev list:


Grant Ingersoll wrote:

> Can you explain in more detail on this bug why this makes you nervous?

Well ... the only specific example I have is NFS (always my favorite
example!).

As I understand it, the NFS client typically uses a separate cache to
hold the "attributes" of the file, including file length.  This cache
often has weaker or maybe just "different" guarantees than the "data
cache" that holds the file contents.  So basically you can ask what
the file length is and get a wrong (stale) answer.  EG see
http://nfs.sourceforge.net, which describes Linux's NFS client
approach.  The NFS client on Apple's OS X seems to be even worse!

I think very likely Lucene may not trip up on this specifically since
a reader would only ask for this file's length for the first time once
the file is done being written (ie the commit of segments_N has
occurred) and so hopefully it's not in the attribute cache yet?

I think there may very well be cases of other filesystems where
"checking file length" is risky (that we all just don't know about
(yet!)), which is why I favor using explicit values instead of relying
on file system semantics, whenever possible.

Maybe I'm just too paranoid :)

But for all the places / devices Lucene has gone and will go, relying
on the bare minimum set of IO operations I think will maximize our
overall portability.  Every filesystem has its quirks.


> maxDoc should be explicitly stored in the index, not derived from file length
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-767
>                 URL: https://issues.apache.org/jira/browse/LUCENE-767
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> This is a spinoff of LUCENE-140
> In general we should rely on "as little as possible" from the file system.  Right now, maxDoc is derived by checking the file length of the FieldsReader index file (.fdx) which makes me nervous.  I think we should explicitly store it instead.
> Note that there are no known cases where this is actually causing a problem. There was some speculation in the discussion of LUCENE-140 that it could be one of the possible, but in digging / discussion there were no specifically relevant JVM bugs found (yet!).  So this would be a defensive fix at this point.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

Robert Engels
It would appear that NFS Version 2 is not suitable for Lucene. NFS  
Version 3 looks like it should work. See http://nfs.sourceforge.net/ 
#section_a

I will take this opportunity to state again what I've always been  
told, and it seems to hold up, using NFS for shared interactively  
updated files is always going to be troublesome. They have patched it  
over the years to help, but it just wasn't designed for this for the  
beginning.

Unix systems never even had file system locks. It was assumed that  
shared access to shared data would be accomplished via a shared  
server - not by sharing access to the data directly. It is far more  
efficient and robust to do things this way.

Modifying a shared Lucene directory via NFS directly is always going  
to be error prone.

Why not just implement a server/parallel index solution ?

On Jan 9, 2007, at 12:25 PM, Michael McCandless (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/LUCENE-767?
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel#action_12463358 ]
>
> Michael McCandless commented on LUCENE-767:
> -------------------------------------------
>
>
> Carrying over from the java-dev list:
>
>
> Grant Ingersoll wrote:
>
>> Can you explain in more detail on this bug why this makes you  
>> nervous?
>
> Well ... the only specific example I have is NFS (always my favorite
> example!).
>
> As I understand it, the NFS client typically uses a separate cache to
> hold the "attributes" of the file, including file length.  This cache
> often has weaker or maybe just "different" guarantees than the "data
> cache" that holds the file contents.  So basically you can ask what
> the file length is and get a wrong (stale) answer.  EG see
> http://nfs.sourceforge.net, which describes Linux's NFS client
> approach.  The NFS client on Apple's OS X seems to be even worse!
>
> I think very likely Lucene may not trip up on this specifically since
> a reader would only ask for this file's length for the first time once
> the file is done being written (ie the commit of segments_N has
> occurred) and so hopefully it's not in the attribute cache yet?
>
> I think there may very well be cases of other filesystems where
> "checking file length" is risky (that we all just don't know about
> (yet!)), which is why I favor using explicit values instead of relying
> on file system semantics, whenever possible.
>
> Maybe I'm just too paranoid :)
>
> But for all the places / devices Lucene has gone and will go, relying
> on the bare minimum set of IO operations I think will maximize our
> overall portability.  Every filesystem has its quirks.
>
>
>> maxDoc should be explicitly stored in the index, not derived from  
>> file length
>> ---------------------------------------------------------------------
>> --------
>>
>>                 Key: LUCENE-767
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-767
>>             Project: Lucene - Java
>>          Issue Type: Improvement
>>    Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1
>>            Reporter: Michael McCandless
>>         Assigned To: Michael McCandless
>>            Priority: Minor
>>
>> This is a spinoff of LUCENE-140
>> In general we should rely on "as little as possible" from the file  
>> system.  Right now, maxDoc is derived by checking the file length  
>> of the FieldsReader index file (.fdx) which makes me nervous.  I  
>> think we should explicitly store it instead.
>> Note that there are no known cases where this is actually causing  
>> a problem. There was some speculation in the discussion of  
>> LUCENE-140 that it could be one of the possible, but in digging /  
>> discussion there were no specifically relevant JVM bugs found  
>> (yet!).  So this would be a defensive fix at this point.
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the  
> administrators: https://issues.apache.org/jira/secure/ 
> Administrators.jspa
> -
> For more information on JIRA, see: http://www.atlassian.com/ 
> software/jira
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

Robert Engels
In reply to this post by JIRA jira@apache.org
I think this is the relevant section:

   A8. What is close-to-open cache consistency?

     A. Perfect cache coherency among disparate NFS clients is very  
expensive to achieve, so NFS settles for something weaker that  
satisfies the requirements of most everyday types of file sharing.  
Everyday file sharing is most often completely sequential: first  
client A opens a file, writes something to it, then closes it; then  
client B opens the same file, and reads the changes.

     So, when an application opens a file stored in NFS, the NFS  
client checks that it still exists on the server, and is permitted to  
the opener, by sending a GETATTR or ACCESS operation. When the  
application closes the file, the NFS client writes back any pending  
changes to the file so that the next opener can view the changes.  
This also gives the NFS client an opportunity to report any server  
write errors to the application via the return code from close().  
This behavior is referred to as close-to-open cache consistency.

     Linux implements close-to-open cache consistency by comparing  
the results of a GETATTR operation done just after the file is closed  
to the results of a GETATTR operation done when the file is next  
opened. If the results are the same, the client will assume its data  
cache is still valid; otherwise, the cache is purged.

     Close-to-open cache consistency was introduced to the Linux NFS  
client in 2.4.20. If for some reason you have applications that  
depend on the old behavior, you can disable close-to-open support by  
using the "nocto" mount option.

     There are still opportunities for a client's data cache to  
contain stale data. The NFS version 3 protocol introduced "weak cache  
consistency" (also known as WCC) which provides a way of checking a  
file's attributes before and after an operation to allow a client to  
identify changes that could have been made by other clients.  
Unfortunately when a client is using many concurrent operations that  
update the same file at the same time, it is impossible to tell  
whether it was that client's updates or some other client's updates  
that changed the file.

     For this reason, some versions of the Linux 2.6 NFS client  
abandon WCC checking entirely, and simply trust their own data cache.  
On these versions, the client can maintain a cache full of stale file  
data if a file is opened for write. In this case, using file locking  
is the best way to ensure that all clients see the latest version of  
a file's data.

     A system administrator can try using the "noac" mount option to  
achieve attribute cache coherency among multiple clients. Almost  
every client operation checks file attribute information. Usually the  
client keeps this information cached for a period of time to reduce  
network and server load. When "noac" is in effect, a client's file  
attribute cache is disabled, so each operation that needs to check a  
file's attributes is forced to go back to the server. This permits a  
client to see changes to a file very quickly, at the cost of many  
extra network operations.

     Be careful not to confuse "noac" with "no data caching." The  
"noac" mount option will keep file attributes up-to-date with the  
server, but there are still races that may result in data incoherency  
between client and server. If you need absolute cache coherency among  
clients, applications can use file locking, where a client purges  
file data when a file is locked, and flushes changes back to the  
server before unlocking a file; or applications can open their files  
with the O_DIRECT flag to disable data caching entirely.

     For a better understanding of the compromises faced in the  
design of NFS caching, see Callaghan's "NFS Illustrated."

On Jan 9, 2007, at 12:25 PM, Michael McCandless (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/LUCENE-767?
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel#action_12463358 ]
>
> Michael McCandless commented on LUCENE-767:
> -------------------------------------------
>
>
> Carrying over from the java-dev list:
>
>
> Grant Ingersoll wrote:
>
>> Can you explain in more detail on this bug why this makes you  
>> nervous?
>
> Well ... the only specific example I have is NFS (always my favorite
> example!).
>
> As I understand it, the NFS client typically uses a separate cache to
> hold the "attributes" of the file, including file length.  This cache
> often has weaker or maybe just "different" guarantees than the "data
> cache" that holds the file contents.  So basically you can ask what
> the file length is and get a wrong (stale) answer.  EG see
> http://nfs.sourceforge.net, which describes Linux's NFS client
> approach.  The NFS client on Apple's OS X seems to be even worse!
>
> I think very likely Lucene may not trip up on this specifically since
> a reader would only ask for this file's length for the first time once
> the file is done being written (ie the commit of segments_N has
> occurred) and so hopefully it's not in the attribute cache yet?
>
> I think there may very well be cases of other filesystems where
> "checking file length" is risky (that we all just don't know about
> (yet!)), which is why I favor using explicit values instead of relying
> on file system semantics, whenever possible.
>
> Maybe I'm just too paranoid :)
>
> But for all the places / devices Lucene has gone and will go, relying
> on the bare minimum set of IO operations I think will maximize our
> overall portability.  Every filesystem has its quirks.
>
>
>> maxDoc should be explicitly stored in the index, not derived from  
>> file length
>> ---------------------------------------------------------------------
>> --------
>>
>>                 Key: LUCENE-767
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-767
>>             Project: Lucene - Java
>>          Issue Type: Improvement
>>    Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1
>>            Reporter: Michael McCandless
>>         Assigned To: Michael McCandless
>>            Priority: Minor
>>
>> This is a spinoff of LUCENE-140
>> In general we should rely on "as little as possible" from the file  
>> system.  Right now, maxDoc is derived by checking the file length  
>> of the FieldsReader index file (.fdx) which makes me nervous.  I  
>> think we should explicitly store it instead.
>> Note that there are no known cases where this is actually causing  
>> a problem. There was some speculation in the discussion of  
>> LUCENE-140 that it could be one of the possible, but in digging /  
>> discussion there were no specifically relevant JVM bugs found  
>> (yet!).  So this would be a defensive fix at this point.
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the  
> administrators: https://issues.apache.org/jira/secure/ 
> Administrators.jspa
> -
> For more information on JIRA, see: http://www.atlassian.com/ 
> software/jira
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

Michael McCandless-2
In reply to this post by Robert Engels
robert engels wrote:

> It would appear that NFS Version 2 is not suitable for Lucene. NFS
> Version 3 looks like it should work. See
> http://nfs.sourceforge.net/#section_a
>
> I will take this opportunity to state again what I've always been told,
> and it seems to hold up, using NFS for shared interactively updated
> files is always going to be troublesome. They have patched it over the
> years to help, but it just wasn't designed for this for the beginning.
>
> Unix systems never even had file system locks. It was assumed that
> shared access to shared data would be accomplished via a shared server -
> not by sharing access to the data directly. It is far more efficient and
> robust to do things this way.
>
> Modifying a shared Lucene directory via NFS directly is always going to
> be error prone.
>
> Why not just implement a server/parallel index solution ?

Actually I think now (with lockless commits) Lucene works fine over
NFS, except for the [yes, rather big] remaining issue: LUCENE-710.

But that issue, while clearly scary when you first see it, can be
easily worked around (just refresh your searchers once they hit "Stale
NFS handle").

Even once we resolve that and Lucene works over NFS, I do think the
performance will typically not be "stellar".  At least in my
experience the performance of NFS is surprisingly poor.  So I do think
for users that require high performance a replicated (like Solr)
and/or distributed index solution is probably the way to go.

Anyway, I didn't mean to turn this back into an NFS discussion.  I
just wanted to use NFS as an example of where relying on file length
for something important (maxDocs() in a segment) is possibly
dangerous.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-767.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 2.1

> maxDoc should be explicitly stored in the index, not derived from file length
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-767
>                 URL: https://issues.apache.org/jira/browse/LUCENE-767
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>             Fix For: 2.1
>
>
> This is a spinoff of LUCENE-140
> In general we should rely on "as little as possible" from the file system.  Right now, maxDoc is derived by checking the file length of the FieldsReader index file (.fdx) which makes me nervous.  I think we should explicitly store it instead.
> Note that there are no known cases where this is actually causing a problem. There was some speculation in the discussion of LUCENE-140 that it could be one of the possible, but in digging / discussion there were no specifically relevant JVM bugs found (yet!).  So this would be a defensive fix at this point.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]