[jira] Created: (LUCENE-673) Exceptions when using Lucene over NFS

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-673) Exceptions when using Lucene over NFS

Nick Burch (Jira)
Exceptions when using Lucene over NFS
-------------------------------------

                 Key: LUCENE-673
                 URL: http://issues.apache.org/jira/browse/LUCENE-673
             Project: Lucene - Java
          Issue Type: Bug
          Components: Index
    Affects Versions: 2.0.0
         Environment: NFS server/client
            Reporter: Michael McCandless


I'm opening this issue to track details on the known problems with
Lucene over NFS.

The summary is: if you have one machine writing to an index stored on
an NFS mount, and other machine(s) reading (and periodically
re-opening the index) then sometimes on re-opening the index the
reader will hit a FileNotFound exception.

This has hit many users because this is a natural way to "scale up"
your searching (single writer, multiple readers) across machines.  The
best current workaround (I think?) is to take the approach Solr takes
(either by actually using Solr or copying/modifying its approach) to
take snapshots of the index and then have the readers open the
snapshots instead of the "live" index being written to.

I've been working on two patches for Lucene:

  * A locking (LockFactory) implementation using native OS locks

  * Lock-less commits

(I'll open separate issues with the details for those).

I have a simple stress test where one machine is constantly adding
docs to an index over NFS, and another machine is constantly
re-opening the index searcher over NFS.

These tests have revealed new details (at least for me!) about the
root cause of our NFS problems:

  * Even when using native locks over NFS, Lucene still hits these
    exceptions!

    I was surprised by this because I had always thought (assumed?)
    the NFS problem was because the "simple" file-based locking was
    not correct over NFS, and that switching to native OS filesystem
    locking would resolve it, but it doesn't.

    I can reproduce the "FileNotFound" exceptions even when using NFS
    V4 (the latest NFS protocol), so this is not just a "your NFS
    server is too old" issue.

  * Then, when running the same stress test with the lock-less
    changes, I don't hit any exceptions.  I've tested on NFS version
    2, 3 and 4 (using the "nfsvers=N" mount option).

I think this means that in fact (as Hoss at one point suggested I
believe), the NFS problems are likely due to the cache coherence of
the NFS file system (I think the "segments" file in particular)
against the existence of the actual segment data files.

In other words, even if you lock correctly, on the reader side it will
sometimes see stale contents of the "segments" file which lead it to
try to open a now deleted segment data file.

So I think this is good news / bad news: the bad news is, native
locking doesn't fix our problems with NFS (as at least I had expected
it to).  But the good news is, it looks like (still need to do more
thorough testing of this) the changes for lock-less commits do enable
Lucene to work fine over NFS.

[One quick side note in case it helps others: to get native locks
working over NFS on Ubuntu/Debian Linux 6.06, I had to "apt-get
install nfs-common" on the NFS client machines.  Before I did this I
would hit "No locks available" IOExceptions on calling the "tryLock"
method.  The default nfs server install on the server machine just
worked because it runs in kernel mode and it start a lockd process.]



--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-673) Exceptions when using Lucene over NFS

Nick Burch (Jira)
    [ http://issues.apache.org/jira/browse/LUCENE-673?page=comments#action_12443975 ]
           
Marvin Humphrey commented on LUCENE-673:
----------------------------------------

It seems that NFS doesn't support delete-upon-last-close semantics.  That means that an IndexWriter can delete files out from underneath a cached IndexReader, and they're really gone, no? Stale NFS Filehandle exception.

> Exceptions when using Lucene over NFS
> -------------------------------------
>
>                 Key: LUCENE-673
>                 URL: http://issues.apache.org/jira/browse/LUCENE-673
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.0.0
>         Environment: NFS server/client
>            Reporter: Michael McCandless
>
> I'm opening this issue to track details on the known problems with
> Lucene over NFS.
> The summary is: if you have one machine writing to an index stored on
> an NFS mount, and other machine(s) reading (and periodically
> re-opening the index) then sometimes on re-opening the index the
> reader will hit a FileNotFound exception.
> This has hit many users because this is a natural way to "scale up"
> your searching (single writer, multiple readers) across machines.  The
> best current workaround (I think?) is to take the approach Solr takes
> (either by actually using Solr or copying/modifying its approach) to
> take snapshots of the index and then have the readers open the
> snapshots instead of the "live" index being written to.
> I've been working on two patches for Lucene:
>   * A locking (LockFactory) implementation using native OS locks
>   * Lock-less commits
> (I'll open separate issues with the details for those).
> I have a simple stress test where one machine is constantly adding
> docs to an index over NFS, and another machine is constantly
> re-opening the index searcher over NFS.
> These tests have revealed new details (at least for me!) about the
> root cause of our NFS problems:
>   * Even when using native locks over NFS, Lucene still hits these
>     exceptions!
>     I was surprised by this because I had always thought (assumed?)
>     the NFS problem was because the "simple" file-based locking was
>     not correct over NFS, and that switching to native OS filesystem
>     locking would resolve it, but it doesn't.
>     I can reproduce the "FileNotFound" exceptions even when using NFS
>     V4 (the latest NFS protocol), so this is not just a "your NFS
>     server is too old" issue.
>   * Then, when running the same stress test with the lock-less
>     changes, I don't hit any exceptions.  I've tested on NFS version
>     2, 3 and 4 (using the "nfsvers=N" mount option).
> I think this means that in fact (as Hoss at one point suggested I
> believe), the NFS problems are likely due to the cache coherence of
> the NFS file system (I think the "segments" file in particular)
> against the existence of the actual segment data files.
> In other words, even if you lock correctly, on the reader side it will
> sometimes see stale contents of the "segments" file which lead it to
> try to open a now deleted segment data file.
> So I think this is good news / bad news: the bad news is, native
> locking doesn't fix our problems with NFS (as at least I had expected
> it to).  But the good news is, it looks like (still need to do more
> thorough testing of this) the changes for lock-less commits do enable
> Lucene to work fine over NFS.
> [One quick side note in case it helps others: to get native locks
> working over NFS on Ubuntu/Debian Linux 6.06, I had to "apt-get
> install nfs-common" on the NFS client machines.  Before I did this I
> would hit "No locks available" IOExceptions on calling the "tryLock"
> method.  The default nfs server install on the server machine just
> worked because it runs in kernel mode and it start a lockd process.]

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-673) Exceptions when using Lucene over NFS

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)
    [ http://issues.apache.org/jira/browse/LUCENE-673?page=comments#action_12444041 ]
           
Michael McCandless commented on LUCENE-673:
-------------------------------------------

Yes, you are absolutely correct.

The current implementation of Lucene's "point in time" searching
capability (ie, once an IndexSearcher is open, it searches the
"snapshot" of the index at that point in time, even as writer(s) are
changing the index), directly relies on specific filesystem semantics
of "deletes of still open files".

But, these semantics differ drastically across filesystems:

  * On WIN32 local filesystems you get "Access Denied" when trying to
    delete open files.  Lucene catches this & retries.

  * On UNIX local filesystems, the delete succeeds but the underlying
    file is still present & usable by open file handles ("delete on
    last close") until they are closed.

  * But, on NFS, there is absolutely no support for this.  NFS server
    (until version 4) is stateless and so makes no effort to let you
    continue to access deleted files.

This means, at best for NFS (with "lock-less commits" fixes -- still
in progress) we can hope to reliably instantiate a reader (ie, no more
intermittent exceptions on loading the segments), but, you will not be
able to use the "point in time searching".  Meaning, when running a
search, you must expect to get a "stale NFS handle" IOException, and
re-open your index when that happens.

I think, in the future, it would make sense to change how Lucene
implements "point in time searching" so that it doesn't rely on
filesystem semantics at all (which are clearly quite different in this
area) and, instead, explicitly keeps segments_N files (and the
segments they reference) in the filesystem until "it's decided" (via
some policy, eg, "keep the last N generations" or "keep past N days
worth") that they should be pruned.

Note that such an explicit implementation would also resolve a
limitation of the current "point in time searching" which is: you
can't close your searcher and re-open it at that same point in time.
If your searcher crashes, or JVM crashes, or whatever, you are forced
at that point to switch up to the current index.  You don't have the
freedom to re-open the snapshot you had been using.  An explicit
implementation would fix that.

The "lock-less commits" changes would make this quite straightforward
as a future change, but I'm not aiming to do that for starters --
"progress not perfection"!


> Exceptions when using Lucene over NFS
> -------------------------------------
>
>                 Key: LUCENE-673
>                 URL: http://issues.apache.org/jira/browse/LUCENE-673
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.0.0
>         Environment: NFS server/client
>            Reporter: Michael McCandless
>
> I'm opening this issue to track details on the known problems with
> Lucene over NFS.
> The summary is: if you have one machine writing to an index stored on
> an NFS mount, and other machine(s) reading (and periodically
> re-opening the index) then sometimes on re-opening the index the
> reader will hit a FileNotFound exception.
> This has hit many users because this is a natural way to "scale up"
> your searching (single writer, multiple readers) across machines.  The
> best current workaround (I think?) is to take the approach Solr takes
> (either by actually using Solr or copying/modifying its approach) to
> take snapshots of the index and then have the readers open the
> snapshots instead of the "live" index being written to.
> I've been working on two patches for Lucene:
>   * A locking (LockFactory) implementation using native OS locks
>   * Lock-less commits
> (I'll open separate issues with the details for those).
> I have a simple stress test where one machine is constantly adding
> docs to an index over NFS, and another machine is constantly
> re-opening the index searcher over NFS.
> These tests have revealed new details (at least for me!) about the
> root cause of our NFS problems:
>   * Even when using native locks over NFS, Lucene still hits these
>     exceptions!
>     I was surprised by this because I had always thought (assumed?)
>     the NFS problem was because the "simple" file-based locking was
>     not correct over NFS, and that switching to native OS filesystem
>     locking would resolve it, but it doesn't.
>     I can reproduce the "FileNotFound" exceptions even when using NFS
>     V4 (the latest NFS protocol), so this is not just a "your NFS
>     server is too old" issue.
>   * Then, when running the same stress test with the lock-less
>     changes, I don't hit any exceptions.  I've tested on NFS version
>     2, 3 and 4 (using the "nfsvers=N" mount option).
> I think this means that in fact (as Hoss at one point suggested I
> believe), the NFS problems are likely due to the cache coherence of
> the NFS file system (I think the "segments" file in particular)
> against the existence of the actual segment data files.
> In other words, even if you lock correctly, on the reader side it will
> sometimes see stale contents of the "segments" file which lead it to
> try to open a now deleted segment data file.
> So I think this is good news / bad news: the bad news is, native
> locking doesn't fix our problems with NFS (as at least I had expected
> it to).  But the good news is, it looks like (still need to do more
> thorough testing of this) the changes for lock-less commits do enable
> Lucene to work fine over NFS.
> [One quick side note in case it helps others: to get native locks
> working over NFS on Ubuntu/Debian Linux 6.06, I had to "apt-get
> install nfs-common" on the NFS client machines.  Before I did this I
> would hit "No locks available" IOExceptions on calling the "tryLock"
> method.  The default nfs server install on the server machine just
> worked because it runs in kernel mode and it start a lockd process.]

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

I Want To Konw What are we tyring to do on lucene

lfcx530
I Want To Konw What are we tyring to do on lucene,i am study lucene  and i want to improve the lucene quality
can you tell me which subject can i do to improve the things that not satisfy us?





lfcx530
2006-10-21



发件人: Michael McCandless (JIRA)
发送时间: 2006-10-22 05:14:59
收件人: [hidden email]
抄送:
主题: (瑞星提示-此邮件可能是垃圾邮件)[jira] Commented: (LUCENE-673) Exceptions when using Lucene overNFS

[  http://issues.apache.org/jira/browse/LUCENE-673?page=comments#action_12444041  ]  
                       
Michael  McCandless  commented  on  LUCENE-673:
-------------------------------------------

Yes,  you  are  absolutely  correct.

The  current  implementation  of  Lucene's  "point  in  time"  searching
capability  (ie,  once  an  IndexSearcher  is  open,  it  searches  the
"snapshot"  of  the  index  at  that  point  in  time,  even  as  writer(s)  are
changing  the  index),  directly  relies  on  specific  filesystem  semantics
of  "deletes  of  still  open  files".

But,  these  semantics  differ  drastically  across  filesystems:

   *  On  WIN32  local  filesystems  you  get  "Access  Denied"  when  trying  to
       delete  open  files.    Lucene  catches  this  &  retries.

   *  On  UNIX  local  filesystems,  the  delete  succeeds  but  the  underlying
       file  is  still  present  &  usable  by  open  file  handles  ("delete  on
       last  close")  until  they  are  closed.

   *  But,  on  NFS,  there  is  absolutely  no  support  for  this.    NFS  server
       (until  version  4)  is  stateless  and  so  makes  no  effort  to  let  you
       continue  to  access  deleted  files.

This  means,  at  best  for  NFS  (with  "lock-less  commits"  fixes  --  still
in  progress)  we  can  hope  to  reliably  instantiate  a  reader  (ie,  no  more
intermittent  exceptions  on  loading  the  segments),  but,  you  will  not  be
able  to  use  the  "point  in  time  searching".    Meaning,  when  running  a
search,  you  must  expect  to  get  a  "stale  NFS  handle"  IOException,  and
re-open  your  index  when  that  happens.

I  think,  in  the  future,  it  would  make  sense  to  change  how  Lucene
implements  "point  in  time  searching"  so  that  it  doesn't  rely  on
filesystem  semantics  at  all  (which  are  clearly  quite  different  in  this
area)  and,  instead,  explicitly  keeps  segments_N  files  (and  the
segments  they  reference)  in  the  filesystem  until  "it's  decided"  (via
some  policy,  eg,  "keep  the  last  N  generations"  or  "keep  past  N  days
worth")  that  they  should  be  pruned.

Note  that  such  an  explicit  implementation  would  also  resolve  a
limitation  of  the  current  "point  in  time  searching"  which  is:  you
can't  close  your  searcher  and  re-open  it  at  that  same  point  in  time.
If  your  searcher  crashes,  or  JVM  crashes,  or  whatever,  you  are  forced
at  that  point  to  switch  up  to  the  current  index.    You  don't  have  the
freedom  to  re-open  the  snapshot  you  had  been  using.    An  explicit
implementation  would  fix  that.

The  "lock-less  commits"  changes  would  make  this  quite  straightforward
as  a  future  change,  but  I'm  not  aiming  to  do  that  for  starters  --
"progress  not  perfection"!


>  Exceptions  when  using  Lucene  over  NFS
>  -------------------------------------
>
>                                  Key:  LUCENE-673
>                                  URL:  http://issues.apache.org/jira/browse/LUCENE-673
>                          Project:  Lucene  -  Java
>                    Issue  Type:  Bug
>                    Components:  Index
>        Affects  Versions:  2.0.0
>                  Environment:  NFS  server/client
>                        Reporter:  Michael  McCandless
>
>  I'm  opening  this  issue  to  track  details  on  the  known  problems  with
>  Lucene  over  NFS.
>  The  summary  is:  if  you  have  one  machine  writing  to  an  index  stored  on
>  an  NFS  mount,  and  other  machine(s)  reading  (and  periodically
>  re-opening  the  index)  then  sometimes  on  re-opening  the  index  the
>  reader  will  hit  a  FileNotFound  exception.
>  This  has  hit  many  users  because  this  is  a  natural  way  to  "scale  up"
>  your  searching  (single  writer,  multiple  readers)  across  machines.    The
>  best  current  workaround  (I  think?)  is  to  take  the  approach  Solr  takes
>  (either  by  actually  using  Solr  or  copying/modifying  its  approach)  to
>  take  snapshots  of  the  index  and  then  have  the  readers  open  the
>  snapshots  instead  of  the  "live"  index  being  written  to.
>  I've  been  working  on  two  patches  for  Lucene:
>      *  A  locking  (LockFactory)  implementation  using  native  OS  locks
>      *  Lock-less  commits
>  (I'll  open  separate  issues  with  the  details  for  those).
>  I  have  a  simple  stress  test  where  one  machine  is  constantly  adding
>  docs  to  an  index  over  NFS,  and  another  machine  is  constantly
>  re-opening  the  index  searcher  over  NFS.
>  These  tests  have  revealed  new  details  (at  least  for  me!)  about  the
>  root  cause  of  our  NFS  problems:
>      *  Even  when  using  native  locks  over  NFS,  Lucene  still  hits  these
>          exceptions!
>          I  was  surprised  by  this  because  I  had  always  thought  (assumed?)
>          the  NFS  problem  was  because  the  "simple"  file-based  locking  was
>          not  correct  over  NFS,  and  that  switching  to  native  OS  filesystem
>          locking  would  resolve  it,  but  it  doesn't.
>          I  can  reproduce  the  "FileNotFound"  exceptions  even  when  using  NFS
>          V4  (the  latest  NFS  protocol),  so  this  is  not  just  a  "your  NFS
>          server  is  too  old"  issue.
>      *  Then,  when  running  the  same  stress  test  with  the  lock-less
>          changes,  I  don't  hit  any  exceptions.    I've  tested  on  NFS  version
>          2,  3  and  4  (using  the  "nfsvers=N"  mount  option).
>  I  think  this  means  that  in  fact  (as  Hoss  at  one  point  suggested  I
>  believe),  the  NFS  problems  are  likely  due  to  the  cache  coherence  of
>  the  NFS  file  system  (I  think  the  "segments"  file  in  particular)
>  against  the  existence  of  the  actual  segment  data  files.
>  In  other  words,  even  if  you  lock  correctly,  on  the  reader  side  it  will
>  sometimes  see  stale  contents  of  the  "segments"  file  which  lead  it  to
>  try  to  open  a  now  deleted  segment  data  file.
>  So  I  think  this  is  good  news  /  bad  news:  the  bad  news  is,  native
>  locking  doesn't  fix  our  problems  with  NFS  (as  at  least  I  had  expected
>  it  to).    But  the  good  news  is,  it  looks  like  (still  need  to  do  more
>  thorough  testing  of  this)  the  changes  for  lock-less  commits  do  enable
>  Lucene  to  work  fine  over  NFS.
>  [One  quick  side  note  in  case  it  helps  others:  to  get  native  locks
>  working  over  NFS  on  Ubuntu/Debian  Linux  6.06,  I  had  to  "apt-get
>  install  nfs-common"  on  the  NFS  client  machines.    Before  I  did  this  I
>  would  hit  "No  locks  available"  IOExceptions  on  calling  the  "tryLock"
>  method.    The  default  nfs  server  install  on  the  server  machine  just
>  worked  because  it  runs  in  kernel  mode  and  it  start  a  lockd  process.]

--  
This  message  is  automatically  generated  by  JIRA.
-
If  you  think  it  was  sent  incorrectly  contact  one  of  the  administrators:  http://issues.apache.org/jira/secure/Administrators.jspa
-
For  more  information  on  JIRA,  see:  http://www.atlassian.com/software/jira

               

---------------------------------------------------------------------
To  unsubscribe,  e-mail:  [hidden email]
For  additional  commands,  e-mail:  [hidden email]
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-673) Exceptions when using Lucene over NFS

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)
    [ http://issues.apache.org/jira/browse/LUCENE-673?page=comments#action_12444069 ]
           
Steven Parkes commented on LUCENE-673:
--------------------------------------

This is more of an aside than anything else, but V2-3 clients do have some support for delete after close, right? The whole .nfsXXXX thing? Server doesn't really need any support, though I think some versions "include" cron cleanup of old .nfsXXXX files that never got deleted.

> Exceptions when using Lucene over NFS
> -------------------------------------
>
>                 Key: LUCENE-673
>                 URL: http://issues.apache.org/jira/browse/LUCENE-673
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.0.0
>         Environment: NFS server/client
>            Reporter: Michael McCandless
>
> I'm opening this issue to track details on the known problems with
> Lucene over NFS.
> The summary is: if you have one machine writing to an index stored on
> an NFS mount, and other machine(s) reading (and periodically
> re-opening the index) then sometimes on re-opening the index the
> reader will hit a FileNotFound exception.
> This has hit many users because this is a natural way to "scale up"
> your searching (single writer, multiple readers) across machines.  The
> best current workaround (I think?) is to take the approach Solr takes
> (either by actually using Solr or copying/modifying its approach) to
> take snapshots of the index and then have the readers open the
> snapshots instead of the "live" index being written to.
> I've been working on two patches for Lucene:
>   * A locking (LockFactory) implementation using native OS locks
>   * Lock-less commits
> (I'll open separate issues with the details for those).
> I have a simple stress test where one machine is constantly adding
> docs to an index over NFS, and another machine is constantly
> re-opening the index searcher over NFS.
> These tests have revealed new details (at least for me!) about the
> root cause of our NFS problems:
>   * Even when using native locks over NFS, Lucene still hits these
>     exceptions!
>     I was surprised by this because I had always thought (assumed?)
>     the NFS problem was because the "simple" file-based locking was
>     not correct over NFS, and that switching to native OS filesystem
>     locking would resolve it, but it doesn't.
>     I can reproduce the "FileNotFound" exceptions even when using NFS
>     V4 (the latest NFS protocol), so this is not just a "your NFS
>     server is too old" issue.
>   * Then, when running the same stress test with the lock-less
>     changes, I don't hit any exceptions.  I've tested on NFS version
>     2, 3 and 4 (using the "nfsvers=N" mount option).
> I think this means that in fact (as Hoss at one point suggested I
> believe), the NFS problems are likely due to the cache coherence of
> the NFS file system (I think the "segments" file in particular)
> against the existence of the actual segment data files.
> In other words, even if you lock correctly, on the reader side it will
> sometimes see stale contents of the "segments" file which lead it to
> try to open a now deleted segment data file.
> So I think this is good news / bad news: the bad news is, native
> locking doesn't fix our problems with NFS (as at least I had expected
> it to).  But the good news is, it looks like (still need to do more
> thorough testing of this) the changes for lock-less commits do enable
> Lucene to work fine over NFS.
> [One quick side note in case it helps others: to get native locks
> working over NFS on Ubuntu/Debian Linux 6.06, I had to "apt-get
> install nfs-common" on the NFS client machines.  Before I did this I
> would hit "No locks available" IOExceptions on calling the "tryLock"
> method.  The default nfs server install on the server machine just
> worked because it runs in kernel mode and it start a lockd process.]

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-673) Exceptions when using Lucene over NFS

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)
    [ http://issues.apache.org/jira/browse/LUCENE-673?page=comments#action_12444086 ]
           
Yonik Seeley commented on LUCENE-673:
-------------------------------------

> but V2-3 clients do have some support for delete after close, right? The whole .nfsXXXX thing?

I don't think that works across boxes though.
If host "a" opens a file, and host "b" deletes that file, host "a" won't end up with the .nfs file but will end up with a "Stale NFS file handle" instead.

> Exceptions when using Lucene over NFS
> -------------------------------------
>
>                 Key: LUCENE-673
>                 URL: http://issues.apache.org/jira/browse/LUCENE-673
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.0.0
>         Environment: NFS server/client
>            Reporter: Michael McCandless
>
> I'm opening this issue to track details on the known problems with
> Lucene over NFS.
> The summary is: if you have one machine writing to an index stored on
> an NFS mount, and other machine(s) reading (and periodically
> re-opening the index) then sometimes on re-opening the index the
> reader will hit a FileNotFound exception.
> This has hit many users because this is a natural way to "scale up"
> your searching (single writer, multiple readers) across machines.  The
> best current workaround (I think?) is to take the approach Solr takes
> (either by actually using Solr or copying/modifying its approach) to
> take snapshots of the index and then have the readers open the
> snapshots instead of the "live" index being written to.
> I've been working on two patches for Lucene:
>   * A locking (LockFactory) implementation using native OS locks
>   * Lock-less commits
> (I'll open separate issues with the details for those).
> I have a simple stress test where one machine is constantly adding
> docs to an index over NFS, and another machine is constantly
> re-opening the index searcher over NFS.
> These tests have revealed new details (at least for me!) about the
> root cause of our NFS problems:
>   * Even when using native locks over NFS, Lucene still hits these
>     exceptions!
>     I was surprised by this because I had always thought (assumed?)
>     the NFS problem was because the "simple" file-based locking was
>     not correct over NFS, and that switching to native OS filesystem
>     locking would resolve it, but it doesn't.
>     I can reproduce the "FileNotFound" exceptions even when using NFS
>     V4 (the latest NFS protocol), so this is not just a "your NFS
>     server is too old" issue.
>   * Then, when running the same stress test with the lock-less
>     changes, I don't hit any exceptions.  I've tested on NFS version
>     2, 3 and 4 (using the "nfsvers=N" mount option).
> I think this means that in fact (as Hoss at one point suggested I
> believe), the NFS problems are likely due to the cache coherence of
> the NFS file system (I think the "segments" file in particular)
> against the existence of the actual segment data files.
> In other words, even if you lock correctly, on the reader side it will
> sometimes see stale contents of the "segments" file which lead it to
> try to open a now deleted segment data file.
> So I think this is good news / bad news: the bad news is, native
> locking doesn't fix our problems with NFS (as at least I had expected
> it to).  But the good news is, it looks like (still need to do more
> thorough testing of this) the changes for lock-less commits do enable
> Lucene to work fine over NFS.
> [One quick side note in case it helps others: to get native locks
> working over NFS on Ubuntu/Debian Linux 6.06, I had to "apt-get
> install nfs-common" on the NFS client machines.  Before I did this I
> would hit "No locks available" IOExceptions on calling the "tryLock"
> method.  The default nfs server install on the server machine just
> worked because it runs in kernel mode and it start a lockd process.]

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-673) Exceptions when using Lucene over NFS

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)
    [ http://issues.apache.org/jira/browse/LUCENE-673?page=comments#action_12444088 ]
           
Steven Parkes commented on LUCENE-673:
--------------------------------------

Yeah, I think you're right. I figured I was missing something.

> Exceptions when using Lucene over NFS
> -------------------------------------
>
>                 Key: LUCENE-673
>                 URL: http://issues.apache.org/jira/browse/LUCENE-673
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.0.0
>         Environment: NFS server/client
>            Reporter: Michael McCandless
>
> I'm opening this issue to track details on the known problems with
> Lucene over NFS.
> The summary is: if you have one machine writing to an index stored on
> an NFS mount, and other machine(s) reading (and periodically
> re-opening the index) then sometimes on re-opening the index the
> reader will hit a FileNotFound exception.
> This has hit many users because this is a natural way to "scale up"
> your searching (single writer, multiple readers) across machines.  The
> best current workaround (I think?) is to take the approach Solr takes
> (either by actually using Solr or copying/modifying its approach) to
> take snapshots of the index and then have the readers open the
> snapshots instead of the "live" index being written to.
> I've been working on two patches for Lucene:
>   * A locking (LockFactory) implementation using native OS locks
>   * Lock-less commits
> (I'll open separate issues with the details for those).
> I have a simple stress test where one machine is constantly adding
> docs to an index over NFS, and another machine is constantly
> re-opening the index searcher over NFS.
> These tests have revealed new details (at least for me!) about the
> root cause of our NFS problems:
>   * Even when using native locks over NFS, Lucene still hits these
>     exceptions!
>     I was surprised by this because I had always thought (assumed?)
>     the NFS problem was because the "simple" file-based locking was
>     not correct over NFS, and that switching to native OS filesystem
>     locking would resolve it, but it doesn't.
>     I can reproduce the "FileNotFound" exceptions even when using NFS
>     V4 (the latest NFS protocol), so this is not just a "your NFS
>     server is too old" issue.
>   * Then, when running the same stress test with the lock-less
>     changes, I don't hit any exceptions.  I've tested on NFS version
>     2, 3 and 4 (using the "nfsvers=N" mount option).
> I think this means that in fact (as Hoss at one point suggested I
> believe), the NFS problems are likely due to the cache coherence of
> the NFS file system (I think the "segments" file in particular)
> against the existence of the actual segment data files.
> In other words, even if you lock correctly, on the reader side it will
> sometimes see stale contents of the "segments" file which lead it to
> try to open a now deleted segment data file.
> So I think this is good news / bad news: the bad news is, native
> locking doesn't fix our problems with NFS (as at least I had expected
> it to).  But the good news is, it looks like (still need to do more
> thorough testing of this) the changes for lock-less commits do enable
> Lucene to work fine over NFS.
> [One quick side note in case it helps others: to get native locks
> working over NFS on Ubuntu/Debian Linux 6.06, I had to "apt-get
> install nfs-common" on the NFS client machines.  Before I did this I
> would hit "No locks available" IOExceptions on calling the "tryLock"
> method.  The default nfs server install on the server machine just
> worked because it runs in kernel mode and it start a lockd process.]

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (LUCENE-673) Exceptions when using Lucene over NFS

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)
     [ http://issues.apache.org/jira/browse/LUCENE-673?page=all ]

Michael McCandless reassigned LUCENE-673:
-----------------------------------------

    Assignee: Michael McCandless

> Exceptions when using Lucene over NFS
> -------------------------------------
>
>                 Key: LUCENE-673
>                 URL: http://issues.apache.org/jira/browse/LUCENE-673
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.0.0
>         Environment: NFS server/client
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>
> I'm opening this issue to track details on the known problems with
> Lucene over NFS.
> The summary is: if you have one machine writing to an index stored on
> an NFS mount, and other machine(s) reading (and periodically
> re-opening the index) then sometimes on re-opening the index the
> reader will hit a FileNotFound exception.
> This has hit many users because this is a natural way to "scale up"
> your searching (single writer, multiple readers) across machines.  The
> best current workaround (I think?) is to take the approach Solr takes
> (either by actually using Solr or copying/modifying its approach) to
> take snapshots of the index and then have the readers open the
> snapshots instead of the "live" index being written to.
> I've been working on two patches for Lucene:
>   * A locking (LockFactory) implementation using native OS locks
>   * Lock-less commits
> (I'll open separate issues with the details for those).
> I have a simple stress test where one machine is constantly adding
> docs to an index over NFS, and another machine is constantly
> re-opening the index searcher over NFS.
> These tests have revealed new details (at least for me!) about the
> root cause of our NFS problems:
>   * Even when using native locks over NFS, Lucene still hits these
>     exceptions!
>     I was surprised by this because I had always thought (assumed?)
>     the NFS problem was because the "simple" file-based locking was
>     not correct over NFS, and that switching to native OS filesystem
>     locking would resolve it, but it doesn't.
>     I can reproduce the "FileNotFound" exceptions even when using NFS
>     V4 (the latest NFS protocol), so this is not just a "your NFS
>     server is too old" issue.
>   * Then, when running the same stress test with the lock-less
>     changes, I don't hit any exceptions.  I've tested on NFS version
>     2, 3 and 4 (using the "nfsvers=N" mount option).
> I think this means that in fact (as Hoss at one point suggested I
> believe), the NFS problems are likely due to the cache coherence of
> the NFS file system (I think the "segments" file in particular)
> against the existence of the actual segment data files.
> In other words, even if you lock correctly, on the reader side it will
> sometimes see stale contents of the "segments" file which lead it to
> try to open a now deleted segment data file.
> So I think this is good news / bad news: the bad news is, native
> locking doesn't fix our problems with NFS (as at least I had expected
> it to).  But the good news is, it looks like (still need to do more
> thorough testing of this) the changes for lock-less commits do enable
> Lucene to work fine over NFS.
> [One quick side note in case it helps others: to get native locks
> working over NFS on Ubuntu/Debian Linux 6.06, I had to "apt-get
> install nfs-common" on the NFS client machines.  Before I did this I
> would hit "No locks available" IOExceptions on calling the "tryLock"
> method.  The default nfs server install on the server machine just
> worked because it runs in kernel mode and it start a lockd process.]

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (LUCENE-673) Exceptions when using Lucene over NFS

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless closed LUCENE-673.
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 2.2

This issue is now resolved by both LUCENE-701 and LUCENE-710 being fixed.
As far as I know there are no other outstanding issues preventing Lucene from
working over NFS.  Here's an excerpt from email I just sent to java-user:

As far as I know, Lucene should now work over NFS, except you will
have to make a custom deletion policy that works for your application.

Lucene had issues with NFS in three areas: locking, stale client-side
file caches and how NFS handles deletion of open files.  The first two
were fixed in Lucene 2.1 with lock-less commits (LUCENE-701) and the
last one is fixed in 2.2 with the addition of "custom deletion
policies" (LUCENE-710).

For a custom deletion policy you need to implement the
org.apache.lucene.index.IndexDeletionPolicy interface in your own
class and pass an instance of that class to your IndexWriter.  This
class tells IndexWriter when it's safe to delete older commits.  By
default Lucene uses an instance of KeepOnlyLastCommitDeletionPolicy.

The basic idea is to implement logic that can tell when your readers
are done using an older commit in the index.  For example if you know
your readers refresh themselves once per hour then your deletion
policy can safely delete any commit older than 1 hour.

But please note that while I believe NFS should work fine, this has
not been heavily tested yet.  Also note that performance over NFS is
generally not great.  If you do go down this route please report back
on any success or failure!  Thanks.


> Exceptions when using Lucene over NFS
> -------------------------------------
>
>                 Key: LUCENE-673
>                 URL: https://issues.apache.org/jira/browse/LUCENE-673
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.0.0
>         Environment: NFS server/client
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 2.2
>
>
> I'm opening this issue to track details on the known problems with
> Lucene over NFS.
> The summary is: if you have one machine writing to an index stored on
> an NFS mount, and other machine(s) reading (and periodically
> re-opening the index) then sometimes on re-opening the index the
> reader will hit a FileNotFound exception.
> This has hit many users because this is a natural way to "scale up"
> your searching (single writer, multiple readers) across machines.  The
> best current workaround (I think?) is to take the approach Solr takes
> (either by actually using Solr or copying/modifying its approach) to
> take snapshots of the index and then have the readers open the
> snapshots instead of the "live" index being written to.
> I've been working on two patches for Lucene:
>   * A locking (LockFactory) implementation using native OS locks
>   * Lock-less commits
> (I'll open separate issues with the details for those).
> I have a simple stress test where one machine is constantly adding
> docs to an index over NFS, and another machine is constantly
> re-opening the index searcher over NFS.
> These tests have revealed new details (at least for me!) about the
> root cause of our NFS problems:
>   * Even when using native locks over NFS, Lucene still hits these
>     exceptions!
>     I was surprised by this because I had always thought (assumed?)
>     the NFS problem was because the "simple" file-based locking was
>     not correct over NFS, and that switching to native OS filesystem
>     locking would resolve it, but it doesn't.
>     I can reproduce the "FileNotFound" exceptions even when using NFS
>     V4 (the latest NFS protocol), so this is not just a "your NFS
>     server is too old" issue.
>   * Then, when running the same stress test with the lock-less
>     changes, I don't hit any exceptions.  I've tested on NFS version
>     2, 3 and 4 (using the "nfsvers=N" mount option).
> I think this means that in fact (as Hoss at one point suggested I
> believe), the NFS problems are likely due to the cache coherence of
> the NFS file system (I think the "segments" file in particular)
> against the existence of the actual segment data files.
> In other words, even if you lock correctly, on the reader side it will
> sometimes see stale contents of the "segments" file which lead it to
> try to open a now deleted segment data file.
> So I think this is good news / bad news: the bad news is, native
> locking doesn't fix our problems with NFS (as at least I had expected
> it to).  But the good news is, it looks like (still need to do more
> thorough testing of this) the changes for lock-less commits do enable
> Lucene to work fine over NFS.
> [One quick side note in case it helps others: to get native locks
> working over NFS on Ubuntu/Debian Linux 6.06, I had to "apt-get
> install nfs-common" on the NFS client machines.  Before I did this I
> would hit "No locks available" IOExceptions on calling the "tryLock"
> method.  The default nfs server install on the server machine just
> worked because it runs in kernel mode and it start a lockd process.]

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Reopened: (LUCENE-673) Exceptions when using Lucene over NFS

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reopened LUCENE-673:
---------------------------------------


This is not quite resolved yet.  In the case where you have multiple
machines that can be writers, and the writer is able to quickly jump
back and forth between them, there is at least one issue (LUCENE-948)
that prevents this from working.


> Exceptions when using Lucene over NFS
> -------------------------------------
>
>                 Key: LUCENE-673
>                 URL: https://issues.apache.org/jira/browse/LUCENE-673
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.0.0
>         Environment: NFS server/client
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 2.2
>
>
> I'm opening this issue to track details on the known problems with
> Lucene over NFS.
> The summary is: if you have one machine writing to an index stored on
> an NFS mount, and other machine(s) reading (and periodically
> re-opening the index) then sometimes on re-opening the index the
> reader will hit a FileNotFound exception.
> This has hit many users because this is a natural way to "scale up"
> your searching (single writer, multiple readers) across machines.  The
> best current workaround (I think?) is to take the approach Solr takes
> (either by actually using Solr or copying/modifying its approach) to
> take snapshots of the index and then have the readers open the
> snapshots instead of the "live" index being written to.
> I've been working on two patches for Lucene:
>   * A locking (LockFactory) implementation using native OS locks
>   * Lock-less commits
> (I'll open separate issues with the details for those).
> I have a simple stress test where one machine is constantly adding
> docs to an index over NFS, and another machine is constantly
> re-opening the index searcher over NFS.
> These tests have revealed new details (at least for me!) about the
> root cause of our NFS problems:
>   * Even when using native locks over NFS, Lucene still hits these
>     exceptions!
>     I was surprised by this because I had always thought (assumed?)
>     the NFS problem was because the "simple" file-based locking was
>     not correct over NFS, and that switching to native OS filesystem
>     locking would resolve it, but it doesn't.
>     I can reproduce the "FileNotFound" exceptions even when using NFS
>     V4 (the latest NFS protocol), so this is not just a "your NFS
>     server is too old" issue.
>   * Then, when running the same stress test with the lock-less
>     changes, I don't hit any exceptions.  I've tested on NFS version
>     2, 3 and 4 (using the "nfsvers=N" mount option).
> I think this means that in fact (as Hoss at one point suggested I
> believe), the NFS problems are likely due to the cache coherence of
> the NFS file system (I think the "segments" file in particular)
> against the existence of the actual segment data files.
> In other words, even if you lock correctly, on the reader side it will
> sometimes see stale contents of the "segments" file which lead it to
> try to open a now deleted segment data file.
> So I think this is good news / bad news: the bad news is, native
> locking doesn't fix our problems with NFS (as at least I had expected
> it to).  But the good news is, it looks like (still need to do more
> thorough testing of this) the changes for lock-less commits do enable
> Lucene to work fine over NFS.
> [One quick side note in case it helps others: to get native locks
> working over NFS on Ubuntu/Debian Linux 6.06, I had to "apt-get
> install nfs-common" on the NFS client machines.  Before I did this I
> would hit "No locks available" IOExceptions on calling the "tryLock"
> method.  The default nfs server install on the server machine just
> worked because it runs in kernel mode and it start a lockd process.]

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-673) Exceptions when using Lucene over NFS

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531314 ]

Michael McCandless commented on LUCENE-673:
-------------------------------------------

More updates on the status of Lucene over NFS (see details in
LUCENE-1011):

  * For the multi-writer (ie, writers on different machines) case,
    sharing an index over NFS, Lucene currently can corrupt the index.
    But the pending fix in LUCENE-1011 looks to resolve this.

  * Also in LUCENE-1011 is a set of tools to test whether locking is
    working correctly in your environment.  If you are having problems
    over NFS or some other "interesting" filesystem, it's best to
    first run the LockStressTest tool to see if it's a locking
    problem.

  * SimpleFSLockFactory seems to work in cases where
    NativeFSLockFactory does not.  So, from now on,
    SimpleFSLockFactory should be the first lock factory you try to
    use on NFS!


> Exceptions when using Lucene over NFS
> -------------------------------------
>
>                 Key: LUCENE-673
>                 URL: https://issues.apache.org/jira/browse/LUCENE-673
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.0.0
>         Environment: NFS server/client
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 2.2
>
>
> I'm opening this issue to track details on the known problems with
> Lucene over NFS.
> The summary is: if you have one machine writing to an index stored on
> an NFS mount, and other machine(s) reading (and periodically
> re-opening the index) then sometimes on re-opening the index the
> reader will hit a FileNotFound exception.
> This has hit many users because this is a natural way to "scale up"
> your searching (single writer, multiple readers) across machines.  The
> best current workaround (I think?) is to take the approach Solr takes
> (either by actually using Solr or copying/modifying its approach) to
> take snapshots of the index and then have the readers open the
> snapshots instead of the "live" index being written to.
> I've been working on two patches for Lucene:
>   * A locking (LockFactory) implementation using native OS locks
>   * Lock-less commits
> (I'll open separate issues with the details for those).
> I have a simple stress test where one machine is constantly adding
> docs to an index over NFS, and another machine is constantly
> re-opening the index searcher over NFS.
> These tests have revealed new details (at least for me!) about the
> root cause of our NFS problems:
>   * Even when using native locks over NFS, Lucene still hits these
>     exceptions!
>     I was surprised by this because I had always thought (assumed?)
>     the NFS problem was because the "simple" file-based locking was
>     not correct over NFS, and that switching to native OS filesystem
>     locking would resolve it, but it doesn't.
>     I can reproduce the "FileNotFound" exceptions even when using NFS
>     V4 (the latest NFS protocol), so this is not just a "your NFS
>     server is too old" issue.
>   * Then, when running the same stress test with the lock-less
>     changes, I don't hit any exceptions.  I've tested on NFS version
>     2, 3 and 4 (using the "nfsvers=N" mount option).
> I think this means that in fact (as Hoss at one point suggested I
> believe), the NFS problems are likely due to the cache coherence of
> the NFS file system (I think the "segments" file in particular)
> against the existence of the actual segment data files.
> In other words, even if you lock correctly, on the reader side it will
> sometimes see stale contents of the "segments" file which lead it to
> try to open a now deleted segment data file.
> So I think this is good news / bad news: the bad news is, native
> locking doesn't fix our problems with NFS (as at least I had expected
> it to).  But the good news is, it looks like (still need to do more
> thorough testing of this) the changes for lock-less commits do enable
> Lucene to work fine over NFS.
> [One quick side note in case it helps others: to get native locks
> working over NFS on Ubuntu/Debian Linux 6.06, I had to "apt-get
> install nfs-common" on the NFS client machines.  Before I did this I
> would hit "No locks available" IOExceptions on calling the "tryLock"
> method.  The default nfs server install on the server machine just
> worked because it runs in kernel mode and it start a lockd process.]

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (LUCENE-673) Exceptions when using Lucene over NFS

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-673.
---------------------------------------

    Resolution: Fixed

> Exceptions when using Lucene over NFS
> -------------------------------------
>
>                 Key: LUCENE-673
>                 URL: https://issues.apache.org/jira/browse/LUCENE-673
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.0.0
>         Environment: NFS server/client
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 2.2
>
>
> I'm opening this issue to track details on the known problems with
> Lucene over NFS.
> The summary is: if you have one machine writing to an index stored on
> an NFS mount, and other machine(s) reading (and periodically
> re-opening the index) then sometimes on re-opening the index the
> reader will hit a FileNotFound exception.
> This has hit many users because this is a natural way to "scale up"
> your searching (single writer, multiple readers) across machines.  The
> best current workaround (I think?) is to take the approach Solr takes
> (either by actually using Solr or copying/modifying its approach) to
> take snapshots of the index and then have the readers open the
> snapshots instead of the "live" index being written to.
> I've been working on two patches for Lucene:
>   * A locking (LockFactory) implementation using native OS locks
>   * Lock-less commits
> (I'll open separate issues with the details for those).
> I have a simple stress test where one machine is constantly adding
> docs to an index over NFS, and another machine is constantly
> re-opening the index searcher over NFS.
> These tests have revealed new details (at least for me!) about the
> root cause of our NFS problems:
>   * Even when using native locks over NFS, Lucene still hits these
>     exceptions!
>     I was surprised by this because I had always thought (assumed?)
>     the NFS problem was because the "simple" file-based locking was
>     not correct over NFS, and that switching to native OS filesystem
>     locking would resolve it, but it doesn't.
>     I can reproduce the "FileNotFound" exceptions even when using NFS
>     V4 (the latest NFS protocol), so this is not just a "your NFS
>     server is too old" issue.
>   * Then, when running the same stress test with the lock-less
>     changes, I don't hit any exceptions.  I've tested on NFS version
>     2, 3 and 4 (using the "nfsvers=N" mount option).
> I think this means that in fact (as Hoss at one point suggested I
> believe), the NFS problems are likely due to the cache coherence of
> the NFS file system (I think the "segments" file in particular)
> against the existence of the actual segment data files.
> In other words, even if you lock correctly, on the reader side it will
> sometimes see stale contents of the "segments" file which lead it to
> try to open a now deleted segment data file.
> So I think this is good news / bad news: the bad news is, native
> locking doesn't fix our problems with NFS (as at least I had expected
> it to).  But the good news is, it looks like (still need to do more
> thorough testing of this) the changes for lock-less commits do enable
> Lucene to work fine over NFS.
> [One quick side note in case it helps others: to get native locks
> working over NFS on Ubuntu/Debian Linux 6.06, I had to "apt-get
> install nfs-common" on the NFS client machines.  Before I did this I
> would hit "No locks available" IOExceptions on calling the "tryLock"
> method.  The default nfs server install on the server machine just
> worked because it runs in kernel mode and it start a lockd process.]

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]