Hi, All,
I was reading some research papers regarding quick inverted index lookups. The classical approach to skipping dictates that a skip should be positioned every sqrt(df) document pointers. I looked at the the current Lucene implementation. The skipInterval is hardcoded as follows in TermInfosWriter.java class: int skipInterval = 16; Therefore, I have two questions: 1) Would it be a good idea and feasible to use sqrt(df) to be the skipInterval, rather than hardcode it? 2) When merging segments, for every term, the skip table is buffered first in the RAMOutputStream and then written to the output stream. If there are lot of documents for a term, this seems to consume a lot of memory, right? If instead, we use sqrt(df) to be the skipInterval, the memory consumed will be a lot less, as it is logarithmic. Hope some one could shed more light on this. Thanks in advance, Jian |
Hi, All,
I should have sent to this email address rather than the old jakarta email address. Sorry if double-posted. Jian ---------- Forwarded message ---------- From: jian chen <[hidden email]> Date: Oct 15, 2005 6:36 PM Subject: skipInterval To: Lucene Developers List <[hidden email]> Hi, All, I was reading some research papers regarding quick inverted index lookups. The classical approach to skipping dictates that a skip should be positioned every sqrt(df) document pointers. I looked at the the current Lucene implementation. The skipInterval is hardcoded as follows in TermInfosWriter.java class: int skipInterval = 16; Therefore, I have two questions: 1) Would it be a good idea and feasible to use sqrt(df) to be the skipInterval, rather than hardcode it? 2) When merging segments, for every term, the skip table is buffered first in the RAMOutputStream and then written to the output stream. If there are lot of documents for a term, this seems to consume a lot of memory, right? If instead, we use sqrt(df) to be the skipInterval, the memory consumed will be a lot less, as it is logarithmic. Hope some one could shed more light on this. Thanks in advance, Jian |
Jian,
On Sunday 16 October 2005 03:42, jian chen wrote: > Hi, All, > > I should have sent to this email address rather than the old jakarta email > address. Sorry if double-posted. > > Jian > > ---------- Forwarded message ---------- > From: jian chen <[hidden email]> > Date: Oct 15, 2005 6:36 PM > Subject: skipInterval > To: Lucene Developers List <[hidden email]> > > Hi, All, > > I was reading some research papers regarding quick inverted index lookups. > The classical approach to skipping dictates that a skip should be positioned > every sqrt(df) document pointers. The typical use of skipping info in Lucene is in ConjunctionScorer, for a query with two required terms. There it helps for the case when one term occurs much less frequently than another. Iirc the sqrt() is optimal for a single lookup in a single level index, reducing the complexity from linear to logarithmic. Does the sqrt() also apply in the case of searching for two required terms and returning all the documents in which they both occur? Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
Hi, Paul,
Thanks for your email. I am not sure how the sqrt vs. constant for skipInterval will pan out for two or multiple required terms. That needs some experiments I guess. Cheers, Jian On 10/16/05, Paul Elschot <[hidden email]> wrote: > > Jian, > ---------- Forwarded message ---------- > > From: jian chen <[hidden email]> > > Date: Oct 15, 2005 6:36 PM > > Subject: skipInterval > > To: Lucene Developers List <[hidden email]> > > > > Hi, All, > > > > I was reading some research papers regarding quick inverted index > lookups. > > The classical approach to skipping dictates that a skip should be > positioned > > every sqrt(df) document pointers. > > The typical use of skipping info in Lucene is in ConjunctionScorer, for a > query with two required terms. There it helps for the case when one > term occurs much less frequently than another. > Iirc the sqrt() is optimal for a single lookup in a single level index, > reducing the complexity from linear to logarithmic. > Does the sqrt() also apply in the case of searching for two required terms > and returning all the documents in which they both occur? > > Regards, > Paul Elschot > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [hidden email] > For additional commands, e-mail: [hidden email] > > > |
Free forum by Nabble | Edit this page |