skipInterval

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

skipInterval

jian chen
Hi, All,

I was reading some research papers regarding quick inverted index lookups.
The classical approach to skipping dictates that a skip should be positioned
every sqrt(df) document pointers.

I looked at the the current Lucene implementation. The skipInterval is
hardcoded as follows in TermInfosWriter.java class:
int skipInterval = 16;

Therefore, I have two questions:

1) Would it be a good idea and feasible to use sqrt(df) to be the
skipInterval, rather than hardcode it?

2) When merging segments, for every term, the skip table is buffered first
in the RAMOutputStream and then written to the output stream. If there are
lot of documents for a term, this seems to consume a lot of memory, right?
If instead, we use sqrt(df) to be the skipInterval, the memory consumed will
be a lot less, as it is logarithmic.

Hope some one could shed more light on this. Thanks in advance,

Jian
Reply | Threaded
Open this post in threaded view
|

Fwd: skipInterval

jian chen
Hi, All,

I should have sent to this email address rather than the old jakarta email
address. Sorry if double-posted.

Jian

---------- Forwarded message ----------
From: jian chen <[hidden email]>
Date: Oct 15, 2005 6:36 PM
Subject: skipInterval
To: Lucene Developers List <[hidden email]>

Hi, All,

I was reading some research papers regarding quick inverted index lookups.
The classical approach to skipping dictates that a skip should be positioned
every sqrt(df) document pointers.

I looked at the the current Lucene implementation. The skipInterval is
hardcoded as follows in TermInfosWriter.java class:
int skipInterval = 16;

Therefore, I have two questions:

1) Would it be a good idea and feasible to use sqrt(df) to be the
skipInterval, rather than hardcode it?

2) When merging segments, for every term, the skip table is buffered first
in the RAMOutputStream and then written to the output stream. If there are
lot of documents for a term, this seems to consume a lot of memory, right?
If instead, we use sqrt(df) to be the skipInterval, the memory consumed will
be a lot less, as it is logarithmic.

Hope some one could shed more light on this. Thanks in advance,

Jian
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: skipInterval

Paul Elschot
Jian,

On Sunday 16 October 2005 03:42, jian chen wrote:

> Hi, All,
>
> I should have sent to this email address rather than the old jakarta email
> address. Sorry if double-posted.
>
> Jian
>
> ---------- Forwarded message ----------
> From: jian chen <[hidden email]>
> Date: Oct 15, 2005 6:36 PM
> Subject: skipInterval
> To: Lucene Developers List <[hidden email]>
>
> Hi, All,
>
> I was reading some research papers regarding quick inverted index lookups.
> The classical approach to skipping dictates that a skip should be positioned
> every sqrt(df) document pointers.

The typical use of skipping info in Lucene is in ConjunctionScorer, for a
query with two required terms. There it helps for the case when one
term occurs much less frequently than another.
Iirc the sqrt() is optimal for a single lookup in a single level index,
reducing the complexity from linear to logarithmic.
Does the sqrt() also apply in the case of searching for two required terms
and returning all the documents in which they both occur?

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: skipInterval

jian chen
Hi, Paul,

Thanks for your email. I am not sure how the sqrt vs. constant for
skipInterval will pan out for two or multiple required terms. That needs
some experiments I guess.

Cheers,

Jian

On 10/16/05, Paul Elschot <[hidden email]> wrote:
>
> Jian,


> ---------- Forwarded message ----------
> > From: jian chen <[hidden email]>
> > Date: Oct 15, 2005 6:36 PM
> > Subject: skipInterval
> > To: Lucene Developers List <[hidden email]>
> >
> > Hi, All,
> >
> > I was reading some research papers regarding quick inverted index
> lookups.
> > The classical approach to skipping dictates that a skip should be
> positioned
> > every sqrt(df) document pointers.
>
> The typical use of skipping info in Lucene is in ConjunctionScorer, for a
> query with two required terms. There it helps for the case when one
> term occurs much less frequently than another.
> Iirc the sqrt() is optimal for a single lookup in a single level index,
> reducing the complexity from linear to logarithmic.
> Does the sqrt() also apply in the case of searching for two required terms
> and returning all the documents in which they both occur?
>
> Regards,
> Paul Elschot
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>