Beyond Lucene 2.0 Index Design

classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Beyond Lucene 2.0 Index Design

Dalton, Jeffery
Hi,

I wanted to start some discussion about possible future Lucene file /
index formats.  This is an extension to the discussion on Flexible
Lucene Indexing discussed on the wiki:
http://wiki.apache.org/jakarta-lucene/FlexibleIndexing

Note: Related sources are listed at the end.    

I would like to have the ability to create term frequency [Persin, et
al. 1996] or "impact" sorted [Anh, Moffat 2001,2006] posting lists (freq
data) . A posting list sorted by Term frequency rather than document id
is straightforward (posting design below).  An Impact sorted list is
relatively new (and perhaps unfamiliar).  An Impact is a single integer
value for a term in a document that is stored in the posting list and is
computed from the combination of the term frequency, document boost,
field boost, length norms, and other arbitrary scoring features (word
position, font, etc...) -- all local information.

The driving motivation for this change is to avoid reading the entire
posting list from disk for very long posting lists (it also leads to
simplified query-time scoring because the tf, norms, and boosts are
built into the impact).  This would address scalability issues with
large collections that have been seen in the past; back in December 2005
there were two threads: "Lucene Performance Bottlenecks" (Lucene User)
and "IndexOptimizer Re: Lucene Performance Bottlenecks" (Nutch Dev)
where Doug and Andrzej addressed some speed concerns by sorting the
Nutch index based on Document Boost (IndexSorter and a TopDocsCollector)
[inpsired by Long, Suel]. The goal is that an impact sorted posting list
would address these and other concerns in a generic manner.    

Allowing a frequency or impact sorted posting list format would lead to
a posting list with the following structure:  
(Frequency or impact could be used interchangeably in the structure.
Lettering continued from Wiki)

e. <impact, num_docs, (doc1,...docN)>
f. <impact, num_docs, ([doc1, freq ,<positions>],...[docN, freq
,<positions>])

The positions are delta encoded for compression.  Similarly, the
document numbers for a given frequency/impact are delta encoded.  If you
read Moffat and Persin, the papers show that this achieves compression
comparable to, or even better than, a standard delta encoded docId
sorted index.  The structure lends itself well to early termination,
pruning, etc... where the entire posting list is not read from disk.  

This type of Impact sorted structure (or similar concept) seems to be
becoming a "standard" solution described in a lot of new research / text
books on IR for large scale indexes.  It would be great if Lucene
supported something like this someday ;-).

Thanks,

Jeff Dalton

References:
Anh, Moffat. Vector-Space Ranking with Effective Early Termination.
2001.
Anh, Moffat. Impact Transformation: Effective and Efficient Web
Retrieval. 2006.  
Anh, Moffat. Pruned Query Evaluation Using Pre-Computed Impacts. 2006.
Long, Suel. Optimized Query Execution in Large Search Engine with Global
Page Ordering.
Manning, Raghavan, Schutze. Introduction to Information Retrieval,
Chapters 2,7.
http://www-csli.stanford.edu/%7Eschuetze/information-retrieval-book.html

Persin, et al. Filtered Document Retrieval with Frequency-Sorted
Indexes. 1996.






---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Beyond Lucene 2.0 Index Design

Marvin Humphrey

On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery wrote:

> e. <impact, num_docs, (doc1,...docN)>
> f. <impact, num_docs, ([doc1, freq ,<positions>],...[docN, freq
> ,<positions>])

Does the impact have any use after it's used to sort the postings?  
Can we leave it out of the index format and recalculate at merge-time?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Beyond Lucene 2.0 Index Design

Doron Cohen
In reply to this post by Dalton, Jeffery
Scoring today goes doc-at-a-time - all scorers and term-posting-readers
advance together; once a new doc is processed, scoring of previous docs is
known and final. This allows maintaining a finite size queue for collecting
best hits. Then, for huge collections, having to exhaustively scan all
posting lists becomes a scalability issue.

If I understand the proposal here correctly, it is suggested to order
posting lists by possible score contribution (frequency/impact) - in
particular, two different terms A and B both appearing in two docs X and Y
may have the docs in a different order - X Y for A and Y X for B. This
flexibility/granularity, while enabling early-terminataion and pruning,
would require a term-at-a-time scoring algorithm, which would be quite a
change in Lucene scoring.

The work sited seems less flexible - only a single value is maintained per
document - e.g. Page Rank - and then documents IDs are shuffled so that in
the resulted index the posting lists are still ordered - as today - by
docids, but sattify: rank(i) >= rank(i+1).

If the latter (single-value-per-doc) approach is taken, still needs to
decide when to sort - externally, when exporting the (large) index for
search, or internally, as part of merge. If done as part of merge, each
(result) segment is always ordered (by rank); merging would become more
tricky, since you don't want to exhaust memory while merging, but I think
it is doable.

"Dalton, Jeffery" <[hidden email]> wrote on 09/01/2007 06:25:33:

> Hi,
>
> I wanted to start some discussion about possible future Lucene file /
> index formats.  This is an extension to the discussion on Flexible
> Lucene Indexing discussed on the wiki:
> http://wiki.apache.org/jakarta-lucene/FlexibleIndexing
>
> Note: Related sources are listed at the end.
>
> I would like to have the ability to create term frequency [Persin, et
> al. 1996] or "impact" sorted [Anh, Moffat 2001,2006] posting lists (freq
> data) . A posting list sorted by Term frequency rather than document id
> is straightforward (posting design below).  An Impact sorted list is
> relatively new (and perhaps unfamiliar).  An Impact is a single integer
> value for a term in a document that is stored in the posting list and is
> computed from the combination of the term frequency, document boost,
> field boost, length norms, and other arbitrary scoring features (word
> position, font, etc...) -- all local information.
>
> The driving motivation for this change is to avoid reading the entire
> posting list from disk for very long posting lists (it also leads to
> simplified query-time scoring because the tf, norms, and boosts are
> built into the impact).  This would address scalability issues with
> large collections that have been seen in the past; back in December 2005
> there were two threads: "Lucene Performance Bottlenecks" (Lucene User)
> and "IndexOptimizer Re: Lucene Performance Bottlenecks" (Nutch Dev)
> where Doug and Andrzej addressed some speed concerns by sorting the
> Nutch index based on Document Boost (IndexSorter and a TopDocsCollector)
> [inpsired by Long, Suel]. The goal is that an impact sorted posting list
> would address these and other concerns in a generic manner.
>
> Allowing a frequency or impact sorted posting list format would lead to
> a posting list with the following structure:
> (Frequency or impact could be used interchangeably in the structure.
> Lettering continued from Wiki)
>
> e. <impact, num_docs, (doc1,...docN)>
> f. <impact, num_docs, ([doc1, freq ,<positions>],...[docN, freq
> ,<positions>])
>
> The positions are delta encoded for compression.  Similarly, the
> document numbers for a given frequency/impact are delta encoded.  If you
> read Moffat and Persin, the papers show that this achieves compression
> comparable to, or even better than, a standard delta encoded docId
> sorted index.  The structure lends itself well to early termination,
> pruning, etc... where the entire posting list is not read from disk.
>
> This type of Impact sorted structure (or similar concept) seems to be
> becoming a "standard" solution described in a lot of new research / text
> books on IR for large scale indexes.  It would be great if Lucene
> supported something like this someday ;-).
>
> Thanks,
>
> Jeff Dalton
>
> References:
> Anh, Moffat. Vector-Space Ranking with Effective Early Termination.
> 2001.
> Anh, Moffat. Impact Transformation: Effective and Efficient Web
> Retrieval. 2006.
> Anh, Moffat. Pruned Query Evaluation Using Pre-Computed Impacts. 2006.
> Long, Suel. Optimized Query Execution in Large Search Engine with Global
> Page Ordering.
> Manning, Raghavan, Schutze. Introduction to Information Retrieval,
> Chapters 2,7.
> http://www-csli.stanford.edu/%7Eschuetze/information-retrieval-book.html
>
> Persin, et al. Filtered Document Retrieval with Frequency-Sorted
> Indexes. 1996.
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Beyond Lucene 2.0 Index Design

Dalton, Jeffery
In reply to this post by Dalton, Jeffery
Doron -- you have the idea.  

And yes, it would be a substantial change to Lucene scoring.  Ideally,
Lucene / doc format would be changed in such a way to support both docId
sorted indexes (and doc-at-a-time processing) and frequency/impact
sorted indexes with term-at-a-time or even score-at-a-time processing.
I think we could do a lot to generalize Lucene to support both modes (at
least at index level).

You do a good job describing doc and term at a time processing.
Score-at-a-time processing is where the posting list with the highest
IDF x current Impact score is read -- the next chunk of index that can
contribute the most to overall document score.  

Doron, the less flexible work I assume you mention is Suel -- Global
Page Ordering.  Indeed, I agree, this is inflexible, requires all
documents to be renumbered, and only sorts by Page Rank.  It's too high
level (only applies to web docs) for Lucene to support.  Anh/Moffat's
Impact sorting is quite flexible and generally has wide application
because it provides one score per term in a document - any local
document information can be factored into the score at index creation
time (such as multiple field or positional boosts).

As you say Doron, I think the best approach is that the sort be
maintained at segment merge time, the ImpactSortedSegmentMerger (or a
class similarly named) would merge sort the postings by score as it
writes them to disk. It's definitely doable -- in fact I have created
classes to do it, but doing so required significant changes to other
parts of Lucene to calculate the score and get it down to the
DocumentWriter / SegmentMerger level.    

On the search side there needs to be ScoreSortedTermDocs and
ScoreSortedTermPositions objects to read the new format, along with
Scorers to perform scoring and intersection.  

The end product would be a very scalable and flexible solution.  

- Jeff

> -----Original Message-----
> From: Doron Cohen [mailto:[hidden email]]
> Sent: Tuesday, January 09, 2007 5:27 PM
> To: [hidden email]
> Subject: Re: Beyond Lucene 2.0 Index Design
>
> Scoring today goes doc-at-a-time - all scorers and
> term-posting-readers advance together; once a new doc is
> processed, scoring of previous docs is known and final. This
> allows maintaining a finite size queue for collecting best
> hits. Then, for huge collections, having to exhaustively scan
> all posting lists becomes a scalability issue.
>
> If I understand the proposal here correctly, it is suggested
> to order posting lists by possible score contribution
> (frequency/impact) - in particular, two different terms A and
> B both appearing in two docs X and Y may have the docs in a
> different order - X Y for A and Y X for B. This
> flexibility/granularity, while enabling early-terminataion
> and pruning, would require a term-at-a-time scoring
> algorithm, which would be quite a change in Lucene scoring.
>
> The work sited seems less flexible - only a single value is
> maintained per document - e.g. Page Rank - and then documents
> IDs are shuffled so that in the resulted index the posting
> lists are still ordered - as today - by docids, but sattify:
> rank(i) >= rank(i+1).
>
> If the latter (single-value-per-doc) approach is taken, still
> needs to decide when to sort - externally, when exporting the
> (large) index for search, or internally, as part of merge. If
> done as part of merge, each
> (result) segment is always ordered (by rank); merging would
> become more tricky, since you don't want to exhaust memory
> while merging, but I think it is doable.
>
> "Dalton, Jeffery" <[hidden email]> wrote on
> 09/01/2007 06:25:33:
>
> > Hi,
> >
> > I wanted to start some discussion about possible future
> Lucene file /
> > index formats.  This is an extension to the discussion on Flexible
> > Lucene Indexing discussed on the wiki:
> > http://wiki.apache.org/jakarta-lucene/FlexibleIndexing
> >
> > Note: Related sources are listed at the end.
> >
> > I would like to have the ability to create term frequency
> [Persin, et
> > al. 1996] or "impact" sorted [Anh, Moffat 2001,2006] posting lists
> > (freq
> > data) . A posting list sorted by Term frequency rather than
> document
> > id is straightforward (posting design below).  An Impact
> sorted list
> > is relatively new (and perhaps unfamiliar).  An Impact is a single
> > integer value for a term in a document that is stored in
> the posting
> > list and is computed from the combination of the term frequency,
> > document boost, field boost, length norms, and other
> arbitrary scoring
> > features (word position, font, etc...) -- all local information.
> >
> > The driving motivation for this change is to avoid reading
> the entire
> > posting list from disk for very long posting lists (it also
> leads to
> > simplified query-time scoring because the tf, norms, and boosts are
> > built into the impact).  This would address scalability issues with
> > large collections that have been seen in the past; back in December
> > 2005 there were two threads: "Lucene Performance
> Bottlenecks" (Lucene
> > User) and "IndexOptimizer Re: Lucene Performance
> Bottlenecks" (Nutch
> > Dev) where Doug and Andrzej addressed some speed concerns
> by sorting
> > the Nutch index based on Document Boost (IndexSorter and a
> > TopDocsCollector) [inpsired by Long, Suel]. The goal is
> that an impact
> > sorted posting list would address these and other concerns
> in a generic manner.
> >
> > Allowing a frequency or impact sorted posting list format
> would lead
> > to a posting list with the following structure:
> > (Frequency or impact could be used interchangeably in the structure.
> > Lettering continued from Wiki)
> >
> > e. <impact, num_docs, (doc1,...docN)>
> > f. <impact, num_docs, ([doc1, freq ,<positions>],...[docN, freq
> > ,<positions>])
> >
> > The positions are delta encoded for compression.  Similarly, the
> > document numbers for a given frequency/impact are delta
> encoded.  If
> > you read Moffat and Persin, the papers show that this achieves
> > compression comparable to, or even better than, a standard delta
> > encoded docId sorted index.  The structure lends itself
> well to early
> > termination, pruning, etc... where the entire posting list
> is not read from disk.
> >
> > This type of Impact sorted structure (or similar concept)
> seems to be
> > becoming a "standard" solution described in a lot of new research /
> > text books on IR for large scale indexes.  It would be
> great if Lucene
> > supported something like this someday ;-).
> >
> > Thanks,
> >
> > Jeff Dalton
> >
> > References:
> > Anh, Moffat. Vector-Space Ranking with Effective Early Termination.
> > 2001.
> > Anh, Moffat. Impact Transformation: Effective and Efficient Web
> > Retrieval. 2006.
> > Anh, Moffat. Pruned Query Evaluation Using Pre-Computed
> Impacts. 2006.
> > Long, Suel. Optimized Query Execution in Large Search Engine with
> > Global Page Ordering.
> > Manning, Raghavan, Schutze. Introduction to Information Retrieval,
> > Chapters 2,7.
> >
> http://www-csli.stanford.edu/%7Eschuetze/information-retrieval-book.ht
> > ml
> >
> > Persin, et al. Filtered Document Retrieval with Frequency-Sorted
> > Indexes. 1996.
> >
> >
> >
> >
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Beyond Lucene 2.0 Index Design

Dalton, Jeffery
In reply to this post by Dalton, Jeffery
I'm not sure we fully understand one another, but I'll try to explain
what I am thinking.

Yes, it has use after sorting.  It is used at query time for document
scoring in place of the TF and length norm components  (new scorers
would need to be created).  

Using an impact based index moves most of the scoring from query time to
index time (trades query time flexibility for greatly improved query
search performance).  Because the field boosts, length norm, position
boosts, etc... are incorporated into a single document-term-score, you
can use a single field at search time.  It allows one posting list per
query term instead of the current one posting list per field per query
term (MultiFieldQueryParser wouldn't be necessary in most cases).  In
addition to having fewer posting lists to examine, you often don't need
to read to the end of long posting lists when processing with a
score-at-a-time approach (see Anh/Moffat's Pruned Query Evaluation Using
Pre-Computed Impacts, SIGIR 2006) for details on one potential
algorithm.  

I'm not quite sure what you mean when mention leaving them out and
re-calculating them at merge time.  

- Jeff

> -----Original Message-----
> From: Marvin Humphrey [mailto:[hidden email]]
> Sent: Tuesday, January 09, 2007 2:58 PM
> To: [hidden email]
> Subject: Re: Beyond Lucene 2.0 Index Design
>
>
> On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery wrote:
>
> > e. <impact, num_docs, (doc1,...docN)>
> > f. <impact, num_docs, ([doc1, freq ,<positions>],...[docN, freq
> > ,<positions>])
>
> Does the impact have any use after it's used to sort the postings?  
> Can we leave it out of the index format and recalculate at merge-time?
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Beyond Lucene 2.0 Index Design

jian chen
Hi, Jeff,

I like the idea of impact based scoring. However, could you elaborate more
on why we only need to use single field at search  time?

In Lucene, the indexed terms are field specific, and two terms, even if they
are the same, are still different terms if they are of different fields.

So,  I think the multiple field scenario is still needed right? What if the
user wants to search on both subject and content for emails, for example,
and sometimes, only wants to search on subject, this type of tasks, without
multiple fields, how this would be handled.

I got lost on this,  could any one educate?

Thanks,

Jian

On 1/9/07, Dalton, Jeffery <[hidden email]> wrote:

>
> I'm not sure we fully understand one another, but I'll try to explain
> what I am thinking.
>
> Yes, it has use after sorting.  It is used at query time for document
> scoring in place of the TF and length norm components  (new scorers
> would need to be created).
>
> Using an impact based index moves most of the scoring from query time to
> index time (trades query time flexibility for greatly improved query
> search performance).  Because the field boosts, length norm, position
> boosts, etc... are incorporated into a single document-term-score, you
> can use a single field at search time.  It allows one posting list per
> query term instead of the current one posting list per field per query
> term (MultiFieldQueryParser wouldn't be necessary in most cases).  In
> addition to having fewer posting lists to examine, you often don't need
> to read to the end of long posting lists when processing with a
> score-at-a-time approach (see Anh/Moffat's Pruned Query Evaluation Using
> Pre-Computed Impacts, SIGIR 2006) for details on one potential
> algorithm.
>
> I'm not quite sure what you mean when mention leaving them out and
> re-calculating them at merge time.
>
> - Jeff
>
> > -----Original Message-----
> > From: Marvin Humphrey [mailto:[hidden email]]
> > Sent: Tuesday, January 09, 2007 2:58 PM
> > To: [hidden email]
> > Subject: Re: Beyond Lucene 2.0 Index Design
> >
> >
> > On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery wrote:
> >
> > > e. <impact, num_docs, (doc1,...docN)>
> > > f. <impact, num_docs, ([doc1, freq ,<positions>],...[docN, freq
> > > ,<positions>])
> >
> > Does the impact have any use after it's used to sort the postings?
> > Can we leave it out of the index format and recalculate at merge-time?
> >
> > Marvin Humphrey
> > Rectangular Research
> > http://www.rectangular.com/
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Beyond Lucene 2.0 Index Design

jian chen
In reply to this post by Dalton, Jeffery
Hi, Jeff,

Also, how to handle the phrase based queries?

For example, here are two posting lists:

TermA: X Y
TermB: Y X

I am not sure how you would return document X or Y for a search of the
phrase "TermA Term B". Which should come first?

Thanks,

Jian

On 1/9/07, Dalton, Jeffery <[hidden email]> wrote:

>
> I'm not sure we fully understand one another, but I'll try to explain
> what I am thinking.
>
> Yes, it has use after sorting.  It is used at query time for document
> scoring in place of the TF and length norm components  (new scorers
> would need to be created).
>
> Using an impact based index moves most of the scoring from query time to
> index time (trades query time flexibility for greatly improved query
> search performance).  Because the field boosts, length norm, position
> boosts, etc... are incorporated into a single document-term-score, you
> can use a single field at search time.  It allows one posting list per
> query term instead of the current one posting list per field per query
> term (MultiFieldQueryParser wouldn't be necessary in most cases).  In
> addition to having fewer posting lists to examine, you often don't need
> to read to the end of long posting lists when processing with a
> score-at-a-time approach (see Anh/Moffat's Pruned Query Evaluation Using
> Pre-Computed Impacts, SIGIR 2006) for details on one potential
> algorithm.
>
> I'm not quite sure what you mean when mention leaving them out and
> re-calculating them at merge time.
>
> - Jeff
>
> > -----Original Message-----
> > From: Marvin Humphrey [mailto:[hidden email]]
> > Sent: Tuesday, January 09, 2007 2:58 PM
> > To: [hidden email]
> > Subject: Re: Beyond Lucene 2.0 Index Design
> >
> >
> > On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery wrote:
> >
> > > e. <impact, num_docs, (doc1,...docN)>
> > > f. <impact, num_docs, ([doc1, freq ,<positions>],...[docN, freq
> > > ,<positions>])
> >
> > Does the impact have any use after it's used to sort the postings?
> > Can we leave it out of the index format and recalculate at merge-time?
> >
> > Marvin Humphrey
> > Rectangular Research
> > http://www.rectangular.com/
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Beyond Lucene 2.0 Index Design

Ming Lei-4
I have a couple of questions about the original post
of the new index design:

(1) Question on the posting list
> > f. <impact, num_docs, ([doc1, freq
> ,<positions>],...[docN, freq
> > > > ,<positions>])
What is the "impact" per posting list? I am under the
impression that "impact" or "frequency" is per pair of
doc and term.

And it seem that "impact" or "frequency" needs to be
stored for each doc on the posting list of a term. The
reasons are two: To efficiently stop the traversal at
some point at search time by looking at the "impact"
value. And to get the component score without
re-cacalculation at search time.


(2) I wonder whether Lucene is really based upon
vector-space model. I am under the impression that the
hits are selected using boolean model and only the
scoring (on the hit set) uses vector space model.

If so, the effect on boolean queries are not very
positive.

For a query like "termA AND termB", I suppose the
posting lists of both A and B have to be fully
traversed, right? The partial traversal is only
possible for disjunctions or single term query. And
the join of the two posting lists will be most costly
than on the original docID-sorted posting lists.

(3) As to Jian's question below,
A phrase query is a special case of a conjunctive
boolean query.

Michael


--- jian chen <[hidden email]> wrote:

> Hi, Jeff,
>
> Also, how to handle the phrase based queries?
>
> For example, here are two posting lists:
>
> TermA: X Y
> TermB: Y X
>
> I am not sure how you would return document X or Y
> for a search of the
> phrase "TermA Term B". Which should come first?
>
> Thanks,
>
> Jian
>
> On 1/9/07, Dalton, Jeffery <[hidden email]>
> wrote:
> >
> > I'm not sure we fully understand one another, but
> I'll try to explain
> > what I am thinking.
> >
> > Yes, it has use after sorting.  It is used at
> query time for document
> > scoring in place of the TF and length norm
> components  (new scorers
> > would need to be created).
> >
> > Using an impact based index moves most of the
> scoring from query time to
> > index time (trades query time flexibility for
> greatly improved query
> > search performance).  Because the field boosts,
> length norm, position
> > boosts, etc... are incorporated into a single
> document-term-score, you
> > can use a single field at search time.  It allows
> one posting list per
> > query term instead of the current one posting list
> per field per query
> > term (MultiFieldQueryParser wouldn't be necessary
> in most cases).  In
> > addition to having fewer posting lists to examine,
> you often don't need
> > to read to the end of long posting lists when
> processing with a
> > score-at-a-time approach (see Anh/Moffat's Pruned
> Query Evaluation Using
> > Pre-Computed Impacts, SIGIR 2006) for details on
> one potential
> > algorithm.
> >
> > I'm not quite sure what you mean when mention
> leaving them out and
> > re-calculating them at merge time.
> >
> > - Jeff
> >
> > > -----Original Message-----
> > > From: Marvin Humphrey
> [mailto:[hidden email]]
> > > Sent: Tuesday, January 09, 2007 2:58 PM
> > > To: [hidden email]
> > > Subject: Re: Beyond Lucene 2.0 Index Design
> > >
> > >
> > > On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery
> wrote:
> > >
> > > > e. <impact, num_docs, (doc1,...docN)>
> > > > f. <impact, num_docs, ([doc1, freq
> ,<positions>],...[docN, freq
> > > > ,<positions>])
> > >
> > > Does the impact have any use after it's used to
> sort the postings?
> > > Can we leave it out of the index format and
> recalculate at merge-time?
> > >
> > > Marvin Humphrey
> > > Rectangular Research
> > > http://www.rectangular.com/
> > >
> > >
> > >
> > >
>
---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
> [hidden email]
> > > For additional commands, e-mail:
> [hidden email]
> > >
> > >
> >
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> [hidden email]
> > For additional commands, e-mail:
> [hidden email]
> >
> >
>



 
____________________________________________________________________________________
Do you Yahoo!?
Everyone is raving about the all-new Yahoo! Mail beta.
http://new.mail.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Beyond Lucene 2.0 Index Design

Ming Lei-4
In reply to this post by jian chen
Just my two cents,
I think what he meant by "single field" is the
following:

If the concept of "field" was introduced to
differentiate the significance of term occurrences in
difference regions of a document, (eg, the occurence
in title is more important than in body, etc), that
significance can be alternatively represented be (or
at least encoded in) "impact" which is per occurence
of a term.
For example, if a base "impact" value for a term
occurence is 1.0, you can assign an additional "0.5"
to the occurence in title, thus you would have the a
impace of "1.5" for the title occurrence of that term,
while "1.0" for the body occurrence.

Does this make sense to you?

I feel that people should take a look at the
theoretical retrieval model first. It is not clear to
me that Lucene is fully vector-space model. It seems
that so much assumption has been made about the
context of the discussion.

Michael

--- jian chen <[hidden email]> wrote:

> Hi, Jeff,
>
> I like the idea of impact based scoring. However,
> could you elaborate more
> on why we only need to use single field at search
> time?
>
> In Lucene, the indexed terms are field specific, and
> two terms, even if they
> are the same, are still different terms if they are
> of different fields.
>
> So,  I think the multiple field scenario is still
> needed right? What if the
> user wants to search on both subject and content for
> emails, for example,
> and sometimes, only wants to search on subject, this
> type of tasks, without
> multiple fields, how this would be handled.
>
> I got lost on this,  could any one educate?
>
> Thanks,
>
> Jian
>
> On 1/9/07, Dalton, Jeffery <[hidden email]>
> wrote:
> >
> > I'm not sure we fully understand one another, but
> I'll try to explain
> > what I am thinking.
> >
> > Yes, it has use after sorting.  It is used at
> query time for document
> > scoring in place of the TF and length norm
> components  (new scorers
> > would need to be created).
> >
> > Using an impact based index moves most of the
> scoring from query time to
> > index time (trades query time flexibility for
> greatly improved query
> > search performance).  Because the field boosts,
> length norm, position
> > boosts, etc... are incorporated into a single
> document-term-score, you
> > can use a single field at search time.  It allows
> one posting list per
> > query term instead of the current one posting list
> per field per query
> > term (MultiFieldQueryParser wouldn't be necessary
> in most cases).  In
> > addition to having fewer posting lists to examine,
> you often don't need
> > to read to the end of long posting lists when
> processing with a
> > score-at-a-time approach (see Anh/Moffat's Pruned
> Query Evaluation Using
> > Pre-Computed Impacts, SIGIR 2006) for details on
> one potential
> > algorithm.
> >
> > I'm not quite sure what you mean when mention
> leaving them out and
> > re-calculating them at merge time.
> >
> > - Jeff
> >
> > > -----Original Message-----
> > > From: Marvin Humphrey
> [mailto:[hidden email]]
> > > Sent: Tuesday, January 09, 2007 2:58 PM
> > > To: [hidden email]
> > > Subject: Re: Beyond Lucene 2.0 Index Design
> > >
> > >
> > > On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery
> wrote:
> > >
> > > > e. <impact, num_docs, (doc1,...docN)>
> > > > f. <impact, num_docs, ([doc1, freq
> ,<positions>],...[docN, freq
> > > > ,<positions>])
> > >
> > > Does the impact have any use after it's used to
> sort the postings?
> > > Can we leave it out of the index format and
> recalculate at merge-time?
> > >
> > > Marvin Humphrey
> > > Rectangular Research
> > > http://www.rectangular.com/
> > >
> > >
> > >
> > >
>
---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
> [hidden email]
> > > For additional commands, e-mail:
> [hidden email]
> > >
> > >
> >
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> [hidden email]
> > For additional commands, e-mail:
> [hidden email]
> >
> >
>



 
____________________________________________________________________________________
Need a quick answer? Get one in minutes from people who know.
Ask your question on www.Answers.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Beyond Lucene 2.0 Index Design

Ming Lei-4
A little bit correction:
Impact does not have to be per occurrence of a term,
but rather most likely per aggregation of all
occurrences of a term in a document (per pair of term
and doc). Thus you just aggregate the significance of
occurrences in different regions of a doc at index
time and put the aggregated significance into
"impact". Then you can do away fields in a
vector-space model of retreival.

But there is usually some semantics of fields in a
boolean model and semi-structured information
retrieval, which you can not get rid of.

Michael


--- Ming Lei <[hidden email]> wrote:

> Just my two cents,
> I think what he meant by "single field" is the
> following:
>
> If the concept of "field" was introduced to
> differentiate the significance of term occurrences
> in
> difference regions of a document, (eg, the occurence
> in title is more important than in body, etc), that
> significance can be alternatively represented be (or
> at least encoded in) "impact" which is per occurence
> of a term.
> For example, if a base "impact" value for a term
> occurence is 1.0, you can assign an additional "0.5"
> to the occurence in title, thus you would have the a
> impace of "1.5" for the title occurrence of that
> term,
> while "1.0" for the body occurrence.
>
> Does this make sense to you?
>
> I feel that people should take a look at the
> theoretical retrieval model first. It is not clear
> to
> me that Lucene is fully vector-space model. It seems
> that so much assumption has been made about the
> context of the discussion.
>
> Michael
>
> --- jian chen <[hidden email]> wrote:
>
> > Hi, Jeff,
> >
> > I like the idea of impact based scoring. However,
> > could you elaborate more
> > on why we only need to use single field at search
> > time?
> >
> > In Lucene, the indexed terms are field specific,
> and
> > two terms, even if they
> > are the same, are still different terms if they
> are
> > of different fields.
> >
> > So,  I think the multiple field scenario is still
> > needed right? What if the
> > user wants to search on both subject and content
> for
> > emails, for example,
> > and sometimes, only wants to search on subject,
> this
> > type of tasks, without
> > multiple fields, how this would be handled.
> >
> > I got lost on this,  could any one educate?
> >
> > Thanks,
> >
> > Jian
> >
> > On 1/9/07, Dalton, Jeffery
> <[hidden email]>
> > wrote:
> > >
> > > I'm not sure we fully understand one another,
> but
> > I'll try to explain
> > > what I am thinking.
> > >
> > > Yes, it has use after sorting.  It is used at
> > query time for document
> > > scoring in place of the TF and length norm
> > components  (new scorers
> > > would need to be created).
> > >
> > > Using an impact based index moves most of the
> > scoring from query time to
> > > index time (trades query time flexibility for
> > greatly improved query
> > > search performance).  Because the field boosts,
> > length norm, position
> > > boosts, etc... are incorporated into a single
> > document-term-score, you
> > > can use a single field at search time.  It
> allows
> > one posting list per
> > > query term instead of the current one posting
> list
> > per field per query
> > > term (MultiFieldQueryParser wouldn't be
> necessary
> > in most cases).  In
> > > addition to having fewer posting lists to
> examine,
> > you often don't need
> > > to read to the end of long posting lists when
> > processing with a
> > > score-at-a-time approach (see Anh/Moffat's
> Pruned
> > Query Evaluation Using
> > > Pre-Computed Impacts, SIGIR 2006) for details on
> > one potential
> > > algorithm.
> > >
> > > I'm not quite sure what you mean when mention
> > leaving them out and
> > > re-calculating them at merge time.
> > >
> > > - Jeff
> > >
> > > > -----Original Message-----
> > > > From: Marvin Humphrey
> > [mailto:[hidden email]]
> > > > Sent: Tuesday, January 09, 2007 2:58 PM
> > > > To: [hidden email]
> > > > Subject: Re: Beyond Lucene 2.0 Index Design
> > > >
> > > >
> > > > On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery
> > wrote:
> > > >
> > > > > e. <impact, num_docs, (doc1,...docN)>
> > > > > f. <impact, num_docs, ([doc1, freq
> > ,<positions>],...[docN, freq
> > > > > ,<positions>])
> > > >
> > > > Does the impact have any use after it's used
> to
> > sort the postings?
> > > > Can we leave it out of the index format and
> > recalculate at merge-time?
> > > >
> > > > Marvin Humphrey
> > > > Rectangular Research
> > > > http://www.rectangular.com/
> > > >
> > > >
> > > >
> > > >
> >
>
---------------------------------------------------------------------

> > > > To unsubscribe, e-mail:
> > [hidden email]
> > > > For additional commands, e-mail:
> > [hidden email]
> > > >
> > > >
> > >
> > >
> >
>
---------------------------------------------------------------------

> > > To unsubscribe, e-mail:
> > [hidden email]
> > > For additional commands, e-mail:
> > [hidden email]
> > >
> > >
> >
>
>
>
>  
>
____________________________________________________________________________________
> Need a quick answer? Get one in minutes from people
> who know.
> Ask your question on www.Answers.yahoo.com
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> [hidden email]
> For additional commands, e-mail:
> [hidden email]
>
>



 
____________________________________________________________________________________
Want to start your own business?
Learn how on Yahoo! Small Business.
http://smallbusiness.yahoo.com/r-index

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Beyond Lucene 2.0 Index Design

Ming Lei-4
In reply to this post by Ming Lei-4
The idea of "impact" and "impact-sorted posting list"
should practically work with boolean model by
approximation in the following way:

(1) Index Structure
Inverted-Index : <term, posting-list>*
posting-list: <impact, docID, occurrence*>+   (sorted
by impact)
occurrence: position

(2) Retrieval Algorithm for boolean query "a AND b"

set an impact threshold imp = imp0

get set of docs (each with impact) from a's posting
list with impact of each doc larger than imp.
get set of docs from b's .......
join the two sets by docID and aggregate the impacts
of the same doc from the two sets. And sort the result
set by impact.
(for union query, union the two sets and aggregates
impacts by docID....)

if the result set size is too small,
  set a lower imp and **redo** the above.

Note: If your major concern is relevance rather than
recall for your retrieval (eg, on web), you will
seldom hit the "redo" for conjunctive boolean queries.

But often Lucene is used in an environment of small
corpus and semi-structured data.


Michael


 
____________________________________________________________________________________
Do you Yahoo!?
Everyone is raving about the all-new Yahoo! Mail beta.
http://new.mail.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Beyond Lucene 2.0 Index Design

Grant Ingersoll-4
In reply to this post by Dalton, Jeffery
Hi Jeff,

Wondering if you (and/or others) would be interested in taking a look  
at https://issues.apache.org/jira/browse/LUCENE-662 and vetting the  
new interfaces, etc. to see if you could come up w/ a prototype  
implementation.  This would help move along 662 as it would sort out  
some of the issues that come up with the new interface based approach  
(I often find it takes at least two implementations to fully flesh  
out an abstraction like this)

-Grant

On Jan 9, 2007, at 9:25 AM, Dalton, Jeffery wrote:

> Hi,
>
> I wanted to start some discussion about possible future Lucene file /
> index formats.  This is an extension to the discussion on Flexible
> Lucene Indexing discussed on the wiki:
> http://wiki.apache.org/jakarta-lucene/FlexibleIndexing
>
> Note: Related sources are listed at the end.
>
> I would like to have the ability to create term frequency [Persin, et
> al. 1996] or "impact" sorted [Anh, Moffat 2001,2006] posting lists  
> (freq
> data) . A posting list sorted by Term frequency rather than  
> document id
> is straightforward (posting design below).  An Impact sorted list is
> relatively new (and perhaps unfamiliar).  An Impact is a single  
> integer
> value for a term in a document that is stored in the posting list  
> and is
> computed from the combination of the term frequency, document boost,
> field boost, length norms, and other arbitrary scoring features (word
> position, font, etc...) -- all local information.
>
> The driving motivation for this change is to avoid reading the entire
> posting list from disk for very long posting lists (it also leads to
> simplified query-time scoring because the tf, norms, and boosts are
> built into the impact).  This would address scalability issues with
> large collections that have been seen in the past; back in December  
> 2005
> there were two threads: "Lucene Performance Bottlenecks" (Lucene User)
> and "IndexOptimizer Re: Lucene Performance Bottlenecks" (Nutch Dev)
> where Doug and Andrzej addressed some speed concerns by sorting the
> Nutch index based on Document Boost (IndexSorter and a  
> TopDocsCollector)
> [inpsired by Long, Suel]. The goal is that an impact sorted posting  
> list
> would address these and other concerns in a generic manner.
>
> Allowing a frequency or impact sorted posting list format would  
> lead to
> a posting list with the following structure:
> (Frequency or impact could be used interchangeably in the structure.
> Lettering continued from Wiki)
>
> e. <impact, num_docs, (doc1,...docN)>
> f. <impact, num_docs, ([doc1, freq ,<positions>],...[docN, freq
> ,<positions>])
>
> The positions are delta encoded for compression.  Similarly, the
> document numbers for a given frequency/impact are delta encoded.  
> If you
> read Moffat and Persin, the papers show that this achieves compression
> comparable to, or even better than, a standard delta encoded docId
> sorted index.  The structure lends itself well to early termination,
> pruning, etc... where the entire posting list is not read from disk.
>
> This type of Impact sorted structure (or similar concept) seems to be
> becoming a "standard" solution described in a lot of new research /  
> text
> books on IR for large scale indexes.  It would be great if Lucene
> supported something like this someday ;-).
>
> Thanks,
>
> Jeff Dalton
>
> References:
> Anh, Moffat. Vector-Space Ranking with Effective Early Termination.
> 2001.
> Anh, Moffat. Impact Transformation: Effective and Efficient Web
> Retrieval. 2006.
> Anh, Moffat. Pruned Query Evaluation Using Pre-Computed Impacts. 2006.
> Long, Suel. Optimized Query Execution in Large Search Engine with  
> Global
> Page Ordering.
> Manning, Raghavan, Schutze. Introduction to Information Retrieval,
> Chapters 2,7.
> http://www-csli.stanford.edu/%7Eschuetze/information-retrieval- 
> book.html
>
> Persin, et al. Filtered Document Retrieval with Frequency-Sorted
> Indexes. 1996.
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Beyond Lucene 2.0 Index Design

Marvin Humphrey
In reply to this post by Dalton, Jeffery

On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery wrote:
> e. <impact, num_docs, (doc1,...docN)>
> f. <impact, num_docs, ([doc1, freq ,<positions>],...[docN, freq
> ,<positions>])

How do you build an efficient PhraseScorer to work with an impact-
sorted posting list?

The way PhraseScorer currently works is: find a doc that contains all  
terms, then see if the terms occur consecutively in phrase order,  
then determine a score.  The TermDocs objects feeding PhraseScorer  
return doc_nums in  ascending order, so finding an intersection is  
easy.  But if the document numbers are returned in what looks to the  
PhraseScorer like random order... ??

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Beyond Lucene 2.0 Index Design

jian chen
I also got the same question. It seems it is very hard to efficiently do
phrase based query.

I think most search engines do phrase based query, or at least appear to be.
So, like in google, the query result must contain all the words user
searched on.

It seems to me that the impacted-sorted list makes sense if you are trying
to do pure vector space based ranking. This is from what I have read from
the research papers. They all talk about how to optimize the vector space
model using this impact-sorted list approach.

Unfortunately, the vector space model has serious drawbacks. It does not
take the inter-word relation into account. Thus, could result in a search
result where documents matching only some keywords ranked higher than
documents matching all of them.

I still yet to see whether the impact-sorted list approach could handle this
efficiently.

Cheers,

Jian

On 1/11/07, Marvin Humphrey <[hidden email]> wrote:

>
>
> On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery wrote:
> > e. <impact, num_docs, (doc1,...docN)>
> > f. <impact, num_docs, ([doc1, freq ,<positions>],...[docN, freq
> > ,<positions>])
>
> How do you build an efficient PhraseScorer to work with an impact-
> sorted posting list?
>
> The way PhraseScorer currently works is: find a doc that contains all
> terms, then see if the terms occur consecutively in phrase order,
> then determine a score.  The TermDocs objects feeding PhraseScorer
> return doc_nums in  ascending order, so finding an intersection is
> easy.  But if the document numbers are returned in what looks to the
> PhraseScorer like random order... ??
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Beyond Lucene 2.0 Index Design

Marvin Humphrey

On Jan 11, 2007, at 2:30 PM, jian chen wrote:

> It seems to me that the impacted-sorted list makes sense if you are  
> trying
> to do pure vector space based ranking. This is from what I have  
> read from
> the research papers. They all talk about how to optimize the vector  
> space
> model using this impact-sorted list approach.

That makes sense.  It would work well for Lucene queries that happen  
to mimic pure vector queries: a simple TermQuery, or a BooleanQuery  
consisting entirely of "should" clauses wrapping TermQueries.

Let's explore how it would work for other varieties of BooleanQuery...

   A -B

Not so good.  We need to iterate over all the docs nums for B.  We  
could start by fetching only some of the docs for A, and then  
iterating over all the docs for B and seeing whether we still have  
enough docs after filtering. If we don't, and we have to go back, we  
have to start over from scratch iterating over B.  Either that or we  
need to save set B in a BitVector just in case.  Yuck.

   A +B

Hmm.  You start with the highest ranking docs for B.  You stop when  
the impact of A has dropped low enough that no subsequent document  
can displace a current doc in the HitQueue.  But what if B is a low-
scoring term compared to A, e.g. 'dinosaur +park' ?  You'll have to  
go deep the list for 'park'.  And things just get really complicated  
and murky.  I can't see how anything less than scoring all docs like  
we do now will reliably produce decent precision.

   +A +B

Ouch.  This is just a slightly less severe version of the phrase  
matching conundrum.  I don't see a good way to handle this at all.  
Basically, you need to load two BitVectors and take the intersection,  
then go back and score.

I think we can stop there.  I don't see any way to make an impact-
sorted posting list work efficiently with boolean logic.  It only  
works with pure vector space.

The trick they used in Nutch wasn't to reorder the postings.  What  
they did was sort the entire index so that documents were ordered by  
descending document boost.  Say you want 10 results.  If you grab the  
first, say, 100 hits, and sort them by score, the odds are pretty  
good that you'll find most of the same docs you would have found  
after crunching through your entire index.  Grab the first 1000 hits,  
and the odds are even better.  (That's assuming that document boost  
plays a dominant role in your scoring algorithm.)  It's a good  
heuristic.  However, it doesn't coexist happily with incremental  
indexing.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Beyond Lucene 2.0 Index Design

Ming Lei-4
Marvin,
Several posts back on this thread, I talked about an
algorithm of impact-sorted posting list for
conjunctive boolean query. Your concerns on
impact-sorting in boolean retrieval model is valid.
But practically, the approximation (as in my original
post) should work well enough for large corpus and
relevancy-driven retrieval.

The saving on disk access for large corpus (implies
very long posting list) will be huge by impact-sorted
posting list.

I also see that a flexible framework to support
multiple indexing scheme will make Lucene a general
search framework and test bed for innovative search
algorithms and gain further ground in research
community.

I would happy to put my efforts on this.

Michael Lei
Search Labs
Live Search, Microsoft Corp.

> That makes sense.  It would work well for Lucene
> queries that happen  
> to mimic pure vector queries: a simple TermQuery, or
> a BooleanQuery  
> consisting entirely of "should" clauses wrapping
> TermQueries.
>
> Let's explore how it would work for other varieties
> of BooleanQuery...
>
>    A -B
>
> Not so good.  We need to iterate over all the docs
> nums for B.  We  
> could start by fetching only some of the docs for A,
> and then  
> iterating over all the docs for B and seeing whether
> we still have  
> enough docs after filtering. If we don't, and we
> have to go back, we  
> have to start over from scratch iterating over B.
> Either that or we  
> need to save set B in a BitVector just in case.
> Yuck.
>
>    A +B
>
> Hmm.  You start with the highest ranking docs for B.
>  You stop when  
> the impact of A has dropped low enough that no
> subsequent document  
> can displace a current doc in the HitQueue.  But
> what if B is a low-
> scoring term compared to A, e.g. 'dinosaur +park' ?
> You'll have to  
> go deep the list for 'park'.  And things just get
> really complicated  
> and murky.  I can't see how anything less than
> scoring all docs like  
> we do now will reliably produce decent precision.
>
>    +A +B
>
> Ouch.  This is just a slightly less severe version
> of the phrase  
> matching conundrum.  I don't see a good way to
> handle this at all.  
> Basically, you need to load two BitVectors and take
> the intersection,  
> then go back and score.
>
> I think we can stop there.  I don't see any way to
> make an impact-
> sorted posting list work efficiently with boolean
> logic.  It only  
> works with pure vector space.
>
> The trick they used in Nutch wasn't to reorder the
> postings.  What  
> they did was sort the entire index so that documents
> were ordered by  
> descending document boost.  Say you want 10 results.
>  If you grab the  
> first, say, 100 hits, and sort them by score, the
> odds are pretty  
> good that you'll find most of the same docs you
> would have found  
> after crunching through your entire index.  Grab the
> first 1000 hits,  
> and the odds are even better.  (That's assuming that
> document boost  
> plays a dominant role in your scoring algorithm.)
> It's a good  
> heuristic.  However, it doesn't coexist happily with
> incremental  
> indexing.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> [hidden email]
> For additional commands, e-mail:
> [hidden email]
>
>



 
____________________________________________________________________________________
Bored stiff? Loosen up...
Download and play hundreds of games for free on Yahoo! Games.
http://games.yahoo.com/games/front

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Beyond Lucene 2.0 Index Design

Marvin Humphrey

On Jan 11, 2007, at 8:37 PM, Ming Lei wrote:

> But practically, the approximation (as in my original
> post) should work well enough for large corpus and
> relevancy-driven retrieval.

> The saving on disk access for large corpus (implies
> very long posting list) will be huge by impact-sorted
> posting list.

For 'A OR B', or a simple TermQuery, absolutely.  But for anything  
else, I'm skeptical.  How about...

    'townshend OR (the AND who)'

You'll start with a small number of docs which have both 'the' and  
'who'.   And then, you'll vet them to see if they contain  
'townshend'; only a miniscule portion will, so you'll have to go  
redo... and redo... and redo...

All that redo-ing is going to be considerably less efficient than  
just plowing through the complete posting lists in the first place,  
right?  That's a lot of disk seeks... not to mention all those  
redundant VInt reads.

And each time you redo, you'll need to build an intersection using  
bitvectors or whatever, which requires RAM and processor above and  
beyond what's needed to drop a doc-score pair into a HitCollector.

And trying to tune those heuristics so that you can tell where the  
proper pruning threshold lies for a given clause in the middle of a  
complex boolean query... the scorers are already the hardest part of  
Lucene to grok.  Won't this add more complexity to the place where we  
already have more than enough?  I shudder to think how difficult it  
would become to uncover bugs in a pruning BooleanScorer.

I really want this to work out, because I can put it into KinoSearch  
0.20 if it does -- backwards compatibility is already out the  
window.  But I just don't see how to make it happen.

Can you show us some code or pseudo-code for a BooleanScorer that  
would use impact-sorted posting lists?

> I also see that a flexible framework to support
> multiple indexing scheme will make Lucene a general
> search framework and test bed for innovative search
> algorithms and gain further ground in research
> community.

Yes, that's the idea.  There are limits to the amount of flexibility  
we can offer, though.  With a unified postings file, especially with  
one file per field, changing the amount of information in each  
posting is reasonably straightforward.  Changing the sort order,  
OTOH, is not, because every component in the scoring apparatus  
assumes that it will be fed document numbers in ascending order.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Beyond Lucene 2.0 Index Design

Dalton, Jeffery
In reply to this post by Dalton, Jeffery
The reason is performance on large collections.  The common case is that
users don't care what field they are searching -- they just want the
most relevant results, fast!  If you need to restrict querying to a
certain field only (subject, URL, etc...) you still need to index that.
However, for many applications you can get by with the one "body" field
with the field boosts integrated into to the impact term score.

The short answer is, if there are scenarios where you need to do it:
consider doing both.

> -----Original Message-----
> From: jian chen [mailto:[hidden email]]
> Sent: Wednesday, January 10, 2007 5:12 PM
> To: [hidden email]
> Subject: Re: Beyond Lucene 2.0 Index Design
>
> Hi, Jeff,
>
> I like the idea of impact based scoring. However, could you
> elaborate more on why we only need to use single field at
> search  time?
>
> In Lucene, the indexed terms are field specific, and two
> terms, even if they are the same, are still different terms
> if they are of different fields.
>
> So,  I think the multiple field scenario is still needed
> right? What if the user wants to search on both subject and
> content for emails, for example, and sometimes, only wants to
> search on subject, this type of tasks, without multiple
> fields, how this would be handled.
>
> I got lost on this,  could any one educate?
>
> Thanks,
>
> Jian
>
> On 1/9/07, Dalton, Jeffery <[hidden email]> wrote:
> >
> > I'm not sure we fully understand one another, but I'll try
> to explain
> > what I am thinking.
> >
> > Yes, it has use after sorting.  It is used at query time
> for document
> > scoring in place of the TF and length norm components  (new scorers
> > would need to be created).
> >
> > Using an impact based index moves most of the scoring from
> query time
> > to index time (trades query time flexibility for greatly improved
> > query search performance).  Because the field boosts, length norm,
> > position boosts, etc... are incorporated into a single
> > document-term-score, you can use a single field at search time.  It
> > allows one posting list per query term instead of the current one
> > posting list per field per query term
> (MultiFieldQueryParser wouldn't
> > be necessary in most cases).  In addition to having fewer posting
> > lists to examine, you often don't need to read to the end of long
> > posting lists when processing with a score-at-a-time approach (see
> > Anh/Moffat's Pruned Query Evaluation Using Pre-Computed
> Impacts, SIGIR
> > 2006) for details on one potential algorithm.
> >
> > I'm not quite sure what you mean when mention leaving them out and
> > re-calculating them at merge time.
> >
> > - Jeff
> >
> > > -----Original Message-----
> > > From: Marvin Humphrey [mailto:[hidden email]]
> > > Sent: Tuesday, January 09, 2007 2:58 PM
> > > To: [hidden email]
> > > Subject: Re: Beyond Lucene 2.0 Index Design
> > >
> > >
> > > On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery wrote:
> > >
> > > > e. <impact, num_docs, (doc1,...docN)> f. <impact, num_docs,
> > > > ([doc1, freq ,<positions>],...[docN, freq
> > > > ,<positions>])
> > >
> > > Does the impact have any use after it's used to sort the postings?
> > > Can we leave it out of the index format and recalculate
> at merge-time?
> > >
> > > Marvin Humphrey
> > > Rectangular Research
> > > http://www.rectangular.com/
> > >
> > >
> > >
> > >
> --------------------------------------------------------------------
> > > - To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Beyond Lucene 2.0 Index Design

Dalton, Jeffery
In reply to this post by Dalton, Jeffery
Lucene is a combination of the vector space similarity and Boolean
models.  Lucene's queries a ranked Boolean query.  Documents must meet
certain Boolean criteria, but this list is then ranked by similarity
score.  If you didn't care about returning the "top" hits, then I would
agree that the docId sorted list would be perfectly applicable.
However, in ranked query, even one where Boolean constraints are
enforced, Impacts can be more efficient.  (more to follow).

> -----Original Message-----
> From: Ming Lei [mailto:[hidden email]]
> Sent: Wednesday, January 10, 2007 5:41 PM
> To: [hidden email]
> Subject: Re: Beyond Lucene 2.0 Index Design
>
> I have a couple of questions about the original post of the
> new index design:
>
> (1) Question on the posting list
> > > f. <impact, num_docs, ([doc1, freq
> > ,<positions>],...[docN, freq
> > > > > ,<positions>])
> What is the "impact" per posting list? I am under the
> impression that "impact" or "frequency" is per pair of doc and term.
>
> And it seem that "impact" or "frequency" needs to be stored
> for each doc on the posting list of a term. The reasons are
> two: To efficiently stop the traversal at some point at
> search time by looking at the "impact"
> value. And to get the component score without
> re-cacalculation at search time.
>
>
> (2) I wonder whether Lucene is really based upon vector-space
> model. I am under the impression that the hits are selected
> using boolean model and only the scoring (on the hit set)
> uses vector space model.
>
> If so, the effect on boolean queries are not very positive.
>
> For a query like "termA AND termB", I suppose the posting
> lists of both A and B have to be fully traversed, right? The
> partial traversal is only possible for disjunctions or single
> term query. And the join of the two posting lists will be
> most costly than on the original docID-sorted posting lists.
>
> (3) As to Jian's question below,
> A phrase query is a special case of a conjunctive boolean query.
>
> Michael
>
>
> --- jian chen <[hidden email]> wrote:
>
> > Hi, Jeff,
> >
> > Also, how to handle the phrase based queries?
> >
> > For example, here are two posting lists:
> >
> > TermA: X Y
> > TermB: Y X
> >
> > I am not sure how you would return document X or Y for a
> search of the
> > phrase "TermA Term B". Which should come first?
> >
> > Thanks,
> >
> > Jian
> >
> > On 1/9/07, Dalton, Jeffery <[hidden email]>
> > wrote:
> > >
> > > I'm not sure we fully understand one another, but
> > I'll try to explain
> > > what I am thinking.
> > >
> > > Yes, it has use after sorting.  It is used at
> > query time for document
> > > scoring in place of the TF and length norm
> > components  (new scorers
> > > would need to be created).
> > >
> > > Using an impact based index moves most of the
> > scoring from query time to
> > > index time (trades query time flexibility for
> > greatly improved query
> > > search performance).  Because the field boosts,
> > length norm, position
> > > boosts, etc... are incorporated into a single
> > document-term-score, you
> > > can use a single field at search time.  It allows
> > one posting list per
> > > query term instead of the current one posting list
> > per field per query
> > > term (MultiFieldQueryParser wouldn't be necessary
> > in most cases).  In
> > > addition to having fewer posting lists to examine,
> > you often don't need
> > > to read to the end of long posting lists when
> > processing with a
> > > score-at-a-time approach (see Anh/Moffat's Pruned
> > Query Evaluation Using
> > > Pre-Computed Impacts, SIGIR 2006) for details on
> > one potential
> > > algorithm.
> > >
> > > I'm not quite sure what you mean when mention
> > leaving them out and
> > > re-calculating them at merge time.
> > >
> > > - Jeff
> > >
> > > > -----Original Message-----
> > > > From: Marvin Humphrey
> > [mailto:[hidden email]]
> > > > Sent: Tuesday, January 09, 2007 2:58 PM
> > > > To: [hidden email]
> > > > Subject: Re: Beyond Lucene 2.0 Index Design
> > > >
> > > >
> > > > On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery
> > wrote:
> > > >
> > > > > e. <impact, num_docs, (doc1,...docN)> f. <impact, num_docs,
> > > > > ([doc1, freq
> > ,<positions>],...[docN, freq
> > > > > ,<positions>])
> > > >
> > > > Does the impact have any use after it's used to
> > sort the postings?
> > > > Can we leave it out of the index format and
> > recalculate at merge-time?
> > > >
> > > > Marvin Humphrey
> > > > Rectangular Research
> > > > http://www.rectangular.com/
> > > >
> > > >
> > > >
> > > >
> >
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail:
> > [hidden email]
> > > > For additional commands, e-mail:
> > [hidden email]
> > > >
> > > >
> > >
> > >
> >
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
> > [hidden email]
> > > For additional commands, e-mail:
> > [hidden email]
> > >
> > >
> >
>
>
>
>  
> ______________________________________________________________
> ______________________
> Do you Yahoo!?
> Everyone is raving about the all-new Yahoo! Mail beta.
> http://new.mail.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Beyond Lucene 2.0 Index Design

Dalton, Jeffery
In reply to this post by Dalton, Jeffery
Thanks Grant, I will take a look at this.  

> -----Original Message-----
> From: Grant Ingersoll [mailto:[hidden email]]
> Sent: Thursday, January 11, 2007 8:12 AM
> To: [hidden email]
> Subject: Re: Beyond Lucene 2.0 Index Design
>
> Hi Jeff,
>
> Wondering if you (and/or others) would be interested in
> taking a look at
> https://issues.apache.org/jira/browse/LUCENE-662 and vetting
> the new interfaces, etc. to see if you could come up w/ a
> prototype implementation.  This would help move along 662 as
> it would sort out some of the issues that come up with the
> new interface based approach (I often find it takes at least
> two implementations to fully flesh out an abstraction like this)
>
> -Grant
>
> On Jan 9, 2007, at 9:25 AM, Dalton, Jeffery wrote:
>
> > Hi,
> >
> > I wanted to start some discussion about possible future
> Lucene file /
> > index formats.  This is an extension to the discussion on Flexible
> > Lucene Indexing discussed on the wiki:
> > http://wiki.apache.org/jakarta-lucene/FlexibleIndexing
> >
> > Note: Related sources are listed at the end.
> >
> > I would like to have the ability to create term frequency
> [Persin, et
> > al. 1996] or "impact" sorted [Anh, Moffat 2001,2006] posting lists
> > (freq
> > data) . A posting list sorted by Term frequency rather than
> document
> > id is straightforward (posting design below).  An Impact
> sorted list
> > is relatively new (and perhaps unfamiliar).  An Impact is a single
> > integer value for a term in a document that is stored in
> the posting
> > list and is computed from the combination of the term frequency,
> > document boost, field boost, length norms, and other
> arbitrary scoring
> > features (word position, font, etc...) -- all local information.
> >
> > The driving motivation for this change is to avoid reading
> the entire
> > posting list from disk for very long posting lists (it also
> leads to
> > simplified query-time scoring because the tf, norms, and boosts are
> > built into the impact).  This would address scalability issues with
> > large collections that have been seen in the past; back in December
> > 2005
> > there were two threads: "Lucene Performance Bottlenecks"
> (Lucene User)
> > and "IndexOptimizer Re: Lucene Performance Bottlenecks" (Nutch Dev)
> > where Doug and Andrzej addressed some speed concerns by sorting the
> > Nutch index based on Document Boost (IndexSorter and a
> > TopDocsCollector)
> > [inpsired by Long, Suel]. The goal is that an impact sorted posting
> > list would address these and other concerns in a generic manner.
> >
> > Allowing a frequency or impact sorted posting list format
> would lead
> > to a posting list with the following structure:
> > (Frequency or impact could be used interchangeably in the structure.
> > Lettering continued from Wiki)
> >
> > e. <impact, num_docs, (doc1,...docN)>
> > f. <impact, num_docs, ([doc1, freq ,<positions>],...[docN, freq
> > ,<positions>])
> >
> > The positions are delta encoded for compression.  Similarly, the
> > document numbers for a given frequency/impact are delta encoded.  
> > If you
> > read Moffat and Persin, the papers show that this achieves
> compression
> > comparable to, or even better than, a standard delta encoded docId
> > sorted index.  The structure lends itself well to early
> termination,
> > pruning, etc... where the entire posting list is not read from disk.
> >
> > This type of Impact sorted structure (or similar concept)
> seems to be
> > becoming a "standard" solution described in a lot of new research /
> > text books on IR for large scale indexes.  It would be
> great if Lucene
> > supported something like this someday ;-).
> >
> > Thanks,
> >
> > Jeff Dalton
> >
> > References:
> > Anh, Moffat. Vector-Space Ranking with Effective Early Termination.
> > 2001.
> > Anh, Moffat. Impact Transformation: Effective and Efficient Web
> > Retrieval. 2006.
> > Anh, Moffat. Pruned Query Evaluation Using Pre-Computed
> Impacts. 2006.
> > Long, Suel. Optimized Query Execution in Large Search Engine with
> > Global Page Ordering.
> > Manning, Raghavan, Schutze. Introduction to Information Retrieval,
> > Chapters 2,7.
> > http://www-csli.stanford.edu/%7Eschuetze/information-retrieval-
> > book.html
> >
> > Persin, et al. Filtered Document Retrieval with Frequency-Sorted
> > Indexes. 1996.
> >
> >
> >
> >
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12