Lucy Knowledge Questions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Lucy Knowledge Questions

kasilak
This post was updated on .
Copying Nick's responses on technical Questions

(1) What is the data structure used to represent Lexicon? (Clownfish
supports hashtable. Does it mean Lucy uses hashtable?)
Lexicon is essentially a sorted on-disk array that is searched with binary search. Clownfish::Hash, on the other hand, is an in-memory data structure. Lucy doesn't build in-memory structures for most index data because this would incur a huge startup penalty. This also makes it possible to work with indices that don't fit in RAM, although performance deteriorates quickly in this case.

(2) What is the data structure used to represent postings? (Clownfish
supports hashtable. Does it mean Lucy uses hashtable?)
Posting lists are stored in an on-disk array. The indices are found in Lexicon.

(3) Which compression method is used? Is it enabled by default?
Lexicon and posting list data is always compressed with delta encoding for numbers and incremental encoding for strings.

(4) Can I know whether searching through lexicon/posting list is in-memory
process or IO process?
Lucy uses memory-mapped files to access most index data so the distinction between in-memory and IO-based operation blurs quite a bit.
==============================
Hi Nick:

I have follow up questions based on the above thread.
(1)  You have mentioned above in your response, that Lexicon/Postings are stored as sorted on-disk array with binary search. Is it simple sorted array or represented as a binary tree (underlying data structure)? Is the binary search is binary tree search (if it is binary tree)? Can you point me to "C" source code where this is done?

(2) While performing searches on a indexed cf.dat, is there a fixed cost to decompress the "delta encoded numbers and strings"?

(3) I am taking measurements using my toy "search benchmark". I have indexed 10,000 documents. While searching, in the IxSearcher_Hits, I am changing the num_wanted for the given query string. My expectation was changing the num_wanted shouldn't affect the performance (measured using QPS "time taken to search and report the query in queries per second (QPS)").
On the contrary what I am observing is when the num_wanted is set to 10 hits, then the QPS is higher in comparison with num_wanted is set to 4000 hits.
If the TF/IDF (term frequency/inverse document frequency) ranking is constructed always and will report the top 10 hits or top 4000 hits,  then my expectation is QPS shouldn't change.
"The only explanation I have is the difference in QPS is because in the case of 10 hits the search stops as soon as the IxSearcher_Hits found 10 hits (disregarding the ranking algorithm) and in the case of 4000 hits, it will continue and stop as soon as 4000 hits or maximum the search application can find. This means IxSearcher_Hits doesn't use any inherent ranking to return the relevant documents. Is my theory right?"


Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Lucy Knowledge Questions

Nick Wellnhofer
On 23/02/2017 01:07, kasilak wrote:
> (1)  You have mentioned above in your response, that Lexicon/Postings are
> stored as sorted on-disk array with binary search. Is the sorted array is
> represented as a binary tree (underlying data structure)? Is the binary
> search is binary tree search?

No, it's the classic binary search algorithm that works on a sorted array:

     https://en.wikipedia.org/wiki/Binary_search_algorithm

> Can I point me to "C" source code where this
> is done?

It's implemented in LexIndex_Seek:

 
https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=blob;f=core/Lucy/Index/LexIndex.c;h=0d7a0ea4f4d4f2d720df13be2919691d0977c8ab;hb=HEAD#l141

> (2) While performing searches on a indexed cf.dat, is there a fixed cost to
> decompress the "delta encoded numbers and strings"?

What do you mean by "fixed cost"?

> (3) I am taking measurements using my toy benchmark. I have indexed 10,000
> documents. In the IxSearcher_Hits, I am changing the num_wanted for the
> given query string. My expectation was changing the num_wanted shouldn't
> affect the performance (measured using QPS "time taken to search and report
> the query in queries per second (QPS)").
> On the contrary what I am observing is when the num_wanted is set to 10
> hits, then the QPS is higher in comparison with num_wanted is set to 4000
> hits.
> If the TF/IDF (term frequency/inverse document frequency) ranking is
> constructed always and will report the top 10 hits or top 4000 hits,  then
> my expectation is QPS shouldn't change.
> "The only explanation I have is the difference in QPS is because in the case
> of 10 hits the search stops as soon as the IxSearcher_Hits found 10 hits
> (disregarding the ranking algorithm) and in the case of 4000 hits, it will
> continue and stop as soon as 4000 hits or maximum the application can find.
> This means IxSearcher_Hits doesn't use any inherent ranking to return the
> relevant documents. Is my theory right?"

No, IxSearcher_Hits always returns sorted documents. The search process can be
divided into three steps:

1. Find all N documents matching a query.
2. Find and sort the top M documents according to the SortSpec. Lucy
    uses a priority queue implemented as a heap (HitQueue), so time
    complexity is O(N * log(M)).
3. Retrieve the stored fields for each of the top M documents.

Steps 2 and 3 are obviously faster for smaller values of M.

Nick