On 23/02/2017 01:07, kasilak wrote:

> (1) You have mentioned above in your response, that Lexicon/Postings are

> stored as sorted on-disk array with binary search. Is the sorted array is

> represented as a binary tree (underlying data structure)? Is the binary

> search is binary tree search?

No, it's the classic binary search algorithm that works on a sorted array:

https://en.wikipedia.org/wiki/Binary_search_algorithm> Can I point me to "C" source code where this

> is done?

It's implemented in LexIndex_Seek:

https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=blob;f=core/Lucy/Index/LexIndex.c;h=0d7a0ea4f4d4f2d720df13be2919691d0977c8ab;hb=HEAD#l141> (2) While performing searches on a indexed cf.dat, is there a fixed cost to

> decompress the "delta encoded numbers and strings"?

What do you mean by "fixed cost"?

> (3) I am taking measurements using my toy benchmark. I have indexed 10,000

> documents. In the IxSearcher_Hits, I am changing the num_wanted for the

> given query string. My expectation was changing the num_wanted shouldn't

> affect the performance (measured using QPS "time taken to search and report

> the query in queries per second (QPS)").

> On the contrary what I am observing is when the num_wanted is set to 10

> hits, then the QPS is higher in comparison with num_wanted is set to 4000

> hits.

> If the TF/IDF (term frequency/inverse document frequency) ranking is

> constructed always and will report the top 10 hits or top 4000 hits, then

> my expectation is QPS shouldn't change.

> "The only explanation I have is the difference in QPS is because in the case

> of 10 hits the search stops as soon as the IxSearcher_Hits found 10 hits

> (disregarding the ranking algorithm) and in the case of 4000 hits, it will

> continue and stop as soon as 4000 hits or maximum the application can find.

> This means IxSearcher_Hits doesn't use any inherent ranking to return the

> relevant documents. Is my theory right?"

No, IxSearcher_Hits always returns sorted documents. The search process can be

divided into three steps:

1. Find all N documents matching a query.

2. Find and sort the top M documents according to the SortSpec. Lucy

uses a priority queue implemented as a heap (HitQueue), so time

complexity is O(N * log(M)).

3. Retrieve the stored fields for each of the top M documents.

Steps 2 and 3 are obviously faster for smaller values of M.

Nick