[lucy-user] Lucy Benchmarking

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

[lucy-user] Lucy Benchmarking

kasilak
I am benchmarking Lucy and need few information on Lucy project. I am a beginner as far as Lucy goes. I am interested only in "C language" port of Lucy

My questions are following and hope the experts can respond to the questions.

(1)  Is Lucy multithreaded or single threaded?

(2) Are "C" runtime and bindings stable?

(2) Is there preexisting benchmark code written in "C" to measure Lucy performance?

(3) I am seeing one under devel/benchmarks/indexers/LuceneIndexer.java. But this one is written in Java and looks like benchmarking Lucene not Lucy. Am I right in my observation?

(4) I was thinking of modifying the lucy/c/sample applications as benchmarking application. Is this a good strategy.
Btw is there a good way to build sample files. I have to modify the Makefile in luc/c/ directory to build the sample files and  I am not sure if this is the correct way.

Your assistance and inputs will be greatly appreciated.

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Lucy Benchmarking

Nick Wellnhofer
On 01/02/2017 01:44, Kasi Lakshman Karthi Anbumony wrote:
> (1)  Is Lucy multithreaded or single threaded?

Single-threaded.

> (2) Are "C" runtime and bindings stable?

Yes.

> (2) Is there preexisting benchmark code written in "C" to measure Lucy performance?

No.

> (3) I am seeing one under devel/benchmarks/indexers/LuceneIndexer.java. But this one is written in Java and looks like benchmarking Lucene not Lucy. Am I right in my observation?

The corresponding Perl benchmark script for Lucy is lucy_indexer.plx:

 
https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=tree;f=devel/benchmarks/indexers;h=77626c37285602941376c5e5950a20e50683da40;hb=HEAD

> (4) I was thinking of modifying the lucy/c/sample applications as benchmarking application. Is this a good strategy.
> Btw is there a good way to build sample files. I have to modify the Makefile in luc/c/ directory to build the sample files and  I am not sure if this is the correct way.

You can find some guidance on how to compile Lucy applications in the comment
on top of getting_started.c:

 
https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=blob;f=c/sample/getting_started.c;h=6d6193d772f2ceaac86c67cc49169878b4d4d2f6;hb=HEAD

Basically, you have to run the Clownfish compiler "cfc" to generate header
files, then you can compile your code and link against libclownfish and liblucy.

Benchmark results for the indexer will largely depend on the particular
Analyzer chain and the total size of your index. The default EasyAnalyzer
consists of

- StandardTokenizer
- Unicode Normalizer
- SnowballStemmer

StandardTokenizer is pretty fast, but Normalizer and Stemmer are
CPU-intensive. Last time I checked, they account for about two-thirds of the
processing time for small indices.

A better benchmarking framework would be a much needed contribution.

Nick

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Lucy Benchmarking

kasilak
Thanks Nick.

Can I know how to build lucy and lucy-clownfish for ARM (AARCH64)?

I do have the ARM cross-compiler tool chain and would like to know which
files to change?

Thanks
-Kasi

On Wed, Feb 1, 2017 at 7:42 AM, Nick Wellnhofer <[hidden email]> wrote:

> On 01/02/2017 01:44, Kasi Lakshman Karthi Anbumony wrote:
>
>> (1)  Is Lucy multithreaded or single threaded?
>>
>
> Single-threaded.
>
> (2) Are "C" runtime and bindings stable?
>>
>
> Yes.
>
> (2) Is there preexisting benchmark code written in "C" to measure Lucy
>> performance?
>>
>
> No.
>
> (3) I am seeing one under devel/benchmarks/indexers/LuceneIndexer.java.
>> But this one is written in Java and looks like benchmarking Lucene not
>> Lucy. Am I right in my observation?
>>
>
> The corresponding Perl benchmark script for Lucy is lucy_indexer.plx:
>
>
> https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=tree;
> f=devel/benchmarks/indexers;h=77626c37285602941376c5e5950a20
> e50683da40;hb=HEAD
>
> (4) I was thinking of modifying the lucy/c/sample applications as
>> benchmarking application. Is this a good strategy.
>> Btw is there a good way to build sample files. I have to modify the
>> Makefile in luc/c/ directory to build the sample files and  I am not sure
>> if this is the correct way.
>>
>
> You can find some guidance on how to compile Lucy applications in the
> comment on top of getting_started.c:
>
>
> https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=blob;
> f=c/sample/getting_started.c;h=6d6193d772f2ceaac86c67cc4916
> 9878b4d4d2f6;hb=HEAD
>
> Basically, you have to run the Clownfish compiler "cfc" to generate header
> files, then you can compile your code and link against libclownfish and
> liblucy.
>
> Benchmark results for the indexer will largely depend on the particular
> Analyzer chain and the total size of your index. The default EasyAnalyzer
> consists of
>
> - StandardTokenizer
> - Unicode Normalizer
> - SnowballStemmer
>
> StandardTokenizer is pretty fast, but Normalizer and Stemmer are
> CPU-intensive. Last time I checked, they account for about two-thirds of
> the processing time for small indices.
>
> A better benchmarking framework would be a much needed contribution.
>
> Nick
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Lucy Benchmarking

Nick Wellnhofer
On 02/02/2017 21:44, Kasi Lakshman Karthi Anbumony wrote:
> Can I know how to build lucy and lucy-clownfish for ARM (AARCH64)?
>
> I do have the ARM cross-compiler tool chain and would like to know which
> files to change?

Cross compiling Lucy isn't supported yet. I haven't tried to build Lucy on ARM
myself, but we have successful test reports from CPAN Testers with Raspberry
Pis. So, if you're feeling adventurous:

1. Build the Clownfish compiler normally for the host platform.
2. Configure the Clownfish runtime using the host compiler.
3. Edit the generated Makefile.
    - Replace CC with the cross compiler.
    - Check CFLAGS etc.
4. Edit the generated charmony.h file to match the target
    platform.
    - CHY_SIZEOF macros
    - Endian macro
    - Possibly other stuff
5. (Maybe) Run `make autogen/hierarchy.json` first and edit the
    generated file autogen/include/cfish_platform.h to match the
    target platform.
6. Run `make`. If you run into errors, adjust charmony.h or the
    Makefile.
7. Make sure to make backups of Makefile, charmony.h, and
    cfish_platform.h. These files might be recreated and you'll
    lose your changes.
8. Repeat steps 2-7 for Lucy.

Nick

Reply | Threaded
Open this post in threaded view
|

[lucy-user] Cross-compiling

Nick Wellnhofer
On 02/02/2017 23:26, Nick Wellnhofer wrote:
> Cross compiling Lucy isn't supported yet.

Here's a quick status update. I made some changes to Charmonizer to support
cross-compiling. Clownfish can now be cross-compiled out of the box. Just set
TARGET_CC when executing ./configure. For example:

     TARGET_CC=arm-linux-gnueabihf-gcc ./configure

We even have a Travis job that tests cross compilation. Note that the library
is only compiled without running tests. There are ways to run ARM binaries on
Travis with QEMU, but it's slow to set up.

     https://travis-ci.org/nwellnhof/lucy-clownfish/jobs/198365821

There are a few things to do before we can easily cross-compile Lucy. It works
if you know what to change in the Makefile but a couple of things need to be
fixed:

     https://issues.apache.org/jira/browse/CLOWNFISH-115

Nick

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Cross-compiling

kasilak
Thanks Nick.

I was able to compile and test in my AARCH64 board I have. This simplifies
the building process.

Thanks

On Sat, Feb 4, 2017 at 1:36 PM, Nick Wellnhofer <[hidden email]> wrote:

> On 02/02/2017 23:26, Nick Wellnhofer wrote:
>
>> Cross compiling Lucy isn't supported yet.
>>
>
> Here's a quick status update. I made some changes to Charmonizer to
> support cross-compiling. Clownfish can now be cross-compiled out of the
> box. Just set TARGET_CC when executing ./configure. For example:
>
>     TARGET_CC=arm-linux-gnueabihf-gcc ./configure
>
> We even have a Travis job that tests cross compilation. Note that the
> library is only compiled without running tests. There are ways to run ARM
> binaries on Travis with QEMU, but it's slow to set up.
>
>     https://travis-ci.org/nwellnhof/lucy-clownfish/jobs/198365821
>
> There are a few things to do before we can easily cross-compile Lucy. It
> works if you know what to change in the Makefile but a couple of things
> need to be fixed:
>
>     https://issues.apache.org/jira/browse/CLOWNFISH-115
>
> Nick
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Lucy Benchmarking

kasilak
In reply to this post by Nick Wellnhofer
Dear Experts:

I am trying to measure the indexing and searching performance using a toy
benchmark and have few questions.

(1) Plan is to report the below metrics:

   - Index creation: tokens/second
      -  Can I know how to obtain the tokens in the lucy_index created? Do
      you think a better metric will be  (Number of terms in the posting
      list/second)? If so, how to obtain the number of terms in the
posting list?
   - Search: free-text query/second
   - Search metric is clean and clear since the toy benchmark controls the
      number of queries.

(2) What are the different query types possible?

   - vary document weighting
      - Is it possible or is it fixed for a given lucy_index generated?
   - vary number of terms
   - vary relationship of terms (e.g., proximity)
      - How to do it? Is there an operator like NEAR?
   - vary operations (e.g., AND, OR)
      - I see that the support is available for boolean query parser. Can I
      know whether for a given search instance I can have multiple boolean
      queries like below?

 if (category) {

        String *category_name = Str_newf("category");

        String *category_str  = Str_newf("%s", category);

        TermQuery *category_query

            = TermQuery_new(category_name, (Obj*)category_str);


        Vector *children = Vec_new(2);

        Vec_Push(children, (Obj*)query1);

        Vec_Push(children, (Obj*)category_query);

         query1 = (Query*)ANDQuery_new(children);


        Vector *children = Vec_new(2);

        Vec_Push(children, (Obj*)query2);

        Vec_Push(children, (Obj*)category_query);

        query2 = (Query*)ANDQuery_new(children);


        DECREF(children);

        DECREF(category_str);

        DECREF(category_name);

    }


Thanks
-Kasi



On Thu, Feb 2, 2017 at 5:26 PM, Nick Wellnhofer <[hidden email]> wrote:

> On 02/02/2017 21:44, Kasi Lakshman Karthi Anbumony wrote:
>
>> Can I know how to build lucy and lucy-clownfish for ARM (AARCH64)?
>>
>> I do have the ARM cross-compiler tool chain and would like to know which
>> files to change?
>>
>
> Cross compiling Lucy isn't supported yet. I haven't tried to build Lucy on
> ARM myself, but we have successful test reports from CPAN Testers with
> Raspberry Pis. So, if you're feeling adventurous:
>
> 1. Build the Clownfish compiler normally for the host platform.
> 2. Configure the Clownfish runtime using the host compiler.
> 3. Edit the generated Makefile.
>    - Replace CC with the cross compiler.
>    - Check CFLAGS etc.
> 4. Edit the generated charmony.h file to match the target
>    platform.
>    - CHY_SIZEOF macros
>    - Endian macro
>    - Possibly other stuff
> 5. (Maybe) Run `make autogen/hierarchy.json` first and edit the
>    generated file autogen/include/cfish_platform.h to match the
>    target platform.
> 6. Run `make`. If you run into errors, adjust charmony.h or the
>    Makefile.
> 7. Make sure to make backups of Makefile, charmony.h, and
>    cfish_platform.h. These files might be recreated and you'll
>    lose your changes.
> 8. Repeat steps 2-7 for Lucy.
>
> Nick
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Lucy Benchmarking

Nick Wellnhofer
On 09/02/2017 01:46, Kasi Lakshman Karthi Anbumony wrote:
> (1) Plan is to report the below metrics:
>
>    - Index creation: tokens/second
>       -  Can I know how to obtain the tokens in the lucy_index created? Do
>       you think a better metric will be  (Number of terms in the posting
>       list/second)? If so, how to obtain the number of terms in the
> posting list?

AFAIK, the total number of terms in all input documents isn't available
because the term frequencies aren't stored separately. I'd simply use the
total size of the input documents in bytes.

> (2) What are the different query types possible?
>
>    - vary document weighting
>       - Is it possible or is it fixed for a given lucy_index generated?

You can apply a boost to queries at query time:

     http://lucy.apache.org/docs/c/Lucy/Search/Query.html#func_Set_Boost

And to fields and documents at indexing time:

     http://lucy.apache.org/docs/c/Lucy/Plan/FieldType.html#func_Set_Boost
     http://lucy.apache.org/docs/c/Lucy/Index/Indexer.html#func_Add_Doc

But for benchmarking purposes, it mostly matters whether you sort by score,
document id, or a field value. See

     http://lucy.apache.org/docs/c/Lucy/Search/SortSpec.html

>    - vary relationship of terms (e.g., proximity)
>       - How to do it? Is there an operator like NEAR?

There's ProximityQuery but I'm not sure how it works:

     http://lucy.apache.org/docs/c/LucyX/Search/ProximityQuery.html

>    - vary operations (e.g., AND, OR)
>       - I see that the support is available for boolean query parser. Can I
>       know whether for a given search instance I can have multiple boolean
>       queries like below?

Yes, that's possible.

Nick

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Lucy Benchmarking

Peter Karman
>
>
>    - vary relationship of terms (e.g., proximity)
>>       - How to do it? Is there an operator like NEAR?
>>
>
> There's ProximityQuery but I'm not sure how it works:
>
>     http://lucy.apache.org/docs/c/LucyX/Search/ProximityQuery.html
>
>

​You can see one example of ProximityQuery usage here (Perl)

https://metacpan.org/source/KARMAN/Search-Query-Dialect-Lucy-0.202/lib/Search/Query/Dialect/Lucy.pm#L701

Of note:

* `within` is like NEAR - it takes an integer argument
* order of terms is respected. It's like a phrase​



--
Peter Karman . https://peknet.com/ <http://peknet.com/> .
https://keybase.io/peterkarman
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Lucy Benchmarking

kasilak
Thanks for the explanation.

As a follow on question, based on this link:
https://lucy.apache.org/docs/c/Lucy/Docs/FileFormat.html

(1) Why the cf.dat has a document section?

(2) Why is it not compressed?

I see most of the content of the books I have indexed being part of cf.dat
file and can read the text as it is! Is this how the inverted indexing
works?

Thanks
-Kasi

On Thu, Feb 9, 2017 at 1:21 PM, Peter Karman <[hidden email]> wrote:

> >
> >
> >    - vary relationship of terms (e.g., proximity)
> >>       - How to do it? Is there an operator like NEAR?
> >>
> >
> > There's ProximityQuery but I'm not sure how it works:
> >
> >     http://lucy.apache.org/docs/c/LucyX/Search/ProximityQuery.html
> >
> >
>
> ​You can see one example of ProximityQuery usage here (Perl)
>
> https://metacpan.org/source/KARMAN/Search-Query-Dialect-
> Lucy-0.202/lib/Search/Query/Dialect/Lucy.pm#L701
>
> Of note:
>
> * `within` is like NEAR - it takes an integer argument
> * order of terms is respected. It's like a phrase​
>
>
>
> --
> Peter Karman . https://peknet.com/ <http://peknet.com/> .
> https://keybase.io/peterkarman
>
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Lucy Benchmarking

Peter Karman
Kasi Lakshman Karthi Anbumony wrote on 2/9/17 5:51 PM:

> Thanks for the explanation.
>
> As a follow on question, based on this link:
> https://lucy.apache.org/docs/c/Lucy/Docs/FileFormat.html
>
> (1) Why the cf.dat has a document section?
>
> (2) Why is it not compressed?
>
> I see most of the content of the books I have indexed being part of cf.dat
> file and can read the text as it is! Is this how the inverted indexing
> works?

Do you have the "stored" flag or "highlightable" flag set to true for your
Plan::FullTextType schema definitions?

IIRC that's why doc text is stored, which seems to be confirmed in that URL you
reference.

As far as why it is not compressed, I'm not sure. I expect that decompression
incurs a performance hit.


--
Peter Karman  .  https://peknet.com/  .  https://keybase.io/peterkarman
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Lucy Benchmarking

Marvin Humphrey
In reply to this post by kasilak
On Thu, Feb 9, 2017 at 3:51 PM, Kasi Lakshman Karthi Anbumony
<[hidden email]> wrote:

> As a follow on question, based on this link:
> https://lucy.apache.org/docs/c/Lucy/Docs/FileFormat.html
>
> (1) Why the cf.dat has a document section?

The search needs to give something back to you to identify which
documents were hits. Lucy's internal document IDs change over time, so
are not suitable for that purpose.  You need to at least store your
own identifier, even if you choose not to store other parts of the
document.

> (2) Why is it not compressed?

It's not done by default, but there are extension points allowing that
behavior to be overridden. There's even example code which ships with
Lucy which does exactly what you suggest.  It's in Perl, but could be
ported to C.

 $REPO/perl/lib/LucyX/Index/ZlibDocReader.pm
 $REPO/perl/lib/LucyX/Index/ZlibDocWriter.pm

> I see most of the content of the books I have indexed being part of cf.dat
> file and can read the text as it is! Is this how the inverted indexing
> works?

The document storage part of a Lucy datastore is separate from the
inverted index.  The inverted index data structures are definitely
compressed, using algorithms tuned to the task of search. The first
part of the search yields a set of internal Lucy document IDs, which
are then used to look up whatever's in document storage.

From a performance perspective, the cost to perform the inverted index
search is roughly proportional to the size of the corpus, whereas the
cost to retrieve the document content afterwards is proportional to
the number of documents retrieved.  When scaling to larger
collections, compressing the inverted index is more important than
compressing document storage, since the number of documents searched
grows while the number of documents retrieved often stays the same.

Of course it may still be reasonable to compress document storage,
depending on usage pattern. But if for example you're only storing
short identifiers, there's no need.

Marvin Humphrey
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Lucy Benchmarking

kasilak
Hi Murphy:

Thanks for your detailed explanation.

Given the significance of inverted index compression, can I know the
following for better understanding of inner workings:

(1) What is the data structure used to represent Lexicon? (Clownfish
supports hashtable. Does it mean Lucy uses hashtable?)

(2) What is the data structure used to represent postings? (Clownfish
supports hashtable. Does it mean Lucy uses hashtable?)

(3) Which compression method is used? Is it enabled by default?

(4) Why there is no API (function call) to know the number of terms in
lexicon and posting list for a given cf.dat?

(3) Can I know whether searching through lexicon/posting list is in-memory
process or IO process?

Thanks
-Kasi


On Sat, Feb 11, 2017 at 1:30 PM, Marvin Humphrey <[hidden email]>
wrote:

> On Thu, Feb 9, 2017 at 3:51 PM, Kasi Lakshman Karthi Anbumony
> <[hidden email]> wrote:
>
> > As a follow on question, based on this link:
> > https://lucy.apache.org/docs/c/Lucy/Docs/FileFormat.html
> >
> > (1) Why the cf.dat has a document section?
>
> The search needs to give something back to you to identify which
> documents were hits. Lucy's internal document IDs change over time, so
> are not suitable for that purpose.  You need to at least store your
> own identifier, even if you choose not to store other parts of the
> document.
>
> > (2) Why is it not compressed?
>
> It's not done by default, but there are extension points allowing that
> behavior to be overridden. There's even example code which ships with
> Lucy which does exactly what you suggest.  It's in Perl, but could be
> ported to C.
>
>  $REPO/perl/lib/LucyX/Index/ZlibDocReader.pm
>  $REPO/perl/lib/LucyX/Index/ZlibDocWriter.pm
>
> > I see most of the content of the books I have indexed being part of
> cf.dat
> > file and can read the text as it is! Is this how the inverted indexing
> > works?
>
> The document storage part of a Lucy datastore is separate from the
> inverted index.  The inverted index data structures are definitely
> compressed, using algorithms tuned to the task of search. The first
> part of the search yields a set of internal Lucy document IDs, which
> are then used to look up whatever's in document storage.
>
> From a performance perspective, the cost to perform the inverted index
> search is roughly proportional to the size of the corpus, whereas the
> cost to retrieve the document content afterwards is proportional to
> the number of documents retrieved.  When scaling to larger
> collections, compressing the inverted index is more important than
> compressing document storage, since the number of documents searched
> grows while the number of documents retrieved often stays the same.
>
> Of course it may still be reasonable to compress document storage,
> depending on usage pattern. But if for example you're only storing
> short identifiers, there's no need.
>
> Marvin Humphrey
>
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Lucy Benchmarking

Nick Wellnhofer
On 14/02/2017 00:57, Kasi Lakshman Karthi Anbumony wrote:
> (1) What is the data structure used to represent Lexicon? (Clownfish
> supports hashtable. Does it mean Lucy uses hashtable?)

Lexicon is essentially a sorted on-disk array that is searched with binary
search. Clownfish::Hash, on the other hand, is an in-memory data structure.
Lucy doesn't build in-memory structures for most index data because this would
incur a huge startup penalty. This also makes it possible to work with indices
that don't fit in RAM, although performance deteriorates quickly in this case.

> (2) What is the data structure used to represent postings? (Clownfish
> supports hashtable. Does it mean Lucy uses hashtable?)

Posting lists are stored in an on-disk array. The indices are found in Lexicon.

> (3) Which compression method is used? Is it enabled by default?

Lexicon and posting list data is always compressed with delta encoding for
numbers and incremental encoding for strings.

> (4) Why there is no API (function call) to know the number of terms in
> lexicon and posting list for a given cf.dat?

It's generally hard to tell why a certain feature wasn't implemented. The only
answer I can give is that no one deemed it important enough so far. But Lucy
is open-source software. So, basically, anyone can implement any features they
want.

> (3) Can I know whether searching through lexicon/posting list is in-memory
> process or IO process?

Lucy uses memory-mapped files to access most index data so the distinction
between in-memory and IO-based operation blurs quite a bit.

Nick