[lucy-user] ProximityQuery in C

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[lucy-user] ProximityQuery in C

kasilak
This post was updated on .
Hi Experts:

I am modifying https://github.com/apache/lucy/blob/master/c/sample/search.c
to handle proximity query.

The default QParser_Parse is poweful to enough to handle all BOOLEAN
queries.

But in order to support proximity queries, I have modified the code by
adding the belowcode block.

  {
        Vector *terms = Vec_new(0);
        Vec_Push(terms, (Obj*)query_str);
String *field_name = Str_newf("content");

        ProximityQuery *pquery = ProximityQuery_new(field_name, terms, 100);
        DECREF(terms);
   }

    //Hits *hits = IxSearcher_Hits(searcher, (Obj*)query, 0, 10, NULL);
    Hits   *hits = IxSearcher_Hits(searcher, (Obj*)pquery, 0, 10, NULL);

Unfortunately my hits are NULL. The query string I am passing is "animals
partial", which is there in the documents I have indexed within the 100
words distance mentioned in the ProximityQuery_new () above.

I stumbled upon this perl thread:
http://lucene.472066.n3.nabble.com/lucy-user-Unable-to-retrieve-records-using-Proximity-query-td3990375.html

The above perl thread says that there is no analyzer for ProximityQuery.
Can I know what is the equivalent translation I need to apply for my C code
so as to enable ProximtyQuery and get it working?

Thanks
-Kasi
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] ProxmityQuery in C

Peter Karman
Kasi Lakshman Karthi Anbumony wrote on 2/15/17 4:22 PM:

>
> I stumbled upon this perl thread:
> http://lucene.472066.n3.nabble.com/lucy-user-Unable-to-retrieve-records-using-Proximity-query-td3990375.html
>
> The above perl thread says that there is no analyzer for ProximityQuery.
> Can I know what is the equivalent translation I need to apply for my C code
> so as to enable ProximtyQuery and get it working?
>


You need to analyze "terms" before you pass it to ProximityQuery_new

The relevant section of the QueryParser code is here and might point you in the
right direction:

https://github.com/apache/lucy/blob/master/core/Lucy/Search/QueryParser.c#L862




--
Peter Karman  .  https://peknet.com/  .  https://keybase.io/peterkarman
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] ProxmityQuery in C

kasilak
Hi Peter:

Thanks for your quick response. I am a beginner with Lucy framework and
getting to know the different classes and objects.

I have copy-pasted my source code, with the code marked in red is the block
I am working.

(1) Is this what you meant by analyze "terms"?

(2) Can I do something similar to what has been done for TermQuery in the
code block marked in blue below?

    String        *folder   = Str_newf("%s", path_to_index);
    printf("Index file used: %s\n", path_to_index);
    IndexSearcher *searcher = IxSearcher_new((Obj*)folder);
    Schema        *schema   = IxSearcher_Get_Schema(searcher);
    QueryParser   *qparser  = QParser_new(schema, NULL, NULL, NULL);
    //ProximityQuery *pquery  = ProximityQuery_new(NULL, "animal quarter",
20);

    String *query_str = Str_newf("%s", query_c);
    Query  *query     = QParser_Parse(qparser, query_str);
    //Query  *query       = QParser_Parse(pquery, query_str);

    String *content_str = Str_newf("content");
#ifdef ENABLE_HIGHLIGHTER
    Highlighter *highlighter
        = Highlighter_new((Searcher*)searcher, (Obj*)query, content_str,
200);
#endif

    if (category) {
        String *category_name = Str_newf("category");
        String *category_str  = Str_newf("%s", category);
        TermQuery *category_query
            = TermQuery_new(category_name, (Obj*)category_str);

        Vector *children = Vec_new(2);
        Vec_Push(children, (Obj*)query);
        Vec_Push(children, (Obj*)category_query);
        query = (Query*)ANDQuery_new(children);

        DECREF(children);
        DECREF(category_str);
        DECREF(category_name);
    }

    if( queryType[g_testProximity] )
    {
        Vector *terms = Vec_new(0);
        Vec_Push(terms, (Obj*)query_str);

String *field_name = Str_newf("content");
        ProximityQuery *pquery = ProximityQuery_new(field_name, terms, 100);

        Analyzer *analyzer = Schema_Fetch_Analyzer(schema, field_name);
        if (!analyzer)
        {
           Vec_Push(query, pquery);
        }

        DECREF(terms);
DECREF(field_name);
    }


    Hits *hits = IxSearcher_Hits(searcher, (Obj*)query, 0, 10, NULL);


Thanks
-Kasi


On Wed, Feb 15, 2017 at 5:55 PM, Peter Karman <[hidden email]> wrote:

> Kasi Lakshman Karthi Anbumony wrote on 2/15/17 4:22 PM:
>
>
>> I stumbled upon this perl thread:
>> http://lucene.472066.n3.nabble.com/lucy-user-Unable-to-
>> retrieve-records-using-Proximity-query-td3990375.html
>>
>> The above perl thread says that there is no analyzer for ProximityQuery.
>> Can I know what is the equivalent translation I need to apply for my C
>> code
>> so as to enable ProximtyQuery and get it working?
>>
>>
>
> You need to analyze "terms" before you pass it to ProximityQuery_new
>
> The relevant section of the QueryParser code is here and might point you
> in the right direction:
>
> https://github.com/apache/lucy/blob/master/core/Lucy/Search/
> QueryParser.c#L862
>
>
>
>
> --
> Peter Karman  .  https://peknet.com/  .  https://keybase.io/peterkarman
>
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] ProxmityQuery in C

kasilak
Hi Peter:

Do you have any additional details to add for me to try out?

Is my previous email with the code snippet makes sense to you?

Thanks
-Kasi


On Wed, Feb 15, 2017 at 7:13 PM, Kasi Lakshman Karthi Anbumony <
[hidden email]> wrote:

> Hi Peter:
>
> Thanks for your quick response. I am a beginner with Lucy framework and
> getting to know the different classes and objects.
>
> I have copy-pasted my source code, with the code marked in red is the
> block I am working.
>
> (1) Is this what you meant by analyze "terms"?
>
> (2) Can I do something similar to what has been done for TermQuery in the
> code block marked in blue below?
>
>     String        *folder   = Str_newf("%s", path_to_index);
>     printf("Index file used: %s\n", path_to_index);
>     IndexSearcher *searcher = IxSearcher_new((Obj*)folder);
>     Schema        *schema   = IxSearcher_Get_Schema(searcher);
>     QueryParser   *qparser  = QParser_new(schema, NULL, NULL, NULL);
>     //ProximityQuery *pquery  = ProximityQuery_new(NULL, "animal quarter",
> 20);
>
>     String *query_str = Str_newf("%s", query_c);
>     Query  *query     = QParser_Parse(qparser, query_str);
>     //Query  *query       = QParser_Parse(pquery, query_str);
>
>     String *content_str = Str_newf("content");
> #ifdef ENABLE_HIGHLIGHTER
>     Highlighter *highlighter
>         = Highlighter_new((Searcher*)searcher, (Obj*)query, content_str,
> 200);
> #endif
>
>     if (category) {
>         String *category_name = Str_newf("category");
>         String *category_str  = Str_newf("%s", category);
>         TermQuery *category_query
>             = TermQuery_new(category_name, (Obj*)category_str);
>
>         Vector *children = Vec_new(2);
>         Vec_Push(children, (Obj*)query);
>         Vec_Push(children, (Obj*)category_query);
>         query = (Query*)ANDQuery_new(children);
>
>         DECREF(children);
>         DECREF(category_str);
>         DECREF(category_name);
>     }
>
>     if( queryType[g_testProximity] )
>     {
>         Vector *terms = Vec_new(0);
>         Vec_Push(terms, (Obj*)query_str);
>
> String *field_name = Str_newf("content");
>         ProximityQuery *pquery = ProximityQuery_new(field_name, terms,
> 100);
>
>         Analyzer *analyzer = Schema_Fetch_Analyzer(schema, field_name);
>         if (!analyzer)
>         {
>            Vec_Push(query, pquery);
>         }
>
>         DECREF(terms);
> DECREF(field_name);
>     }
>
>
>     Hits *hits = IxSearcher_Hits(searcher, (Obj*)query, 0, 10, NULL);
>
>
> Thanks
> -Kasi
>
>
> On Wed, Feb 15, 2017 at 5:55 PM, Peter Karman <[hidden email]> wrote:
>
>> Kasi Lakshman Karthi Anbumony wrote on 2/15/17 4:22 PM:
>>
>>
>>> I stumbled upon this perl thread:
>>> http://lucene.472066.n3.nabble.com/lucy-user-Unable-to-retri
>>> eve-records-using-Proximity-query-td3990375.html
>>>
>>> The above perl thread says that there is no analyzer for ProximityQuery.
>>> Can I know what is the equivalent translation I need to apply for my C
>>> code
>>> so as to enable ProximtyQuery and get it working?
>>>
>>>
>>
>> You need to analyze "terms" before you pass it to ProximityQuery_new
>>
>> The relevant section of the QueryParser code is here and might point you
>> in the right direction:
>>
>> https://github.com/apache/lucy/blob/master/core/Lucy/Search/
>> QueryParser.c#L862
>>
>>
>>
>>
>> --
>> Peter Karman  .  https://peknet.com/  .  https://keybase.io/peterkarman
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] ProxmityQuery in C

kasilak
Sharing the complete C code for search.c.

Please look for g_testProximity to follow the proximity query related changes.

As such the default QParser, handles the follow query strings passed from command line arguments without issues. I need not make any code changes.
(1) "a AND b"
(2) "a or b"
(3) "a NOT b"
(4) "a b"

But what is failing is the case "a b"~100,  though my indexed documents have the necessary terms with a span of 100 words.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define CFISH_USE_SHORT_NAMES
#define LUCY_USE_SHORT_NAMES
#include "Clownfish/String.h"
#include "Clownfish/Vector.h"
#include "Lucy/Document/HitDoc.h"
#include "Lucy/Highlight/Highlighter.h"
#include "Lucy/Plan/Schema.h"
#include "Lucy/Search/ANDQuery.h"
#include "Lucy/Search/Hits.h"
#include "Lucy/Search/IndexSearcher.h"
#include "Lucy/Search/TermQuery.h"
#include "Lucy/Search/QueryParser.h"
#include "LucyX/Search/ProximityQuery.h"
#include "Lucy/Analysis/Analyzer.h"
#include "Clownfish/TestHarness/TestUtils.h"
#include "QUtils.h"
#include "version.h"

char path_to_index[100] = "./lucy_index/lucy_index";
#define ENABLE_HIGHLIGHTER

// Test Configuration
enum
{
    g_testDefault    = 0, //QParser supports BOOLEAN/TERM queries
    g_testProximity  = 1, //To support proximity queries
    g_testMax        = 2
};

typedef struct TestOpts_ {
   const char*   name;
}TestOpts;


static void
S_usage_and_exit(const char *arg0) {
    printf("Usage: %s [-p <x86_64/aarch64> platform] [-a <enable(1)/disable(0)> angel signals] [-s <Docs count>] [-c <category (OPTIONAL)>] <querystring>\n", arg0);
    exit(1);
}

int
main(int argc, char *argv[]) {
    fprintf( stderr, "Search Version: %d.%d\n", MAJOR_VERSION, MINOR_VERSION);
    bool isEnableAngelSignals = false;
    uint32_t docCount = 0;
    uint32_t numWanted = 10;

    // Initialize the library.
    lucy_bootstrap_parcel();

    const char *category  = NULL;
    const char *platform  = NULL;
    const char *testQuery = NULL;
    TestOpts g_testopts[] =
    {
      { "default"  },
      { "proximity"},
    };
    bool  queryType[g_testMax] = {false};

    int i = 1;
    uint32_t j;

    while (i < argc - 1) {
        if (strcmp(argv[i], "-p") == 0) {
            if (i + 1 >= argc) {
                S_usage_and_exit(argv[0]);
            }
            i += 1;
            platform = argv[i];
        }
        else if (strcmp(argv[i], "-a") == 0) {
            if (i + 1 >= argc) {
                S_usage_and_exit(argv[0]);
            }
            i += 1;
            isEnableAngelSignals = argv[i];
        }
        else if (strcmp(argv[i], "-s") == 0) {
            if (i + 1 >= argc) {
                S_usage_and_exit(argv[0]);
            }
            i += 1;
            docCount = atol(argv[i]);
        }
        else if (strcmp(argv[i], "-c") == 0) {
            if (i + 1 >= argc) {
                //S_usage_and_exit(argv[0]);
            }
            i += 1;
            category = argv[i];
            printf("Category given: %s\n\n", category);
        }
        else if (strcmp(argv[i], "-T") == 0) {
            if (i + 1 >= argc) {
                S_usage_and_exit(argv[0]);
            }
            i += 1;
            testQuery = argv[i];
            char *opt = (char *)&testQuery[0];
            bool found = false;

            for(j = 0; j < sizeof(g_testopts)/sizeof(TestOpts); j++)
            {
              if (strcmp(opt, g_testopts[j].name) == 0)
              {
                queryType[j] = true;
                printf( "Testing Query: %s\n", g_testopts[j].name);
                found = true;
                break;
              }
            }

            if (!found)
            {
              printf("Invalid option: -T=%s\n", testQuery);
              printf("Valid tests: -T=%s", g_testopts[0].name);
              for(j = 1; j < sizeof(g_testopts)/sizeof(TestOpts); j++)
              {
                printf( ",%s", g_testopts[j].name);
              }
              printf( "\n");
              exit(0);
            }
        }
        else {
            S_usage_and_exit(argv[0]);
        }

        i += 1;
    }

    if (i + 1 != argc) {
        S_usage_and_exit(argv[0]);
    }

    const char *query_c = argv[i];

    printf("Searching for: %s with # of hits: %d \n\n", query_c, numWanted);

#ifdef PERF_INSTRUMENT
  perf_event_init(  (enable_perf_events) (ENABLE_HW_CYCLES_PER | ENABLE_HW_INSTRS_PER) );
#endif

    char buff1[100];
    sprintf(buff1, "%s%d%s%s", "-", docCount, "-", platform);
    strcat(path_to_index, buff1);
    String        *folder   = Str_newf("%s", path_to_index);
    printf("Index file used: %s\n", path_to_index);

#ifdef PERF_INSTRUMENT
  perf_event_enable ( (enable_perf_events) (ENABLE_HW_CYCLES_PER | ENABLE_HW_INSTRS_PER) );
  uint64_t beginI = perf_per_instr_event_read();
#endif
    double start = (double)clock();

    IndexSearcher *searcher  = IxSearcher_new((Obj*)folder);
    Schema        *schema    = IxSearcher_Get_Schema(searcher);
    String        *query_str = Str_newf("%s", query_c);

    QueryParser *qparser = QParser_new(schema, NULL, NULL, NULL);
    ProximityQuery *pquery = NULL;

    Query *query = NULL;
    query = QParser_Parse(qparser, query_str);

    String *content_str = Str_newf("content");
#ifdef ENABLE_HIGHLIGHTER
    Highlighter *highlighter
        = Highlighter_new((Searcher*)searcher, (Obj*)query, content_str, 200);
#endif

    if (category)
    {
        String *category_name = Str_newf("category");
        String *category_str  = Str_newf("%s", category);
        TermQuery *category_query
            = TermQuery_new(category_name, (Obj*)category_str);

        Vector *children = Vec_new(2);
        Vec_Push(children, (Obj*)query);
        Vec_Push(children, (Obj*)category_query);
        query = (Query*)ANDQuery_new(children);

        DECREF(children);
        DECREF(category_str);
        DECREF(category_name);
    }

    //To handle proximity queries
    if( queryType[g_testProximity] )
    {
        Vector *terms = Vec_new(0);
        Vec_Push(terms, (Obj*)query_str);

      String *field_name = Str_newf("content");
        pquery = (Query*)ProximityQuery_new(field_name, terms, 100);

        Vector *children = Vec_new(2);
        Vec_Push(children, (Obj*) query);
        Vec_Push(children, (Obj*) pquery);
        query = (Query*) (children); //???

        DECREF(children);
      DECREF(field_name);
        DECREF(terms);
    }

    Hits *hits;
    if ( queryType[g_testDefault] )
    {
      hits = IxSearcher_Hits(searcher, (Obj*)query, 0, numWanted, NULL);
    }
    else
    {
      hits = IxSearcher_Hits(searcher, (Obj*)query, 0, numWanted, NULL);
    }

    String *title_str = Str_newf("title");
    String *url_str   = Str_newf("url");
    HitDoc *hit;
    i = 1;

    // Loop over search results.
    while (NULL != (hit = Hits_Next(hits))) {
        String *title = (String*)HitDoc_Extract(hit, title_str);
        char *title_c = Str_To_Utf8(title);

        String *url = (String*)HitDoc_Extract(hit, url_str);
        char *url_c = Str_To_Utf8(url);

#ifdef ENABLE_HIGHLIGHTER
        String *excerpt = Highlighter_Create_Excerpt(highlighter, hit);
        char *excerpt_c = Str_To_Utf8(excerpt);

        printf("Result %d: %s (%s)\n%s\n\n", i, title_c, url_c, excerpt_c);
        free(excerpt_c);
        DECREF(excerpt);
#else
        printf("Result %d: %s (%s)\n\n", i, title_c, url_c);
#endif
        free(url_c);
        free(title_c);
        DECREF(url);
        DECREF(title);
        DECREF(hit);
        i++;
    }

    printf("Search: %8.5f QPS\n", (1 * ((double)CLOCKS_PER_SEC/((double)clock()-start))) );

#ifdef PERF_INSTRUMENT
  printf("================================\n");
  printf("For Searching: %lld instructions\n", (perf_per_instr_event_read() - beginI) );
#endif

    DECREF(url_str);
    DECREF(title_str);
    DECREF(hits);
    DECREF(query);
    DECREF(query_str);
    if( queryType[g_testProximity] )
    {
      DECREF(pquery);
    }
#ifdef ENABLE_HIGHLIGHTER
    DECREF(highlighter);
#endif
    DECREF(content_str);
    DECREF(qparser);
    DECREF(searcher);
    DECREF(folder);
    return 0;
}

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] ProxmityQuery in C

Peter Karman
kasilak wrote on 2/17/17 6:35 PM:
> Sharing the complete C code for search.c.


>     //To handle proximity queries
>     if( queryType[g_testProximity] )
>     {
>         Vector *terms = Vec_new(0);
>         Vec_Push(terms, (Obj*)query_str);
>
>       String *field_name = Str_newf("content");
>         pquery = (Query*)ProximityQuery_new(field_name, terms, 100);


^^^ that won't work if `terms` is not first analyzed. If your index uses
stemming or case normalization or anything else, then `terms` will not match
anything in the lexicon.

What I pointed at earlier in this thread is how the C QueryParser handles
phrases, which is very similar to what must happen for proximity. A proximity is
just a phrase with the positions in a range <= maxdistance.

Since the built-in QueryParser does not yet handle proximity syntax, you'd need
to detect that case and parse it out yourself.

Or better yet, patch the Lucy QueryParser to recognize the proximity syntax and
submit that back here as a pull request.

The relevant place to start looking is here:

https://github.com/apache/lucy/blob/master/core/Lucy/Search/QueryParser.c#L862

and look at the logic around `is_phrase`.

HTH,
pek


--
Peter Karman  .  https://peknet.com/  .  https://keybase.io/peterkarman
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] ProxmityQuery in C

Nick Wellnhofer
In reply to this post by kasilak
On 16/02/2017 01:13, Kasi Lakshman Karthi Anbumony wrote:
>         Vector *terms = Vec_new(0);
>         Vec_Push(terms, (Obj*)query_str);
>
>         String *field_name = Str_newf("content");
>         ProximityQuery *pquery = ProximityQuery_new(field_name, terms, 100);

Try something like this instead:

     Vector *terms = Analyzer_Split(analyzer, query_str);
     String *field_name = Str_newf("content");
     ProximityQuery *pquery = ProximityQuery_new(field_name, terms, 100);

See

     https://lucy.apache.org/docs/c/Lucy/Analysis/Analyzer.html#func_Split

Nick

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] ProxmityQuery in C

kasilak
Thanks Nick and Peter. Now ProximityQuery is working. Sharing the code excerpts if anyone else would like to use.

My search string is of the form "animal render"~100, where 100 is the within distance.

    String *content_str = Str_newf("content"); //field_name
 
    //To handle proximity queries
    if( queryType[g_testProximity] )
    {
        //String *field_name = Str_newf("content");
        Analyzer *analyzer = Schema_Fetch_Analyzer(schema, content_str);
        Vector *terms = Analyzer_Split(analyzer, query_str);

        String     *token = (String *)Vec_Pop(terms);
        uint32_t   within = Str_To_I64(token);

        printf("Search within=%d and Vec_Size=%zu\n", within, Vec_Get_Size(terms));

        pquery = ProximityQuery_new(content_str, terms, within);

        DECREF(terms);
    }

    Hits
     *hits = IxSearcher_Hits(searcher, (Obj*)pquery, 0, numWanted, NULL);
     Highlighter
       *highlighter = Highlighter_new((Searcher*)searcher, (Obj*)pquery, content_str, 200);