[lucy-user] C library:Suggester

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[lucy-user] C library:Suggester

serkanmulayim@gmail.com
Hi guys,

I am using the C library. I would like to get the suggester or autocomplete functionality in my library. It needs to return {"hello", "hell", "hellx"} when your query is "hell". I feel like I need to be able to read all the tokens in the whole index, and return the results based on it. I looked at the indexReader for this, but I could not find any useful information. Do you think this is possible?

Thanks,
Serkan
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] C library:Suggester

Marvin Humphrey
On Mon, May 1, 2017 at 3:55 PM, Serkan Mulayim <[hidden email]> wrote:

> I am using the C library. I would like to get the suggester or autocomplete
> functionality in my library. It needs to return {"hello", "hell", "hellx"}
> when your query is "hell". I feel like I need to be able to read all the
> tokens in the whole index, and return the results based on it. I looked at
> the indexReader for this, but I could not find any useful information. Do
> you think this is possible?

Autosuggestion functionality will need tuning, just like search results.  In
fact, autosuggestion is really a specialized form of search application.  It
could be implemented with a separate index or separate fields.

Say that we only wanted to offer suggestions derived from the `title` field.
Split each title into an array of words.  Then for each word, index starting
at some letter, say the third.  For the title `hello world`, you'd get the
following tokens:

    hello -> hel hell hello
    world -> wor worl world

Then at search time, perform a search query with every keystroke.

    h -> (no result)
    he -> (no result)
    hel -> "hello world"

Once you've got basic functionality running, experiment with minimum token
length, adding Soundex/Metaphone, performing character normalization, etc.

Marvin Humphrey
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] C library:Suggester

serkanmulayim@gmail.com
Thank you very much Marvin,

When I type hell, I would like to get tokens starting with hell, e.g.
{"hell","hello","helix"}. I do not want to get documents which contain hell
token in the title. So it seems like it should be working on the tokens.

What I need is basically to be able to iterate over all tokens which are
lexicographically ordered. Also I would need to sort them based on their
frequencies when returning the results. I guess Lexicon class,
https://lucy.apache.org/docs/c/Lucy/Index/Lexicon.html,  is designed for
this. Can you please confirm? I hope the returned results in the
lucy_Lex_seek contains the frequency of the terms as well.

Thanks again,
Serkan





On Tue, May 2, 2017 at 4:22 PM, Marvin Humphrey <[hidden email]>
wrote:

> On Mon, May 1, 2017 at 3:55 PM, Serkan Mulayim <[hidden email]>
> wrote:
>
> > I am using the C library. I would like to get the suggester or
> autocomplete
> > functionality in my library. It needs to return {"hello", "hell",
> "hellx"}
> > when your query is "hell". I feel like I need to be able to read all the
> > tokens in the whole index, and return the results based on it. I looked
> at
> > the indexReader for this, but I could not find any useful information. Do
> > you think this is possible?
>
> Autosuggestion functionality will need tuning, just like search results.
> In
> fact, autosuggestion is really a specialized form of search application.
> It
> could be implemented with a separate index or separate fields.
>
> Say that we only wanted to offer suggestions derived from the `title`
> field.
> Split each title into an array of words.  Then for each word, index
> starting
> at some letter, say the third.  For the title `hello world`, you'd get the
> following tokens:
>
>     hello -> hel hell hello
>     world -> wor worl world
>
> Then at search time, perform a search query with every keystroke.
>
>     h -> (no result)
>     he -> (no result)
>     hel -> "hello world"
>
> Once you've got basic functionality running, experiment with minimum token
> length, adding Soundex/Metaphone, performing character normalization, etc.
>
> Marvin Humphrey
>
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] C library:Suggester

Marvin Humphrey
On Wed, May 3, 2017 at 1:06 PM, Serkan Mulayim <[hidden email]> wrote:

> Thank you very much Marvin,
>
> When I type hell, I would like to get tokens starting with hell, e.g.
> {"hell","hello","helix"}. I do not want to get documents which contain hell
> token in the title. So it seems like it should be working on the tokens.
>
> What I need is basically to be able to iterate over all tokens which are
> lexicographically ordered. Also I would need to sort them based on their
> frequencies when returning the results. I guess Lexicon class,
> https://lucy.apache.org/docs/c/Lucy/Index/Lexicon.html,  is designed for
> this. Can you please confirm? I hope the returned results in the
> lucy_Lex_seek contains the frequency of the terms as well.

I stand by my recommendation of using a dedicated index because you
will almost certainly want to tune your autosuggestion results. But
feel free to play around with Lexicon and see how it works for you.

Note that depending on what Analyzer you are using for a given field,
the terms in the Lexicon may not be what you expect.

Marvin Humphrey
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] C library:Suggester

Peter Karman
In reply to this post by serkanmulayim@gmail.com
You might find this Perl implementation a helpful reference.

https://metacpan.org/pod/LucyX::Suggester

On Wed, May 3, 2017 at 3:06 PM, Serkan Mulayim <[hidden email]>
wrote:

> Thank you very much Marvin,
>
> When I type hell, I would like to get tokens starting with hell, e.g.
> {"hell","hello","helix"}. I do not want to get documents which contain hell
> token in the title. So it seems like it should be working on the tokens.
>
> What I need is basically to be able to iterate over all tokens which are
> lexicographically ordered. Also I would need to sort them based on their
> frequencies when returning the results. I guess Lexicon class,
> https://lucy.apache.org/docs/c/Lucy/Index/Lexicon.html,  is designed for
> this. Can you please confirm? I hope the returned results in the
> lucy_Lex_seek contains the frequency of the terms as well.
>
> Thanks again,
> Serkan
>
>
>
>
>
> On Tue, May 2, 2017 at 4:22 PM, Marvin Humphrey <[hidden email]>
> wrote:
>
> > On Mon, May 1, 2017 at 3:55 PM, Serkan Mulayim <[hidden email]>
> > wrote:
> >
> > > I am using the C library. I would like to get the suggester or
> > autocomplete
> > > functionality in my library. It needs to return {"hello", "hell",
> > "hellx"}
> > > when your query is "hell". I feel like I need to be able to read all
> the
> > > tokens in the whole index, and return the results based on it. I looked
> > at
> > > the indexReader for this, but I could not find any useful information.
> Do
> > > you think this is possible?
> >
> > Autosuggestion functionality will need tuning, just like search results.
> > In
> > fact, autosuggestion is really a specialized form of search application.
> > It
> > could be implemented with a separate index or separate fields.
> >
> > Say that we only wanted to offer suggestions derived from the `title`
> > field.
> > Split each title into an array of words.  Then for each word, index
> > starting
> > at some letter, say the third.  For the title `hello world`, you'd get
> the
> > following tokens:
> >
> >     hello -> hel hell hello
> >     world -> wor worl world
> >
> > Then at search time, perform a search query with every keystroke.
> >
> >     h -> (no result)
> >     he -> (no result)
> >     hel -> "hello world"
> >
> > Once you've got basic functionality running, experiment with minimum
> token
> > length, adding Soundex/Metaphone, performing character normalization,
> etc.
> >
> > Marvin Humphrey
> >
>



--
Peter Karman . https://peknet.com/ <http://peknet.com/> .
https://keybase.io/peterkarman
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] C library:Suggester

serkanmulayim@gmail.com
Thanks Marvin and Peter for your comments.

I tried to make the library work for Lexicons, but I am receiving a
Segfault. I believe I am not able to initialize the LexiconReader
correctly. I could not find any samples anywhere. I put my code snippet to
the end of the message. As I mentioned earlier, I simply would like to have
an access to the tokens for a specific field. Those fields do not have any
stemmers.

I have a few questions for the following code snippet. In order to create a
LexiconReader*, I create IndexReader. Then I initialize the LexReader. I am
really suspicious on what I am doing, because LexReader_init method takes a
LexReader (self) as argument and return the LexReader. In order to do this
I had to malloc a LexiconReader pointer, otherwise LexReader_init fails.
Since sizeof(lucy_LexiconReader) fails, I malloc with 10000 bytes. The
program crashes at line:
lucy_Lexicon * lexicon = LexReader_Lexicon(lexiconReader, field_str,
(cfish_Obj*) term_str);.

The lldb output for this crash is below.

Is anyone able to see what I am doing wrong here.

Thanks in advance,
Serkan

---------------------------------------------------LLDB
Output------------------------------------------------------
    frame #0: 0x00000001000036ab
testSuggester`LUCY_LexReader_Lexicon(self=0x0000000103000000,
field=0x0000000100508910, term=0x00000001005086e0) + 27 at
LexiconReader.h:275
   272 extern LUCY_VISIBLE uint32_t LUCY_LexReader_Lexicon_OFFSET;
   273 static CFISH_INLINE lucy_Lexicon*
   274 LUCY_LexReader_Lexicon(lucy_LexiconReader* self, cfish_String*
field, cfish_Obj* term) {
-> 275    const LUCY_LexReader_Lexicon_t method =
(LUCY_LexReader_Lexicon_t)cfish_obj_method(self,
LUCY_LexReader_Lexicon_OFFSET);
   276    return method(self, field, term);
   277 }
   278
(lldb) n
Process 68332 stopped
* thread #1: tid = 0x17272f5, 0x000000010000549f
testSuggester`cfish_method(klass=0x0000000000000000, offset=192) + 31 at
cfish_parcel.h:108, queue = 'com.apple.main-thread', stop reason =
EXC_BAD_ACCESS (code=1, address=0xc0)
    frame #0: 0x000000010000549f
testSuggester`cfish_method(klass=0x0000000000000000, offset=192) + 31 at
cfish_parcel.h:108
   105 cfish_method(const void *klass, uint32_t offset) {
   106    union { char *cptr; cfish_method_t *fptr; } ptr;
   107    ptr.cptr = (char*)klass + offset;
-> 108    return ptr.fptr[0];
   109 }
   110


-----------------------------------------------Code
Snippet--------------------------------------------------------
lucy_FSFolder *folder = lucy_FSFolder_new(folder_str);
lucy_IndexReader *indexReader = lucy_IxReader_open((cfish_Obj *)
folder_str, NULL, NULL);
cfish_Vector *segments = IxReader_Get_Segments(indexReader);
lucy_Snapshot *snapshot = IxReader_Get_Snapshot(indexReader);
int32_t seg_tick = IxReader_Get_Seg_Tick(indexReader);

//sizeof does not work for lexiconreader or for datareader. Put 10000 for
testing
lucy_LexiconReader * lexiconReader = (lucy_LexiconReader*) malloc(10000);
lucy_LexReader_init(lexiconReader, schema, (lucy_Folder*) folder, snapshot,
segments, seg_tick);


cfish_String *field_str = Str_newf(field);
cfish_String *term_str = Str_newf(term);
lucy_Lexicon * lexicon = LexReader_Lexicon(lexiconReader, field_str,
(cfish_Obj*) term_str);

char *out;

cfish_Obj *out_str = Lex_Get_Term(lexicon);
out = Str_To_Utf8((cfish_String*) out_str);
DECREF(out_str);
printf("%s\n", out);
free(out);
while(Lex_Next(lexicon)) {
cfish_Obj *out_str = Lex_Get_Term(lexicon);
out = Str_To_Utf8((cfish_String*) out_str);
DECREF(out_str);
printf("%s\n", out);
free(out);
}


On Wed, May 3, 2017 at 2:30 PM, Peter Karman <[hidden email]> wrote:

> You might find this Perl implementation a helpful reference.
>
> https://metacpan.org/pod/LucyX::Suggester
>
> On Wed, May 3, 2017 at 3:06 PM, Serkan Mulayim <[hidden email]>
> wrote:
>
> > Thank you very much Marvin,
> >
> > When I type hell, I would like to get tokens starting with hell, e.g.
> > {"hell","hello","helix"}. I do not want to get documents which contain
> hell
> > token in the title. So it seems like it should be working on the tokens.
> >
> > What I need is basically to be able to iterate over all tokens which are
> > lexicographically ordered. Also I would need to sort them based on their
> > frequencies when returning the results. I guess Lexicon class,
> > https://lucy.apache.org/docs/c/Lucy/Index/Lexicon.html,  is designed for
> > this. Can you please confirm? I hope the returned results in the
> > lucy_Lex_seek contains the frequency of the terms as well.
> >
> > Thanks again,
> > Serkan
> >
> >
> >
> >
> >
> > On Tue, May 2, 2017 at 4:22 PM, Marvin Humphrey <[hidden email]>
> > wrote:
> >
> > > On Mon, May 1, 2017 at 3:55 PM, Serkan Mulayim <
> [hidden email]>
> > > wrote:
> > >
> > > > I am using the C library. I would like to get the suggester or
> > > autocomplete
> > > > functionality in my library. It needs to return {"hello", "hell",
> > > "hellx"}
> > > > when your query is "hell". I feel like I need to be able to read all
> > the
> > > > tokens in the whole index, and return the results based on it. I
> looked
> > > at
> > > > the indexReader for this, but I could not find any useful
> information.
> > Do
> > > > you think this is possible?
> > >
> > > Autosuggestion functionality will need tuning, just like search
> results.
> > > In
> > > fact, autosuggestion is really a specialized form of search
> application.
> > > It
> > > could be implemented with a separate index or separate fields.
> > >
> > > Say that we only wanted to offer suggestions derived from the `title`
> > > field.
> > > Split each title into an array of words.  Then for each word, index
> > > starting
> > > at some letter, say the third.  For the title `hello world`, you'd get
> > the
> > > following tokens:
> > >
> > >     hello -> hel hell hello
> > >     world -> wor worl world
> > >
> > > Then at search time, perform a search query with every keystroke.
> > >
> > >     h -> (no result)
> > >     he -> (no result)
> > >     hel -> "hello world"
> > >
> > > Once you've got basic functionality running, experiment with minimum
> > token
> > > length, adding Soundex/Metaphone, performing character normalization,
> > etc.
> > >
> > > Marvin Humphrey
> > >
> >
>
>
>
> --
> Peter Karman . https://peknet.com/ <http://peknet.com/> .
> https://keybase.io/peterkarman
>
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] C library:Suggester

Nick Wellnhofer
On 15/05/2017 23:13, Serkan Mulayim wrote:
> I tried to make the library work for Lexicons, but I am receiving a
> Segfault. I believe I am not able to initialize the LexiconReader
> correctly.

You can get the LexiconReader for an index with IndexReader's Obtain method:

     http://lucy.apache.org/docs/c/Lucy/Index/IndexReader.html#func_Obtain

Example code:

     LexiconReader *lex_reader = (LexiconReader*)IxReader_Obtain(
         index_reader, Class_Get_Name(LEXICONREADER));

Nick
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] C library:Suggester

serkanmulayim@gmail.com
Thank you very much Nick, I tried your suggestion and it worked. And this
is a much simpler way of getting the LexiconReader. I suppose it does the
lexicon run over all segments in the index, right?

On Tue, May 16, 2017 at 7:56 AM, Nick Wellnhofer <[hidden email]>
wrote:

> On 15/05/2017 23:13, Serkan Mulayim wrote:
>
>> I tried to make the library work for Lexicons, but I am receiving a
>> Segfault. I believe I am not able to initialize the LexiconReader
>> correctly.
>>
>
> You can get the LexiconReader for an index with IndexReader's Obtain
> method:
>
>     http://lucy.apache.org/docs/c/Lucy/Index/IndexReader.html#func_Obtain
>
> Example code:
>
>     LexiconReader *lex_reader = (LexiconReader*)IxReader_Obtain(
>         index_reader, Class_Get_Name(LEXICONREADER));
>
> Nick
>
Reply | Threaded
Open this post in threaded view
|

[lucy-user] trailing double quote

arjan
Dear all,

It seems that if a double quote is the last character of a query,
followed by nothing or nothing other than space characters, an error is
thrown:

    StrIter_crop: top is behind tail
    cfish_StrIter_crop at cfcore/Clownfish/String.c line 704

As can be seen in code like this:

    my $query_parser = Lucy::Search::QueryParser->new(
         schema             => $env->get_schema,
         fields                 => [ 'normalized' ],
    );

    my $user_query = 'aap noot mies" ';
         $user_query  = $query_parser->parse(
         $user_query
    );

The single occurence of a double quote anywhere else in a string is no
problem, nor is the single occurence of a single quote anywhere in the
string. Also if it's at the end.

Is this a bug in Apache::Lucy? (Lucy-v0.6.1 latest version on cpan)

Kind regards,
Arjan.

On 05/16/2017 07:53 PM, Serkan Mulayim wrote:

> Thank you very much Nick, I tried your suggestion and it worked. And this
> is a much simpler way of getting the LexiconReader. I suppose it does the
> lexicon run over all segments in the index, right?
>
> On Tue, May 16, 2017 at 7:56 AM, Nick Wellnhofer <[hidden email]>
> wrote:
>
>> On 15/05/2017 23:13, Serkan Mulayim wrote:
>>
>>> I tried to make the library work for Lexicons, but I am receiving a
>>> Segfault. I believe I am not able to initialize the LexiconReader
>>> correctly.
>>>
>> You can get the LexiconReader for an index with IndexReader's Obtain
>> method:
>>
>>      http://lucy.apache.org/docs/c/Lucy/Index/IndexReader.html#func_Obtain
>>
>> Example code:
>>
>>      LexiconReader *lex_reader = (LexiconReader*)IxReader_Obtain(
>>          index_reader, Class_Get_Name(LEXICONREADER));
>>
>> Nick
>>

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] trailing double quote

Nick Wellnhofer
On 20/05/2017 11:48, Arjan Widlak - United Knowledge wrote:
> It seems that if a double quote is the last character of a query, followed by
> nothing or nothing other than space characters, an error is thrown:

> Is this a bug in Apache::Lucy? (Lucy-v0.6.1 latest version on cpan)

Yes, this is a bug. It should be fixed in the next release:

     https://issues.apache.org/jira/browse/LUCY-325

Nick