[lucy-user] Indexing Lucy::Plan::Int32Type

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

[lucy-user] Indexing Lucy::Plan::Int32Type

Thomas den Braber
Hallo,

I am using Lucy for some time now and it works great. I am extending some functionality
and one is the use for numbers to search for and use in ranges and use for sorting.

But I found out that integer fields can't be indexed.

I like to use:

my $numsortindexed = Lucy::Plan::Int32Type->new( indexed => 1, sortable => 1, stored => 1  );

Do you have any plans to implement searching/ranges for the Int32Type ?

Regards,

Thomas


Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Indexing Lucy::Plan::Int32Type

Marvin Humphrey
On Wed, Nov 27, 2013 at 3:12 AM, Thomas den Braber <[hidden email]> wrote:
> Hallo,
>
> I am using Lucy for some time now and it works great.

:)

> Do you have any plans to implement searching/ranges for the Int32Type ?

Yes, that is the intent.

Int32Type is already `sortable`, and since RangeQuery needs sortable field
types, ranges should work now.

The principle reason that Int32Type is not yet public is that such fields
cannot yet be `indexed`.  To complete this feature, we need to perform some
more refactoring of the inner indexing classes, including the Lexicon classes
and PostingPool.  (`S_write_terms_and_postings` in particular makes
assumptions that all terms have a text type.)  There are a few secondary
issues as well, such as how QueryParser should handle numeric fields (it
probably won't).

A lot of times people have been able to simulate numeric fields by hacking in
leading zeroes.  Perhaps either that helps you, or perhaps the tidbit that
RangeQuery should already work with Int32Type helps?

Marvin Humphrey
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Indexing Lucy::Plan::Int32Type

Thomas den Braber
Marvin,

> Int32Type is already `sortable`, and since RangeQuery needs sortable field
> types, ranges should work now.
>

When I do a range query on a sortable Int32Type I get an error:
 "term is a Lucy::Object::CharBuf, and not comparable to a Lucy::Object::Integer32"

I use the range in the same way as in the example:
http://search.cpan.org/~creamyg/Lucy-0.3.3/lib/Lucy/Search/RangeQuery.pod


> A lot of times people have been able to simulate numeric fields by hacking in
> leading zeroes.  

Do you know if there is a speed difference between sorting on Int32Type fields and text
fields with leading zero's ?


Thomas den Braber





Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Indexing Lucy::Plan::Int32Type

Marvin Humphrey
On Thu, Nov 28, 2013 at 2:19 AM, Thomas den Braber <[hidden email]> wrote:
> When I do a range query on a sortable Int32Type I get an error: "term is a
> Lucy::Object::CharBuf, and not comparable to a Lucy::Object::Integer32"
>
> I use the range in the same way as in the example:
> http://search.cpan.org/~creamyg/Lucy-0.3.3/lib/Lucy/Search/RangeQuery.pod

Ah.  You might be able to work around that by supplying values like so:

    my $range_query = Lucy::Search::RangeQuery->new(
        field      => 'product_number',
        lower_term => Lucy::Object::Integer32->new(value => 3),
    );

> Do you know if there is a speed difference between sorting on Int32Type
> fields and text fields with leading zero's ?

Should be negligible.

(Gory details: We pre-sort everything at index-time and write out binary
integer ordinals.  Most comparisons happen between the ordinals and are very
fast.  Some text comparisons happen but these scale with the number of
segments in the index, not the number of documents matched by the query.)

Marvin Humphrey
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Indexing Lucy::Plan::Int32Type

Thomas den Braber


> Ah.  You might be able to work around that by supplying values like so:
>
>     my $range_query = Lucy::Search::RangeQuery->new(
>         field      => 'product_number',
>         lower_term => Lucy::Object::Integer32->new(value => 3),
>     );

I got an error when doing so:

Invalid parameter: 'value'\n\tcfish_XSBind_allot_params at xs\\XSBind.c line
507\n\tXS_Lucy_Object_Obj_new at lib\\\\Lucy.xs line 343

I am using version 0.3.3



--
Thomas den Braber


Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Indexing Lucy::Plan::Int32Type

Marvin Humphrey
On Mon, Dec 2, 2013 at 9:08 AM, Thomas den Braber <[hidden email]> wrote:

>> Ah.  You might be able to work around that by supplying values like so:
>>
>>     my $range_query = Lucy::Search::RangeQuery->new(
>>         field      => 'product_number',
>>         lower_term => Lucy::Object::Integer32->new(value => 3),
>>     );
>
> I got an error when doing so:
>
> Invalid parameter: 'value'\n\tcfish_XSBind_allot_params at xs\\XSBind.c line
> 507\n\tXS_Lucy_Object_Obj_new at lib\\\\Lucy.xs line 343
>
> I am using version 0.3.3

OK, it looks like that workaround is only feasible with the current master
branch, not 0.3.x.  (Using `Clownfish::Integer32` instead of
`Lucy::Object::Integer32`.)

That being the case, does the leading-zeroes technique work for you?  It's
probably better anyway because it doesn't depend on non-public API features.

Marvin Humphrey
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Indexing Lucy::Plan::Int32Type

Thomas den Braber
That is OK, I will use the leading zero's until the Integer support is ready.

Thanks for your help,

Thomas den Braber

-----Original Message-----
From: Marvin Humphrey <[hidden email]>
To: Thomas den Braber <[hidden email]>
Cc: [hidden email]
Date: Mon, 2 Dec 2013 10:39:30 -0800
Subject: Re: [lucy-user] Indexing Lucy::Plan::Int32Type

> On Mon, Dec 2, 2013 at 9:08 AM, Thomas den Braber <[hidden email]> wrote:
>
> >> Ah.  You might be able to work around that by supplying values like so:
> >>
> >>     my $range_query = Lucy::Search::RangeQuery->new(
> >>         field      => 'product_number',
> >>         lower_term => Lucy::Object::Integer32->new(value => 3),
> >>     );
> >
> > I got an error when doing so:
> >
> > Invalid parameter: 'value'\n\tcfish_XSBind_allot_params at xs\\XSBind.c line
> > 507\n\tXS_Lucy_Object_Obj_new at lib\\\\Lucy.xs line 343
> >
> > I am using version 0.3.3
>
> OK, it looks like that workaround is only feasible with the current master
> branch, not 0.3.x.  (Using `Clownfish::Integer32` instead of
> `Lucy::Object::Integer32`.)
>
> That being the case, does the leading-zeroes technique work for you?  It's
> probably better anyway because it doesn't depend on non-public API features.
>
> Marvin Humphrey


Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Indexing Lucy::Plan::Int32Type

Nick D.
In reply to this post by Marvin Humphrey
I am having the same issue as well and not sure how to correct this. Can I not use Int32Type when indexing integers that I want to do a range query on later?

Can you give me an example of the leading zeros because I think I tried that also but I may be miss understanding what you mean by leading zeros? I tried adding zeros like so

my $range_query = Lucy::Search::RangeQuery->new(
         field      => 'time_sec',
         lower_term => '00014',
     );

I am storing the seconds in a day into an Int32Type so it's range will be from 0-86400. If I storing in a Int32Type is impossible to use on RangeQuery then how should I store this value and have it sorted the correct way (ex. 1111 is not smaller than 21 just because "1111" begins with "1" and "21" begins with "2") ???

Thanks in advance,

Nicholas Dwyer
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Indexing Lucy::Plan::Int32Type

Marvin Humphrey
On Thu, Dec 12, 2013 at 10:36 AM, Nick D. <[hidden email]> wrote:
> Can I not use Int32Type when indexing integers that I want to do a range
> query on later?

That's right.  Int32Type isn't public and isn't ready for prime time in Lucy
0.3.x.

> Can you give me an example of the leading zeros because I think I tried that
> also but I may be miss understanding what you mean by leading zeros?

The idea is to define the field as an ordinary text type (probably StringType)
and add leading zeroes at *index-time*.

    # If `$time_sec` is 14, then `$fields{time_sec}` will be `"00014"`.
    $fields{time_sec} = sprintf("%0.5d", $time_sec);
    $indexer->add_doc(\%fields);

Then your query will work at search-time:

> my $range_query = Lucy::Search::RangeQuery->new(
>          field      => 'time_sec',
>          lower_term => '00014',
>      );

Marvin Humphrey
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Indexing Lucy::Plan::Int32Type

Nick D.
Marvin Humphrey wrote
On Thu, Dec 12, 2013 at 10:36 AM, Nick D. <[hidden email]> wrote:
> Can I not use Int32Type when indexing integers that I want to do a range
> query on later?

That's right.  Int32Type isn't public and isn't ready for prime time in Lucy
0.3.x.

> Can you give me an example of the leading zeros because I think I tried that
> also but I may be miss understanding what you mean by leading zeros?

The idea is to define the field as an ordinary text type (probably StringType)
and add leading zeroes at *index-time*.

    # If `$time_sec` is 14, then `$fields{time_sec}` will be `"00014"`.
    $fields{time_sec} = sprintf("%0.5d", $time_sec);
    $indexer->add_doc(\%fields);

Then your query will work at search-time:

> my $range_query = Lucy::Search::RangeQuery->new(
>          field      => 'time_sec',
>          lower_term => '00014',
>      );

Marvin Humphrey

Thanks this worked!
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Indexing Lucy::Plan::Int32Type

Nick D.
In reply to this post by Marvin Humphrey
Marvin Humphrey wrote
On Thu, Dec 12, 2013 at 10:36 AM, Nick D. <[hidden email]> wrote:
> Can I not use Int32Type when indexing integers that I want to do a range
> query on later?

That's right.  Int32Type isn't public and isn't ready for prime time in Lucy
0.3.x.

> Can you give me an example of the leading zeros because I think I tried that
> also but I may be miss understanding what you mean by leading zeros?

The idea is to define the field as an ordinary text type (probably StringType)
and add leading zeroes at *index-time*.

    # If `$time_sec` is 14, then `$fields{time_sec}` will be `"00014"`.
    $fields{time_sec} = sprintf("%0.5d", $time_sec);
    $indexer->add_doc(\%fields);

Then your query will work at search-time:

> my $range_query = Lucy::Search::RangeQuery->new(
>          field      => 'time_sec',
>          lower_term => '00014',
>      );

Marvin Humphrey

Another question regarding this method of Range. Does the StringType have to be stored or can I mark it unstored and still be able to use a RangeQuery on it.

Is this valid to do a RangeQuery on this field:

my $unindexed_string_type = Lucy::Plan::StringType->new( indexed => 0, sortable => 1, stored => 0  );

Or do I need this:

my $unindexed_string_type = Lucy::Plan::StringType->new( indexed => 0, sortable => 1, stored => 1  );
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Indexing Lucy::Plan::Int32Type

Marvin Humphrey
On Tue, Jan 28, 2014 at 1:57 PM, Nick D. <[hidden email]> wrote:

> Another question regarding this method of Range. Does the StringType have to
> be stored or can I mark it unstored and still be able to use a RangeQuery on
> it.

It can be unstored.  The data structures for full-document retrieval
and sorting are separate.

Marvin Humphrey