iterating through hits, is there a way to improve performance, or can we run these iterations in parallel

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

iterating through hits, is there a way to improve performance, or can we run these iterations in parallel

rohit0908
Hi,

I am working on deeply nested directory structure and large logs in perl and trying to generate the results from index data,

my $hits = $searcher->hits(
        query      => ['title'],
        num_wanted => -1,
    );

while ( my $hit = $hits->next ) {
    # making 28777 calls to Lucy::Search::Hits::next
    # do some work - already ran profiling on this code and optimized

Now, since the hit count is large, the stuff that is in the iteration is consuming good amount of time, i need to improve its performance to get it going,

can we run this in parallel or any other optimization possible for case where hits are present in thousands?

I tried implementing Parallel::ForkManager in it, but it had increased the running time to serious extent instead of reducing it,

Not getting any clue, please help i am badly stuck now.

Regards
Rohit Singh

Regards Rohit Singh
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] iterating through hits, is there a way to improve performance, or can we run these iterations in parallel

Marvin Humphrey
On Tue, Apr 11, 2017 at 3:08 AM, rohit0908 <[hidden email]> wrote:

> Now, since the hit count is large, the stuff that is in the iteration is
> consuming good amount of time, i need to improve its performance to get it
> going,

Every call to $hits->next requires deserializing an entire document.  It may
be possible, depending on how your application is structured, to reduce or
avoid the cost of deserialization.

If you don't need any fields other than `title` and you are currently have
other fields which are `stored`, then you could try changing the FieldType for
those other fields so that they are no longer `stored`.  That will reduce the
the cost of deserializaing a document.

Another possibility might be to spend memory to avoid i/o, and cache all the
titles in a Perl array on Searcher initialization with indices corresponding
to Lucy doc IDs.  Then you could use a BitCollector, avoiding the
deserialization that $hits->next does.  Something like this:

    my $searcher = Lucy::Search::IndexSearcher->open(index => $index);
    my @titles;
    my $doc_max = $searcher->doc_max;
    for (1 .. $searcher->doc_max - 1) {
        my $doc = $searcher->fetch_doc($_);
        $titles[$_] = $doc->{title};
    }

    my $bit_vec = Lucy::Object::BitVector->new(
        capacity => $searcher->doc_max + 1,
    );
    my $bit_collector = Lucy::Search::Collector::BitCollector->new(
        bit_vector => $bit_vec,
    );
    $searcher->collect(
        collector => $bit_collector,
        query     => $query,
    );
    my $last_id = 0;
    while (1) {
        my $doc_id = $bit_vec->next_hit($last_id);
        last if $doc_id == -1;
        $last_id = $doc_id;
        print $titles[$doc_id] . "\n"; # or whatever
    }

> can we run this in parallel or any other optimization possible for case
> where hits are present in thousands?

Lucy is single-threaded, and there is not a practical way to parallelize
$hits->next at this time.  I've hacked some process-based parallelism using
unsupported private APIs but the approach wasn't ready for prime-time.

Marvin Humphrey
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] iterating through hits, is there a way to improve performance, or can we run these iterations in parallel

rohit0908
Thanks Marvin for your reply and taking a quick look on this. I will try your second option of caching and using bitcollector. Meanwhile could you please help me on below one,

>>If you don't need any fields other than `title` and you are currently have
>>other fields which are `stored`, then you could try changing the FieldType for
>>those other fields so that they are no longer `stored`.  That will reduce the
>>the cost of deserializaing a document.

I am running query on title only, and i require almost 4 fields only to serve my purpose, title, content, url, urlpath. so, is there a way we can fetch only these fields and it reduces the deserializaing cost or you mean to say not to store the fields if those are not necessary. Please let me know how to do it, thanks!!



Regards Rohit Singh
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] iterating through hits, is there a way to improve performance, or can we run these iterations in parallel

Peter Karman
rohit0908 wrote on 4/14/17 7:33 AM:

> Thanks Marvin for your reply and taking a quick look on this. I will try your
> second option of caching and using bitcollector. Meanwhile could you please
> help me on below one,
>
>>> If you don't need any fields other than `title` and you are currently have
>>> other fields which are `stored`, then you could try changing the FieldType
> for
>>> those other fields so that they are no longer `stored`.  That will reduce
> the
>>> the cost of deserializaing a document.
>
> I am running query on title only, and i require almost 4 fields only to
> serve my purpose, title, content, url, urlpath. so, is there a way we can
> fetch only these fields and it reduces the deserializaing cost or you mean
> to say not to store the fields if those are not necessary. Please let me
> know how to do it, thanks!!
>

"Storing" a field means you can retrieve the original value from the index
directly. You can index a field value without storing it, so that you can search
on the field but not retrieve the original (un-analyzed) value.

See https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Plan/FieldType.pod for
the flags available when defining a field.

To give you more concrete advice, we'd need to see your indexing code,
especially how you define your Schema.


--
Peter Karman  .  https://karpet.github.io  .  https://keybase.io/peterkarman