[lucy-user] Speed up Search with Lucy::Search::IndexSearcher and Lucy::Search::PolySearcher from multiple index folders

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[lucy-user] Speed up Search with Lucy::Search::IndexSearcher and Lucy::Search::PolySearcher from multiple index folders

Gupta, Rajiv
I'm creating indexes on multiple subfolders under one parent folder.

Indexes are created on multiple folders since files are getting created in parallel and I want to avoid segment locking between multiple indexers.

One of my applications creates the directory structure with lots of log files within different subfolders.

I'm indexing all those files in parallel as and when they are created.

The directory structure looks like this:
TopDir/00_log.log
      /01_log2.log
      /.lucyindexer/1/seg_1
                     /seg_2
      /03_log3.log
      /03_log3/log31.log
              /log32.log
              /.lucyindexer/1/seg_1
                             /seg 2
              /log32/log321.log
                    /log322.log
                    /.lucyindexer/1/seg_1
                                   /seg_2
                                 /2/seg_1



This works fine, and while my application is running all log files get indexed as well.
Search is a different application which does following:
1.    Scan through all the directories till .lucyindexer/1 and create a list of all such folders. I use File::Find<https://metacpan.org/pod/File::Find> to do that.
2.    Create searchers using Lucy::Search::IndexSearcher<https://metacpan.org/pod/Lucy::Search::IndexSearcher> in loop and add all the searchers to Lucy::Search::PolySearcher<https://metacpan.org/pod/Lucy::Search::PolySearcher>

My code looks like this:


my $schema;



for my $index ( @all_dirs ) {

    chomp $index;

    my $indexer = Lucy::Search::IndexSearcher->new( index => $index );

    push @searchers, $indexer;

    $schema = $indexer->get_schema;

}



# Poly server is the only way to get all search results combined.

my $poly_searcher = Lucy::Search::PolySearcher->new(

    schema    => $schema,

    searchers => \@searchers,

);



my $query_parser = Lucy::Search::QueryParser->new(

    schema => $poly_searcher->get_schema,

    fields => ['title'],

);



# Build up a Query.

my $q = "1 2 3 4 5 6 7 11 12 13 14 18";



my $query = $query_parser->parse( $q );



# Execute the Query and get a Hits object.

my $hits = $poly_searcher->hits(

    query      => $query,

    num_wanted => -1,       # -1 equivalent to all results



    # sort_spec => $sort_spec,

);



while ( my $hit = $hits->next ) {



    ## Do some operation

}


This runs and returns the expected results. However, the performance is really bad when the directory structure is deeply nested.
I did profiling using Devel::NYTProf<https://metacpan.org/pod/Devel::NYTProf> and found two places where the maximum time was taken:
1.    While scanning the directory. (This I will try to solve by generating a list of directories while the application is generating the indexes).
2.    When creating the searchers using Lucy::Search::IndexSearcher. This takes maximum time when running in loop for all indexed directories.
To solve the item #2 I tried to generate a Lucy::Search::IndexSearcher object for different index folders using Parallel::ForkManager<https://metacpan.org/pod/Parallel::ForkManager> but I got the following error:
The storable module was unable to store the child's data structure to the temp file "/tmp/Parallel-ForkManager-27339-27366.txt": Storable serialization not implemented for Lucy::Search::IndexSearcher at /usr/software/lib/perl5/site_perl/5.14.0/x86_64-linux-thread-multi/Clownfish.pm line 93
Using following code:
my $pm = new Parallel::ForkManager( $max_procs );

$pm->run_on_finish(
    sub {
        my ( $pid, $exit_code, $ident, $exit_signal, $core_dump, $index ) = @_;
        print Dumper $index;
        push( @searchers, $index );
    }
);

for my $index ( @all_dirs ) {
    chomp $index;
    my $forkpid = $pm->start( $index ) and next;    #fork
    my $indexer = Lucy::Search::IndexSearcher->new( index => $index );
    $pm->finish( 0, \$indexer );
}

$pm->wait_all_children;
This whole process takes up to 60-120 seconds for a large log directory. At the end of the whole process I create a nested JSON object from all search results to display using JQuery.
I'm looking for ideas to improve its performance. Any idea how to create multiple searchers using Parallel::ForkManager or any other method? Or some other way to improve the search performance.
Also, is there any way I can merge all the indexes in one place?
Thanks,
Rajiv Gupta
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Speed up Search with Lucy::Search::IndexSearcher and Lucy::Search::PolySearcher from multiple index folders

Marvin Humphrey
On Wed, Sep 14, 2016 at 12:05 AM, Gupta, Rajiv <[hidden email]> wrote:

> 2. Create searchers using Lucy::Search::IndexSearcher in loop and add all
> the searchers to Lucy::Search::PolySearcher

I suggest trying LucyX::Remote::ClusterSearcher instead of PolySearcher.
ClusterSearcher supports parallel searching through multi-process-based
concurrency.

> The storable module was unable to store the child's data structure to the
> temp file "/tmp/Parallel-ForkManager-27339-27366.txt": Storable
> serialization not implemented for Lucy::Search::IndexSearcher at
> /usr/software/lib/perl5/site_perl/5.14.0/x86_64-linux-thread-multi/Clownfish.pm
> line 93

IndexSearchers cannot be serialized with Storable because they refer to a
potentially huge index on the local file system.  Furthermore, they rely on
keeping open file descriptors to preserve a snapshot view of the index in
time, and those file descriptors also cannot be serialized.

> Also, is there any way I can merge all the indexes in one place?

Lucy::Index::Indexer's add_index() method could potentially help.

Marvin Humphrey
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Speed up Search with Lucy::Search::IndexSearcher and Lucy::Search::PolySearcher from multiple index folders

Nick Wellnhofer
In reply to this post by Gupta, Rajiv
On 14/09/2016 09:05, Gupta, Rajiv wrote:
> I'm creating indexes on multiple subfolders under one parent folder.
>
> Indexes are created on multiple folders since files are getting created in parallel and I want to avoid segment locking between multiple indexers.

> I did profiling using Devel::NYTProf<https://metacpan.org/pod/Devel::NYTProf> and found two places where the maximum time was taken:
> 1.    While scanning the directory. (This I will try to solve by generating a list of directories while the application is generating the indexes).
> 2.    When creating the searchers using Lucy::Search::IndexSearcher. This takes maximum time when running in loop for all indexed directories.

It sounds like you're working with an excessively large number of indices.
Maybe you should simply rethink your approach and use a single index? If
you're concerned about locking maybe a separate indexing process with some
kind of notification mechanism would help?

Nick

Reply | Threaded
Open this post in threaded view
|

RE: [lucy-user] Speed up Search with Lucy::Search::IndexSearcher and Lucy::Search::PolySearcher from multiple index folders

Gupta, Rajiv
Hi Nick,

Thanks for your reply! The another constraint with the single index approach is our index locations are dynamic and search happened through APIs which construct the index location based on certain input parameters.

However, I did modification in code to do fork process to complete the actions. In that process Searcher object gets created and destroyed and I get all search hits in an array that I process later. I did not use polysercher. This increases the speed.

Thanks,
Rajiv Gupta

-----Original Message-----
From: Nick Wellnhofer [mailto:[hidden email]]
Sent: Friday, September 16, 2016 3:51 PM
To: [hidden email]
Subject: Re: [lucy-user] Speed up Search with Lucy::Search::IndexSearcher and Lucy::Search::PolySearcher from multiple index folders

On 14/09/2016 09:05, Gupta, Rajiv wrote:
> I'm creating indexes on multiple subfolders under one parent folder.
>
> Indexes are created on multiple folders since files are getting created in parallel and I want to avoid segment locking between multiple indexers.

> I did profiling using Devel::NYTProf<https://metacpan.org/pod/Devel::NYTProf> and found two places where the maximum time was taken:
> 1.    While scanning the directory. (This I will try to solve by generating a list of directories while the application is generating the indexes).
> 2.    When creating the searchers using Lucy::Search::IndexSearcher. This takes maximum time when running in loop for all indexed directories.

It sounds like you're working with an excessively large number of indices.
Maybe you should simply rethink your approach and use a single index? If you're concerned about locking maybe a separate indexing process with some kind of notification mechanism would help?

Nick

Reply | Threaded
Open this post in threaded view
|

RE: [lucy-user] Speed up Search with Lucy::Search::IndexSearcher and Lucy::Search::PolySearcher from multiple index folders

Gupta, Rajiv
In reply to this post by Marvin Humphrey
Thanks Marvin for your reply!

I will try add_index.

-Rajiv

-----Original Message-----
From: Marvin Humphrey [mailto:[hidden email]]
Sent: Friday, September 16, 2016 7:01 AM
To: [hidden email]
Subject: Re: [lucy-user] Speed up Search with Lucy::Search::IndexSearcher and Lucy::Search::PolySearcher from multiple index folders

On Wed, Sep 14, 2016 at 12:05 AM, Gupta, Rajiv <[hidden email]> wrote:

> 2. Create searchers using Lucy::Search::IndexSearcher in loop and add
> all the searchers to Lucy::Search::PolySearcher

I suggest trying LucyX::Remote::ClusterSearcher instead of PolySearcher.
ClusterSearcher supports parallel searching through multi-process-based concurrency.

> The storable module was unable to store the child's data structure to
> the temp file "/tmp/Parallel-ForkManager-27339-27366.txt": Storable
> serialization not implemented for Lucy::Search::IndexSearcher at
> /usr/software/lib/perl5/site_perl/5.14.0/x86_64-linux-thread-multi/Clo
> wnfish.pm
> line 93

IndexSearchers cannot be serialized with Storable because they refer to a potentially huge index on the local file system.  Furthermore, they rely on keeping open file descriptors to preserve a snapshot view of the index in time, and those file descriptors also cannot be serialized.

> Also, is there any way I can merge all the indexes in one place?

Lucy::Index::Indexer's add_index() method could potentially help.

Marvin Humphrey