[lucy-user] Lucy and Coro/AnyEvent

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

[lucy-user] Lucy and Coro/AnyEvent

Gerald Richter
Hi,

as far as I see all calls to Lucy are synchronous.

I there are way to use it together with AnyEvent and/or Coro without
blocking the whole system for the time of the Lucy calls?

Thanks & Regards

Gerald

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Lucy and Coro/AnyEvent

Marvin Humphrey
On Sat, Oct 17, 2015 at 11:11 AM, Gerald Richter <[hidden email]> wrote:
> Hi,
>
> as far as I see all calls to Lucy are synchronous.
>
> I there are way to use it together with AnyEvent and/or Coro without
> blocking the whole system for the time of the Lucy calls?

Hi Gerald,

The only way I think it could work would be to launch a concurrent
independent process/thread on which Lucy does work. A call to interact
with the Lucy thread would then fire off work to be done on the
separate thread and register a callback signaling the main thread when
the work is done. That's effectively what we do in
LucyX::Remote::ClusterSearcher, though that's using a select loop.

Marvin Humphrey
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Lucy and Coro/AnyEvent

Gerald Richter
Hi Marvin,

thanks for your feedback.

Using threads, like IO::AIO/AnyEvent::AIO does, would be my prefered way.

Is the searcher thread safe?

Is there any documentation about the C interface of Lucy?

Thanks & Regards

Gerald


Am 19.10.2015 um 21:54 schrieb Marvin Humphrey:

> On Sat, Oct 17, 2015 at 11:11 AM, Gerald Richter <[hidden email]> wrote:
>> Hi,
>>
>> as far as I see all calls to Lucy are synchronous.
>>
>> I there are way to use it together with AnyEvent and/or Coro without
>> blocking the whole system for the time of the Lucy calls?
> Hi Gerald,
>
> The only way I think it could work would be to launch a concurrent
> independent process/thread on which Lucy does work. A call to interact
> with the Lucy thread would then fire off work to be done on the
> separate thread and register a callback signaling the main thread when
> the work is done. That's effectively what we do in
> LucyX::Remote::ClusterSearcher, though that's using a select loop.
>
> Marvin Humphrey
>
> !DSPAM:416,56254a9f23791092315305!
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Lucy and Coro/AnyEvent

Nick Wellnhofer
On 20/10/2015 09:43, Gerald Richter wrote:
> Is the searcher thread safe?

In a strict sense, no. But the code is reentrant, so it's possible to use
multiple searchers from separate threads as long as each searcher is only used
by a single thread.

A user-supplied locking mechanism should work, too. But the Perl bindings use
the CLONE_SKIP facility, so Lucy objects can't be shared across Perl threads
anyway.

> Is there any documentation about the C interface of Lucy?

The C interface will be fully documented in the upcoming 0.5 release. You can
find a preview here:

 
https://rawgit.com/nwellnhof/lucy/generated_docs/c/autogen/share/doc/clownfish/lucy.html

Here is some sample code:

     https://github.com/apache/lucy/tree/master/c/sample

Nick

Reply | Threaded
Open this post in threaded view
|

[lucy-user] how to get distinct values of a field

Gerald Richter
Hi,

I like to get all distinct values from a field, something which would in
sql look like this:

select distinct fieldname from table

where fieldname is a StringType.

Is this possible with lucy?

Thanks & Regards

Gerald




Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] how to get distinct values of a field

Nick Wellnhofer
On 02/11/2015 08:54, Gerald Richter wrote:
> I like to get all distinct values from a field, something which would in sql
> look like this:
>
> select distinct fieldname from table
>
> where fieldname is a StringType.
>
> Is this possible with lucy?

The easiest way (using a PolyLexiconReader under the hood):

     my $index = Lucy::Index::IndexReader->open(index => $path_to_index);
     my $lex_reader = $index->obtain('Lucy::Index::LexiconReader');
     my $lexicon = $lex_reader->lexicon(field => $field_name);
     my @terms;

     while ($lexicon->next) {
         push(@terms, $lexicon->get_term);
     }

Depending on the size of your index and the number of segments, it might be
more efficient to merge the terms from multiple segments manually:

     my $index = Lucy::Index::IndexReader->open(index => $path_to_index);
     my $seg_readers = $index->seg_readers;
     my %term_hash;

     for my $seg_reader (@$seg_readers) {
         my $lex_reader = $seg_reader->obtain('Lucy::Index::LexiconReader');
         my $lexicon = $lex_reader->lexicon(field => $field_name);

         while ($lexicon->next) {
             my $term = $lexicon->get_term;
             $term_hash{$term} = undef;
         }
     }

     my @terms = keys(%term_hash);

Note that these examples also work with full text fields.

Nick

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] how to get distinct values of a field

Gerald Richter
That works great!

Thanks

Gerald


Am 02.11.2015 um 14:14 schrieb Nick Wellnhofer:

>    my $index = Lucy::Index::IndexReader->open(index => $path_to_index);
>     my $lex_reader = $index->obtain('Lucy::Index::LexiconReader');
>     my $lexicon = $lex_reader->lexicon(field => $field_name);
>     my @terms;
>
>     while ($lexicon->next) {
>         push(@terms, $lexicon->get_term);
>     }
>
> Depending on the size of your index and the number of segments, it
> might be more efficient to merge the terms from multiple segments
> manually:
>
>     my $index = Lucy::Index::IndexReader->open(index => $path_to_index);
>     my $seg_readers = $index->seg_readers;
>     my %term_hash;
>
>     for my $seg_reader (@$seg_readers) {
>         my $lex_reader =
> $seg_reader->obtain('Lucy::Index::LexiconReader');
>         my $lexicon = $lex_reader->lexicon(field => $field_name);
>
>         while ($lexicon->next) {
>             my $term = $lexicon->get_term;
>             $term_hash{$term} = undef;
>         }
>     }
>
>     my @terms = keys(%term_hash);

Reply | Threaded
Open this post in threaded view
|

[lucy-user] Strange results when documents gets delete while iterating

Gerald Richter
In reply to this post by Nick Wellnhofer
Hi,

I have a simple query that consists of a TermQuery and a RangeQuery, I
am iterating over this query like this:

         while ($cnt-- >= 0 && ($hit = $hits -> next))
             {
             $data = $hit->get_fields() ;
             ....
             }

While this loop runs documents are deleted from the index by another
process. Without this other process everything is fine. When this
deletion is happeing, it seems that half of the documents that are
returned by $hits -> next are wrong, which mean I get a totaly different
document, which should not be part of the resultset.

I thought that a searcher operates on a snapshot, so changes that
happens at the same time does not influence the query. Is this wrong? If
yes, how could I make sure my resultset is not corrupted?

Thanks & Regards

Gerald




Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Strange results when documents gets delete while iterating

Marvin Humphrey
On Wed, Nov 18, 2015 at 10:22 PM, Gerald Richter <[hidden email]> wrote:

> I have a simple query that consists of a TermQuery and a RangeQuery, I am
> iterating over this query like this:
>
>         while ($cnt-- >= 0 && ($hit = $hits -> next))
>             {
>             $data = $hit->get_fields() ;
>             ....
>             }
>
> While this loop runs documents are deleted from the index by another
> process. Without this other process everything is fine. When this deletion
> is happeing, it seems that half of the documents that are returned by $hits
> -> next are wrong, which mean I get a totaly different document, which
> should not be part of the resultset.
>
> I thought that a searcher operates on a snapshot, so changes that happens at
> the same time does not influence the query. Is this wrong? If yes, how could
> I make sure my resultset is not corrupted?

What kind of a Searcher is this?  If it's an IndexSearcher operating
on a local index, I don't see how it could happen.  But if it's a
ClusterSearcher, then it would be possible if the remotes are being
refreshed.

Marvin Humphrey
Reply | Threaded
Open this post in threaded view
|

AW: [lucy-user] Strange results when documents gets delete while iterating

Gerald Richter - ECOS Technology
Hi,

It's a local IndexSearcher.

I have done a lot of tests and it's really happening.

Let me give you a little more details, maybe this helps:

- I call a function that creates a new IndexSearcher and call $hits = $searcher -> hits.
- I iterate over the first few entries and returns the entries and the $hits
- The documents that were found are deleted from a database, which in turn deletes the documents from the Lucy index.
- Now I iterate over the next few entries and delete them and so on

I have made small test where per iteration only two entries are fetch. The result looks like this:

      id  => "8b8bce64e69b52ed244671009c11ee0e",
      id  => "8b8bce64e69b52ed244671009c4857e7",
      id  => "4a3dcd6c2e9e3074d2d52b8e72584b68",
      id  => "8b8bce64e69b52ed244671009c730dc9",
      id  => "4a3dcd6c2e9e3074d2d52b8e72584d19",
      id  => "8b8bce64e69b52ed244671009c7e3974",
      id  => "4a3dcd6c2e9e3074d2d52b8e72585475",
      id  => "8b8bce64e69b52ed244671009c7e4788",
      id  => "4a3dcd6c2e9e3074d2d52b8e72585dc2",
      id  => "8b8bce64e69b52ed244671009c7e2fa6",

id is some value I store in the document. The result should only contain ids starting with 8.

So you see the first two are correct, after deletion of this two (always in a different process), the next time, the first one I get is wrong the second one is correct...

If I do not delete anything I only get the right entries (just commented out one line the rest is still the same).

Any clue?

Thanks & Regards

Gerald




-----Urspr√ľngliche Nachricht-----
Von: Marvin Humphrey [mailto:[hidden email]]
Gesendet: Donnerstag, 19. November 2015 12:19
An: [hidden email]
Betreff: Re: [lucy-user] Strange results when documents gets delete while iterating

On Wed, Nov 18, 2015 at 10:22 PM, Gerald Richter <[hidden email]> wrote:

> I have a simple query that consists of a TermQuery and a RangeQuery, I am
> iterating over this query like this:
>
>         while ($cnt-- >= 0 && ($hit = $hits -> next))
>             {
>             $data = $hit->get_fields() ;
>             ....
>             }
>
> While this loop runs documents are deleted from the index by another
> process. Without this other process everything is fine. When this deletion
> is happeing, it seems that half of the documents that are returned by $hits
> -> next are wrong, which mean I get a totaly different document, which
> should not be part of the resultset.
>
> I thought that a searcher operates on a snapshot, so changes that happens at
> the same time does not influence the query. Is this wrong? If yes, how could
> I make sure my resultset is not corrupted?

What kind of a Searcher is this?  If it's an IndexSearcher operating
on a local index, I don't see how it could happen.  But if it's a
ClusterSearcher, then it would be possible if the remotes are being
refreshed.

Marvin Humphrey

!DSPAM:416,564db01923795029453755!

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Strange results when documents gets delete while iterating

Gerald Richter
In reply to this post by Marvin Humphrey
Hi,

It's a local IndexSearcher.

I have done a lot of tests and it's really happening.

Let me give you a little more details, maybe this helps:

- I call a function that creates a new IndexSearcher and call $hits =
$searcher -> hits.

- I iterate over the first few entries and returns the entries and the $hits

- The documents that were found are deleted from a database, which in
turn deletes the documents from the Lucy index.

- Now I iterate over the next few entries and delete them and so on

I have made small test where per iteration only two entries are fetch.
The result looks like this:

id=> "8b8bce64e69b52ed244671009c11ee0e",

id=> "8b8bce64e69b52ed244671009c4857e7",

id=> "4a3dcd6c2e9e3074d2d52b8e72584b68",

id=> "8b8bce64e69b52ed244671009c730dc9",

id=> "4a3dcd6c2e9e3074d2d52b8e72584d19",

id=> "8b8bce64e69b52ed244671009c7e3974",

id=> "4a3dcd6c2e9e3074d2d52b8e72585475",

id=> "8b8bce64e69b52ed244671009c7e4788",

id=> "4a3dcd6c2e9e3074d2d52b8e72585dc2",

id=> "8b8bce64e69b52ed244671009c7e2fa6",

id is some value I store in the document. The result should only contain
ids starting with 8.

So you see the first two are correct, after deletion of this two (always
in a different process), the next time, the first one I get is wrong the
second one is correct...

If I do not delete anything I only get the right entries (just commented
out one line the rest is still the same).

Any clue?

Thanks & Regards

Gerald



Am 19.11.2015 um 12:18 schrieb Marvin Humphrey:

> On Wed, Nov 18, 2015 at 10:22 PM, Gerald Richter <[hidden email]> wrote:
>
>> I have a simple query that consists of a TermQuery and a RangeQuery, I am
>> iterating over this query like this:
>>
>>          while ($cnt-- >= 0 && ($hit = $hits -> next))
>>              {
>>              $data = $hit->get_fields() ;
>>              ....
>>              }
>>
>> While this loop runs documents are deleted from the index by another
>> process. Without this other process everything is fine. When this deletion
>> is happeing, it seems that half of the documents that are returned by $hits
>> -> next are wrong, which mean I get a totaly different document, which
>> should not be part of the resultset.
>>
>> I thought that a searcher operates on a snapshot, so changes that happens at
>> the same time does not influence the query. Is this wrong? If yes, how could
>> I make sure my resultset is not corrupted?
> What kind of a Searcher is this?  If it's an IndexSearcher operating
> on a local index, I don't see how it could happen.  But if it's a
> ClusterSearcher, then it would be possible if the remotes are being
> refreshed.
>
> Marvin Humphrey
>
> !DSPAM:416,564db01923795029453755!

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Strange results when documents gets delete while iterating

Marvin Humphrey
In reply to this post by Gerald Richter - ECOS Technology
On Thu, Nov 19, 2015 at 4:39 AM, Gerald Richter - ECOS Technology
<[hidden email]> wrote:

> Hi,
>
> It's a local IndexSearcher.
>
> I have done a lot of tests and it's really happening.
>
> Let me give you a little more details, maybe this helps:
>
> - I call a function that creates a new IndexSearcher and call $hits = $searcher -> hits.
> - I iterate over the first few entries and returns the entries and the $hits
> - The documents that were found are deleted from a database, which in turn deletes the documents from the Lucy index.
> - Now I iterate over the next few entries and delete them and so on
>
> I have made small test where per iteration only two entries are fetch. The result looks like this:
>
>       id  => "8b8bce64e69b52ed244671009c11ee0e",
>       id  => "8b8bce64e69b52ed244671009c4857e7",
>       id  => "4a3dcd6c2e9e3074d2d52b8e72584b68",
>       id  => "8b8bce64e69b52ed244671009c730dc9",
>       id  => "4a3dcd6c2e9e3074d2d52b8e72584d19",
>       id  => "8b8bce64e69b52ed244671009c7e3974",
>       id  => "4a3dcd6c2e9e3074d2d52b8e72585475",
>       id  => "8b8bce64e69b52ed244671009c7e4788",
>       id  => "4a3dcd6c2e9e3074d2d52b8e72585dc2",
>       id  => "8b8bce64e69b52ed244671009c7e2fa6",
>
> id is some value I store in the document. The result should only contain ids starting with 8.
>
> So you see the first two are correct, after deletion of this two (always in a different process), the next time, the first one I get is wrong the second one is correct...
>
> If I do not delete anything I only get the right entries (just commented out one line the rest is still the same).
>
> Any clue?

When documents in an old segment are marked as deleted, that information is
written to a bitmap deletions file which is written to a new segment.  Old
readers are not supposed to know about new segments.  So for something to go
wrong, either 1) information in an old segment would have to be corrupted, 2)
a reader would have to somehow find out about information in a new segment, or
3) somthing else unrelated.

Indexers write index data (including new deletions data referencing documents
in old segments) to temp files in a new segment, which are then consolidated
into a single per-segment "compound file" named "cf.dat".  When a reader
opens, it mmaps cf.dat for each segment in the snapshot.  Once the reader
successfully opens all the files it needs, it never goes looking for new
files.

It's hard to imagine a mechanism that would either cause an existing "cf.dat"
file to be modified, or persuade a reader to go look at a new "cf.dat"
file.  So unless my reasoning is wrong, the cause is #3 -- something else
unrelated.  I really have no idea what that could be, though since you've
previously asked some questions about Coro/AnyEvent and other concurrency
stuff the most likely prospect would seem to be something unique to your
setup.

The next step is probably to take the behavior you've been able to reproduce
and isolate it in a test case that others can run and analyze.

Marvin Humphrey
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Strange results when documents gets delete while iterating

Gerald Richter
Thanks for the detailed explanation. Yes, I am using Coro, but in this
special test case only one Coro thread was running.

After restarting all processes the issue has gone away. I still did not
really understand what was going on, but since the restart (a few days
ago) everything works like expected

Regards

Gerald


Am 19.11.2015 um 16:03 schrieb Marvin Humphrey:

> On Thu, Nov 19, 2015 at 4:39 AM, Gerald Richter - ECOS Technology
> <[hidden email]> wrote:
>> Hi,
>>
>> It's a local IndexSearcher.
>>
>> I have done a lot of tests and it's really happening.
>>
>> Let me give you a little more details, maybe this helps:
>>
>> - I call a function that creates a new IndexSearcher and call $hits = $searcher -> hits.
>> - I iterate over the first few entries and returns the entries and the $hits
>> - The documents that were found are deleted from a database, which in turn deletes the documents from the Lucy index.
>> - Now I iterate over the next few entries and delete them and so on
>>
>> I have made small test where per iteration only two entries are fetch. The result looks like this:
>>
>>        id  => "8b8bce64e69b52ed244671009c11ee0e",
>>        id  => "8b8bce64e69b52ed244671009c4857e7",
>>        id  => "4a3dcd6c2e9e3074d2d52b8e72584b68",
>>        id  => "8b8bce64e69b52ed244671009c730dc9",
>>        id  => "4a3dcd6c2e9e3074d2d52b8e72584d19",
>>        id  => "8b8bce64e69b52ed244671009c7e3974",
>>        id  => "4a3dcd6c2e9e3074d2d52b8e72585475",
>>        id  => "8b8bce64e69b52ed244671009c7e4788",
>>        id  => "4a3dcd6c2e9e3074d2d52b8e72585dc2",
>>        id  => "8b8bce64e69b52ed244671009c7e2fa6",
>>
>> id is some value I store in the document. The result should only contain ids starting with 8.
>>
>> So you see the first two are correct, after deletion of this two (always in a different process), the next time, the first one I get is wrong the second one is correct...
>>
>> If I do not delete anything I only get the right entries (just commented out one line the rest is still the same).
>>
>> Any clue?
> When documents in an old segment are marked as deleted, that information is
> written to a bitmap deletions file which is written to a new segment.  Old
> readers are not supposed to know about new segments.  So for something to go
> wrong, either 1) information in an old segment would have to be corrupted, 2)
> a reader would have to somehow find out about information in a new segment, or
> 3) somthing else unrelated.
>
> Indexers write index data (including new deletions data referencing documents
> in old segments) to temp files in a new segment, which are then consolidated
> into a single per-segment "compound file" named "cf.dat".  When a reader
> opens, it mmaps cf.dat for each segment in the snapshot.  Once the reader
> successfully opens all the files it needs, it never goes looking for new
> files.
>
> It's hard to imagine a mechanism that would either cause an existing "cf.dat"
> file to be modified, or persuade a reader to go look at a new "cf.dat"
> file.  So unless my reasoning is wrong, the cause is #3 -- something else
> unrelated.  I really have no idea what that could be, though since you've
> previously asked some questions about Coro/AnyEvent and other concurrency
> stuff the most likely prospect would seem to be something unique to your
> setup.
>
> The next step is probably to take the behavior you've been able to reproduce
> and isolate it in a test case that others can run and analyze.
>
> Marvin Humphrey
>
> !DSPAM:416,564de4eb23791822212463!

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Strange results when documents gets delete while iterating

Marvin Humphrey
Thanks for closing the loop, and glad that things seem to be working OK!

Marvin Humphrey

On Wed, Nov 25, 2015 at 9:38 PM, Gerald Richter <[hidden email]> wrote:

> Thanks for the detailed explanation. Yes, I am using Coro, but in this
> special test case only one Coro thread was running.
>
> After restarting all processes the issue has gone away. I still did not
> really understand what was going on, but since the restart (a few days ago)
> everything works like expected
>
> Regards
>
> Gerald