[lucy-user] Doc id from hits and remove redundant documents

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[lucy-user] Doc id from hits and remove redundant documents

Gupta, Rajiv
Hi,

I have 2 questions.


1.       Which field I use to get the document id from hits:
  my $hits = $searcher->hits(
      query      => $query_parsed,
      num_wanted => -1, # -1 equivlent to all results
);
while (my $hits $hits->next()){
                print "Docment id: " . $hit->{???};
}


2.       While inserting records how can avoid inserting duplicate records.
Somehow in my process the same file is reopening again multiple times and each time it starts indexing from beginning of the file. So initially it added few documents and file closed after some time some more content added to the file and I reopen the file now the same set of documents added again along with additional content, instead I want that it should only add documents for new additions in the file. I cannot use truncate as there are other files documents will also get impacted, which are present in same folders.

Thanks much!

Thanks,
Rajiv Gupta
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Doc id from hits and remove redundant documents

Nick Wellnhofer
On 23/11/2016 15:33, Gupta, Rajiv wrote:
> 1.       Which field I use to get the document id from hits:
>   my $hits = $searcher->hits(
>       query      => $query_parsed,
>       num_wanted => -1, # -1 equivlent to all results
> );
> while (my $hits $hits->next()){
>                 print "Docment id: " . $hit->{???};
> }

$hits->next() returns an arrayref of Lucy::Document::HitDocs:

     http://lucy.apache.org/docs/perl/Lucy/Document/HitDoc.html

HitDoc inherits from Lucy::Document::Doc which has a get_doc_id method:

     http://lucy.apache.org/docs/perl/Lucy/Document/Doc.html#get_doc_id

So you can get the doc ID with:

     my $doc_id = $hit->get_doc_id();

> 2.       While inserting records how can avoid inserting duplicate records.

You have to delete the old documents, using one of the delete_* methods in
Lucy::Index::Indexer:

     http://lucy.apache.org/docs/perl/Lucy/Index/Indexer.html

Typically, you use one of the fields in your schema as primary key and delete
documents using delete_by_term:

     $indexer->delete_by_term(my_primary_key => $value);

Nick

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Doc id from hits and remove redundant documents

Nick Wellnhofer
On 23/11/2016 16:11, Nick Wellnhofer wrote:
> Typically, you use one of the fields in your schema as primary key and delete
> documents using delete_by_term:
>
>     $indexer->delete_by_term(my_primary_key => $value);

Oops, this should have been:

     $indexer->delete_by_term(
         field => 'my_primary_key',
         term  => $value,
     );

Nick

Reply | Threaded
Open this post in threaded view
|

RE: [lucy-user] Doc id from hits and remove redundant documents

Gupta, Rajiv
Thanks for your reply Nick.

I wanted to delete the old documents that is why I was trying to get the doc_id and use that to delete it. However, that does not help it deleted other documents and keep changing the document. I wanted to use delete by term but in my doc I don't have any primary key.

I add document like this:

$indexer->add_doc({
                title    => $mytitle,
                content  => substr($mybodytext,0,1024),
                url      => $onlyfilename,
                urlpath  => $filpath,
                position => $fileseektostart,
                linenum  => $filelinenumtostart,
                jobtype  => $self->{_logfile_hash}{$filetoindex}[5] ,
            });

The title is the key that I use to query for any search. Will term be its value?

The title key values are [1,2,3,4] then will that work?

$indexer->delete_by_term(
    field => 'title'  # required
    term  => 4   # required
);

Thanks,
Rajiv

-----Original Message-----
From: Nick Wellnhofer [mailto:[hidden email]]
Sent: Wednesday, November 23, 2016 8:55 PM
To: [hidden email]
Subject: Re: [lucy-user] Doc id from hits and remove redundant documents

On 23/11/2016 16:11, Nick Wellnhofer wrote:
> Typically, you use one of the fields in your schema as primary key and
> delete documents using delete_by_term:
>
>     $indexer->delete_by_term(my_primary_key => $value);

Oops, this should have been:

     $indexer->delete_by_term(
         field => 'my_primary_key',
         term  => $value,
     );

Nick

Reply | Threaded
Open this post in threaded view
|

RE: [lucy-user] Doc id from hits and remove redundant documents

Gupta, Rajiv
Thanks Nick.

With worked with term and its value.

-----Original Message-----
From: Gupta, Rajiv [mailto:[hidden email]]
Sent: Wednesday, November 23, 2016 9:02 PM
To: [hidden email]
Subject: RE: [lucy-user] Doc id from hits and remove redundant documents

Thanks for your reply Nick.

I wanted to delete the old documents that is why I was trying to get the doc_id and use that to delete it. However, that does not help it deleted other documents and keep changing the document. I wanted to use delete by term but in my doc I don't have any primary key.

I add document like this:

$indexer->add_doc({
                title    => $mytitle,
                content  => substr($mybodytext,0,1024),
                url      => $onlyfilename,
                urlpath  => $filpath,
                position => $fileseektostart,
                linenum  => $filelinenumtostart,
                jobtype  => $self->{_logfile_hash}{$filetoindex}[5] ,
            });

The title is the key that I use to query for any search. Will term be its value?

The title key values are [1,2,3,4] then will that work?

$indexer->delete_by_term(
    field => 'title'  # required
    term  => 4   # required
);

Thanks,
Rajiv

-----Original Message-----
From: Nick Wellnhofer [mailto:[hidden email]]
Sent: Wednesday, November 23, 2016 8:55 PM
To: [hidden email]
Subject: Re: [lucy-user] Doc id from hits and remove redundant documents

On 23/11/2016 16:11, Nick Wellnhofer wrote:
> Typically, you use one of the fields in your schema as primary key and
> delete documents using delete_by_term:
>
>     $indexer->delete_by_term(my_primary_key => $value);

Oops, this should have been:

     $indexer->delete_by_term(
         field => 'my_primary_key',
         term  => $value,
     );

Nick

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Doc id from hits and remove redundant documents

Nick Wellnhofer
In reply to this post by Gupta, Rajiv
On 23/11/2016 16:31, Gupta, Rajiv wrote:

> Thanks for your reply Nick.
>
> I wanted to delete the old documents that is why I was trying to get the doc_id and use that to delete it. However, that does not help it deleted other documents and keep changing the document. I wanted to use delete by term but in my doc I don't have any primary key.
>
> I add document like this:
>
> $indexer->add_doc({
>                 title    => $mytitle,
>                 content  => substr($mybodytext,0,1024),
>                 url      => $onlyfilename,
>                 urlpath  => $filpath,
>                 position => $fileseektostart,
>                 linenum  => $filelinenumtostart,
>                 jobtype  => $self->{_logfile_hash}{$filetoindex}[5] ,
>             });

You can use any field as primary key if the field's value is guaranteed to be
unique for all your documents. But it seems that you index the contents of
files line by line, so "urlpath" isn't unique. Your primary key is probably
the tuple (urlpath, linenum).

If you update all the lines of a file at once, this isn't a problem. You can
simply delete all documents relating to the file with

     $indexer->delete_by_term(
         field => 'urlpath',
         term  => $filepath,
     );

If you only want to update certain lines, you'll have to construct an ANDQuery
for each line and use delete_by_query. For example:

     $indexer->delete_by_query(Lucy::Search::ANDQuery->new(
         children => [
             Lucy::Search::TermQuery->new(
                 field => 'urlpath',
                 term  => $filepath,
             ),
             Lucy::Search::TermQuery->new(
                 field => 'linenum',
                 term  => $linenum,
             ),
         ],
     ));

Or maybe use a RangeQuery to delete a contiguous range of lines.

Nick

Reply | Threaded
Open this post in threaded view
|

RE: [lucy-user] Doc id from hits and remove redundant documents

Gupta, Rajiv
What I'm doing now is since I have line number and seek position I'm moving forward line by line based on last record that I got. I'm also adding an end_point marker which is my search to decide to move forward.

Thanks,
Rajiv Gupta

-----Original Message-----
From: Nick Wellnhofer [mailto:[hidden email]]
Sent: Wednesday, November 23, 2016 9:30 PM
To: [hidden email]
Subject: Re: [lucy-user] Doc id from hits and remove redundant documents

On 23/11/2016 16:31, Gupta, Rajiv wrote:

> Thanks for your reply Nick.
>
> I wanted to delete the old documents that is why I was trying to get the doc_id and use that to delete it. However, that does not help it deleted other documents and keep changing the document. I wanted to use delete by term but in my doc I don't have any primary key.
>
> I add document like this:
>
> $indexer->add_doc({
>                 title    => $mytitle,
>                 content  => substr($mybodytext,0,1024),
>                 url      => $onlyfilename,
>                 urlpath  => $filpath,
>                 position => $fileseektostart,
>                 linenum  => $filelinenumtostart,
>                 jobtype  => $self->{_logfile_hash}{$filetoindex}[5] ,
>             });

You can use any field as primary key if the field's value is guaranteed to be unique for all your documents. But it seems that you index the contents of files line by line, so "urlpath" isn't unique. Your primary key is probably the tuple (urlpath, linenum).

If you update all the lines of a file at once, this isn't a problem. You can simply delete all documents relating to the file with

     $indexer->delete_by_term(
         field => 'urlpath',
         term  => $filepath,
     );

If you only want to update certain lines, you'll have to construct an ANDQuery for each line and use delete_by_query. For example:

     $indexer->delete_by_query(Lucy::Search::ANDQuery->new(
         children => [
             Lucy::Search::TermQuery->new(
                 field => 'urlpath',
                 term  => $filepath,
             ),
             Lucy::Search::TermQuery->new(
                 field => 'linenum',
                 term  => $linenum,
             ),
         ],
     ));

Or maybe use a RangeQuery to delete a contiguous range of lines.

Nick