[lucy-user] Get doc_id during indexing?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[lucy-user] Get doc_id during indexing?

Aleksandar Radovanovic
Hi there,

I was wondering is it possible to get doc_id during the indexing
process, or can I simply assume that doc_id starts from 0 and increments
with each record added?

Basically, I need SQL like:
INSERT INTO tbl (name) VALUES ('John') RETURNING id
after each INSERT I can extend the list of document id's in which name
John appears.

For example, I want to make a hash which maps some people names to a
list of internal doc_id:

my %keyword_to_doc_id;
while (...) {
   my $content = ...get a document;
   my $keyword = .. get a person's name;

   $indexer->add_doc( { doc_content => $content, ... } );
   push ( @{$keyword_to_doc_id{$keyword}}, <doc_id> ) if ($keyword is in the $content)
}|
$indexer->commit;
...
make another index of keywords appearing in the indexed documents without
time consuming search of previously created index for|||millions of predefined keywords|
|

For text mining purposes, I can later analyze only index of predefined
keywords (metadata), and extend the search to much bigger documents
index only when needed.

Alex
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Get doc_id during indexing?

Peter Karman
On 1/14/14 3:03 AM, Aleksandar Radovanovic wrote:
> Hi there,
>
> I was wondering is it possible to get doc_id during the indexing
> process, or can I simply assume that doc_id starts from 0 and increments
> with each record added?
>
>

Even if you could, I would not recommend that approach for solving your
problem. The doc_id is an internal implementation detail.

Instead, why not assign a unique term (like a URI) to each document in
your index, and reference that externally?

You could also, post indexing, iterate over the Lexicons in an index and
create a new index based on your keyword identification. Note that
'keyword' might be a misnomer depending on what Analysis classes you
apply to your documents: i.e., you might have phrases, etc., not just
single terms.


--
Peter Karman  .  http://peknet.com/  .  [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Get doc_id during indexing?

Aleksandar Radovanovic
Thank you Peter.

Actually, I am using the method you suggested. I was thinking that
having another field for the record identification is an overhead since
the doc_id  is the minimal and the fastest (if I am not mistaken)
possible way to retrieve records.

Regards,
Alex

On 2014-1-14, 6:18 PM, Peter Karman wrote:

> On 1/14/14 3:03 AM, Aleksandar Radovanovic wrote:
>> Hi there,
>>
>> I was wondering is it possible to get doc_id during the indexing
>> process, or can I simply assume that doc_id starts from 0 and increments
>> with each record added?
>>
>>
>
> Even if you could, I would not recommend that approach for solving
> your problem. The doc_id is an internal implementation detail.
>
> Instead, why not assign a unique term (like a URI) to each document in
> your index, and reference that externally?
>
> You could also, post indexing, iterate over the Lexicons in an index
> and create a new index based on your keyword identification. Note that
> 'keyword' might be a misnomer depending on what Analysis classes you
> apply to your documents: i.e., you might have phrases, etc., not just
> single terms.
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Get doc_id during indexing?

Peter Karman
On 1/14/14 11:44 AM, Aleksandar Radovanovic wrote:
> Thank you Peter.
>
> Actually, I am using the method you suggested. I was thinking that
> having another field for the record identification is an overhead since
> the doc_id  is the minimal and the fastest (if I am not mistaken)
> possible way to retrieve records.


The doc_id is ephemeral. It can change whenever an index changes
(segments getting merged, etc.).



--
Peter Karman  .  http://peknet.com/  .  [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Get doc_id during indexing?

Aleksandar Radovanovic
Good to know.
I wish to thank you again for the amazing work on Lucy and other CPAN
modules.

Alex

On 2014-1-14, 9:52 PM, Peter Karman wrote:

> On 1/14/14 11:44 AM, Aleksandar Radovanovic wrote:
>> Thank you Peter.
>>
>> Actually, I am using the method you suggested. I was thinking that
>> having another field for the record identification is an overhead since
>> the doc_id  is the minimal and the fastest (if I am not mistaken)
>> possible way to retrieve records.
>
>
> The doc_id is ephemeral. It can change whenever an index changes
> (segments getting merged, etc.).
>
>
>