[lucy-user] Regarding document Ids

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[lucy-user] Regarding document Ids

serkanmulayim@gmail.com
Hi,

As far as I see if we add the same document twice, it creates a new
document. As per http://lucy.apache.org/docs/c/Lucy/Docs/DocIDs.html, " If
you truly need a primary key field, you must define it and populate it
yourself". Can you please elaborate on this one? Does it mean choosing a
field to be primary key and delete the document with the primary key and
re-add it? If so the document might have not been created until we commit,
so deletion would not be possible, right? Also performance would be another
issue.

Another solution might be hashing the "primary key" and put it as the
documentId (but the referred page also says that docIds are ephemeral). If
the ephemeralness of the docId is not a problem, my concern is regarding
the collisions considering that I might need to have many documents in the
same index. This boils down to the birthday problem and we might not be
safe in the range of an integer.

Do you have any suggestions about this one?

Thanks,
Serkan
Reply | Threaded
Open this post in threaded view
|

[lucy-user] Re: Regarding document Ids

serkanmulayim@gmail.com
Hi guys,

I think I need to simplify my question. After reading it one more time, I
realized I touched many things, and it seem confusing.

It seems like if we index the same document twice, a new document is
created. And as per http://lucy.apache.org/docs/c/Lucy/Docs/DocIDs.html, " If
you truly need a primary key field, you must define it and populate it
yourself". How can we do this, are there any examples around this? Should I
search for the document with the primary key before indexing and if it
exists, should I not index it?

Thanks,
Serkan

On Tue, Nov 15, 2016 at 2:22 PM, Serkan Mulayim <[hidden email]>
wrote:

> Hi,
>
> As far as I see if we add the same document twice, it creates a new
> document. As per http://lucy.apache.org/docs/c/Lucy/Docs/DocIDs.html, " If
> you truly need a primary key field, you must define it and populate it
> yourself". Can you please elaborate on this one? Does it mean choosing a
> field to be primary key and delete the document with the primary key and
> re-add it? If so the document might have not been created until we commit,
> so deletion would not be possible, right? Also performance would be another
> issue.
>
> Another solution might be hashing the "primary key" and put it as the
> documentId (but the referred page also says that docIds are ephemeral). If
> the ephemeralness of the docId is not a problem, my concern is regarding
> the collisions considering that I might need to have many documents in the
> same index. This boils down to the birthday problem and we might not be
> safe in the range of an integer.
>
> Do you have any suggestions about this one?
>
> Thanks,
> Serkan
>
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Re: Regarding document Ids

Peter Karman
Serkan Mulayim wrote on 11/16/16, 1:17 PM:

> Hi guys,
>
> I think I need to simplify my question. After reading it one more time, I
> realized I touched many things, and it seem confusing.
>
> It seems like if we index the same document twice, a new document is
> created. And as per http://lucy.apache.org/docs/c/Lucy/Docs/DocIDs.html, " If
> you truly need a primary key field, you must define it and populate it
> yourself". How can we do this, are there any examples around this? Should I
> search for the document with the primary key before indexing and if it
> exists, should I not index it?

What I do in all my apps is use delete_by_term
https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Index/Indexer.pod#delete_by_term

I have my own primary key system that varies based on the application. Sometimes
it is a URI, sometimes a db PK. I maintain the document integrity myself.

One example from how Dezi solves this more generally:

https://metacpan.org/source/KARMAN/Dezi-App-0.014/lib/Dezi/Lucy/Indexer.pm#L451

Lucy isn't a RDBMS. It just tokenizes the fields you shove into it, and
retrieves very quickly.


--
Peter Karman  .  http://peknet.com/  .  [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Re: Regarding document Ids

serkanmulayim@gmail.com
Thank you Peter for your quick response.

As I understand before adding new documents to the index, you delete by
query (by using your primary key). How is the performance in your end,
then? Since delete by query will search through all segments in the index
for the deletion, I feel like the performance would be affected. Roughly,
how many documents do you have in your index, and what is the document size?

BTW, my document sizes are very small, and I think I will have around 40K
documents.

Thanks,
Serkan

On Wed, Nov 16, 2016 at 11:25 AM, Peter Karman <[hidden email]> wrote:

> Serkan Mulayim wrote on 11/16/16, 1:17 PM:
>
>> Hi guys,
>>
>> I think I need to simplify my question. After reading it one more time, I
>> realized I touched many things, and it seem confusing.
>>
>> It seems like if we index the same document twice, a new document is
>> created. And as per http://lucy.apache.org/docs/c/Lucy/Docs/DocIDs.html,
>> " If
>> you truly need a primary key field, you must define it and populate it
>> yourself". How can we do this, are there any examples around this? Should
>> I
>> search for the document with the primary key before indexing and if it
>> exists, should I not index it?
>>
>
> What I do in all my apps is use delete_by_term
> https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Index/
> Indexer.pod#delete_by_term
>
> I have my own primary key system that varies based on the application.
> Sometimes it is a URI, sometimes a db PK. I maintain the document integrity
> myself.
>
> One example from how Dezi solves this more generally:
>
> https://metacpan.org/source/KARMAN/Dezi-App-0.014/lib/Dezi/
> Lucy/Indexer.pm#L451
>
> Lucy isn't a RDBMS. It just tokenizes the fields you shove into it, and
> retrieves very quickly.
>
>
> --
> Peter Karman  .  http://peknet.com/  .  [hidden email]
>
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Re: Regarding document Ids

Peter Karman
Serkan Mulayim wrote on 11/16/16, 2:21 PM:

> Thank you Peter for your quick response.
>
> As I understand before adding new documents to the index, you delete by
> query (by using your primary key). How is the performance in your end,
> then? Since delete by query will search through all segments in the index
> for the deletion, I feel like the performance would be affected. Roughly,
> how many documents do you have in your index, and what is the document size?
>
> BTW, my document sizes are very small, and I think I will have around 40K
> documents.
>

performance is fast enough for me. I have 1MM+ docs but not much churn (not
updating docs constantly). IME the bottleneck is not the search. It's a search
engine; it's pretty fast. The bottleneck is updating the index. That's true
whether you delete first or not.


--
Peter Karman  .  http://peknet.com/  .  [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Re: Regarding document Ids

serkanmulayim@gmail.com
Thank you Peter for your comments. Regards...

On Wed, Nov 16, 2016 at 3:05 PM, Peter Karman <[hidden email]> wrote:

> Serkan Mulayim wrote on 11/16/16, 2:21 PM:
>
>> Thank you Peter for your quick response.
>>
>> As I understand before adding new documents to the index, you delete by
>> query (by using your primary key). How is the performance in your end,
>> then? Since delete by query will search through all segments in the index
>> for the deletion, I feel like the performance would be affected. Roughly,
>> how many documents do you have in your index, and what is the document
>> size?
>>
>> BTW, my document sizes are very small, and I think I will have around 40K
>> documents.
>>
>>
> performance is fast enough for me. I have 1MM+ docs but not much churn
> (not updating docs constantly). IME the bottleneck is not the search. It's
> a search engine; it's pretty fast. The bottleneck is updating the index.
> That's true whether you delete first or not.
>
>
>
> --
> Peter Karman  .  http://peknet.com/  .  [hidden email]
>