Inserting a document into an index at a specified position

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Inserting a document into an index at a specified position

Jason Calabrese
All,

For performance reasons we keep our index of over a million documents ordered
alphabeticaly.  This way for an alpha sort we can just use the index order.  
This works very good, but I'm now looking for a way to insert a single
document to the index in the correct position.  

Is there any standard way to do this?

--Jason

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Inserting a document into an index at a specified position

Jason Calabrese
All,

I sent this the other day, but didn't get any responses.  I'm hoping that it
was just missed, so I'm trying again.

There has to be a better way to to insert a document in to an index then
reindexing everything.

--Jason

On Wednesday 05 July 2006 5:06 pm, Jason Calabrese wrote:

> All,
>
> For performance reasons we keep our index of over a million documents
> ordered alphabeticaly.  This way for an alpha sort we can just use the
> index order. This works very good, but I'm now looking for a way to insert
> a single document to the index in the correct position.
>
> Is there any standard way to do this?
>
> --Jason

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Inserting a document into an index at a specified position

Erick Erickson
When you say you keep your documents ordered alphabetically, it's confusing
to me. Are you saying that you pre-sort all your documents then insert them
one after another so that automatically-generated internal Lucene ID maps
exactly to the alphabetical ordering? That is, for any document IDs D1 and
D2 and any documents C1 and C2 (where C1 and C2 are the alphabetical
representations of the documents, whatever that means) if D1 < D2 then C1 <
C2?

The short answer is that you can't insert a document into a Lucene index and
have any control whatsoever about the assigned document ID. The assigned
document ID is always greater than the maximum document ID already in your
index.

But it doesn't make sense to try. You have documents A, B, D that you index.
They get IDs 1, 2, 3. Now you want to index document C. What sort of
document ID would you expect? 2.5? Or do I completely misunderstand your
problem?

Would it work to just index a field for each document that contained the
alphabetical representation and use that for retrieval ordering? I *think*
you can use a FilteredTermEnum with a new Term("field", "") to enumerate all
the terms in an index ( They are guaranteed to be in lexical order.....).
Then you let lucene do your sorting... I'm a little fuzzy on how to go from
there to a document, but I suspect there's a way.

Hope this helps
Erick
Reply | Threaded
Open this post in threaded view
|

Re: Inserting a document into an index at a specified position

Jason Calabrese
> When you say you keep your documents ordered alphabetically, it's confusing
> to me. Are you saying that you pre-sort all your documents then insert them
> one after another so that automatically-generated internal Lucene ID maps
> exactly to the alphabetical ordering? That is, for any document IDs D1 and
> D2 and any documents C1 and C2 (where C1 and C2 are the alphabetical
> representations of the documents, whatever that means) if D1 < D2 then C1 <
> C2?

Yes, this is a pre-sort. For our application we have some fairly large result
sets and using the standard sort on a name field was too slow.  By
pre-sorting before we index we can make sure that all the docs are inserted
in alpha order, and then sort them by index order just as fast or faster than
the standard relvance sort.

This:
Hits hits = searcher.search(query, Sort.INDEXORDER);

is much faster than:
Hits hits = searcher.search(query, new Sort("fullname"));

> The short answer is that you can't insert a document into a Lucene index
> and have any control whatsoever about the assigned document ID. The
> assigned document ID is always greater than the maximum document ID already
> in your index.

I know that there is no direct way to insert a doc a at a specified position
with a single IndexWriter method, but it seems that there is a better way
then reindexing everything.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Inserting a document into an index at a specified position

Erick Erickson
 Did you use a Hits object to assemble your results? And is that what you're
measuring when you say it's slow? In other words, were you measuring the
time it took to execute the statement

Hits hits = searcher.search(query, new Sort("fullname"));

or the time it took to iterate over the Hits object and do something? If the
latter, your problem may really be the fact that the Hits object re-issues
the search every 100 retrievals or so (this has been discussed in the mail
archive...) and you'd get satisfactory performance by using a lower-level
interface HitCollector(?) TopDocs(?).

Otherwise, I haven't a clue, but you probably already realized that...

Best
Erick
Reply | Threaded
Open this post in threaded view
|

Re: Inserting a document into an index at a specified position

Jason Calabrese
We only display the 10 hits at a time, so we don't need to iterate through all
the hits.

It feels like there should be a way to pull a document out 1 index and stick
it into an other and bring all the unstored fields along with it.

On Friday 07 July 2006 12:52, Erick Erickson wrote:

>  Did you use a Hits object to assemble your results? And is that what
> you're measuring when you say it's slow? In other words, were you measuring
> the time it took to execute the statement
>
> Hits hits = searcher.search(query, new Sort("fullname"));
>
> or the time it took to iterate over the Hits object and do something? If
> the latter, your problem may really be the fact that the Hits object
> re-issues the search every 100 retrievals or so (this has been discussed in
> the mail archive...) and you'd get satisfactory performance by using a
> lower-level interface HitCollector(?) TopDocs(?).
>
> Otherwise, I haven't a clue, but you probably already realized that...
>
> Best
> Erick

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]