numDocs and maxDoc

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

numDocs and maxDoc

Vinci
Hi,

I am trying to update the index by 2 stage posting: part of the index will be posted in stage 1 by 1.xml, then after a meanwhiles the left of the index of the entry will be posted by 2.xml. Assume both 1.xml and 2.xml have 3 document and id is used as unique field, what I see in the admin panel make me feels confusing:
numDocs : 3
maxDoc : 6
which number is the value of document exist in system? Is maxDoc just only a stat, not involved in any calculating process?
If the maxDoc is the true number of document exist in system, is the optimization tool is the only way to compress the index?

Thank you,
Vinci
Reply | Threaded
Open this post in threaded view
|

Re: numDocs and maxDoc

Mike Klaas

On 2-Apr-08, at 11:29 AM, Vinci wrote:

>
> Hi,
>
> I am trying to update the index by 2 stage posting: part of the  
> index will
> be posted in stage 1 by 1.xml, then after a meanwhiles the left of  
> the index
> of the entry will be posted by 2.xml. Assume both 1.xml and 2.xml  
> have 3
> document and id is used as unique field, what I see in the admin  
> panel make
> me feels confusing:
> numDocs : 3
> maxDoc : 6
> which number is the value of document exist in system? Is maxDoc  
> just only a
> stat, not involved in any calculating process?
> If the maxDoc is the true number of document exist in system, is the
> optimization tool is the only way to compress the index?

When you add a document that has the same unique id as a document  
currently in the index, the previous document is marked as "deleted"  
and the new one added.   This results in 6 documents physically on  
disk (BUT when searching you will never see the deleted docs).

Deleted documents are purged during segment merging, which will occur  
for the whole index during optimization and will happen naturally as  
you add more documents to the system without optimization.  Normally  
it isn't something to worry about.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: numDocs and maxDoc

hossman
In reply to this post by Vinci

: I am trying to update the index by 2 stage posting: part of the index will
: be posted in stage 1 by 1.xml, then after a meanwhiles the left of the index
: of the entry will be posted by 2.xml. Assume both 1.xml and 2.xml have 3
: document and id is used as unique field, what I see in the admin panel make

my gut tells me that what you mean by this is that you want to index
fields A and B for documents 1, 2, and 3; and then later you want to
provide valudes for additional fields C and D for the same documents (1,2
and 3)

"updating" documents is not currently supported in Solr.  there has
been lots of dicsussion about it in the past, and some patches exist in
Jira that approach the problem, but it's a lot harder then it seems like
it should be because of hte way Lucene works - esentially Solr under the
covers does the exact same thing you currently have do do: keep a record
of all the fields for all the documents, and reindex the *whole* document
once you have them.

: me feels confusing:
: numDocs : 3
: maxDoc : 6

numDocs is hte number of unique "live" Documents in the index.  it's how
many docs you would get back fro ma query for *:*.  maxDoc is the maximum
internal document id currently in use.  the difference between those
numbers gives you an idea of how many "deleted" (orreplaced) documents are
currently still in the index ... they gradually get cleaned up as segments
get merged or when the index gets optimized.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: numDocs and maxDoc

Vinci
Hi,

Thanks hossman, this is exactly what I want to do.
Final question: so I need to merge the field by myself first? (Actually my original plan is to do 2 consecutive posting....so merging is possible)

Thank you,
Vinci

hossman wrote
: I am trying to update the index by 2 stage posting: part of the index will
: be posted in stage 1 by 1.xml, then after a meanwhiles the left of the index
: of the entry will be posted by 2.xml. Assume both 1.xml and 2.xml have 3
: document and id is used as unique field, what I see in the admin panel make

my gut tells me that what you mean by this is that you want to index
fields A and B for documents 1, 2, and 3; and then later you want to
provide valudes for additional fields C and D for the same documents (1,2
and 3)

"updating" documents is not currently supported in Solr.  there has
been lots of dicsussion about it in the past, and some patches exist in
Jira that approach the problem, but it's a lot harder then it seems like
it should be because of hte way Lucene works - esentially Solr under the
covers does the exact same thing you currently have do do: keep a record
of all the fields for all the documents, and reindex the *whole* document
once you have them.

: me feels confusing:
: numDocs : 3
: maxDoc : 6

numDocs is hte number of unique "live" Documents in the index.  it's how
many docs you would get back fro ma query for *:*.  maxDoc is the maximum
internal document id currently in use.  the difference between those
numbers gives you an idea of how many "deleted" (orreplaced) documents are
currently still in the index ... they gradually get cleaned up as segments
get merged or when the index gets optimized.



-Hoss
Reply | Threaded
Open this post in threaded view
|

Re: numDocs and maxDoc

hossman
: Thanks hossman, this is exactly what I want to do.
: Final question: so I need to merge the field by myself first? (Actually my
: original plan is to do 2 consecutive posting....so merging is possible)

you need to send Solr whole documents with all the fields in them.  if you
send another "doc" with the same value for the uniqueKey field, it will
replace the previous doc.




-Hoss