deletions from index

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

deletions from index

Michael Coffey
With my new news crawl, I would like to keep web pages in the index, even after they have disappeared from the web, so I can continue using them in machine-learning processes. I thought I could achieve this by avoiding running cleaning jobs. However, I still notice increasing numbers of deletions in my solr index.
When and why does nutch tell the indexer to delete documents, other than during cleaningJob?
For example, recently, Solr tells me that numDocs is about 189,000 and deletedDocs is about 96,000. Even if I assume that some of the "deleted" docs have just been replaced by newer content, I am not ready to believe that has happened to so many of them.
Should I use a different indexer, or different settings, or something other than an indexer for this purpose?
Reply | Threaded
Open this post in threaded view
|

RE: deletions from index

Markus Jelsma-2
You can check the Hadoop job's counters to see how many are being deleted. If some are, then -deleteGone is on in your case. Only with that setting documents are going to be deleted.

 
 
-----Original message-----

> From:Michael Coffey <[hidden email]>
> Sent: Monday 2nd October 2017 21:51
> To: User <[hidden email]>
> Subject: deletions from index
>
> With my new news crawl, I would like to keep web pages in the index, even after they have disappeared from the web, so I can continue using them in machine-learning processes. I thought I could achieve this by avoiding running cleaning jobs. However, I still notice increasing numbers of deletions in my solr index.
> When and why does nutch tell the indexer to delete documents, other than during cleaningJob?
> For example, recently, Solr tells me that numDocs is about 189,000 and deletedDocs is about 96,000. Even if I assume that some of the "deleted" docs have just been replaced by newer content, I am not ready to believe that has happened to so many of them.
> Should I use a different indexer, or different settings, or something other than an indexer for this purpose?
>
Reply | Threaded
Open this post in threaded view
|

Re: deletions from index

Michael Coffey
So, I had these numbers in my index:
Num Docs: 189550Max Docs: 285531
Deleted Docs: 95981

Then I did a crawl and index, which told meindexed (add/update): 13,423
And now I have these numbers in my index:

Num Docs: 190785Max Docs: 223339Deleted Docs: 32554So, I am completely confused. I don't use "-deleteGone" but I get massive numbers of deletions.

Is it your theory that Solr's report of deleted docs really just means that docs were replaced by newer content?


      From: Markus Jelsma <[hidden email]>
 To: "[hidden email]" <[hidden email]>; User <[hidden email]>
 Sent: Monday, October 2, 2017 1:19 PM
 Subject: RE: deletions from index
   
You can check the Hadoop job's counters to see how many are being deleted. If some are, then -deleteGone is on in your case. Only with that setting documents are going to be deleted.

 
 
-----Original message-----

> From:Michael Coffey <[hidden email]>
> Sent: Monday 2nd October 2017 21:51
> To: User <[hidden email]>
> Subject: deletions from index
>
> With my new news crawl, I would like to keep web pages in the index, even after they have disappeared from the web, so I can continue using them in machine-learning processes. I thought I could achieve this by avoiding running cleaning jobs. However, I still notice increasing numbers of deletions in my solr index.
> When and why does nutch tell the indexer to delete documents, other than during cleaningJob?
> For example, recently, Solr tells me that numDocs is about 189,000 and deletedDocs is about 96,000. Even if I assume that some of the "deleted" docs have just been replaced by newer content, I am not ready to believe that has happened to so many of them.
> Should I use a different indexer, or different settings, or something other than an indexer for this purpose?
>

   
Reply | Threaded
Open this post in threaded view
|

RE: deletions from index

Markus Jelsma-2
In reply to this post by Michael Coffey
If you don't delete documents, the numDoc/maxDoc difference is just updated documents, of which the older version is eligible for deletion.

 
 
-----Original message-----

> From:Michael Coffey <[hidden email]>
> Sent: Monday 2nd October 2017 23:29
> To: [hidden email]
> Subject: Re: deletions from index
>
> So, I had these numbers in my index:
> Num Docs: 189550Max Docs: 285531
> Deleted Docs: 95981
>
> Then I did a crawl and index, which told meindexed (add/update): 13,423
> And now I have these numbers in my index:
>
> Num Docs: 190785Max Docs: 223339Deleted Docs: 32554So, I am completely confused. I don't use "-deleteGone" but I get massive numbers of deletions.
>
> Is it your theory that Solr's report of deleted docs really just means that docs were replaced by newer content?
>
>
>       From: Markus Jelsma <[hidden email]>
>  To: "[hidden email]" <[hidden email]>; User <[hidden email]>
>  Sent: Monday, October 2, 2017 1:19 PM
>  Subject: RE: deletions from index
>   
> You can check the Hadoop job's counters to see how many are being deleted. If some are, then -deleteGone is on in your case. Only with that setting documents are going to be deleted.
>


> -----Original message-----
> > From:Michael Coffey <[hidden email]>
> > Sent: Monday 2nd October 2017 21:51
> > To: User <[hidden email]>
> > Subject: deletions from index
> >
> > With my new news crawl, I would like to keep web pages in the index, even after they have disappeared from the web, so I can continue using them in machine-learning processes. I thought I could achieve this by avoiding running cleaning jobs. However, I still notice increasing numbers of deletions in my solr index.
> > When and why does nutch tell the indexer to delete documents, other than during cleaningJob?
> > For example, recently, Solr tells me that numDocs is about 189,000 and deletedDocs is about 96,000. Even if I assume that some of the "deleted" docs have just been replaced by newer content, I am not ready to believe that has happened to so many of them.
> > Should I use a different indexer, or different settings, or something other than an indexer for this purpose?
> >
>
>