301 perm redirect pages are still in Solr

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

301 perm redirect pages are still in Solr

hany.nasr-2
Hi All,

I'm using Nutch 1.15, and figure out that permeant redirect pages (301) are still indexed and not removed in Solr.

When I exported the crawlDB I found the page Status: 5 (db_redir_perm).

How can I keep Solr index up to date and make Nutch clean these pages automatically?

Regards,
Hany

-----------------------------------------
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.

It may also be legally privileged. If you are not the addressee you may not copy,
forward, disclose or use any part of it. If you have received this message in error,
please delete it and all copies from your system and notify the sender immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or virus-free.
The sender does not accept liability for any errors or omissions.
Reply | Threaded
Open this post in threaded view
|

Re: 301 perm redirect pages are still in Solr

Markus Jelsma-2
Hello Hany,

You need to tell the indexer to delete those record. This will help:

  <!-- delete gone and redirects -->
 <property>
   <name>indexer.delete</name>
   <value>true</value>
 </property>

Regards,
Markus

Op ma 8 mrt. 2021 om 15:31 schreef Hany NASR <[hidden email]>:

> Hi All,
>
> I'm using Nutch 1.15, and figure out that permeant redirect pages (301)
> are still indexed and not removed in Solr.
>
> When I exported the crawlDB I found the page Status: 5 (db_redir_perm).
>
> How can I keep Solr index up to date and make Nutch clean these pages
> automatically?
>
> Regards,
> Hany
>
> -----------------------------------------
> SAVE PAPER - THINK BEFORE YOU PRINT!
>
> This E-mail is confidential.
>
> It may also be legally privileged. If you are not the addressee you may
> not copy,
> forward, disclose or use any part of it. If you have received this message
> in error,
> please delete it and all copies from your system and notify the sender
> immediately by
> return E-mail.
>
> Internet communications cannot be guaranteed to be timely secure, error or
> virus-free.
> The sender does not accept liability for any errors or omissions.
>
Reply | Threaded
Open this post in threaded view
|

RE: EXTERNAL: Re: 301 perm redirect pages are still in Solr

hany.nasr-2
Hello Markus,

I added the property in nutch-site.xml with no luck.

The documents still exist in Solr; any advice?

Regards,
Hany

From: Markus Jelsma <[hidden email]>
Sent: Monday, March 8, 2021 3:40 PM
To: [hidden email]
Subject: EXTERNAL: Re: 301 perm redirect pages are still in Solr

Hello Hany,

You need to tell the indexer to delete those record. This will help:

  <!-- delete gone and redirects -->
 <property>
   <name>indexer.delete</name>
   <value>true</value>
 </property>

Regards,
Markus

Op ma 8 mrt. 2021 om 15:31 schreef Hany NASR <[hidden email]<mailto:[hidden email]>.invalid>:

> Hi All,
>
> I'm using Nutch 1.15, and figure out that permeant redirect pages (301)
> are still indexed and not removed in Solr.
>
> When I exported the crawlDB I found the page Status: 5 (db_redir_perm).
>
> How can I keep Solr index up to date and make Nutch clean these pages
> automatically?
>
> Regards,
> Hany
>
> -----------------------------------------
> SAVE PAPER - THINK BEFORE YOU PRINT!
>
> This E-mail is confidential.
>
> It may also be legally privileged. If you are not the addressee you may
> not copy,
> forward, disclose or use any part of it. If you have received this message
> in error,
> please delete it and all copies from your system and notify the sender
> immediately by
> return E-mail.
>
> Internet communications cannot be guaranteed to be timely secure, error or
> virus-free.
> The sender does not accept liability for any errors or omissions.
>

******************************************************************
This message originated from the Internet.  Its originator may or
may not be who they claim to be and the information contained in
the message and any attachments may or may not be accurate.
******************************************************************

-----------------------------------------
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.

It may also be legally privileged. If you are not the addressee you may not copy,
forward, disclose or use any part of it. If you have received this message in error,
please delete it and all copies from your system and notify the sender immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or virus-free.
The sender does not accept liability for any errors or omissions.
Reply | Threaded
Open this post in threaded view
|

Re: EXTERNAL: Re: 301 perm redirect pages are still in Solr

Markus Jelsma-2
Hello Hany,

Sure, check these commands:

 solrclean         remove HTTP 301 and 404 documents from solr - DEPRECATED
use the clean command instead
 clean             remove HTTP 301 and 404 documents and duplicates from
indexing backends configured via plugins

Regards,
Markus

Op di 9 mrt. 2021 om 08:49 schreef Hany NASR <[hidden email]>:

> Hello Markus,
>
> I added the property in nutch-site.xml with no luck.
>
> The documents still exist in Solr; any advice?
>
> Regards,
> Hany
>
> From: Markus Jelsma <[hidden email]>
> Sent: Monday, March 8, 2021 3:40 PM
> To: [hidden email]
> Subject: EXTERNAL: Re: 301 perm redirect pages are still in Solr
>
> Hello Hany,
>
> You need to tell the indexer to delete those record. This will help:
>
>   <!-- delete gone and redirects -->
>  <property>
>    <name>indexer.delete</name>
>    <value>true</value>
>  </property>
>
> Regards,
> Markus
>
> Op ma 8 mrt. 2021 om 15:31 schreef Hany NASR <[hidden email]<mailto:
> [hidden email]>.invalid>:
>
> > Hi All,
> >
> > I'm using Nutch 1.15, and figure out that permeant redirect pages (301)
> > are still indexed and not removed in Solr.
> >
> > When I exported the crawlDB I found the page Status: 5 (db_redir_perm).
> >
> > How can I keep Solr index up to date and make Nutch clean these pages
> > automatically?
> >
> > Regards,
> > Hany
> >
> > -----------------------------------------
> > SAVE PAPER - THINK BEFORE YOU PRINT!
> >
> > This E-mail is confidential.
> >
> > It may also be legally privileged. If you are not the addressee you may
> > not copy,
> > forward, disclose or use any part of it. If you have received this
> message
> > in error,
> > please delete it and all copies from your system and notify the sender
> > immediately by
> > return E-mail.
> >
> > Internet communications cannot be guaranteed to be timely secure, error
> or
> > virus-free.
> > The sender does not accept liability for any errors or omissions.
> >
>
> ******************************************************************
> This message originated from the Internet.  Its originator may or
> may not be who they claim to be and the information contained in
> the message and any attachments may or may not be accurate.
> ******************************************************************
>
> -----------------------------------------
> SAVE PAPER - THINK BEFORE YOU PRINT!
>
> This E-mail is confidential.
>
> It may also be legally privileged. If you are not the addressee you may
> not copy,
> forward, disclose or use any part of it. If you have received this message
> in error,
> please delete it and all copies from your system and notify the sender
> immediately by
> return E-mail.
>
> Internet communications cannot be guaranteed to be timely secure, error or
> virus-free.
> The sender does not accept liability for any errors or omissions.
>
Reply | Threaded
Open this post in threaded view
|

RE: EXTERNAL: Re: Re: 301 perm redirect pages are still in Solr

hany.nasr-2
Hello Markus.

Before running the commands I dumped the crawldb and checked again that document status is 5 (db_redir_perm), then I ran both commands with the same result, but the 301 document/s still exists in Solr


1.      sudo bin/nutch clean crawl/crawldb/

2.      sudo bin/nutch solrclean crawl/crawldb/


No exchange was configured. The documents will be routed to all index writers.
SolrIndexer: deleting 1000/1000 documents
SolrIndexer: deleting 1000/2000 documents
SolrIndexer: deleting 1000/3000 documents
SolrIndexer: deleting 1000/4000 documents
SolrIndexer: deleting 270/4270 documents

Did I miss anything here?

Regards,
Hany

From: Markus Jelsma <[hidden email]>
Sent: Tuesday, March 9, 2021 11:19 AM
To: [hidden email]
Subject: EXTERNAL: Re: Re: 301 perm redirect pages are still in Solr

Hello Hany,

Sure, check these commands:

 solrclean         remove HTTP 301 and 404 documents from solr - DEPRECATED
use the clean command instead
 clean             remove HTTP 301 and 404 documents and duplicates from
indexing backends configured via plugins

Regards,
Markus

Op di 9 mrt. 2021 om 08:49 schreef Hany NASR <[hidden email]<mailto:[hidden email]>.invalid>:

> Hello Markus,
>
> I added the property in nutch-site.xml with no luck.
>
> The documents still exist in Solr; any advice?
>
> Regards,
> Hany
>
> From: Markus Jelsma <[hidden email]<mailto:[hidden email]>>
> Sent: Monday, March 8, 2021 3:40 PM
> To: [hidden email]<mailto:[hidden email]>
> Subject: EXTERNAL: Re: 301 perm redirect pages are still in Solr
>
> Hello Hany,
>
> You need to tell the indexer to delete those record. This will help:
>
>   <!-- delete gone and redirects -->
>  <property>
>    <name>indexer.delete</name>
>    <value>true</value>
>  </property>
>
> Regards,
> Markus
>
> Op ma 8 mrt. 2021 om 15:31 schreef Hany NASR <[hidden email]<mailto:[hidden email]><mailto:
> [hidden email]<mailto:[hidden email]>>.invalid>:
>
> > Hi All,
> >
> > I'm using Nutch 1.15, and figure out that permeant redirect pages (301)
> > are still indexed and not removed in Solr.
> >
> > When I exported the crawlDB I found the page Status: 5 (db_redir_perm).
> >
> > How can I keep Solr index up to date and make Nutch clean these pages
> > automatically?
> >
> > Regards,
> > Hany
> >
> > -----------------------------------------
> > SAVE PAPER - THINK BEFORE YOU PRINT!
> >
> > This E-mail is confidential.
> >
> > It may also be legally privileged. If you are not the addressee you may
> > not copy,
> > forward, disclose or use any part of it. If you have received this
> message
> > in error,
> > please delete it and all copies from your system and notify the sender
> > immediately by
> > return E-mail.
> >
> > Internet communications cannot be guaranteed to be timely secure, error
> or
> > virus-free.
> > The sender does not accept liability for any errors or omissions.
> >
>
> ******************************************************************
> This message originated from the Internet.  Its originator may or
> may not be who they claim to be and the information contained in
> the message and any attachments may or may not be accurate.
> ******************************************************************
>
> -----------------------------------------
> SAVE PAPER - THINK BEFORE YOU PRINT!
>
> This E-mail is confidential.
>
> It may also be legally privileged. If you are not the addressee you may
> not copy,
> forward, disclose or use any part of it. If you have received this message
> in error,
> please delete it and all copies from your system and notify the sender
> immediately by
> return E-mail.
>
> Internet communications cannot be guaranteed to be timely secure, error or
> virus-free.
> The sender does not accept liability for any errors or omissions.
>

-----------------------------------------
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.

It may also be legally privileged. If you are not the addressee you may not copy,
forward, disclose or use any part of it. If you have received this message in error,
please delete it and all copies from your system and notify the sender immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or virus-free.
The sender does not accept liability for any errors or omissions.