finding broken links with nutch 1.14

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

finding broken links with nutch 1.14

Robert Scavilla
Hi again, and thank you in advance for your kind help.

I'm using Nutch 1.14

I'm trying to use nutch to find broken links (404s) on a site. I
followed the instructions:
bin/nutch readdb <crawlFolder>/crawldb/ -dump myDump

but the dump only shows 200 and 301 status. There is no sign of any broken
link. When enter just 1 broken link in the seed file the crawldb is empty.

Please advise how I can inspect broken links with nutch1.14

Thank you!
...bob
Reply | Threaded
Open this post in threaded view
|

Re: finding broken links with nutch 1.14

Robert Scavilla
Nutch 1.14:
I am looking at the FetcherThread code. The 404 url does get flagged with
a ProtocolStatus.NOTFOUND, but the broken link never gets to the crawldb.
It does however got into the linkdb. Please tell me how I can collect these
404 urls.

Any help would be appreciated,
.,..bob

           case ProtocolStatus.NOTFOUND:
            case ProtocolStatus.GONE: // gone
            case ProtocolStatus.ACCESS_DENIED:
            case ProtocolStatus.ROBOTS_DENIED:
              output(fit.url, fit.datum, null, status,
                  CrawlDatum.STATUS_FETCH_GONE);     // broken link is
getting here
              break;

On Fri, Feb 28, 2020 at 12:06 PM Robert Scavilla <[hidden email]>
wrote:

> Hi again, and thank you in advance for your kind help.
>
> I'm using Nutch 1.14
>
> I'm trying to use nutch to find broken links (404s) on a site. I
> followed the instructions:
> bin/nutch readdb <crawlFolder>/crawldb/ -dump myDump
>
> but the dump only shows 200 and 301 status. There is no sign of any broken
> link. When enter just 1 broken link in the seed file the crawldb is empty.
>
> Please advise how I can inspect broken links with nutch1.14
>
> Thank you!
> ...bob
>
Reply | Threaded
Open this post in threaded view
|

Re: finding broken links with nutch 1.14

Sebastian Nagel-2
Hi Robert,

404s are recorded in the CrawlDb after the tool "updatedb" is called.
Could you share the commands you're running? Please also have a look into the log files (esp. the
hadoop.log) - all fetches are logged and
also whether fetches have failed. If you cannot find a log message
for the broken links, it might be that the URLs are filtered. In this
case, please also share the configuration (if different from the default).

Best,
Sebastian

On 3/2/20 11:11 PM, Robert Scavilla wrote:

> Nutch 1.14:
> I am looking at the FetcherThread code. The 404 url does get flagged with
> a ProtocolStatus.NOTFOUND, but the broken link never gets to the crawldb.
> It does however got into the linkdb. Please tell me how I can collect these
> 404 urls.
>
> Any help would be appreciated,
> .,..bob
>
>            case ProtocolStatus.NOTFOUND:
>             case ProtocolStatus.GONE: // gone
>             case ProtocolStatus.ACCESS_DENIED:
>             case ProtocolStatus.ROBOTS_DENIED:
>               output(fit.url, fit.datum, null, status,
>                   CrawlDatum.STATUS_FETCH_GONE);     // broken link is
> getting here
>               break;
>
> On Fri, Feb 28, 2020 at 12:06 PM Robert Scavilla <[hidden email]>
> wrote:
>
>> Hi again, and thank you in advance for your kind help.
>>
>> I'm using Nutch 1.14
>>
>> I'm trying to use nutch to find broken links (404s) on a site. I
>> followed the instructions:
>> bin/nutch readdb <crawlFolder>/crawldb/ -dump myDump
>>
>> but the dump only shows 200 and 301 status. There is no sign of any broken
>> link. When enter just 1 broken link in the seed file the crawldb is empty.
>>
>> Please advise how I can inspect broken links with nutch1.14
>>
>> Thank you!
>> ...bob
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: finding broken links with nutch 1.14

Robert Scavilla
Sebastian, I'm so sorry to have bothered you. I was following your email
and found a setting that was purging the 404 pages. It was set to true and
once set to false, all worked well!

Thank you,
...bob

        <property>
                <name>db.update.purge.404</name>
                <value>false</value>
                <description>If true, updatedb will add purge records with
status DB_GONE from the CrawlDB.</description>
        </property>

On Tue, Mar 3, 2020 at 3:57 AM Sebastian Nagel
<[hidden email]> wrote:

> Hi Robert,
>
> 404s are recorded in the CrawlDb after the tool "updatedb" is called.
> Could you share the commands you're running? Please also have a look into
> the log files (esp. the
> hadoop.log) - all fetches are logged and
> also whether fetches have failed. If you cannot find a log message
> for the broken links, it might be that the URLs are filtered. In this
> case, please also share the configuration (if different from the default).
>
> Best,
> Sebastian
>
> On 3/2/20 11:11 PM, Robert Scavilla wrote:
> > Nutch 1.14:
> > I am looking at the FetcherThread code. The 404 url does get flagged with
> > a ProtocolStatus.NOTFOUND, but the broken link never gets to the crawldb.
> > It does however got into the linkdb. Please tell me how I can collect
> these
> > 404 urls.
> >
> > Any help would be appreciated,
> > .,..bob
> >
> >            case ProtocolStatus.NOTFOUND:
> >             case ProtocolStatus.GONE: // gone
> >             case ProtocolStatus.ACCESS_DENIED:
> >             case ProtocolStatus.ROBOTS_DENIED:
> >               output(fit.url, fit.datum, null, status,
> >                   CrawlDatum.STATUS_FETCH_GONE);     // broken link is
> > getting here
> >               break;
> >
> > On Fri, Feb 28, 2020 at 12:06 PM Robert Scavilla <[hidden email]>
> > wrote:
> >
> >> Hi again, and thank you in advance for your kind help.
> >>
> >> I'm using Nutch 1.14
> >>
> >> I'm trying to use nutch to find broken links (404s) on a site. I
> >> followed the instructions:
> >> bin/nutch readdb <crawlFolder>/crawldb/ -dump myDump
> >>
> >> but the dump only shows 200 and 301 status. There is no sign of any
> broken
> >> link. When enter just 1 broken link in the seed file the crawldb is
> empty.
> >>
> >> Please advise how I can inspect broken links with nutch1.14
> >>
> >> Thank you!
> >> ...bob
> >>
> >
>
>