Missing pages.

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Missing pages.

Lyndon Maydwell
Am I right in assuming that broken pages (404) are removed once a page
is re-crawled and found missing?
Reply | Threaded
Open this post in threaded view
|

Re: Missing pages.

Andrzej Białecki-2
Lyndon Maydwell wrote:
> Am I right in assuming that broken pages (404) are removed once a page
> is re-crawled and found missing?


Pages are never removed from the crawldb, unless you change URLFilters
to remove them. Missing pages (404) are marked as GONE. Such pages may
be linked to from several sites - and Nutch needs to know that we
already discovered the page and what is its fetch status. If we simply
removed them from the db, they would be discovered again, only this time
we wouldn't know what their status was and we would have to try fetching
them.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Missing pages.

Lyndon Maydwell
Ah, yes, of course. I was a bit hasty with my question.

I was really referring to the results returned from the Nutch web-application.

I'm also getting a lot of requests to change some of the configuration
options relating to addresses Nutch considers equivalent. Is it
possible to alter the configuration files in the web-application and
have these changes reflected in the results returned? Or are these
options only used on crawling/indexing etc? If so, can I regenerate
the database somehow to have new configuration options recognized?

Thanks guys.
Reply | Threaded
Open this post in threaded view
|

Re: Missing pages.

Andrzej Białecki-2
Lyndon Maydwell wrote:
> Ah, yes, of course. I was a bit hasty with my question.
>
> I was really referring to the results returned from the Nutch web-application.
>
> I'm also getting a lot of requests to change some of the configuration
> options relating to addresses Nutch considers equivalent. Is it

Such as?

> possible to alter the configuration files in the web-application and
> have these changes reflected in the results returned? Or are these
> options only used on crawling/indexing etc? If so, can I regenerate
> the database somehow to have new configuration options recognized?

There are two subsystems in Nutch that handle this: one is URLFilters
(which basically say yes/no to urls, so that you can remove unwanted
urls) and URLNormalizers, which bring urls to their "canonical" format,
whatever that means in your case. The default url normalizer in Nutch
simply resolves relative paths and removes some session id junk (see
conf/regex-normalize.xml).

Once you tweak these two to match your expectations, you can regenerate
crawldb by updating it once again from already fetched segments.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Missing pages.

Lyndon Maydwell
For example, on the sites that I'm crawling, all addresses starting
with www.x are  simply redirects to x.
Reply | Threaded
Open this post in threaded view
|

Re: Missing pages.

Andrzej Białecki-2
Lyndon Maydwell wrote:
> For example, on the sites that I'm crawling, all addresses starting
> with www.x are  simply redirects to x.

If that's really the case (you know, it doesn't always work this way for
all sites) then adjust regex-urlfilter config file to remove the www.
prefix - see how it's done with the session ids.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Missing pages.

Lyndon Maydwell
> If that's really the case (you know, it doesn't always work this way for
> all sites)

Certainly, but for the network that I'm crawling it does.

I'd actually already used the regex-urlfilter to do just this, but
wasn't seeing any change in the results being returned. Regenerating
the db from the segments should fix the problem?

Thanks for your help.