Bug in Nutch, possibly due to issues-273 and 322

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Bug in Nutch, possibly due to issues-273 and 322

Meghna Kukreja
Hi,

I checked out the latest Nutch trunk to get the fix for Issues 273
(http://issues.apache.org/jira/browse/NUTCH-273) and 322
(http://issues.apache.org/jira/browse/NUTCH-322) and it looks like if
http.redirect.max is 0, then the redirected url is inserted with the
same status as the original url of CrawlDatum.STATUS_FETCH_REDIR_PERM
(Fetcher.java line 201) which is converted to
CrawlDatum.STATUS_DB_REDIR_PERM. This prevents the redirected url from
being selected by the generator in the next round (Generator.java
lines 140-142).
My initial seed url was a RSS feed where all the items were being
redirected. I tried to do a crawl with depth 3 but at the last depth
the Generator was not able to select any urls, printed the message "0
records selected for fetching, exiting ..." and the crawl ended with a
NullPointerException as the fetcher was trying to fetch an empty
segment.
I was wondering if any one else had seen this behaviour and has been
able to find a way to fix it.

Thanks,
Meghna
Reply | Threaded
Open this post in threaded view
|

Re: Bug in Nutch, possibly due to issues-273 and 322

Andrzej Białecki-2
Meghna Kukreja wrote:

> Hi,
>
> I checked out the latest Nutch trunk to get the fix for Issues 273
> (http://issues.apache.org/jira/browse/NUTCH-273) and 322
> (http://issues.apache.org/jira/browse/NUTCH-322) and it looks like if
> http.redirect.max is 0, then the redirected url is inserted with the
> same status as the original url of CrawlDatum.STATUS_FETCH_REDIR_PERM
> (Fetcher.java line 201) which is converted to
> CrawlDatum.STATUS_DB_REDIR_PERM. This prevents the redirected url from
> being selected by the generator in the next round (Generator.java
> lines 140-142).

Indeed, this looks wrong - the target of the redirect should be inserted
with another fetch status (we could re-use status code LINKED, which is
then turned into DB_UNFETCHED if a page didn't exist yet).

> My initial seed url was a RSS feed where all the items were being
> redirected. I tried to do a crawl with depth 3 but at the last depth
> the Generator was not able to select any urls, printed the message "0
> records selected for fetching, exiting ..." and the crawl ended with a
> NullPointerException as the fetcher was trying to fetch an empty
> segment.

Yup, another place to fix so that it doesn't throw an NPE.

> I was wondering if any one else had seen this behaviour and has been
> able to find a way to fix it.
Thanks for testing this, I'll apply these fixes soon.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Bug in Nutch, possibly due to issues-273 and 322

Meghna Kukreja
Hi,

I am trying to use nutch-0.9 and it looks like the bug about a NPE
thrown when the generator creates an empty segment is still present.
Has this bug been fixed in the trunk?

Thanks,
Meghna

On Wed, Jan 3, 2007 at 3:50 PM, Andrzej Bialecki <[hidden email]> wrote:

> Meghna Kukreja wrote:
>>
>> Hi,
>>
>> I checked out the latest Nutch trunk to get the fix for Issues 273
>> (http://issues.apache.org/jira/browse/NUTCH-273) and 322
>> (http://issues.apache.org/jira/browse/NUTCH-322) and it looks like if
>> http.redirect.max is 0, then the redirected url is inserted with the
>> same status as the original url of CrawlDatum.STATUS_FETCH_REDIR_PERM
>> (Fetcher.java line 201) which is converted to
>> CrawlDatum.STATUS_DB_REDIR_PERM. This prevents the redirected url from
>> being selected by the generator in the next round (Generator.java
>> lines 140-142).
>
> Indeed, this looks wrong - the target of the redirect should be inserted
> with another fetch status (we could re-use status code LINKED, which is then
> turned into DB_UNFETCHED if a page didn't exist yet).
>
>> My initial seed url was a RSS feed where all the items were being
>> redirected. I tried to do a crawl with depth 3 but at the last depth
>> the Generator was not able to select any urls, printed the message "0
>> records selected for fetching, exiting ..." and the crawl ended with a
>> NullPointerException as the fetcher was trying to fetch an empty
>> segment.
>
> Yup, another place to fix so that it doesn't throw an NPE.
>
>> I was wondering if any one else had seen this behaviour and has been
>> able to find a way to fix it.
>
> Thanks for testing this, I'll apply these fixes soon.
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
Reply | Threaded
Open this post in threaded view
|

Highlight terms in hit Title

searchfresco
Hi Nutchers!

I'm just wondering where to look to get search terms bolded in hit title.

Where to start or any patches out there?

Thanks

John