Nutch Fetch - HttpException : Connect Exception : Invalid Argument

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch Fetch - HttpException : Connect Exception : Invalid Argument

Jon Shoberg
When following the whole web crawling strategy outlined in the tutorial,
the following error is occurring.  I'd say probably 50% of the output
from the fetch is this error?  Has anyone else seen this?  There are a
few thousand URLs loaded via nutch inject.  I can understand possibly
getting a few errors but in hand checking the URLs for which this
happens, they respond fine.

I checked the URL file list and there are not extraneous characters.

Error: (example.com is not the real URL)

  050719 221355 fetch of http://example.com/ failed with:
net.nutch.protocol.http.HttpException: java.net.ConnectException:
Invalid argument

The Script:

#!/bin/bash
rm -rf db
rm -rf segments
mkdir db
mkdir segments
bin/nutch admin db -create
bin/nutch inject db -urlfile urls
bin/nutch generate db segments
s=`ls -d segments/2* | tail -1`
echo Segment is $s
bin/nutch fetch $s   <-- ERROR ERROR ERROR
bin/nutch updatedb db $s
bin/nutch analyze db 5
bin/nutch index $s





Reply | Threaded
Open this post in threaded view
|

Log Error Stack - Re: Nutch Fetch - HttpException : Connect Exception : Invalid Argument

Jon Shoberg

From: src/java/net/nutch/fetcher/Fetcher.java

   Any suggestions on where to look for logging of this stack, related
to the message below.  I have to missing something small here (perhaps
lack of coffee).  "LOG.info" by default displays to stdout.  Where
does/can "LOG.log" write to?

   private void logError(String url, FetchListEntry fle, Throwable t) {
       LOG.info("fetch of " + url + " failed with: " + t);
       LOG.log(Level.FINE, "stack", t);            // stack trace
       synchronized (Fetcher.this) {               // record failure
         errors++;
       }
     }


> When following the whole web crawling strategy outlined in the tutorial,
> the following error is occurring.  I'd say probably 50% of the output
> from the fetch is this error?  Has anyone else seen this?  There are a
> few thousand URLs loaded via nutch inject.  I can understand possibly
> getting a few errors but in hand checking the URLs for which this
> happens, they respond fine.
>
> I checked the URL file list and there are not extraneous characters.
>
> Error: (example.com is not the real URL)
>
>  050719 221355 fetch of http://example.com/ failed with:
> net.nutch.protocol.http.HttpException: java.net.ConnectException:
> Invalid argument
>
> The Script:
>
> #!/bin/bash
> rm -rf db
> rm -rf segments
> mkdir db
> mkdir segments
> bin/nutch admin db -create
> bin/nutch inject db -urlfile urls
> bin/nutch generate db segments
> s=`ls -d segments/2* | tail -1`
> echo Segment is $s
> bin/nutch fetch $s   <-- ERROR ERROR ERROR
> bin/nutch updatedb db $s
> bin/nutch analyze db 5
> bin/nutch index $s

Reply | Threaded
Open this post in threaded view
|

Re: [Nutch-dev] Log Error Stack - Re: Nutch Fetch - HttpException : Connect Exception : Invalid Argument

praveen pathiyil
Hi,

Nutch logs at the INFO level by default. If you want more verbose
output, you will have to set 'fetcher.verbose' to true in
nutch-default.xml (Or better override this value in nutch-site.xml).
If you do this, the messages at the 'FINE' level should also be logged
to the stdout.

[You might also want to look at 'http.verbose']

HTH,
Praveen.

On 7/20/05, Jon Shoberg <[hidden email]> wrote:

>
> From: src/java/net/nutch/fetcher/Fetcher.java
>
>   Any suggestions on where to look for logging of this stack, related
> to the message below.  I have to missing something small here (perhaps
> lack of coffee).  "LOG.info" by default displays to stdout.  Where
> does/can "LOG.log" write to?
>
>   private void logError(String url, FetchListEntry fle, Throwable t) {
>       LOG.info("fetch of " + url + " failed with: " + t);
>       LOG.log(Level.FINE, "stack", t);            // stack trace
>       synchronized (Fetcher.this) {               // record failure
>         errors++;
>       }
>     }
>
>
> > When following the whole web crawling strategy outlined in the tutorial,
> > the following error is occurring.  I'd say probably 50% of the output
> > from the fetch is this error?  Has anyone else seen this?  There are a
> > few thousand URLs loaded via nutch inject.  I can understand possibly
> > getting a few errors but in hand checking the URLs for which this
> > happens, they respond fine.
> >
> > I checked the URL file list and there are not extraneous characters.
> >
> > Error: (example.com is not the real URL)
> >
> >  050719 221355 fetch of http://example.com/ failed with:
> > net.nutch.protocol.http.HttpException: java.net.ConnectException:
> > Invalid argument
> >
> > The Script:
> >
> > #!/bin/bash
> > rm -rf db
> > rm -rf segments
> > mkdir db
> > mkdir segments
> > bin/nutch admin db -create
> > bin/nutch inject db -urlfile urls
> > bin/nutch generate db segments
> > s=`ls -d segments/2* | tail -1`
> > echo Segment is $s
> > bin/nutch fetch $s   <-- ERROR ERROR ERROR
> > bin/nutch updatedb db $s
> > bin/nutch analyze db 5
> > bin/nutch index $s
>
>
>
> -------------------------------------------------------
> SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
> from IBM. Find simple to follow Roadmaps, straightforward articles,
> informative Webcasts and more! Get everything you need to get up to
> speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
> _______________________________________________
> Nutch-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
>