Problem crawling/fetching using https

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem crawling/fetching using https

Michael Wechner
Hi

I try to fetch data from a website using https, whereas I have added

<value>nutch-extensionpoints|protocol-file|protocol-http|protocol-https

to nutch-site.xml

but still receive the following error

fetch of https://www.foo.bar/ failed with:
org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https

Is there anything else one has to do?

I am using Nutch 0.8.x

Thanks

Michi

--
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
[hidden email]                        [hidden email]
+41 44 272 91 61

Reply | Threaded
Open this post in threaded view
|

Re: Problem crawling/fetching using https

chrismattmann
Hi Michi,

 I am pretty sure that in order to support https, you need to enable the
protocol-httpclient plugin, which is based on commons-httpclient. There
isn't a protocol-https plugin as far as I know. Try that and see if that
fixes your issue.

Thanks!

Cheers,
 Chris



On 1/24/07 2:29 PM, "Michael Wechner" <[hidden email]> wrote:

> Hi
>
> I try to fetch data from a website using https, whereas I have added
>
> <value>nutch-extensionpoints|protocol-file|protocol-http|protocol-https
>
> to nutch-site.xml
>
> but still receive the following error
>
> fetch of https://www.foo.bar/ failed with:
> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https
>
> Is there anything else one has to do?
>
> I am using Nutch 0.8.x
>
> Thanks
>
> Michi


Reply | Threaded
Open this post in threaded view
|

Re: Problem crawling/fetching using https

Michael Wechner
Chris Mattmann wrote:

>Hi Michi,
>
> I am pretty sure that in order to support https, you need to enable the
>protocol-httpclient plugin, which is based on commons-httpclient. There
>isn't a protocol-https plugin as far as I know. Try that and see if that
>fixes your issue.
>  
>

that seems to work indeed. Thanks very much for the hint.

Btw, wouldn't it make sense to add protocol-httpclient as default,
because I guess
I am not the only one trying to fetch pages using https?

Thanks again

Michi

>Thanks!
>
>Cheers,
> Chris
>
>
>
>On 1/24/07 2:29 PM, "Michael Wechner" <[hidden email]> wrote:
>
>  
>
>>Hi
>>
>>I try to fetch data from a website using https, whereas I have added
>>
>><value>nutch-extensionpoints|protocol-file|protocol-http|protocol-https
>>
>>to nutch-site.xml
>>
>>but still receive the following error
>>
>>fetch of https://www.foo.bar/ failed with:
>>org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https
>>
>>Is there anything else one has to do?
>>
>>I am using Nutch 0.8.x
>>
>>Thanks
>>
>>Michi
>>    
>>
>
>
>
>  
>


--
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
[hidden email]                        [hidden email]
+41 44 272 91 61

Reply | Threaded
Open this post in threaded view
|

Re: Problem crawling/fetching using https

chrismattmann
Hi Michi,

> Btw, wouldn't it make sense to add protocol-httpclient as default,
> because I guess
> I am not the only one trying to fetch pages using https?

Indeed. The issue with this was in fact that some time ago, the powers that
be decided that it probably made sense to make protocol-httpclient the
default. However, due to some performance issues with the underlying
commons-httpclient Apache library (I think), it was decided to go with
protocol-http, which turned out to be must faster/more reliable, etc, at the
expense of not natively supporting HTTPS. I wonder what the user community
thinks about this now though? What do other people think? Have the issues
with protocol-httpclient gone away, such that it makes sense to enable it
again?


Cheers,
  Chris

>
> Thanks again
>
> Michi
>
>> Thanks!
>>
>> Cheers,
>> Chris
>>
>>
>>
>> On 1/24/07 2:29 PM, "Michael Wechner" <[hidden email]> wrote:
>>
>>  
>>
>>> Hi
>>>
>>> I try to fetch data from a website using https, whereas I have added
>>>
>>> <value>nutch-extensionpoints|protocol-file|protocol-http|protocol-https
>>>
>>> to nutch-site.xml
>>>
>>> but still receive the following error
>>>
>>> fetch of https://www.foo.bar/ failed with:
>>> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https
>>>
>>> Is there anything else one has to do?
>>>
>>> I am using Nutch 0.8.x
>>>
>>> Thanks
>>>
>>> Michi
>>>    
>>>
>>
>>
>>
>>  
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: Problem crawling/fetching using https

Michael Wechner
Chris Mattmann wrote:

>Hi Michi,
>
>  
>
>>Btw, wouldn't it make sense to add protocol-httpclient as default,
>>because I guess
>>I am not the only one trying to fetch pages using https?
>>    
>>
>
>Indeed. The issue with this was in fact that some time ago, the powers that
>be decided that it probably made sense to make protocol-httpclient the
>default. However, due to some performance issues with the underlying
>commons-httpclient Apache library (I think), it was decided to go with
>protocol-http, which turned out to be must faster/more reliable, etc, at the
>expense of not natively supporting HTTPS.
>

ok. So what about adding a comment to nutch-site.xml, e.g.

<!-- NOTE: In order to use https please add protocol-httpclient, but be
aware of possible performance problems! -->

Cheers

Michi

> I wonder what the user community
>thinks about this now though? What do other people think? Have the issues
>with protocol-httpclient gone away, such that it makes sense to enable it
>again?
>
>
>Cheers,
>  Chris
>
>  
>
>>Thanks again
>>
>>Michi
>>
>>    
>>
>>>Thanks!
>>>
>>>Cheers,
>>>Chris
>>>
>>>
>>>
>>>On 1/24/07 2:29 PM, "Michael Wechner" <[hidden email]> wrote:
>>>
>>>
>>>
>>>      
>>>
>>>>Hi
>>>>
>>>>I try to fetch data from a website using https, whereas I have added
>>>>
>>>><value>nutch-extensionpoints|protocol-file|protocol-http|protocol-https
>>>>
>>>>to nutch-site.xml
>>>>
>>>>but still receive the following error
>>>>
>>>>fetch of https://www.foo.bar/ failed with:
>>>>org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https
>>>>
>>>>Is there anything else one has to do?
>>>>
>>>>I am using Nutch 0.8.x
>>>>
>>>>Thanks
>>>>
>>>>Michi
>>>>  
>>>>
>>>>        
>>>>
>>>
>>>
>>>
>>>      
>>>
>
>
>
>  
>


--
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
[hidden email]                        [hidden email]
+41 44 272 91 61

Reply | Threaded
Open this post in threaded view
|

Re: Problem crawling/fetching using https

Andrzej Białecki-2
Michael Wechner wrote:
> ok. So what about adding a comment to nutch-site.xml, e.g.
>
> <!-- NOTE: In order to use https please add protocol-httpclient, but
> be aware of possible performance problems! -->

They were not performance problems. There were some issues related to
using multiple threads, which would sometimes cause the httpclient
library to fail. There was also a logging message produce in the
internals of httpclient that was difficult to turn off - but now that we
are using log4j this should be straightforward. There was a bug in
chunked encoding handling that would cause hangs.

There were also other intermittent problems with this library, so after
much deliberation we decided to leave the simpler plugin as the default ...

These issues may have been solved in a newer version of httpclient library.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Problem crawling/fetching using https

chrismattmann
Hi Guys,

 Yep, I couldn't remember exactly what the issues were. Thanks for digging
that up, Andrzej. So, yeah, anyways it may make sense to update
nutch-site.xml with the comment below, with "performance problems" replaced
with "intermittent problems with the underlying commons-httpclient library".

 If you guys agree, I'll add the comment to nutch-site...

Cheers,
  Chris



On 1/24/07 3:10 PM, "Andrzej Bialecki" <[hidden email]> wrote:

> Michael Wechner wrote:
>> ok. So what about adding a comment to nutch-site.xml, e.g.
>>
>> <!-- NOTE: In order to use https please add protocol-httpclient, but
>> be aware of possible performance problems! -->
>
> They were not performance problems. There were some issues related to
> using multiple threads, which would sometimes cause the httpclient
> library to fail. There was also a logging message produce in the
> internals of httpclient that was difficult to turn off - but now that we
> are using log4j this should be straightforward. There was a bug in
> chunked encoding handling that would cause hangs.
>
> There were also other intermittent problems with this library, so after
> much deliberation we decided to leave the simpler plugin as the default ...
>
> These issues may have been solved in a newer version of httpclient library.

______________________________________________
Chris A. Mattmann
[hidden email]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply | Threaded
Open this post in threaded view
|

Re: Problem crawling/fetching using https

chrismattmann
Folks,

 I've went ahead and added the following comment in nutch-default.xml:

" In order to use HTTPS please enable protocol-httpclient, but be aware of
possible intermittent problems with the underlying commons-httpclient
library."

 Hopefully this will help folks in the future with this.

Thanks!

Cheers,
  Chris



On 1/24/07 3:29 PM, "Chris Mattmann" <[hidden email]> wrote:

> Hi Guys,
>
>  Yep, I couldn't remember exactly what the issues were. Thanks for digging
> that up, Andrzej. So, yeah, anyways it may make sense to update
> nutch-site.xml with the comment below, with "performance problems" replaced
> with "intermittent problems with the underlying commons-httpclient library".
>
>  If you guys agree, I'll add the comment to nutch-site...
>
> Cheers,
>   Chris
>
>
>
> On 1/24/07 3:10 PM, "Andrzej Bialecki" <[hidden email]> wrote:
>
>> Michael Wechner wrote:
>>> ok. So what about adding a comment to nutch-site.xml, e.g.
>>>
>>> <!-- NOTE: In order to use https please add protocol-httpclient, but
>>> be aware of possible performance problems! -->
>>
>> They were not performance problems. There were some issues related to
>> using multiple threads, which would sometimes cause the httpclient
>> library to fail. There was also a logging message produce in the
>> internals of httpclient that was difficult to turn off - but now that we
>> are using log4j this should be straightforward. There was a bug in
>> chunked encoding handling that would cause hangs.
>>
>> There were also other intermittent problems with this library, so after
>> much deliberation we decided to leave the simpler plugin as the default ...
>>
>> These issues may have been solved in a newer version of httpclient library.
>
> ______________________________________________
> Chris A. Mattmann
> [hidden email]
> Staff Member
> Modeling and Data Management Systems Section (387)
> Data Management Systems and Technologies Group
>
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                        Mailstop:  171-246
> _______________________________________________________
>
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Problem crawling/fetching using https

Michael Wechner
Chris Mattmann wrote:

>Folks,
>
> I've went ahead and added the following comment in nutch-default.xml:
>
>" In order to use HTTPS please enable protocol-httpclient, but be aware of
>possible intermittent problems with the underlying commons-httpclient
>library."
>
> Hopefully this will help folks in the future with this.
>  
>

thanks very much

Michael

>Thanks!
>
>Cheers,
>  Chris
>
>
>
>On 1/24/07 3:29 PM, "Chris Mattmann" <[hidden email]> wrote:
>
>  
>
>>Hi Guys,
>>
>> Yep, I couldn't remember exactly what the issues were. Thanks for digging
>>that up, Andrzej. So, yeah, anyways it may make sense to update
>>nutch-site.xml with the comment below, with "performance problems" replaced
>>with "intermittent problems with the underlying commons-httpclient library".
>>
>> If you guys agree, I'll add the comment to nutch-site...
>>
>>Cheers,
>>  Chris
>>
>>
>>
>>On 1/24/07 3:10 PM, "Andrzej Bialecki" <[hidden email]> wrote:
>>
>>    
>>
>>>Michael Wechner wrote:
>>>      
>>>
>>>>ok. So what about adding a comment to nutch-site.xml, e.g.
>>>>
>>>><!-- NOTE: In order to use https please add protocol-httpclient, but
>>>>be aware of possible performance problems! -->
>>>>        
>>>>
>>>They were not performance problems. There were some issues related to
>>>using multiple threads, which would sometimes cause the httpclient
>>>library to fail. There was also a logging message produce in the
>>>internals of httpclient that was difficult to turn off - but now that we
>>>are using log4j this should be straightforward. There was a bug in
>>>chunked encoding handling that would cause hangs.
>>>
>>>There were also other intermittent problems with this library, so after
>>>much deliberation we decided to leave the simpler plugin as the default ...
>>>
>>>These issues may have been solved in a newer version of httpclient library.
>>>      
>>>
>>______________________________________________
>>Chris A. Mattmann
>>[hidden email]
>>Staff Member
>>Modeling and Data Management Systems Section (387)
>>Data Management Systems and Technologies Group
>>
>>_________________________________________________
>>Jet Propulsion Laboratory            Pasadena, CA
>>Office: 171-266B                        Mailstop:  171-246
>>_______________________________________________________
>>
>>Disclaimer:  The opinions presented within are my own and do not reflect
>>those of either NASA, JPL, or the California Institute of Technology.
>>
>>
>>    
>>
>
>
>
>  
>


--
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
[hidden email]                        [hidden email]
+41 44 272 91 61