crawling protected pages

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

crawling protected pages

Edward Quick
Hi,

I posted to the user list but didn't get a reply. I want to crawl a
protected site, but there doesn't seem to be an option for that in Nutch at
the moment.

However, it doesn't sound like something that would be too hard to add,
assuming the java http client library can handle that. As I'm not familiar
with the code, could someone point me at the file (or files) in the source
which do the crawling please? I'm not professing to be a top Java programmer
(perl's my speciality) but I'll give it a shot, unless anyone else wants
to?!

Many thanks,

Ed.


Reply | Threaded
Open this post in threaded view
|

Re: crawling protected pages

Andrzej Białecki-2
Edward Quick wrote:

> Hi,
>
> I posted to the user list but didn't get a reply. I want to crawl a
> protected site, but there doesn't seem to be an option for that in Nutch
> at the moment.
>
> However, it doesn't sound like something that would be too hard to add,
> assuming the java http client library can handle that. As I'm not
> familiar with the code, could someone point me at the file (or files) in
> the source which do the crawling please? I'm not professing to be a top
> Java programmer (perl's my speciality) but I'll give it a shot, unless
> anyone else wants to?!

The quick hack would be to add necessary code somewhere in
protocol-httpclient. Eventually though, I think Nutch should grow an
authentication factory, which would supply needed credentials to other
plugins.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: crawling protected pages

Jack.Tang
Hi Andrzej

There is HttpAuthenticationFactory class in protocol-httpclient
plugin. But I doubt that whether RFC 2617 basic authentication works.
I cannot see the reference to HttpAuthenticationFactory class. I
missed something?

Reagds
/Jack

On 9/13/05, Andrzej Bialecki <[hidden email]> wrote:

> Edward Quick wrote:
> > Hi,
> >
> > I posted to the user list but didn't get a reply. I want to crawl a
> > protected site, but there doesn't seem to be an option for that in Nutch
> > at the moment.
> >
> > However, it doesn't sound like something that would be too hard to add,
> > assuming the java http client library can handle that. As I'm not
> > familiar with the code, could someone point me at the file (or files) in
> > the source which do the crawling please? I'm not professing to be a top
> > Java programmer (perl's my speciality) but I'll give it a shot, unless
> > anyone else wants to?!
>
> The quick hack would be to add necessary code somewhere in
> protocol-httpclient. Eventually though, I think Nutch should grow an
> authentication factory, which would supply needed credentials to other
> plugins.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: crawling protected pages

Andrzej Białecki-2
Jack Tang wrote:
> Hi Andrzej
>
> There is HttpAuthenticationFactory class in protocol-httpclient
> plugin. But I doubt that whether RFC 2617 basic authentication works.
> I cannot see the reference to HttpAuthenticationFactory class. I
> missed something?

Unfortunately, you didn't - when I imported the plugin I left this class
in place as a sort of reminder to complete this part... but as it is now
it is not used.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com