A problem for web site needing username & password

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

A problem for web site needing username & password

郑世强
Hi!
In many web sites username and password are needed to login.If I want to
crawl a web site like this,and I know the username and password,how can
I let the crawler know the username and password to login the site like
a human doing.How can I change the configuration files?
Thanks!


Reply | Threaded
Open this post in threaded view
|

Re: A problem for web site needing username & password

Michael Piccuirro
If you're talking about basic http authentication I had the same problem
using nutch 0.9.  I saw a few articles explaining how to do it by modifying
config files and nothing worked.   So as a messy quick fixed I just modified
this file:

src\plugin\protocol-httpclient\src\java\org\apache\nutch\protocol\httpclient\Http.java

I just grab a username/password from the config:

   String  basicUsername = conf.get("http.auth.basic.username");
   String  basicPassword = conf.get("http.auth.basic.password");

//then set the credentials like this:

Credentials ntCreds = new NTCredentials(ntlmUsername, ntlmPassword,
ntlmHost, ntlmDomain);
      client.getState().setCredentials(new AuthScope(ntlmHost,
AuthScope.ANY_PORT), ntCreds);

      if (LOG.isInfoEnabled()) { LOG.info("**** setting basic auth
credentials ****"); }
        client.getParams().setAuthenticationPreemptive(true);

        client.getState().setCredentials(
            new    AuthScope("www.mydomain.com", AuthScope.ANY_PORT,
AuthScope.ANY_REALM),
            new UsernamePasswordCredentials(basicUsername, basicPassword));


Not the best way to do this but it'll work.

Change the www.mydomain.com to your domain.


Also another way around it is you can have nutch go through a proxy then
have the proxy tack on the auth header. I was using CharlesProxy.  Again not
the best to do this at all but it'll get you going.


On Mon, Jul 28, 2008 at 2:53 AM, zhengsj03 User <[hidden email]> wrote:

> Hi!
> In many web sites username and password are needed to login.If I want to
> crawl a web site like this,and I know the username and password,how can
> I let the crawler know the username and password to login the site like
> a human doing.How can I change the configuration files?
> Thanks!
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: A problem for web site needing username & password

郑世强
These days , I have tied to solve the problem by modifying the source
code,but failed.
I think your method will help me .I will try it. Thanks!

> If you're talking about basic http authentication I had the same problem
> using nutch 0.9.  I saw a few articles explaining how to do it by modifying
> config files and nothing worked.   So as a messy quick fixed I just modified
> this file:
>
> src\plugin\protocol-httpclient\src\java\org\apache\nutch\protocol\httpclient\Http.java
>
> I just grab a username/password from the config:
>
>    String  basicUsername = conf.get("http.auth.basic.username");
>    String  basicPassword = conf.get("http.auth.basic.password");
>
> //then set the credentials like this:
>
> Credentials ntCreds = new NTCredentials(ntlmUsername, ntlmPassword,
> ntlmHost, ntlmDomain);
>       client.getState().setCredentials(new AuthScope(ntlmHost,
> AuthScope.ANY_PORT), ntCreds);
>
>       if (LOG.isInfoEnabled()) { LOG.info("**** setting basic auth
> credentials ****"); }
>         client.getParams().setAuthenticationPreemptive(true);
>
>         client.getState().setCredentials(
>             new    AuthScope("www.mydomain.com", AuthScope.ANY_PORT,
> AuthScope.ANY_REALM),
>             new UsernamePasswordCredentials(basicUsername, basicPassword));
>
>
> Not the best way to do this but it'll work.
>
> Change the www.mydomain.com to your domain.
>
>
> Also another way around it is you can have nutch go through a proxy then
> have the proxy tack on the auth header. I was using CharlesProxy.  Again not
> the best to do this at all but it'll get you going.
>
>
> On Mon, Jul 28, 2008 at 2:53 AM, zhengsj03 User <[hidden email]> wrote:
>
> > Hi!
> > In many web sites username and password are needed to login.If I want to
> > crawl a web site like this,and I know the username and password,how can
> > I let the crawler know the username and password to login the site like
> > a human doing.How can I change the configuration files?
> > Thanks!
> >
> >
> >