Crawler submits forms?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Crawler submits forms?

Andy Read-2
Hi,

I'm using nutch to create a site search facility for a couple of site.

I upgraded from 0.6 to 0.7.1 a few days ago and have just noticed that blank
users are being registered on my site at the exact times the cron job runs
the crawl tool to re-index the site.  This means that the crawler is now
submitting a post request from the registration form!  Is this a new
'feature' of 0.7 or 0.7.1?  I can't find any mention in changes.txt and I
can't find any config option referring to it.  Surely the crawler should
never submit form input?

Any help appreciated.

Thanks,

Andy Read

www.azurite.co.uk


Reply | Threaded
Open this post in threaded view
|

Re: Crawler submits forms?

Jack.Tang
Hi

You can read the article about Stanford's HiWE search engine on www10.org.
And it is easy to extend Nutch if you are using http-client protocol.

http://www10.org/cdrom/posters/p1049/

Good luck:)

/Jack

On 12/14/05, Andy Read <[hidden email]> wrote:

> Hi,
>
> I'm using nutch to create a site search facility for a couple of site.
>
> I upgraded from 0.6 to 0.7.1 a few days ago and have just noticed that blank
> users are being registered on my site at the exact times the cron job runs
> the crawl tool to re-index the site.  This means that the crawler is now
> submitting a post request from the registration form!  Is this a new
> 'feature' of 0.7 or 0.7.1?  I can't find any mention in changes.txt and I
> can't find any config option referring to it.  Surely the crawler should
> never submit form input?
>
> Any help appreciated.
>
> Thanks,
>
> Andy Read
>
> www.azurite.co.uk
>
>
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: Crawler submits forms?

Rod Taylor-2
In reply to this post by Andy Read-2
On Tue, 2005-12-13 at 16:57 +0000, Andy Read wrote:

> Hi,
>
> I'm using nutch to create a site search facility for a couple of site.
>
> I upgraded from 0.6 to 0.7.1 a few days ago and have just noticed that blank
> users are being registered on my site at the exact times the cron job runs
> the crawl tool to re-index the site.  This means that the crawler is now
> submitting a post request from the registration form!  Is this a new
> 'feature' of 0.7 or 0.7.1?  I can't find any mention in changes.txt and I
> can't find any config option referring to it.  Surely the crawler should
> never submit form input?

Nutch follows links. You can argue that it should not extract links from
POST style forms (this change has been made) but in the end it doesn't
make much of a difference since if you link to that script in any way (a
href, etc.) it will be followed and give you the same results.

Your registration form script is broken for accepting invalid input (or
GET requests at all) and robots.txt should be used to protect dynamic
areas from inadvertent uses.

--
Rod Taylor <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: Crawler submits forms?

Jack.Tang
In reply to this post by Jack.Tang
And please note the mail from Doug on Nov 23.

---------------------------------------------------------------------------------------------
Title: [Fwd: Spider Causing Contact Form Submissions]
Body: It looks as though Nutch is inadvertantly submitting forms.

At DOMContentUtils.java:58 we specify that the "action" parameter of an
HTML form should be extracted as a link.  Yet we ignore the "method"
parameter of the form.  I think we should only follow these when the
method is "get", not when it is "post".

Do others agree?

Doug
-------------------------------------------------------------------------------------------

I think the source code in svn ignore the POST url now .

/Jack


On 12/14/05, Jack Tang <[hidden email]> wrote:

> Hi
>
> You can read the article about Stanford's HiWE search engine on www10.org.
> And it is easy to extend Nutch if you are using http-client protocol.
>
> http://www10.org/cdrom/posters/p1049/
>
> Good luck:)
>
> /Jack
>
> On 12/14/05, Andy Read <[hidden email]> wrote:
> > Hi,
> >
> > I'm using nutch to create a site search facility for a couple of site.
> >
> > I upgraded from 0.6 to 0.7.1 a few days ago and have just noticed that blank
> > users are being registered on my site at the exact times the cron job runs
> > the crawl tool to re-index the site.  This means that the crawler is now
> > submitting a post request from the registration form!  Is this a new
> > 'feature' of 0.7 or 0.7.1?  I can't find any mention in changes.txt and I
> > can't find any config option referring to it.  Surely the crawler should
> > never submit form input?
> >
> > Any help appreciated.
> >
> > Thanks,
> >
> > Andy Read
> >
> > www.azurite.co.uk
> >
> >
> >
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

RE: Crawler submits forms?

Andy Read-2
Thanks for these various responses.

I agree that I should be checking input more carefully and will do so.
In my experience most developers find it useful to allow both GET and POST
input so would prefer not to deny GET requests.

But I do agree with Doug's fix to stop the crawler following POST links as
the recommendation is that POST requests are used where side-effects are
likely (see http://www.w3.org/2001/tag/doc/whenToUseGet.html#checklist).  I
assume this fix will make it into 0.7.2 some time, if I don't want to build
from CVS.

I'm not quite sure Jack's response about Stanford's HiWE search engine was a
direct answer to my question, but it does raise the question of whether some
applications will always think there are valid reasons to submit form POSTs
in an effort to discover "the hidden web".

This seems very reminiscent of the Google Web Accelerator saga earlier this
year (e.g. see
http://www.sitepoint.com/newsletter/viewissue.php?id=3&issue=113&format=html
), although that caused problems even with hrefs with side-effects (bad
idea!) but usually only when users are logged in.

Andy Read

www.azurite.co.uk