why does url change during fetching?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

why does url change during fetching?

savannah_beckett
Hi,
  I am crawling
http://jobsearch.monster.com/PowerSearch.aspx?q=PHP%20Scripting%20Language%20(PHP%20Hypertext%20Preprocessor)&tjt=PHP%20Developer&where=san%20jose%2Cca&rad=20&rad_units=miles&tm=60&dv=


one of the urls to be fetched during crawling is
http://jobview.monster.com/PHP-MySQL-Developer-Back-End-Developer-Online-Gaming-Job-Menlo-Park-CA-89694469.aspx

 
But it became this when fetching: 
http://jobsearch.monster.com/\/\/jobview.monster.com\/PHP-MySQL-Developer-Back-End-Developer-Online-Gaming-Job-Menlo-Park-CA-89694469.aspx



Why?
Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: why does url change during fetching?

Alex McLintock
Hi Savannah,

I think the funny looking URL is actually in the page you were
fetching. For instance when I looked at the source I found this

 href=\"http:\/\/jobview.monster.com\/PHP-Developer-PHP-Linux-Apache-MySQL-JavaScript-CSS-Job-Mountain-View-CA-89901978.aspx\"

Now that looks wrong to me and so it is no wonder that Nutch got a
little confused.

Now that appears to be within some Javascript. Nutch does some crude
Javascript parsing, but it may be that it is getting it wrong, or that
it hasn't realised that it is javascript in the first place.

It might be worth finding out whether the funny URLs are coming from
the javascript parsing, or the html parsing. - That may require some
more logging to be inserted.

Alex


On 10 August 2010 08:25, Savannah Beckett <[hidden email]> wrote:

> Hi,
>   I am crawling
> http://jobsearch.monster.com/PowerSearch.aspx?q=PHP%20Scripting%20Language%20(PHP%20Hypertext%20Preprocessor)&tjt=PHP%20Developer&where=san%20jose%2Cca&rad=20&rad_units=miles&tm=60&dv=
>
>
> one of the urls to be fetched during crawling is
> http://jobview.monster.com/PHP-MySQL-Developer-Back-End-Developer-Online-Gaming-Job-Menlo-Park-CA-89694469.aspx
>
>
> But it became this when fetching:
> http://jobsearch.monster.com/\/\/jobview.monster.com\/PHP-MySQL-Developer-Back-End-Developer-Online-Gaming-Job-Menlo-Park-CA-89694469.aspx
>
>
>
> Why?
> Thanks.
>
>
>