Nutch Crawler, Page Rediction and Pagination

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch Crawler, Page Rediction and Pagination

Jack.Tang
Hi Guys

I known it is one difficult question for crawler and I just want to
know is it possible to nutch's crawler.

The page structure of website I want to crawl is like this

                                                                     
       -----> Page 1
                                                                     
       |     http://a.com/next?pg=1
         search box                          --> url redirector   -|----> Page 2
http://a.com/search?x=1&y=2&z=3    http://a.com/b   |     http://a.com/next?pg=2
http://a.com/search?x=2&y=2&z=4                            -----> Page ...
                                                                     
             http://a.com/next?pg=.


I added the URLs from search box into nutch crawler's feed. Say now
crawler is processing http://a.com/search?x=1&y=2&z=3, and it is
redirected to http://a.com/b, the response is OK now, so page 1
(http://a.com/next) is fetched and parsed. Page 1 contains the
pagination information like

1
<a href="http://a.com/next?pg=2">2</a>
<a href="http://a.com/next?pg=3">3</a>
<a href="http://a.com/next?pg=4">4</a>

Unfortunately, page 2,3,4 is un-fetchable to crawler. Because url
redirector stores query options into cookie/session, and pagination
get the information from there. I mean the real url of page 2 is
http://a.com/search?x=1&y=2&z=3&pg=2.

Any solution for this scenario?


Regards
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: Nutch Crawler, Page Rediction and Pagination

Transbuerg Tian
I have the same conditions like you meet.

I think to crawle a dynamic page is black hole for crawler.

we could not get all necessary parameters which need to post to a form .

and to fetch dynamic page , we need to identify the duplicate page.

2005/9/26, Jack Tang <[hidden email]>:

>
> Hi Guys
>
> I known it is one difficult question for crawler and I just want to
> know is it possible to nutch's crawler.
>
> The page structure of website I want to crawl is like this
>
>
> -----> Page 1
>
> | http://a.com/next?pg=1
> search box --> url redirector -|----> Page 2
> http://a.com/search?x=1&y=2&z=3 http://a.com/b | http://a.com/next?pg=2
> http://a.com/search?x=2&y=2&z=4 -----> Page ...
>
> http://a.com/next?pg=.
>
>
> I added the URLs from search box into nutch crawler's feed. Say now
> crawler is processing http://a.com/search?x=1&y=2&z=3, and it is
> redirected to http://a.com/b, the response is OK now, so page 1
> (http://a.com/next) is fetched and parsed. Page 1 contains the
> pagination information like
>
> 1
> <a href="http://a.com/next?pg=2">2</a>
> <a href="http://a.com/next?pg=3">3</a>
> <a href="http://a.com/next?pg=4">4</a>
>
> Unfortunately, page 2,3,4 is un-fetchable to crawler. Because url
> redirector stores query options into cookie/session, and pagination
> get the information from there. I mean the real url of page 2 is
> http://a.com/search?x=1&y=2&z=3&pg=2.
>
> Any solution for this scenario?
>
>
> Regards
> /Jack
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>
Reply | Threaded
Open this post in threaded view
|

Re: Nutch Crawler, Page Rediction and Pagination

em-13
To crawl dynamic pages you need to know the dynamics structure of each
website separately.

Or, as I used to do it, just crawl anything in small enough chunks, and
when something goes wrong, look at the said website, determine why it
happened, modify the urlfilter and repeat the process. This is feasible
process if you can dedicate yourself full time for up to a million pages
or so. In my experience, anything over 1 million pages, with dynamic
pages in it, is just plain asking the fetchlists to go into infinity.
I've encountered static pages that are creative with the apache
extensions and throw nutch into infinity. I've encountered huge websites
with loads of static files that also mix up the desired fetching range.

I know that if you are big user (several dedicated machines in a data
center with fast connection...) you probably don't care about this, your
crawler will run over any website, with 50-500 threads the default three
retry times, and the problem will solve itself out. But, can something
be done for the rest or us, please?

A simple <max-host-pages>500</max-host-pages> would really be appreciated..

Regards,
EM



Transbuerg Tian wrote:

>I have the same conditions like you meet.
>
>I think to crawle a dynamic page is black hole for crawler.
>
>we could not get all necessary parameters which need to post to a form .
>
>and to fetch dynamic page , we need to identify the duplicate page.
>
>2005/9/26, Jack Tang <[hidden email]>:
>  
>
>>Hi Guys
>>
>>I known it is one difficult question for crawler and I just want to
>>know is it possible to nutch's crawler.
>>
>>The page structure of website I want to crawl is like this
>>
>>
>>-----> Page 1
>>
>>| http://a.com/next?pg=1
>>search box --> url redirector -|----> Page 2
>>http://a.com/search?x=1&y=2&z=3 http://a.com/b | http://a.com/next?pg=2
>>http://a.com/search?x=2&y=2&z=4 -----> Page ...
>>
>>http://a.com/next?pg=.
>>
>>
>>I added the URLs from search box into nutch crawler's feed. Say now
>>crawler is processing http://a.com/search?x=1&y=2&z=3, and it is
>>redirected to http://a.com/b, the response is OK now, so page 1
>>(http://a.com/next) is fetched and parsed. Page 1 contains the
>>pagination information like
>>
>>1
>><a href="http://a.com/next?pg=2">2</a>
>><a href="http://a.com/next?pg=3">3</a>
>><a href="http://a.com/next?pg=4">4</a>
>>
>>Unfortunately, page 2,3,4 is un-fetchable to crawler. Because url
>>redirector stores query options into cookie/session, and pagination
>>get the information from there. I mean the real url of page 2 is
>>http://a.com/search?x=1&y=2&z=3&pg=2.
>>
>>Any solution for this scenario?
>>
>>
>>Regards
>>/Jack
>>--
>>Keep Discovering ... ...
>>http://www.jroller.com/page/jmars
>>
>>    
>>
>
>  
>

Reply | Threaded
Open this post in threaded view
|

Re: Nutch Crawler, Page Rediction and Pagination

Jack.Tang
Hi EM

On 9/26/05, EM <[hidden email]> wrote:
> To crawl dynamic pages you need to know the dynamics structure of each
> website separately.
>
> Or, as I used to do it, just crawl anything in small enough chunks, and
> when something goes wrong, look at the said website, determine why it
> happened, modify the urlfilter and repeat the process.

Thank you for you advice. I figured out the reason, and I thought
"urlfilter and repeat the process" cannot solve the problem. because
nutch's crawler cannot stored "page context" if "url redirector"
exists.

> This is feasible
> process if you can dedicate yourself full time for up to a million pages
> or so. In my experience, anything over 1 million pages, with dynamic
> pages in it, is just plain asking the fetchlists to go into infinity.
> I've encountered static pages that are creative with the apache
> extensions and throw nutch into infinity. I've encountered huge websites
> with loads of static files that also mix up the desired fetching range.
>
> I know that if you are big user (several dedicated machines in a data
> center with fast connection...) you probably don't care about this, your
> crawler will run over any website, with 50-500 threads the default three
> retry times, and the problem will solve itself out. But, can something
> be done for the rest or us, please?
No, I don't think so. Some web designer put the "url director" as
obstacle of search engine. It is common in China. And you cannot get
conten of these websites at all.

> A simple <max-host-pages>500</max-host-pages> would really be appreciated..
>
> Regards,
> EM
>
>
>
> Transbuerg Tian wrote:
>
> >I have the same conditions like you meet.
> >
> >I think to crawle a dynamic page is black hole for crawler.
> >
> >we could not get all necessary parameters which need to post to a form .
> >
> >and to fetch dynamic page , we need to identify the duplicate page.
> >
> >2005/9/26, Jack Tang <[hidden email]>:
> >
> >
> >>Hi Guys
> >>
> >>I known it is one difficult question for crawler and I just want to
> >>know is it possible to nutch's crawler.
> >>
> >>The page structure of website I want to crawl is like this
> >>
> >>
> >>-----> Page 1
> >>
> >>| http://a.com/next?pg=1
> >>search box --> url redirector -|----> Page 2
> >>http://a.com/search?x=1&y=2&z=3 http://a.com/b | http://a.com/next?pg=2
> >>http://a.com/search?x=2&y=2&z=4 -----> Page ...
> >>
> >>http://a.com/next?pg=.
> >>
> >>
> >>I added the URLs from search box into nutch crawler's feed. Say now
> >>crawler is processing http://a.com/search?x=1&y=2&z=3, and it is
> >>redirected to http://a.com/b, the response is OK now, so page 1
> >>(http://a.com/next) is fetched and parsed. Page 1 contains the
> >>pagination information like
> >>
> >>1
> >><a href="http://a.com/next?pg=2">2</a>
> >><a href="http://a.com/next?pg=3">3</a>
> >><a href="http://a.com/next?pg=4">4</a>
> >>
> >>Unfortunately, page 2,3,4 is un-fetchable to crawler. Because url
> >>redirector stores query options into cookie/session, and pagination
> >>get the information from there. I mean the real url of page 2 is
> >>http://a.com/search?x=1&y=2&z=3&pg=2.
> >>
> >>Any solution for this scenario?
> >>
> >>
> >>Regards
> >>/Jack
> >>--
> >>Keep Discovering ... ...
> >>http://www.jroller.com/page/jmars
> >>
> >>
> >>
> >
> >
> >
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: Nutch Crawler, Page Rediction and Pagination

em-13

>>
>>I know that if you are big user (several dedicated machines in a data
>>center with fast connection...) you probably don't care about this, your
>>crawler will run over any website, with 50-500 threads the default three
>>retry times, and the problem will solve itself out. But, can something
>>be done for the rest or us, please?
>>    
>>
>No, I don't think so. Some web designer put the "url director" as
>obstacle of search engine. It is common in China. And you cannot get
>conten of these websites at all.
>  
>
Maybe I wasn't totally clear, with 10 seconds of timeout , the fetcher
will jump over bunch of pages on the same host. Any obstacles will be
pretty much ignored because these pages won't be fetched and pages
leading from them won't be fetched also. On large scale, search engine
traps or not, the fetcher will play rough an get over them in 3 runs
(actually a bit more since some pages will be fetched). This is of
course the case if you don't need 100% of the pages, just as many as you
can fetch.

People who are technically able to put search engine traps should be
technically able to put robots.txt, of course with both sides not
obeying the rules, it's a bit of a mess lately and everyone are paying
the price.

I've encountered cases where spam was the issue and not search engine
traps. There's this website that has mod rewrite or something like that
setup so ANY RANDOM link you can type on his page is valid and it will
show you bunch of unrelated random advertisements. This is static page
by the way. Now, if I had 100mbps my fetcher would run over his website
without blinking, being limited to 2 the effect is noticeable. No matter
how many times I run the fetcher, the number of instructions left wasn't
decreasing ;) I've encountered cases like this, and instead of manually
typing regex to clean them off (which takes time) I'd strongly prefer an
automated solution if possible.

Regards,
EM


Reply | Threaded
Open this post in threaded view
|

Re: Nutch Crawler, Page Rediction and Pagination

Jack.Tang
Hi EM

On 9/26/05, EM <[hidden email]> wrote:

>
> >>
> >>I know that if you are big user (several dedicated machines in a data
> >>center with fast connection...) you probably don't care about this, your
> >>crawler will run over any website, with 50-500 threads the default three
> >>retry times, and the problem will solve itself out. But, can something
> >>be done for the rest or us, please?
> >>
> >>
> >No, I don't think so. Some web designer put the "url director" as
> >obstacle of search engine. It is common in China. And you cannot get
> >conten of these websites at all.
> >
> >
> Maybe I wasn't totally clear, with 10 seconds of timeout , the fetcher
> will jump over bunch of pages on the same host. Any obstacles will be
> pretty much ignored because these pages won't be fetched and pages
> leading from them won't be fetched also. On large scale, search engine
> traps or not, the fetcher will play rough an get over them in 3 runs
> (actually a bit more since some pages will be fetched). This is of
> course the case if you don't need 100% of the pages, just as many as you
> can fetch.

Well, let me explain more about my case.

Say there is only one entry to list all content of the website:
http://a.com/search?city=YourCity. (We take it as search engine on the
website of course.)
If I input "YourCity"'s value as "NewYork", then it will list all
content related with NewYork, and there are many pages. And the
pagination URL we expect is
http://a.com/search?city=NewYork&pg=1 for the first page
http://a.com/search?city=NewYork&pg=2 for the second page.

Nutch crawler is not "trapped" in this case.

But if we change the website design, it will be "trapped".
After the request "http://a.com/search?city=NewYork" is sent, the
website store the query parameters into cookie/session first, then
redirect to another page, say http://a.com/next.html, and response it.
Also the pagination URLs are changed. The query parameters are
discarded, since all of them can be retrieved from cookie/session. The
final pagination URL will be
http://a.com/search?pg=1 for the first page
http://a.com/search?pg=2 for the second page.

If nutch crawler extract URL from HTML document, it definitely cannot
get the desired content in parsing, right?

Regards
/Jack





> People who are technically able to put search engine traps should be
> technically able to put robots.txt, of course with both sides not
> obeying the rules, it's a bit of a mess lately and everyone are paying
> the price.
>
> I've encountered cases where spam was the issue and not search engine
> traps. There's this website that has mod rewrite or something like that
> setup so ANY RANDOM link you can type on his page is valid and it will
> show you bunch of unrelated random advertisements. This is static page
> by the way. Now, if I had 100mbps my fetcher would run over his website
> without blinking, being limited to 2 the effect is noticeable. No matter
> how many times I run the fetcher, the number of instructions left wasn't
> decreasing ;) I've encountered cases like this, and instead of manually
> typing regex to clean them off (which takes time) I'd strongly prefer an
> automated solution if possible.
>
> Regards,
> EM
>
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: Nutch Crawler, Page Rediction and Pagination

em-13
>
>
>Say there is only one entry to list all content of the website:
>http://a.com/search?city=YourCity. (We take it as search engine on the
>website of course.)
>If I input "YourCity"'s value as "NewYork", then it will list all
>content related with NewYork, and there are many pages. And the
>pagination URL we expect is
>http://a.com/search?city=NewYork&pg=1 for the first page
>http://a.com/search?city=NewYork&pg=2 for the second page.
>
>Nutch crawler is not "trapped" in this case.
>  
>
True.

>But if we change the website design, it will be "trapped".
>After the request "http://a.com/search?city=NewYork" is sent, the
>website store the query parameters into cookie/session first, then
>redirect to another page, say http://a.com/next.html, and response it.
>Also the pagination URLs are changed. The query parameters are
>discarded, since all of them can be retrieved from cookie/session. The
>final pagination URL will be
>http://a.com/search?pg=1 for the first page
>http://a.com/search?pg=2 for the second page.
>
>If nutch crawler extract URL from HTML document, it definitely cannot
>get the desired content in parsing, right?
>  
>
Sometimes, what information is available to you is determined by the
decision of whoever designed the page right?
It the page tries to be smart and 'determine' what user want's to see,
well, if you don't own that webpage there isn't anything you can do.

Best regards,
EM