Reg: Issues while crawling pagination

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Reg: Issues while crawling pagination

ShivaKarthik S
 Hi

Can you help me in figuring out the issue while crawling a hub page having
pagination. Problem what i am facing is what depth to give and how to
handle pagination.
I have a hubpage which has a pagination of more than 4.95L.
e.g. https://www.jagran.com/latest-news-page497342.html     <here 497342 is
the number of pages under the hubpage latest-news>


--
Thanks and Regards
Shiva
Reply | Threaded
Open this post in threaded view
|

RE: Issues while crawling pagination

Yossi Tamari
Hi Shiva,

My suggestion would be to programmatically generate a seeds file containing these 497342 URLs (since you know them in advance), and then use a very low max-depth (probably 1), and a high number of iterations, since only a small number will be fetched in each iteration, unless you set a very low crawl-delay.
(Mathematically, If you fetch 1 URL per second from this domain, fetching 497342 URLs will take 138 hours).

        Yossi.

> -----Original Message-----
> From: ShivaKarthik S <[hidden email]>
> Sent: 28 July 2018 23:20
> To: [hidden email]; [hidden email]
> Subject: Reg: Issues while crawling pagination
>
>  Hi
>
> Can you help me in figuring out the issue while crawling a hub page having
> pagination. Problem what i am facing is what depth to give and how to handle
> pagination.
> I have a hubpage which has a pagination of more than 4.95L.
> e.g. https://www.jagran.com/latest-news-page497342.html     <here 497342 is
> the number of pages under the hubpage latest-news>
>
>
> --
> Thanks and Regards
> Shiva

Reply | Threaded
Open this post in threaded view
|

RE: Issues while crawling pagination

Markus Jelsma-2
Hello,

Yossi's suggestion is excellent if your case is crawl everything once, and never again. However, if you need to crawl future articles as well, and have to deal with mutations, then let the crawler run continuously without regard for depth.

The latter is the usual case, because after all, if you got this task a few months ago you wouldn't need to go to a depth of 497342 right?

Regards,
Markus


 
 
-----Original message-----

> From:Yossi Tamari <[hidden email]>
> Sent: Saturday 28th July 2018 23:09
> To: [hidden email]; [hidden email]; [hidden email]
> Subject: RE: Issues while crawling pagination
>
> Hi Shiva,
>
> My suggestion would be to programmatically generate a seeds file containing these 497342 URLs (since you know them in advance), and then use a very low max-depth (probably 1), and a high number of iterations, since only a small number will be fetched in each iteration, unless you set a very low crawl-delay.
> (Mathematically, If you fetch 1 URL per second from this domain, fetching 497342 URLs will take 138 hours).
>
> Yossi.
>
> > -----Original Message-----
> > From: ShivaKarthik S <[hidden email]>
> > Sent: 28 July 2018 23:20
> > To: [hidden email]; [hidden email]
> > Subject: Reg: Issues while crawling pagination
> >
> >  Hi
> >
> > Can you help me in figuring out the issue while crawling a hub page having
> > pagination. Problem what i am facing is what depth to give and how to handle
> > pagination.
> > I have a hubpage which has a pagination of more than 4.95L.
> > e.g. https://www.jagran.com/latest-news-page497342.html     <here 497342 is
> > the number of pages under the hubpage latest-news>
> >
> >
> > --
> > Thanks and Regards
> > Shiva
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Issues while crawling pagination

Yossi Tamari
Hi Shiva,

Having looked at the specific site, I have to amend my recommended max-depth from 1 to 2, since I assume you want to fetch the stories themselves, not just the hubpages.

If you want to crawl continuously, as Markus suggested, I still think you should keep the depth at 2, but define the first hubpage(s) to have a very high priority and very short recrawl delay. This is because stories are always added on the first page, and then get pushed back. I suspect that if you don't limit depth, and especially if you don't limit yourself to the domain, you will find yourself crawling the whole internet eventually. If you do limit to the domain, that won't be a problem, but unless you give special treatment to the first page(s), you will be continuously recrawling hundreds of thousands of static pages.

        Yossi.

> -----Original Message-----
> From: Markus Jelsma <[hidden email]>
> Sent: 29 July 2018 00:53
> To: [hidden email]
> Subject: RE: Issues while crawling pagination
>
> Hello,
>
> Yossi's suggestion is excellent if your case is crawl everything once, and never
> again. However, if you need to crawl future articles as well, and have to deal
> with mutations, then let the crawler run continuously without regard for depth.
>
> The latter is the usual case, because after all, if you got this task a few months
> ago you wouldn't need to go to a depth of 497342 right?
>
> Regards,
> Markus
>
>
>
>
> -----Original message-----
> > From:Yossi Tamari <[hidden email]>
> > Sent: Saturday 28th July 2018 23:09
> > To: [hidden email]; [hidden email];
> > [hidden email]
> > Subject: RE: Issues while crawling pagination
> >
> > Hi Shiva,
> >
> > My suggestion would be to programmatically generate a seeds file containing
> these 497342 URLs (since you know them in advance), and then use a very low
> max-depth (probably 1), and a high number of iterations, since only a small
> number will be fetched in each iteration, unless you set a very low crawl-delay.
> > (Mathematically, If you fetch 1 URL per second from this domain, fetching
> 497342 URLs will take 138 hours).
> >
> > Yossi.
> >
> > > -----Original Message-----
> > > From: ShivaKarthik S <[hidden email]>
> > > Sent: 28 July 2018 23:20
> > > To: [hidden email]; [hidden email]
> > > Subject: Reg: Issues while crawling pagination
> > >
> > >  Hi
> > >
> > > Can you help me in figuring out the issue while crawling a hub page
> > > having pagination. Problem what i am facing is what depth to give
> > > and how to handle pagination.
> > > I have a hubpage which has a pagination of more than 4.95L.
> > > e.g. https://www.jagran.com/latest-news-page497342.html     <here 497342
> is
> > > the number of pages under the hubpage latest-news>
> > >
> > >
> > > --
> > > Thanks and Regards
> > > Shiva
> >
> >