recrawling

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

recrawling

Neeti Gupta
we had made a crawler that visit various sites, and i want the crawler to crawl sites as soon as they are updated, if anyone can help me to know how i can know when the site is updated and its the time to crawl again
Reply | Threaded
Open this post in threaded view
|

Re: recrawling

Otis Gospodnetic-2-2

Neeti,

I don't think there is a way to know when a regular web site has been updated.  You can issue GET or HEAD requests and look at the Last-Modified date, but this is not 100% reliable.  You can fetch and compare content, but that's not 100% reliable either.  If you are indexing blogs, then you can get "pings" when they update, or can rely on detecting changes in their feeds.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----

> From: Neeti Gupta <[hidden email]>
> To: [hidden email]
> Sent: Wednesday, June 24, 2009 7:52:47 AM
> Subject: recrawling
>
>
> we had made a crawler that visit various sites, and i want the crawler to
> crawl sites as soon as they are updated, if anyone can help me to know how i
> can know when the site is updated and its the time to crawl again
> --
> View this message in context:
> http://www.nabble.com/recrawling-tp24183356p24183356.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: recrawling

Neeti Gupta
But are there any rules by which we can define when to crawl a website to get its updated contents
as soon as possible.


Otis Gospodnetic-2 wrote
Neeti,

I don't think there is a way to know when a regular web site has been updated.  You can issue GET or HEAD requests and look at the Last-Modified date, but this is not 100% reliable.  You can fetch and compare content, but that's not 100% reliable either.  If you are indexing blogs, then you can get "pings" when they update, or can rely on detecting changes in their feeds.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Neeti Gupta <neeti_gupta13@yahoo.com>
> To: nutch-user@lucene.apache.org
> Sent: Wednesday, June 24, 2009 7:52:47 AM
> Subject: recrawling
>
>
> we had made a crawler that visit various sites, and i want the crawler to
> crawl sites as soon as they are updated, if anyone can help me to know how i
> can know when the site is updated and its the time to crawl again
> --
> View this message in context:
> http://www.nabble.com/recrawling-tp24183356p24183356.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: recrawling

Sjaiful Bahri
In reply to this post by Neeti Gupta

You have to detect changes of web content.

http://zipclue.com


--- On Tue, 7/14/09, Neeti Gupta <[hidden email]> wrote:

> From: Neeti Gupta <[hidden email]>
> Subject: Re: recrawling
> To: [hidden email]
> Date: Tuesday, July 14, 2009, 6:50 AM
>
> But are there any rules by which we can define when to
> crawl a website to get
> its updated contents
> as soon as possible.
>
>
>
> Otis Gospodnetic-2 wrote:
> >
> >
> > Neeti,
> >
> > I don't think there is a way to know when a regular
> web site has been
> > updated.  You can issue GET or HEAD requests and
> look at the Last-Modified
> > date, but this is not 100% reliable.  You can
> fetch and compare content,
> > but that's not 100% reliable either.  If you are
> indexing blogs, then you
> > can get "pings" when they update, or can rely on
> detecting changes in
> > their feeds.
> >
> >  Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr -
> Nutch
> >
> >
> >
> > ----- Original Message ----
> >> From: Neeti Gupta <[hidden email]>
> >> To: [hidden email]
> >> Sent: Wednesday, June 24, 2009 7:52:47 AM
> >> Subject: recrawling
> >>
> >>
> >> we had made a crawler that visit various sites,
> and i want the crawler to
> >> crawl sites as soon as they are updated, if anyone
> can help me to know
> >> how i
> >> can know when the site is updated and its the time
> to crawl again
> >> --
> >> View this message in context:
> >> http://www.nabble.com/recrawling-tp24183356p24183356.html
> >> Sent from the Nutch - User mailing list archive at
> Nabble.com.
> >
> >
> >
>
> --
> View this message in context: http://www.nabble.com/recrawling-tp24183356p24474563.html
> Sent from the Nutch - User mailing list archive at
> Nabble.com.
>
>