Site being crawled even when the URL is removed from seed.txt

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Site being crawled even when the URL is removed from seed.txt

Rajinimaski
Hi Team,

   I did a crawling on this site for the first time :
http://viterbi.usc.edu/admission/
   *nutch command : *Downloads/apache-nutch-1.5.1$ bin/nutch crawl urls
-dir *nutchcrawldb* -solr <a href="http://localhost:*8080/solrnutch*">http://localhost:*8080/solrnutch* -depth 3 -topN 5


   Now I wanted to do fresh new crawl, So after the completion of above
crawling process,  i followed the below steps:

   - Changed the URL in seed.txt to service.sony.com.in,
   - Deleted the above *nutchcrawldb* and
   - in the regexurlfilter.txt I just  gave "+." [I know that this "+."
   means to accept anything , But "anything"  does it mean any ULRS that is
   not there in seed.txt too? ]
   - initiated crawling.
   - *nutch command :* Downloads/apache-nutch-1.5.1$ bin/nutch crawl urls
   -dir *new_crawl_db* -solr <a href="http://localhost:*8080/solrnutch_new*">http://localhost:*8080/solrnutch_new* -depth 3
   -topN 5


*What I observe is crawling for the site : http://viterbi.usc.edu/admission/
is still taking place even when the url does not exist in seed.txt nor the
old crawldb(nutchcrawldb) exists.  Why is this happening? Does nutch stores
the seedlist somewhere else? *
*
*
*
*
Thanks
Rajani
*
*
*
*
Reply | Threaded
Open this post in threaded view
|

Re: Site being crawled even when the URL is removed from seed.txt

lewis john mcgibbney
Hi Rajani,



On Wed, Dec 19, 2012 at 10:33 AM, Rajani Maski <[hidden email]>wrote:

>
>    Now I wanted to do fresh new crawl, So after the completion of above
> crawling process,  i followed the below steps:
>
>    - Changed the URL in seed.txt to service.sony.com.in,
>

Did you inject the above URL into the new crawl database?


>    - in the regexurlfilter.txt I just  gave "+." [I know that this "+."
>    means to accept anything , But "anything"  does it mean any ULRS that is
>    not there in seed.txt too? ]
>

Nutch will follow out/in links for any given URL (depending on your
configuration). The crawler cannot magically jump to undiscovered URLs,
there needs to be a graph linking nodes.


>
> *What I observe is crawling for the site :
> http://viterbi.usc.edu/admission/
> is still taking place even when the url does not exist in seed.txt nor the
> old crawldb(nutchcrawldb) exists.
>

If you have totally deleted the old crawl database this should be
impossible. The crawl database tracks URLs along with lots of metadata,
once it is deleted this information is lost and you will need to create
your crawl database from scratch.

Lewis
Reply | Threaded
Open this post in threaded view
|

Re: Site being crawled even when the URL is removed from seed.txt

Rajinimaski
Hi Lewis,

   I think there is something wrong in the configuration from my end.

But I am yet to find the reason for crawl that is taking place on the
history of links that are not mentioned in the seed text file nor the old
crawl db created in /home/ubuntu/nutch_new_setup/testcrawl/crawldb exists.
Did you mean the same crawldb or does it create tmp folder somewhere else
that need to be cleared?

Please find the screens shots in this
link<http://rajinimaski.blogspot.in/2012/12/nutch-learning.html> taken
during set up and while crawl is executed. It shows the detailed
configuration steps followed.


Thanks & Regards,
Rajani Maski



On Wed, Dec 19, 2012 at 6:35 PM, Lewis John Mcgibbney <
[hidden email]> wrote:

> Hi Rajani,
>
>
>
> On Wed, Dec 19, 2012 at 10:33 AM, Rajani Maski <[hidden email]
> >wrote:
>
> >
> >    Now I wanted to do fresh new crawl, So after the completion of above
> > crawling process,  i followed the below steps:
> >
> >    - Changed the URL in seed.txt to service.sony.com.in,
> >
>
> Did you inject the above URL into the new crawl database?
>
>
> >    - in the regexurlfilter.txt I just  gave "+." [I know that this "+."
> >    means to accept anything , But "anything"  does it mean any ULRS that
> is
> >    not there in seed.txt too? ]
> >
>
> Nutch will follow out/in links for any given URL (depending on your
> configuration). The crawler cannot magically jump to undiscovered URLs,
> there needs to be a graph linking nodes.
>
>
> >
> > *What I observe is crawling for the site :
> > http://viterbi.usc.edu/admission/
> > is still taking place even when the url does not exist in seed.txt nor
> the
> > old crawldb(nutchcrawldb) exists.
> >
>
> If you have totally deleted the old crawl database this should be
> impossible. The crawl database tracks URLs along with lots of metadata,
> once it is deleted this information is lost and you will need to create
> your crawl database from scratch.
>
> Lewis
>
Reply | Threaded
Open this post in threaded view
|

Re: Site being crawled even when the URL is removed from seed.txt

Tejas Patil
Hi Rajani,

As per screen shot #1, the seed url (
http://localhost:8080/nutch-test-site/chi.html) was saved in the file named
"seeds.txt". But while running the crawl (screen shot #3), this file is not
passed as an argument to the crawl command. Instead some other file named
"urls" is passed as an argument. I suspect that it might be having the
links from sony.com and usc.edu.
Please pass the correct seed file in the crawl command and run a fresh
crawl again.

Thanks,
Tejas Patil


On Wed, Dec 26, 2012 at 4:04 AM, Rajani Maski <[hidden email]> wrote:

> Hi Lewis,
>
>    I think there is something wrong in the configuration from my end.
>
> But I am yet to find the reason for crawl that is taking place on the
> history of links that are not mentioned in the seed text file nor the old
> crawl db created in /home/ubuntu/nutch_new_setup/testcrawl/crawldb exists.
> Did you mean the same crawldb or does it create tmp folder somewhere else
> that need to be cleared?
>
> Please find the screens shots in this
> link<http://rajinimaski.blogspot.in/2012/12/nutch-learning.html> taken
> during set up and while crawl is executed. It shows the detailed
> configuration steps followed.
>
>
> Thanks & Regards,
> Rajani Maski
>
>
>
> On Wed, Dec 19, 2012 at 6:35 PM, Lewis John Mcgibbney <
> [hidden email]> wrote:
>
> > Hi Rajani,
> >
> >
> >
> > On Wed, Dec 19, 2012 at 10:33 AM, Rajani Maski <[hidden email]
> > >wrote:
> >
> > >
> > >    Now I wanted to do fresh new crawl, So after the completion of above
> > > crawling process,  i followed the below steps:
> > >
> > >    - Changed the URL in seed.txt to service.sony.com.in,
> > >
> >
> > Did you inject the above URL into the new crawl database?
> >
> >
> > >    - in the regexurlfilter.txt I just  gave "+." [I know that this "+."
> > >    means to accept anything , But "anything"  does it mean any ULRS
> that
> > is
> > >    not there in seed.txt too? ]
> > >
> >
> > Nutch will follow out/in links for any given URL (depending on your
> > configuration). The crawler cannot magically jump to undiscovered URLs,
> > there needs to be a graph linking nodes.
> >
> >
> > >
> > > *What I observe is crawling for the site :
> > > http://viterbi.usc.edu/admission/
> > > is still taking place even when the url does not exist in seed.txt nor
> > the
> > > old crawldb(nutchcrawldb) exists.
> > >
> >
> > If you have totally deleted the old crawl database this should be
> > impossible. The crawl database tracks URLs along with lots of metadata,
> > once it is deleted this information is lost and you will need to create
> > your crawl database from scratch.
> >
> > Lewis
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Site being crawled even when the URL is removed from seed.txt

Rajinimaski
Hi Tejas,

    "urls" is the directory at /home/ubuntu/nutch_new_setup/urls/seed.txt.
Within that there is only one file with name : seed.txt and that has only
one url : http://localhost:8080/nutch-test-site/chi.html . You can see the
folder structure in the screen shot 1 for the same. I am sure that there is
no other /urls/seed.txt folder structure on disc. This is the command
: ubuntu@ubuntu-OptiPlex-390:~/nutch_new_setup$ bin/nutch crawl urls -dir
tomcatcrawl -solr http://localhost:8080/nutch_poc -depth 5.

Thanks & Regards
Rajani


On Wed, Dec 26, 2012 at 11:36 PM, Tejas Patil <[hidden email]>wrote:

> Hi Rajani,
>
> As per screen shot #1, the seed url (
> http://localhost:8080/nutch-test-site/chi.html) was saved in the file
> named
> "seeds.txt". But while running the crawl (screen shot #3), this file is not
> passed as an argument to the crawl command. Instead some other file named
> "urls" is passed as an argument. I suspect that it might be having the
> links from sony.com and usc.edu.
> Please pass the correct seed file in the crawl command and run a fresh
> crawl again.
>
> Thanks,
> Tejas Patil
>
>
> On Wed, Dec 26, 2012 at 4:04 AM, Rajani Maski <[hidden email]>
> wrote:
>
> > Hi Lewis,
> >
> >    I think there is something wrong in the configuration from my end.
> >
> > But I am yet to find the reason for crawl that is taking place on the
> > history of links that are not mentioned in the seed text file nor the old
> > crawl db created in /home/ubuntu/nutch_new_setup/testcrawl/crawldb
> exists.
> > Did you mean the same crawldb or does it create tmp folder somewhere else
> > that need to be cleared?
> >
> > Please find the screens shots in this
> > link<http://rajinimaski.blogspot.in/2012/12/nutch-learning.html> taken
> > during set up and while crawl is executed. It shows the detailed
> > configuration steps followed.
> >
> >
> > Thanks & Regards,
> > Rajani Maski
> >
> >
> >
> > On Wed, Dec 19, 2012 at 6:35 PM, Lewis John Mcgibbney <
> > [hidden email]> wrote:
> >
> > > Hi Rajani,
> > >
> > >
> > >
> > > On Wed, Dec 19, 2012 at 10:33 AM, Rajani Maski <[hidden email]
> > > >wrote:
> > >
> > > >
> > > >    Now I wanted to do fresh new crawl, So after the completion of
> above
> > > > crawling process,  i followed the below steps:
> > > >
> > > >    - Changed the URL in seed.txt to service.sony.com.in,
> > > >
> > >
> > > Did you inject the above URL into the new crawl database?
> > >
> > >
> > > >    - in the regexurlfilter.txt I just  gave "+." [I know that this
> "+."
> > > >    means to accept anything , But "anything"  does it mean any ULRS
> > that
> > > is
> > > >    not there in seed.txt too? ]
> > > >
> > >
> > > Nutch will follow out/in links for any given URL (depending on your
> > > configuration). The crawler cannot magically jump to undiscovered URLs,
> > > there needs to be a graph linking nodes.
> > >
> > >
> > > >
> > > > *What I observe is crawling for the site :
> > > > http://viterbi.usc.edu/admission/
> > > > is still taking place even when the url does not exist in seed.txt
> nor
> > > the
> > > > old crawldb(nutchcrawldb) exists.
> > > >
> > >
> > > If you have totally deleted the old crawl database this should be
> > > impossible. The crawl database tracks URLs along with lots of metadata,
> > > once it is deleted this information is lost and you will need to create
> > > your crawl database from scratch.
> > >
> > > Lewis
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Site being crawled even when the URL is removed from seed.txt

Tejas Patil
This might be the reason: You are using GEdit to edit the seeds file. It
creates a backup of the old version of the file when changes are made to
it. The backup file is hidden.

Check the contents of the urls directory using this command: *ls -a urls*
(to be executed from NUTCH_HOME. In your setup its ~/nutch_new_setup)
*
*
This might give you:
*.  ..  seed.txt  seed.txt~*

seed.txt, the updated version, will have
http://localhost:8080/nutch-test-site/chi.html  while the backup version,
seed.txt~ will have the sony.com and usc.edu urls. The second file is a
hidden file.

Nutch scans the "urls" directory and gets *all* the files inside it... both
the files are getting picked by nutch and hence you see the old urls too.
Delete the hidden file urls/seeds.txt~ and try a fresh crawl.

Thanks,
 Tejas Patil

On Wed, Dec 26, 2012 at 8:54 PM, Rajani Maski <[hidden email]> wrote:

>  http://localhost:8080/nutch-test-site/chi.html
>
Reply | Threaded
Open this post in threaded view
|

Re: Site being crawled even when the URL is removed from seed.txt

Rajinimaski
Hi Tejas, Right, this is because of back up files. Thank you very much for
the support.


On Thu, Dec 27, 2012 at 3:27 PM, Tejas Patil <[hidden email]>wrote:

> This might be the reason: You are using GEdit to edit the seeds file. It
> creates a backup of the old version of the file when changes are made to
> it. The backup file is hidden.
>
> Check the contents of the urls directory using this command: *ls -a urls*
> (to be executed from NUTCH_HOME. In your setup its ~/nutch_new_setup)
> *
> *
> This might give you:
> *.  ..  seed.txt  seed.txt~*
>
> seed.txt, the updated version, will have
> http://localhost:8080/nutch-test-site/chi.html  while the backup version,
> seed.txt~ will have the sony.com and usc.edu urls. The second file is a
> hidden file.
>
> Nutch scans the "urls" directory and gets *all* the files inside it... both
> the files are getting picked by nutch and hence you see the old urls too.
> Delete the hidden file urls/seeds.txt~ and try a fresh crawl.
>
> Thanks,
>  Tejas Patil
>
> On Wed, Dec 26, 2012 at 8:54 PM, Rajani Maski <[hidden email]>
> wrote:
>
> >  http://localhost:8080/nutch-test-site/chi.html
> >
>