Basic Usage Questions

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Basic Usage Questions

Paul Stewart-5
Hi folks...

I've read through various tutorials but wanted to clarify my usage of
Nutch and confirm I'm doing things the right way..;)

I have a directory called "urls" and inside that directory currently
have one file called "sites" - my goal is to put a list of websites I
want to index into the "sites" file including additions and changes as
time goes on.  Is there a benefit to using more than one file or must I
use more files in future??

Then, in the conf/crawl-urlfilter.txt file I expanded upon each entry
from the "sites" file to permit subdomains etc:

+^http://([a-z0-9]*\.)*domain.com/
+^http://([a-z0-9]*\.)*anotherdomain.com/
-.

I believe I understand this so far...;)

Then according to
"http://peterpuwang.googlepages.com/NutchGuideForDummies.htm" I run:

bin/nutch crawl urls -dir crawl -depth 3 -topN 50

So far so good.. I get my crawling done a few times and after restarting
tomcat can search my results ... very good!!

Now, some questions:

When this crawl finishes above, it looks like it automatically does the
inverting, dedupping and other stuff I'm starting to understand.  Since
I'm not done crawling yet I decided to run the command again and get
this:

[root@mail nutch]# bin/nutch crawl urls -dir crawl -depth 3 -topN 50
Exception in thread "main" java.lang.RuntimeException: crawl already
exists.
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:85)

Now this would tell me obviously to read the manual a little more - but
what I want to do is continue crawling where I left off - how??

Couple more questions:

The default in the config states that every 30 days the links will
expire and be updated in the crawl process?? Just confirming ....
Can I simply add/remove website addresses from my "sites" file and the
rest will be taken care of?

Thanks folks... I come from the Mnogosearch and Aspseek worlds so
getting used to Nutch...;)

Paul





----------------------------------------------------------------------------

"The information transmitted is intended only for the person or entity to which it is addressed and contains confidential and/or privileged material. If you received this in error, please contact the sender immediately and then destroy this transmission, including all attachments, without copying, distributing or disclosing same. Thank you."
Reply | Threaded
Open this post in threaded view
|

Re: Basic Usage Questions

Susam Pal
Please find my response inline.

On Jan 31, 2008 9:33 PM, Paul Stewart <[hidden email]> wrote:

> Hi folks...
>
> I've read through various tutorials but wanted to clarify my usage of
> Nutch and confirm I'm doing things the right way..;)
>
> I have a directory called "urls" and inside that directory currently
> have one file called "sites" - my goal is to put a list of websites I
> want to index into the "sites" file including additions and changes as
> time goes on.  Is there a benefit to using more than one file or must I
> use more files in future??

I don't think it makes any significant difference. I put all the URLs
in one file.

>
> Then, in the conf/crawl-urlfilter.txt file I expanded upon each entry
> from the "sites" file to permit subdomains etc:
>
> +^http://([a-z0-9]*\.)*domain.com/
> +^http://([a-z0-9]*\.)*anotherdomain.com/
> -.
>
> I believe I understand this so far...;)

If you do not want to crawl any domain other than the ones you allow
in this file, this configuration is fine.

>
> Then according to
> "http://peterpuwang.googlepages.com/NutchGuideForDummies.htm" I run:
>
> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
>
> So far so good.. I get my crawling done a few times and after restarting
> tomcat can search my results ... very good!!
>
> Now, some questions:
>
> When this crawl finishes above, it looks like it automatically does the
> inverting, dedupping and other stuff I'm starting to understand.  Since
> I'm not done crawling yet I decided to run the command again and get
> this:
>
> [root@mail nutch]# bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> Exception in thread "main" java.lang.RuntimeException: crawl already
> exists.
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:85)
>
> Now this would tell me obviously to read the manual a little more - but
> what I want to do is continue crawling where I left off - how??

Currently, this can not be done with bin/nutch crawl command.
Crawl.java checks whether the crawl directory exists before beginning
the crawl. I don't know why this is necessary but that's how it is
currently. For repeated crawls, you can try this script:-

http://wiki.apache.org/nutch/Crawl

>
> Couple more questions:
>
> The default in the config states that every 30 days the links will
> expire and be updated in the crawl process?? Just confirming ....
> Can I simply add/remove website addresses from my "sites" file and the
> rest will be taken care of?

If you are doing a recrawl on the same crawl directory, you would be
probably using the script I have given above. In that case removing
website addresses from the "sites" file wouldn't help because the URLs
would already be present in the crawldb. However, adding new website
addresses would inject the new addresses into the crawldb which would
be crawled in the next fetch cycle.

Regards,
Susam Pal
Reply | Threaded
Open this post in threaded view
|

RE: Basic Usage Questions

Paul Stewart-5
Thanks very much for the reply...

In regards to sites already in the crawldb, is there a manual way to
remove them or is it common practice to remove the entire crawdb
directory and start over?  I could be into millions of addresses by the
time I'm done so just looking for best practices ;)

Take care,

Paul


>
> [root@mail nutch]# bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> Exception in thread "main" java.lang.RuntimeException: crawl already
> exists.
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:85)
>
> Now this would tell me obviously to read the manual a little more -
but
> what I want to do is continue crawling where I left off - how??

Currently, this can not be done with bin/nutch crawl command.
Crawl.java checks whether the crawl directory exists before beginning
the crawl. I don't know why this is necessary but that's how it is
currently. For repeated crawls, you can try this script:-

http://wiki.apache.org/nutch/Crawl




----------------------------------------------------------------------------

"The information transmitted is intended only for the person or entity to which it is addressed and contains confidential and/or privileged material. If you received this in error, please contact the sender immediately and then destroy this transmission, including all attachments, without copying, distributing or disclosing same. Thank you."