Help to understand the crawl filter

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Help to understand the crawl filter

Mario Méndez Villegas
Hello,

I've been doing some crawling to understand how the crawl filter works but I
cannot figure out, I followed the tutorial inside the wiki, and even I have
added the urls I want to crawl to a file called urls and configured the
crawl-urlfilter.txt, when I run the crawl nutch fetchs sites that are not
listed in any of these files, can anybody tell me why this happens?

Regards,
Mario
Reply | Threaded
Open this post in threaded view
|

Re: Help to understand the crawl filter

Susam Pal
My response inline.

On 2/20/08, Mario Méndez Villegas <[hidden email]> wrote:
Hello,
>
> I\'ve been doing some crawling to understand how the crawl filter works but I
> cannot figure out, I followed the tutorial inside the wiki, and even I have
> added the urls I want to crawl to a file called urls and configured the

Did you place the urls file inside a directory? Let me give you an
example of a proper way of doing it.

0. Let us assume your current directory is the Nutch project directory
(i.e. the directory which contains the bin directory).
1. In the current directory create a directory called: urls (This has
to be passed as an argument to bin/nutch crawl command)
2. In the urls directory, create a file called: url (This can be any
name though)
3. Write the URLs with which you want to start the crawl in urls/url
file. Write one URL per line.
4. Start the crawl as: bin/nutch crawl urls -dir crawl -depth 3 -topN 50

> crawl-urlfilter.txt, when I run the crawl nutch fetchs sites that are not
> listed in any of these files, can anybody tell me why this happens?

You can read the last paragraph in the comments written in
crawl-urlfilter.txt. It clearly explains how the crawl-urlfilter.txt
works.

Regards,
Susam Pal
Reply | Threaded
Open this post in threaded view
|

Indexer return null

Duan, Nick
After running the nutch indexer, I got the following results:

Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20080220171109
 Indexing [http://37869.rapidforum.com/] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer@a17083 (null)
 Indexing [http://allbrightideas.com/] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer@a17083 (null)
...


The search couldn't return any results.  I guess it must be related that
indexer didn't work (due to return null).  Any suggestions on what I did
wrong?

Thanks a lot!

ND