Crawling SLASHDOT.ORG

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Crawling SLASHDOT.ORG

kranthi reddy
Hi,

     I am new to nutch . I have been trying to crawl "slashdot.org" . But
due to some unknown problems i am unable to crawl the site.
     I am able to crawl any other site site (bbc,ndtv,cricbuzz etc)... but
when i try to crawl "slashdot.org" i get the following error ...

        "Generator: jobtracker is 'local', generating exactly one partition.
         Generator: 0 records selected for fetching, exiting ...
          Stopping at depth=1 - no more URLs to fetch."

    Can some one please help me out.


Thank you in advance

 Kranthi Reddy. B
Reply | Threaded
Open this post in threaded view
|

RE: Crawling SLASHDOT.ORG

Howie Wang

What does your crawl-urlfilter.txt or regex-urlfilter.txt look like?

Howie


> Date: Wed, 25 Jun 2008 23:00:12 +0530
> From: [hidden email]
> To: [hidden email]
> Subject: Crawling SLASHDOT.ORG
>
> Hi,
>
>      I am new to nutch . I have been trying to crawl "slashdot.org" . But
> due to some unknown problems i am unable to crawl the site.
>      I am able to crawl any other site site (bbc,ndtv,cricbuzz etc)... but
> when i try to crawl "slashdot.org" i get the following error ...
>
>         "Generator: jobtracker is 'local', generating exactly one partition.
>          Generator: 0 records selected for fetching, exiting ...
>           Stopping at depth=1 - no more URLs to fetch."
>
>     Can some one please help me out.
>
>
> Thank you in advance
>
>  Kranthi Reddy. B

_________________________________________________________________
Need to know now? Get instant answers with Windows Live Messenger.
http://www.windowslive.com/messenger/connect_your_way.html?ocid=TXT_TAGLM_WL_Refresh_messenger_062008
Reply | Threaded
Open this post in threaded view
|

Re: Crawling SLASHDOT.ORG

kranthi reddy
Hi Howie,

CRAWL-URLFILTER.TXT  looks like this

  # The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
+[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

# skip everything else
+.*



and my REGEX-URLFILTER.TXT looks like this....

# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept anything else
+.


On Wed, Jun 25, 2008 at 11:15 PM, Howie Wang <[hidden email]> wrote:

>
> What does your crawl-urlfilter.txt or regex-urlfilter.txt look like?
>
> Howie
>
>
> > Date: Wed, 25 Jun 2008 23:00:12 +0530
> > From: [hidden email]
> > To: [hidden email]
> > Subject: Crawling SLASHDOT.ORG
> >
> > Hi,
> >
> >      I am new to nutch . I have been trying to crawl "slashdot.org" .
> But
> > due to some unknown problems i am unable to crawl the site.
> >      I am able to crawl any other site site (bbc,ndtv,cricbuzz etc)...
> but
> > when i try to crawl "slashdot.org" i get the following error ...
> >
> >         "Generator: jobtracker is 'local', generating exactly one
> partition.
> >          Generator: 0 records selected for fetching, exiting ...
> >           Stopping at depth=1 - no more URLs to fetch."
> >
> >     Can some one please help me out.
> >
> >
> > Thank you in advance
> >
> >  Kranthi Reddy. B
>
> _________________________________________________________________
> Need to know now? Get instant answers with Windows Live Messenger.
>
> http://www.windowslive.com/messenger/connect_your_way.html?ocid=TXT_TAGLM_WL_Refresh_messenger_062008
Reply | Threaded
Open this post in threaded view
|

RE: Crawling SLASHDOT.ORG

Howie Wang

Looks like your URL filters are OK. I was able to crawl to depth 2
on slashdot.

Have you turned on logging and looked for more clues in the logs?

Howie


> Date: Wed, 25 Jun 2008 23:18:27 +0530
> From: [hidden email]
> To: [hidden email]
> Subject: Re: Crawling SLASHDOT.ORG
>
> Hi Howie,
>
> CRAWL-URLFILTER.TXT  looks like this
>
>   # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> +[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>
> # skip everything else
> +.*
>
>
>
> and my REGEX-URLFILTER.TXT looks like this....
>
> # The default url filter.
> # Better for whole-internet crawling.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file: ftp: and mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept anything else
> +.
>
>
> On Wed, Jun 25, 2008 at 11:15 PM, Howie Wang <[hidden email]> wrote:
>
> >
> > What does your crawl-urlfilter.txt or regex-urlfilter.txt look like?
> >
> > Howie
> >
> >
> > > Date: Wed, 25 Jun 2008 23:00:12 +0530
> > > From: [hidden email]
> > > To: [hidden email]
> > > Subject: Crawling SLASHDOT.ORG
> > >
> > > Hi,
> > >
> > >      I am new to nutch . I have been trying to crawl "slashdot.org" .
> > But
> > > due to some unknown problems i am unable to crawl the site.
> > >      I am able to crawl any other site site (bbc,ndtv,cricbuzz etc)...
> > but
> > > when i try to crawl "slashdot.org" i get the following error ...
> > >
> > >         "Generator: jobtracker is 'local', generating exactly one
> > partition.
> > >          Generator: 0 records selected for fetching, exiting ...
> > >           Stopping at depth=1 - no more URLs to fetch."
> > >
> > >     Can some one please help me out.
> > >
> > >
> > > Thank you in advance
> > >
> > >  Kranthi Reddy. B
> >
> > _________________________________________________________________
> > Need to know now? Get instant answers with Windows Live Messenger.
> >
> > http://www.windowslive.com/messenger/connect_your_way.html?ocid=TXT_TAGLM_WL_Refresh_messenger_062008

_________________________________________________________________
Need to know now? Get instant answers with Windows Live Messenger.
http://www.windowslive.com/messenger/connect_your_way.html?ocid=TXT_TAGLM_WL_Refresh_messenger_062008
Reply | Threaded
Open this post in threaded view
|

Re: Crawling SLASHDOT.ORG

kranthi reddy
The log files look like this in the end ...

      Dedup: adding indexes in: crawl/indexes
      Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)


On Wed, Jun 25, 2008 at 11:45 PM, Howie Wang <[hidden email]> wrote:

>
> Looks like your URL filters are OK. I was able to crawl to depth 2
> on slashdot.
>
> Have you turned on logging and looked for more clues in the logs?
>
> Howie
>
>
> > Date: Wed, 25 Jun 2008 23:18:27 +0530
> > From: [hidden email]
> > To: [hidden email]
> > Subject: Re: Crawling SLASHDOT.ORG
> >
> > Hi Howie,
> >
> > CRAWL-URLFILTER.TXT  looks like this
> >
> >   # The url filter file used by the crawl command.
> >
> > # Better for intranet crawling.
> > # Be sure to change MY.DOMAIN.NAME to your domain name.
> >
> > # Each non-comment, non-blank line contains a regular expression
> > # prefixed by '+' or '-'.  The first matching pattern in the file
> > # determines whether a URL is included or ignored.  If no pattern
> > # matches, the URL is ignored.
> >
> > # skip file:, ftp:, & mailto: urls
> > -^(file|ftp|mailto):
> >
> > # skip image and other suffixes we can't yet parse
> >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> >
> > # skip URLs containing certain characters as probable queries, etc.
> > +[?*!@=]
> >
> > # skip URLs with slash-delimited segment that repeats 3+ times, to break
> > loops
> > -.*(/.+?)/.*?\1/.*?\1/
> >
> > # accept hosts in MY.DOMAIN.NAME
> > #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> >
> > # skip everything else
> > +.*
> >
> >
> >
> > and my REGEX-URLFILTER.TXT looks like this....
> >
> > # The default url filter.
> > # Better for whole-internet crawling.
> >
> > # Each non-comment, non-blank line contains a regular expression
> > # prefixed by '+' or '-'.  The first matching pattern in the file
> > # determines whether a URL is included or ignored.  If no pattern
> > # matches, the URL is ignored.
> >
> > # skip file: ftp: and mailto: urls
> > -^(file|ftp|mailto):
> >
> > # skip image and other suffixes we can't yet parse
> >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> >
> > # skip URLs containing certain characters as probable queries, etc.
> > -[?*!@=]
> >
> > # skip URLs with slash-delimited segment that repeats 3+ times, to break
> > loops
> > -.*(/.+?)/.*?\1/.*?\1/
> >
> > # accept anything else
> > +.
> >
> >
> > On Wed, Jun 25, 2008 at 11:15 PM, Howie Wang <[hidden email]>
> wrote:
> >
> > >
> > > What does your crawl-urlfilter.txt or regex-urlfilter.txt look like?
> > >
> > > Howie
> > >
> > >
> > > > Date: Wed, 25 Jun 2008 23:00:12 +0530
> > > > From: [hidden email]
> > > > To: [hidden email]
> > > > Subject: Crawling SLASHDOT.ORG
> > > >
> > > > Hi,
> > > >
> > > >      I am new to nutch . I have been trying to crawl "slashdot.org"
> .
> > > But
> > > > due to some unknown problems i am unable to crawl the site.
> > > >      I am able to crawl any other site site (bbc,ndtv,cricbuzz
> etc)...
> > > but
> > > > when i try to crawl "slashdot.org" i get the following error ...
> > > >
> > > >         "Generator: jobtracker is 'local', generating exactly one
> > > partition.
> > > >          Generator: 0 records selected for fetching, exiting ...
> > > >           Stopping at depth=1 - no more URLs to fetch."
> > > >
> > > >     Can some one please help me out.
> > > >
> > > >
> > > > Thank you in advance
> > > >
> > > >  Kranthi Reddy. B
> > >
> > > _________________________________________________________________
> > > Need to know now? Get instant answers with Windows Live Messenger.
> > >
> > >
> http://www.windowslive.com/messenger/connect_your_way.html?ocid=TXT_TAGLM_WL_Refresh_messenger_062008
>
> _________________________________________________________________
> Need to know now? Get instant answers with Windows Live Messenger.
>
> http://www.windowslive.com/messenger/connect_your_way.html?ocid=TXT_TAGLM_WL_Refresh_messenger_062008
>
Reply | Threaded
Open this post in threaded view
|

RE: Crawling SLASHDOT.ORG

Howie Wang

That looks like a bug later on trying to delete duplicate URLs -
probably because there are no URLs to be de-duplicated.
I'm guessing it's not the root problem.  There's probably
an error early on and it's not fetching or parsing the first URL
and so it can't see any other URLs after that.

Maybe turn on debug logging in log4j.conf, and look for problems
with the first  fetched URL.

http://www.mail-archive.com/nutch-general@.../msg07188.html

Howie


> Date: Wed, 25 Jun 2008 23:53:14 +0530
> From: [hidden email]
> To: [hidden email]
> Subject: Re: Crawling SLASHDOT.ORG
>
> The log files look like this in the end ...
>
>       Dedup: adding indexes in: crawl/indexes
>       Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>         at
> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
>
>
> On Wed, Jun 25, 2008 at 11:45 PM, Howie Wang <[hidden email]> wrote:
>
> >
> > Looks like your URL filters are OK. I was able to crawl to depth 2
> > on slashdot.
> >
> > Have you turned on logging and looked for more clues in the logs?
> >
> > Howie
> >
> >
> > > Date: Wed, 25 Jun 2008 23:18:27 +0530
> > > From: [hidden email]
> > > To: [hidden email]
> > > Subject: Re: Crawling SLASHDOT.ORG
> > >
> > > Hi Howie,
> > >
> > > CRAWL-URLFILTER.TXT  looks like this
> > >
> > >   # The url filter file used by the crawl command.
> > >
> > > # Better for intranet crawling.
> > > # Be sure to change MY.DOMAIN.NAME to your domain name.
> > >
> > > # Each non-comment, non-blank line contains a regular expression
> > > # prefixed by '+' or '-'.  The first matching pattern in the file
> > > # determines whether a URL is included or ignored.  If no pattern
> > > # matches, the URL is ignored.
> > >
> > > # skip file:, ftp:, & mailto: urls
> > > -^(file|ftp|mailto):
> > >
> > > # skip image and other suffixes we can't yet parse
> > >
> > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> > >
> > > # skip URLs containing certain characters as probable queries, etc.
> > > +[?*!@=]
> > >
> > > # skip URLs with slash-delimited segment that repeats 3+ times, to break
> > > loops
> > > -.*(/.+?)/.*?\1/.*?\1/
> > >
> > > # accept hosts in MY.DOMAIN.NAME
> > > #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> > >
> > > # skip everything else
> > > +.*
> > >
> > >
> > >
> > > and my REGEX-URLFILTER.TXT looks like this....
> > >
> > > # The default url filter.
> > > # Better for whole-internet crawling.
> > >
> > > # Each non-comment, non-blank line contains a regular expression
> > > # prefixed by '+' or '-'.  The first matching pattern in the file
> > > # determines whether a URL is included or ignored.  If no pattern
> > > # matches, the URL is ignored.
> > >
> > > # skip file: ftp: and mailto: urls
> > > -^(file|ftp|mailto):
> > >
> > > # skip image and other suffixes we can't yet parse
> > >
> > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> > >
> > > # skip URLs containing certain characters as probable queries, etc.
> > > -[?*!@=]
> > >
> > > # skip URLs with slash-delimited segment that repeats 3+ times, to break
> > > loops
> > > -.*(/.+?)/.*?\1/.*?\1/
> > >
> > > # accept anything else
> > > +.
> > >
> > >
> > > On Wed, Jun 25, 2008 at 11:15 PM, Howie Wang <[hidden email]>
> > wrote:
> > >
> > > >
> > > > What does your crawl-urlfilter.txt or regex-urlfilter.txt look like?
> > > >
> > > > Howie
> > > >
> > > >
> > > > > Date: Wed, 25 Jun 2008 23:00:12 +0530
> > > > > From: [hidden email]
> > > > > To: [hidden email]
> > > > > Subject: Crawling SLASHDOT.ORG
> > > > >
> > > > > Hi,
> > > > >
> > > > >      I am new to nutch . I have been trying to crawl "slashdot.org"
> > .
> > > > But
> > > > > due to some unknown problems i am unable to crawl the site.
> > > > >      I am able to crawl any other site site (bbc,ndtv,cricbuzz
> > etc)...
> > > > but
> > > > > when i try to crawl "slashdot.org" i get the following error ...
> > > > >
> > > > >         "Generator: jobtracker is 'local', generating exactly one
> > > > partition.
> > > > >          Generator: 0 records selected for fetching, exiting ...
> > > > >           Stopping at depth=1 - no more URLs to fetch."
> > > > >
> > > > >     Can some one please help me out.
> > > > >
> > > > >
> > > > > Thank you in advance
> > > > >
> > > > >  Kranthi Reddy. B
> > > >
> > > > _________________________________________________________________
> > > > Need to know now? Get instant answers with Windows Live Messenger.
> > > >
> > > >
> > http://www.windowslive.com/messenger/connect_your_way.html?ocid=TXT_TAGLM_WL_Refresh_messenger_062008
> >
> > _________________________________________________________________
> > Need to know now? Get instant answers with Windows Live Messenger.
> >
> > http://www.windowslive.com/messenger/connect_your_way.html?ocid=TXT_TAGLM_WL_Refresh_messenger_062008
> >

_________________________________________________________________
Introducing Live Search cashback .  It's search that pays you back!
http://search.live.com/cashback/?&pkw=form=MIJAAF/publ=HMTGL/crea=introsrchcashback
Reply | Threaded
Open this post in threaded view
|

Re: Crawling SLASHDOT.ORG

kranthi reddy
Thank you Howie...

 I am able to crawl slashdot.org....After looking into the log files ...i
found that the "fetcher.max.crawl.delay" was by default 30....but in
slashdot.org/robots.txt ...the value was 100....

Since the default value is less than the value given in the site...nutch was
unable to crawl it....By chaning it to 150 ...i am able to crawl
successfully...

Thank you once again...
Kranthi Reddy.B

On Thu, Jun 26, 2008 at 12:28 AM, Howie Wang <[hidden email]> wrote:

>
> That looks like a bug later on trying to delete duplicate URLs -
> probably because there are no URLs to be de-duplicated.
> I'm guessing it's not the root problem.  There's probably
> an error early on and it's not fetching or parsing the first URL
> and so it can't see any other URLs after that.
>
> Maybe turn on debug logging in log4j.conf, and look for problems
> with the first  fetched URL.
>
>
> http://www.mail-archive.com/nutch-general@.../msg07188.html
>
> Howie
>
>
> > Date: Wed, 25 Jun 2008 23:53:14 +0530
> > From: [hidden email]
> > To: [hidden email]
> > Subject: Re: Crawling SLASHDOT.ORG
> >
> > The log files look like this in the end ...
> >
> >       Dedup: adding indexes in: crawl/indexes
> >       Exception in thread "main" java.io.IOException: Job failed!
> >         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
> >         at
> >
> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
> >         at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
> >
> >
> > On Wed, Jun 25, 2008 at 11:45 PM, Howie Wang <[hidden email]>
> wrote:
> >
> > >
> > > Looks like your URL filters are OK. I was able to crawl to depth 2
> > > on slashdot.
> > >
> > > Have you turned on logging and looked for more clues in the logs?
> > >
> > > Howie
> > >
> > >
> > > > Date: Wed, 25 Jun 2008 23:18:27 +0530
> > > > From: [hidden email]
> > > > To: [hidden email]
> > > > Subject: Re: Crawling SLASHDOT.ORG
> > > >
> > > > Hi Howie,
> > > >
> > > > CRAWL-URLFILTER.TXT  looks like this
> > > >
> > > >   # The url filter file used by the crawl command.
> > > >
> > > > # Better for intranet crawling.
> > > > # Be sure to change MY.DOMAIN.NAME to your domain name.
> > > >
> > > > # Each non-comment, non-blank line contains a regular expression
> > > > # prefixed by '+' or '-'.  The first matching pattern in the file
> > > > # determines whether a URL is included or ignored.  If no pattern
> > > > # matches, the URL is ignored.
> > > >
> > > > # skip file:, ftp:, & mailto: urls
> > > > -^(file|ftp|mailto):
> > > >
> > > > # skip image and other suffixes we can't yet parse
> > > >
> > >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> > > >
> > > > # skip URLs containing certain characters as probable queries, etc.
> > > > +[?*!@=]
> > > >
> > > > # skip URLs with slash-delimited segment that repeats 3+ times, to
> break
> > > > loops
> > > > -.*(/.+?)/.*?\1/.*?\1/
> > > >
> > > > # accept hosts in MY.DOMAIN.NAME
> > > > #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> > > >
> > > > # skip everything else
> > > > +.*
> > > >
> > > >
> > > >
> > > > and my REGEX-URLFILTER.TXT looks like this....
> > > >
> > > > # The default url filter.
> > > > # Better for whole-internet crawling.
> > > >
> > > > # Each non-comment, non-blank line contains a regular expression
> > > > # prefixed by '+' or '-'.  The first matching pattern in the file
> > > > # determines whether a URL is included or ignored.  If no pattern
> > > > # matches, the URL is ignored.
> > > >
> > > > # skip file: ftp: and mailto: urls
> > > > -^(file|ftp|mailto):
> > > >
> > > > # skip image and other suffixes we can't yet parse
> > > >
> > >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> > > >
> > > > # skip URLs containing certain characters as probable queries, etc.
> > > > -[?*!@=]
> > > >
> > > > # skip URLs with slash-delimited segment that repeats 3+ times, to
> break
> > > > loops
> > > > -.*(/.+?)/.*?\1/.*?\1/
> > > >
> > > > # accept anything else
> > > > +.
> > > >
> > > >
> > > > On Wed, Jun 25, 2008 at 11:15 PM, Howie Wang <[hidden email]
> >
> > > wrote:
> > > >
> > > > >
> > > > > What does your crawl-urlfilter.txt or regex-urlfilter.txt look
> like?
> > > > >
> > > > > Howie
> > > > >
> > > > >
> > > > > > Date: Wed, 25 Jun 2008 23:00:12 +0530
> > > > > > From: [hidden email]
> > > > > > To: [hidden email]
> > > > > > Subject: Crawling SLASHDOT.ORG
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > >      I am new to nutch . I have been trying to crawl "
> slashdot.org"
> > > .
> > > > > But
> > > > > > due to some unknown problems i am unable to crawl the site.
> > > > > >      I am able to crawl any other site site (bbc,ndtv,cricbuzz
> > > etc)...
> > > > > but
> > > > > > when i try to crawl "slashdot.org" i get the following error ...
> > > > > >
> > > > > >         "Generator: jobtracker is 'local', generating exactly one
> > > > > partition.
> > > > > >          Generator: 0 records selected for fetching, exiting ...
> > > > > >           Stopping at depth=1 - no more URLs to fetch."
> > > > > >
> > > > > >     Can some one please help me out.
> > > > > >
> > > > > >
> > > > > > Thank you in advance
> > > > > >
> > > > > >  Kranthi Reddy. B
> > > > >
> > > > > _________________________________________________________________
> > > > > Need to know now? Get instant answers with Windows Live Messenger.
> > > > >
> > > > >
> > >
> http://www.windowslive.com/messenger/connect_your_way.html?ocid=TXT_TAGLM_WL_Refresh_messenger_062008
> > >
> > > _________________________________________________________________
> > > Need to know now? Get instant answers with Windows Live Messenger.
> > >
> > >
> http://www.windowslive.com/messenger/connect_your_way.html?ocid=TXT_TAGLM_WL_Refresh_messenger_062008
> > >
>
> _________________________________________________________________
> Introducing Live Search cashback .  It's search that pays you back!
>
> http://search.live.com/cashback/?&pkw=form=MIJAAF/publ=HMTGL/crea=introsrchcashback