url file and crawl filter file - basic question ( may be )

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

url file and crawl filter file - basic question ( may be )

Developer Developer
Hello Frens,

I want  nutch to crawl two hosts www.oracle.com and www.ibm.com .  I think
my url-crawl filter is not set up correctly, because i see the message "No
URLs to fetch - check your seed list and URL filters."

here is my url seed file is setup as follows

http://www.oracle.com
http://www.ibm.com


and crawl-filter file is set up as follows

# ....
+^http://www.oracle.com/*
+^http://www.ibm.com/*
# skip everything else
-.


Do you see anything wrong in these files ?
Reply | Threaded
Open this post in threaded view
|

Re: url file and crawl filter file - basic question ( may be )

Developer Developer
no comments ? :)

On Fri, Mar 28, 2008 at 12:42 PM, Developer Developer <
[hidden email]> wrote:

> Hello Frens,
>
> I want  nutch to crawl two hosts www.oracle.com and www.ibm.com .  I think
> my url-crawl filter is not set up correctly, because i see the message "No
> URLs to fetch - check your seed list and URL filters."
>
> here is my url seed file is setup as follows
>
> http://www.oracle.com
> http://www.ibm.com
>
>
> and crawl-filter file is set up as follows
>
> # ....
> +^http://www.oracle.com/*
> +^http://www.ibm.com/*
> # skip everything else
> -.
>
>
> Do you see anything wrong in these files ?
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: url file and crawl filter file - basic question ( may be )

Dingding Ye
Haven't seen any errors from you described.

However, you can debug it yourself and i think it's easy.


On Sat, Mar 29, 2008 at 3:12 AM, Developer Developer <[hidden email]>
wrote:

> no comments ? :)
>
> On Fri, Mar 28, 2008 at 12:42 PM, Developer Developer <
> [hidden email]> wrote:
>
> > Hello Frens,
> >
> > I want  nutch to crawl two hosts www.oracle.com and www.ibm.com .  I
> think
> > my url-crawl filter is not set up correctly, because i see the message
> "No
> > URLs to fetch - check your seed list and URL filters."
> >
> > here is my url seed file is setup as follows
> >
> > http://www.oracle.com
> > http://www.ibm.com
> >
> >
> > and crawl-filter file is set up as follows
> >
> > # ....
> > +^http://www.oracle.com/*
> > +^http://www.ibm.com/*
> > # skip everything else
> > -.
> >
> >
> > Do you see anything wrong in these files ?
> >
> >
> >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: url file and crawl filter file - basic question ( may be )

Otis Gospodnetic-2
In reply to this post by Developer Developer
I hate to do this, but here it goes:

Please give volunteers at least 2-3 days to answer your question before reminding - it doesn't look nice.
Either my mail reader is lying or you sent your reminder email only 30 minutes after your original email.

Words like please and thank you also help. :)

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Developer Developer <[hidden email]>
To: [hidden email]
Sent: Friday, March 28, 2008 3:12:15 PM
Subject: Re: url file and crawl filter file - basic question ( may be )

no comments ? :)

On Fri, Mar 28, 2008 at 12:42 PM, Developer Developer <
[hidden email]> wrote:

> Hello Frens,
>
> I want  nutch to crawl two hosts www.oracle.com and www.ibm.com .  I think
> my url-crawl filter is not set up correctly, because i see the message "No
> URLs to fetch - check your seed list and URL filters."
>
> here is my url seed file is setup as follows
>
> http://www.oracle.com
> http://www.ibm.com
>
>
> and crawl-filter file is set up as follows
>
> # ....
> +^http://www.oracle.com/*
> +^http://www.ibm.com/*
> # skip everything else
> -.
>
>
> Do you see anything wrong in these files ?
>
>
>
>
>