Customize Crawling..

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Customize Crawling..

Volkan Ebil
Hi,

 

I am a new nutch user. My problem is to customize the crawl process.My aim
is to detect and crawl web sites written in my language.I want to crawl only
the sites that contains special chars like "ğ" or "ç" and also ,

i want to limit the urls that ends only with special extensions like
"com.uk"  and skip others.How can i do these limitations ?   Where shoul i
change in inject,generate,fetch,parse algorithms?

 

Thanks.

Reply | Threaded
Open this post in threaded view
|

RE: Customize Crawling..

kishore.krishna2
Hi
I dnt knw abt the special character part...but u can limit the urls  using conf/urfilter.txt...
Thanx
kishore

-----Original Message-----
From: Volkan Ebil [mailto:[hidden email]]
Sent: Tuesday, January 15, 2008 6:13 PM
To: [hidden email]
Subject: Customize Crawling..

Hi,



I am a new nutch user. My problem is to customize the crawl process.My aim is to detect and crawl web sites written in my language.I want to crawl only the sites that contains special chars like "ğ" or "ç" and also ,

i want to limit the urls that ends only with special extensions like
"com.uk"  and skip others.How can i do these limitations ?   Where shoul i
change in inject,generate,fetch,parse algorithms?



Thanks.


The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.

www.wipro.com

Reply | Threaded
Open this post in threaded view
|

RE: Customize Crawling..

Volkan Ebil
url filter will solve the url limitation problem thanks.Is anyone know how i
can add an if check to the crawl process that allows only the sites that
contains special chars like "ç,ü,ğ".Shoul i study on parse algoritm.

Reply | Threaded
Open this post in threaded view
|

Re: Customize Crawling..

Manoj Bist
I came across a languageidentifier plugin at PluginCentral while trying to
figure out something else. *Maybe *this could be a starting point for you.

http://wiki.apache.org/nutch/PluginCentral

2008/1/16 Volkan Ebil <[hidden email]>:

> url filter will solve the url limitation problem thanks.Is anyone know how
> i
> can add an if check to the crawl process that allows only the sites that
> contains special chars like "ç,ü,ğ".Shoul i study on parse algoritm.
>
>


--
Tired of reading blogs? Listen to  your favorite blogs at
http://www.blogbard.com   !!!!