Nutch - Filtering (REGEX)

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Nutch - Filtering (REGEX)

simon_ece
hi all,
i am new to Nutch. I would like to crawl a particular site and get the result in the following pattern.I dont want to list other urls from the Crwaled site.

Site to be Crwal :eg" www.example.com
^http://([a-z0-9]*\.)example.com/([a-zA-Z]*)-\([a-z0-9]*\)-.*-\([0-9]*-[A-Za-z0-9]*\)\.html$

i can crawl and geting all the matching urls from the site,
i dont know how to filterout the urls and get only the particular urls,
kindly post the suggestions
Thanks & Regards
Simon