How Nutch crawl for specifice word not for specific url Then get the structure data and store in hbase.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

How Nutch crawl for specifice word not for specific url Then get the structure data and store in hbase.

Muhammad UMER
Hi All,

             I am new Using Apache Nutch to crawl some sites , filter and get content on the base of word not on the base of url. e.g.


  1.  I have to crawl those sites  that contain words like 'shop'  or 'product' in contents(text). if these word not exists then not crawl further links on that page and leave the page to further parse.
  2.  Apache Nutch is directly interact with the HBASE to dump whole webpage source html but I want to get structured (json formate e.g text , url , metadata etc.) data instead of unstructured(whole page source) data.
  3.  Then Apache Nutch send this data to solr where data is index and structured. but I want to show this data on my on web page instead of solr web page. how can I get this data in structured format and categorized. it with words i provide it to Nutch.

that's what I want to achieve, any little help would be appreciable.

Regards
Muhammad umer