crawl xml url using nutch-0.9

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

crawl xml url using nutch-0.9

Chetan Patel
Hi All,

I have tried to crawl xml url (http://sports.yahoo.com/nfl/rss.xml) using depth 2.

But it will crawl only root url.

Please help me how to crawl root url as well as all sub url of root url.

Thanks in advance.

Regads,
Chetan Patel
Reply | Threaded
Open this post in threaded view
|

RE: crawl xml url using nutch-0.9

Edward Quick

Chetan,

Try adding parse-rss in nutch-site.xml. Here's mine:

<property>
  <name>plugin.includes</name>
  <value>protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf|zip|swf|rss)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description></description>
</property>


Ed.


> Date: Sat, 27 Sep 2008 01:30:43 -0700
> From: [hidden email]
> To: [hidden email]
> Subject: crawl xml url using nutch-0.9
>
>
> Hi All,
>
> I have tried to crawl xml url (http://sports.yahoo.com/nfl/rss.xml) using
> depth 2.
>
> But it will crawl only root url.
>
> Please help me how to crawl root url as well as all sub url of root url.
>
> Thanks in advance.
>
> Regads,
> Chetan Patel
> --
> View this message in context: http://www.nabble.com/crawl-xml-url-using-nutch-0.9-tp19700770p19700770.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

_________________________________________________________________
Get all your favourite content with the slick new MSN Toolbar - FREE
http://clk.atdmt.com/UKM/go/111354027/direct/01/
Reply | Threaded
Open this post in threaded view
|

RE: crawl xml url using nutch-0.9

Chetan Patel
Hi,

Thanks for help.

I have already added this in plugin.includes.

and still getting only root url.

Regards,
Chetan Patel

Edward Quick wrote
Chetan,

Try adding parse-rss in nutch-site.xml. Here's mine:

<property>
  <name>plugin.includes</name>
  <value>protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf|zip|swf|rss)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description></description>
</property>


Ed.


> Date: Sat, 27 Sep 2008 01:30:43 -0700
> From: chetan@webmail.aruhat.com
> To: nutch-user@lucene.apache.org
> Subject: crawl xml url using nutch-0.9
>
>
> Hi All,
>
> I have tried to crawl xml url (http://sports.yahoo.com/nfl/rss.xml) using
> depth 2.
>
> But it will crawl only root url.
>
> Please help me how to crawl root url as well as all sub url of root url.
>
> Thanks in advance.
>
> Regads,
> Chetan Patel
> --
> View this message in context: http://www.nabble.com/crawl-xml-url-using-nutch-0.9-tp19700770p19700770.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

_________________________________________________________________
Get all your favourite content with the slick new MSN Toolbar - FREE
http://clk.atdmt.com/UKM/go/111354027/direct/01/
Reply | Threaded
Open this post in threaded view
|

RE: crawl xml url using nutch-0.9

Chetan Patel
Hi,

I have got following message from log file while crawling xml url.

2008-09-27 16:06:20,920 WARN  parse.ParserFactory - ParserFactory:Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: text/xml

Please help me if you have any idea.

-Chetan


Chetan Patel wrote
Hi,

Thanks for help.

I have already added this in plugin.includes.

and still getting only root url.

Regards,
Chetan Patel

Edward Quick wrote
Chetan,

Try adding parse-rss in nutch-site.xml. Here's mine:

<property>
  <name>plugin.includes</name>
  <value>protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf|zip|swf|rss)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description></description>
</property>


Ed.


> Date: Sat, 27 Sep 2008 01:30:43 -0700
> From: chetan@webmail.aruhat.com
> To: nutch-user@lucene.apache.org
> Subject: crawl xml url using nutch-0.9
>
>
> Hi All,
>
> I have tried to crawl xml url (http://sports.yahoo.com/nfl/rss.xml) using
> depth 2.
>
> But it will crawl only root url.
>
> Please help me how to crawl root url as well as all sub url of root url.
>
> Thanks in advance.
>
> Regads,
> Chetan Patel
> --
> View this message in context: http://www.nabble.com/crawl-xml-url-using-nutch-0.9-tp19700770p19700770.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

_________________________________________________________________
Get all your favourite content with the slick new MSN Toolbar - FREE
http://clk.atdmt.com/UKM/go/111354027/direct/01/
Reply | Threaded
Open this post in threaded view
|

RE: crawl xml url using nutch-0.9

Edward Quick


>
>
> Hi,
>
> I have got following message from log file while crawling xml url.
>
> 2008-09-27 16:06:20,920 WARN  parse.ParserFactory - ParserFactory:Plugin:
> org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via
> parse-plugins.xml, but its plugin.xml file does not claim to support
> contentType: text/xml
>
> Please help me if you have any idea.

Possibly a problem with the content type. For rss files I think the content type is supposed to be application/rss+xml


>
> -Chetan
>
>
>
> Chetan Patel wrote:
> >
> > Hi,
> >
> > Thanks for help.
> >
> > I have already added this in plugin.includes.
> >
> > and still getting only root url.
> >
> > Regards,
> > Chetan Patel
> >
> >
> > Edward Quick wrote:
> >>
> >>
> >> Chetan,
> >>
> >> Try adding parse-rss in nutch-site.xml. Here's mine:
> >>
> >> <property>
> >>   <name>plugin.includes</name>
> >>  
> >> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf|zip|swf|rss)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >>   <description></description>
> >> </property>
> >>
> >>
> >> Ed.
> >>
> >>
> >>> Date: Sat, 27 Sep 2008 01:30:43 -0700
> >>> From: [hidden email]
> >>> To: [hidden email]
> >>> Subject: crawl xml url using nutch-0.9
> >>>
> >>>
> >>> Hi All,
> >>>
> >>> I have tried to crawl xml url (http://sports.yahoo.com/nfl/rss.xml)
> >>> using
> >>> depth 2.
> >>>
> >>> But it will crawl only root url.
> >>>
> >>> Please help me how to crawl root url as well as all sub url of root url.
> >>>
> >>> Thanks in advance.
> >>>
> >>> Regads,
> >>> Chetan Patel
> >>> --
> >>> View this message in context:
> >>> http://www.nabble.com/crawl-xml-url-using-nutch-0.9-tp19700770p19700770.html
> >>> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>>
> >>
> >> _________________________________________________________________
> >> Get all your favourite content with the slick new MSN Toolbar - FREE
> >> http://clk.atdmt.com/UKM/go/111354027/direct/01/
> >>
> >
> >
>
> --
> View this message in context: http://www.nabble.com/crawl-xml-url-using-nutch-0.9-tp19700770p19701619.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

_________________________________________________________________
Win New York holidays with Kellogg’s & Live Search
http://clk.atdmt.com/UKM/go/111354033/direct/01/
Reply | Threaded
Open this post in threaded view
|

RE: crawl xml url using nutch-0.9

Webmaster-271
Hi threre..

I'm New to thew list and am in the process of setting up a nutch server
cluster using hadoop..


Are there any stable versions  after 0.9 ?

I've tried the trunk and a number of the nightly builds and they seem to be
kinda buggy..

.9 will do the job but something a bit more edge might be better..

Thanks..

Axel

Reply | Threaded
Open this post in threaded view
|

Re: crawl xml url using nutch-0.9

David Grandinetti
In reply to this post by Chetan Patel
I've been using nutch to crawl a lot of news feeds and I had to modify  
my plugins file to handle a bunch of mime types.  Not many sites  
follow the spec on what mime type to use.

My parse-plugins.xml file has mime-type mappings for all of these:

text/html
text/plain
text/rss
text/xml
application/xml
application/rss+xml
application/atom+xml
application/xhtml+xml
application/octet-stream

-dave

On Sep 27, 2008, at 6:44 AM, Chetan Patel wrote:

>
> Hi,
>
> I have got following message from log file while crawling xml url.
>
> 2008-09-27 16:06:20,920 WARN  parse.ParserFactory -  
> ParserFactory:Plugin:
> org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml  
> via
> parse-plugins.xml, but its plugin.xml file does not claim to support
> contentType: text/xml
>
> Please help me if you have any idea.
>
> -Chetan
>
>
>
> Chetan Patel wrote:
>>
>> Hi,
>>
>> Thanks for help.
>>
>> I have already added this in plugin.includes.
>>
>> and still getting only root url.
>>
>> Regards,
>> Chetan Patel
>>
>>
>> Edward Quick wrote:
>>>
>>>
>>> Chetan,
>>>
>>> Try adding parse-rss in nutch-site.xml. Here's mine:
>>>
>>> <property>
>>>  <name>plugin.includes</name>
>>>
>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|
>>> msexcel|msword|mspowerpoint|pdf|zip|swf|rss)|index-basic|query-
>>> (basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|
>>> regex|basic)</value>
>>>  <description></description>
>>> </property>
>>>
>>>
>>> Ed.
>>>
>>>
>>>> Date: Sat, 27 Sep 2008 01:30:43 -0700
>>>> From: [hidden email]
>>>> To: [hidden email]
>>>> Subject: crawl xml url using nutch-0.9
>>>>
>>>>
>>>> Hi All,
>>>>
>>>> I have tried to crawl xml url (http://sports.yahoo.com/nfl/rss.xml)
>>>> using
>>>> depth 2.
>>>>
>>>> But it will crawl only root url.
>>>>
>>>> Please help me how to crawl root url as well as all sub url of  
>>>> root url.
>>>>
>>>> Thanks in advance.
>>>>
>>>> Regads,
>>>> Chetan Patel
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/crawl-xml-url-using-nutch-0.9-tp19700770p19700770.html
>>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>>
>>>
>>> _________________________________________________________________
>>> Get all your favourite content with the slick new MSN Toolbar - FREE
>>> http://clk.atdmt.com/UKM/go/111354027/direct/01/
>>>
>>
>>
>
> --
> View this message in context: http://www.nabble.com/crawl-xml-url-using-nutch-0.9-tp19700770p19701619.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Reply | Threaded
Open this post in threaded view
|

Stable versions

Webmaster-271
In reply to this post by Webmaster-271
Hi threre..

Sorry if this is repetative but I forgot to change the subject when I posted
it..

I'm New to thew list and am in the process of setting up a nutch server
cluster using hadoop..


Are there any stable versions  after 0.9 ?

I've tried the trunk and a number of the nightly builds and they seem to be
kinda buggy..

.9 will do the job but something a bit more edge might be better..

Thanks..

Axel


Reply | Threaded
Open this post in threaded view
|

Re: crawl xml url using nutch-0.9

Chetan Patel
In reply to this post by David Grandinetti
Hi David,

Thanks for the solution.

Now, I am able to crawl xml URL using nutch-0.9.

But I have problem with http://sports.yahoo.com/nfl/rss.xml this url.

When I have tried to crawl above URL with depth 2. it return only root URL. It did not return sub URL of root URL.

Please help me if you have any idea.

-Chetan Patel