RSSParser

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

RSSParser

Carli Collins
Hello,

 

I would like to use Nutch to search RSS documents from my Struts web app. I
see that there is a class called RSSParser, however the JARS in the nightly
builds do not seem to have this class. Should I be using something else?  Is
Nutch not the right tool for me?

 

Thanks

 

Reply | Threaded
Open this post in threaded view
|

RE: RSSParser

Carli Collins
 

Hello,

 

I would like to use Nutch to search RSS documents from my Struts web app. I
see that there is a class called RSSParser, however the JARS in the nightly
builds do not seem to have this class. Should I be using something else?  Is
Nutch not the right tool for me?

 

Thanks

 

Reply | Threaded
Open this post in threaded view
|

Re: RSSParser

chrismattmann
Hi Carli,

  The RSSParser class has been part of the trunk since 0.7, but didn't get
released with those official releases. Right now you can get the RSSParser
capability by downloading 0.8-dev, which is available from the nutch trunk.
Point your favorite web browser to:

http://lucene.apache.org/nutch/version_control.html

And then d/l the latest trunk and you should be all set. To use the
RSSParsr, make sure to enable the plugin "parse-rss", which you do by
setting the property "plugin.includes" within nutch-default.xml or
nutch-site.xml. Make sure it looks something like:

<property>
  <name>plugin.includes</name>
 
<value>protocol-http|urlfilter-regex|parse-(rss|text|html|js)|index-basic|qu
ery-(basic|site|url)|summary-basic|scoring-opic</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

If you notice that the RSSParser isn't getting called for your particular
flavor of RSS feeds, you can tweak around with the parse-plugins.xml file in
$NUTCH_HOME/conf/, and set the mimeType in there for the RSS files that are
being returned by the servers that you're crawling. Because of the
inconspicuousness of RSS and content types returned by web servers for it,
you might have to play around to make sure that the RSS parser gets called
for the RSS content types that you want to parse.

Hope that helps!

Cheers,
  Chris



On 6/15/06 10:50 AM, "Carli Collins" <[hidden email]> wrote:

>  
>
> Hello,
>
>  
>
> I would like to use Nutch to search RSS documents from my Struts web app. I
> see that there is a class called RSSParser, however the JARS in the nightly
> builds do not seem to have this class. Should I be using something else?  Is
> Nutch not the right tool for me?
>
>  
>
> Thanks
>
>  
>

______________________________________________
Chris A. Mattmann
[hidden email]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply | Threaded
Open this post in threaded view
|

Re: RSSParser

Sami Siren-2

>Point your favorite web browser to:
>
>http://lucene.apache.org/nutch/version_control.html
>
>And then d/l the latest trunk and you should be all set. To use the
>  
>
There are also nightly builds available, see
http://lucene.apache.org/nutch/nightly.html

--
 Sami Siren
Reply | Threaded
Open this post in threaded view
|

RE: RSSParser

Carli Collins
I tried to use the nightly builds but I was unable to compile the code.

-----Original Message-----
From: Sami Siren [mailto:[hidden email]]
Sent: Thursday, June 15, 2006 2:25 PM
To: [hidden email]
Subject: Re: RSSParser


>Point your favorite web browser to:
>
>http://lucene.apache.org/nutch/version_control.html
>
>And then d/l the latest trunk and you should be all set. To use the
>  
>
There are also nightly builds available, see
http://lucene.apache.org/nutch/nightly.html

--
 Sami Siren

Reply | Threaded
Open this post in threaded view
|

Re: RSSParser

Sami Siren-2
Carli Collins wrote:

>I tried to use the nightly builds but I was unable to compile the code.
>
>  
>
There is no need to compile the code unless you want to extend it (in
wich case i recommend you
to check out a version from svn repository). Just use it ;)

--
 Sami Siren