RSS search by nutch

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

RSS search by nutch

Dima Gritsenko
Hi,

Does nutch have a class for searching incoming RSS feeds in real time?
Thank you.
Dima.
Reply | Threaded
Open this post in threaded view
|

Re: RSS search by nutch

chrismattmann
Hi there Dima,

  I'm not exactly sure what you mean by "real time", but there is an RSS
Parsing plugin in Nutch that can parse RSS feeds that Nutch encounters
during its crawl. You can enable parse-rss by opening up
$NUTCH_HOME/conf/nutch-site.xml, and searching for the property
"plugin.includes". For the value of "plugin.includes", ensure that there is
an entry for "parse-rss" somewhere in that property value.

HTH,
  Chris


On 8/28/06 10:44 AM, "Dima Gritsenko" <[hidden email]> wrote:

> Hi,
>
> Does nutch have a class for searching incoming RSS feeds in real time?
> Thank you.
> Dima.

______________________________________________
Chris A. Mattmann
[hidden email]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply | Threaded
Open this post in threaded view
|

RE: RSS search by nutch

HUYLEBROECK Jeremy RD-ILAB-SSF-2
In reply to this post by Dima Gritsenko

The Nutch Feed/RSS plugin (parse-rss) only allows you to search the
entire channel/feed text, not items individually.
You'll have to develop your own if it's what you are trying to do.
I also found that the feedparse library used by parse-rss doesn't read
properly all formats and I myself moved to the ROME library for now.

By real time I guess you meant aggregate, index and add it to the main
index, everything as fast as possible.
Nutch is batch oriented, so it doesn't allow this without heavy
modification IMHO.

Maybe the way to go is to use some buffer/intermediate indices before
merging with the main index once in a while.
Once an index is created, it is added to a list of dynamic searchers.
But once the index is merged with the main one, this searcher is turned
off. I am not sure it scales well because you always get new content and
merging is long. Is there any architecture ideas that someone can share?

For example, I wonder how Yahoo indexes the emails in realtime.
As soon as I sent an email, it can be searched by keywords.


Cheers,
Jeremy.


-----Original Message-----
From: Dima Gritsenko [mailto:[hidden email]]
Sent: Monday, August 28, 2006 10:44 AM
To: [hidden email]
Subject: RSS search by nutch

Hi,

Does nutch have a class for searching incoming RSS feeds in real time?
Thank you.
Dima.
Reply | Threaded
Open this post in threaded view
|

Re: RSS search by nutch

chrismattmann
Hi Jeremy,

On 8/28/06 10:18 AM, "HUYLEBROECK Jeremy RD-ILAB-SSF"
<[hidden email]> wrote:

>
> The Nutch Feed/RSS plugin (parse-rss) only allows you to search the
> entire channel/feed text, not items individually.

Actually, this isn't entirely the case. parse-rss actually indexes the item
text (see line 148 in RSSParser.java) as well. Additionally, parse-rss adds
the individual item links to the Outlinks (see lines 161 and 163 in
RSSParser.java) , and they get crawled as well, in addition to the channel
text (see line 123 in RSSParser.java) and channel outlink (see lines 130 and
132 in RSSParser.java).

> You'll have to develop your own if it's what you are trying to do.
> I also found that the feedparse library used by parse-rss doesn't read
> properly all formats and I myself moved to the ROME library for now.

I haven't really noticed any formats not really handled by
commons-feedparser. What formats have you noticed that it doesn't handle?



Cheers,
  Chris


>
>
> -----Original Message-----
> From: Dima Gritsenko [mailto:[hidden email]]
> Sent: Monday, August 28, 2006 10:44 AM
> To: [hidden email]
> Subject: RSS search by nutch
>
> Hi,
>
> Does nutch have a class for searching incoming RSS feeds in real time?
> Thank you.
> Dima.


Reply | Threaded
Open this post in threaded view
|

RE: RSS search by nutch

HUYLEBROECK Jeremy RD-ILAB-SSF-2
In reply to this post by Dima Gritsenko
 

> Actually, this isn't entirely the case. parse-rss actually indexes the
item text (see line 148 in RSSParser.java) as well. Additionally,
parse-rss adds the individual item links to the Outlinks (see lines 161
and 163 in
RSSParser.java) , and they get crawled as well, in addition to the
channel text (see line 123 in RSSParser.java) and channel outlink (see
lines 130 and
132 in RSSParser.java).

Yep, I wasn't clear enough maybe. Sorry Chris ;)
RSSParser actually reads the items and allows to index the concated
text.
But they are not individually returned and then can't be individually
indexed right away.
But if you decide to fetch and parse each item "link", parse-rss
actually returns all the links.
Then you could extract the item text or do other parsing for each
individual item page.
Sorry if I confused some people.

I am personally focusing on only RSS and I am trying to index as much as
I can from the RSS feed directly to avoid to have to extract the item
text from the full HTML page. Of course, I then limit myself to whatever
I have in the feed.


> I haven't really noticed any formats not really handled by
commons-feedparser. What formats have you noticed that it doesn't
handle?

I think I had problems with ATOM <content> from feeds like this one:
http://meetvinz.blogspot.com/atom.xml 
and the RSS <content:encoded> for instance from
http://feeds.feedburner.com/TechCrunch

Was it my mistake?
If it was, I'd love to go back to feedparser, as it is apparently faster
than ROME. ;)



>
>
> -----Original Message-----
> From: Dima Gritsenko [mailto:[hidden email]]
> Sent: Monday, August 28, 2006 10:44 AM
> To: [hidden email]
> Subject: RSS search by nutch
>
> Hi,
>
> Does nutch have a class for searching incoming RSS feeds in real time?
> Thank you.
> Dima.


Reply | Threaded
Open this post in threaded view
|

Re: RSS search by nutch

Dima Gritsenko
In reply to this post by chrismattmann
Thank you, everybody, for all your replies on this.
We are trying RSS parsing with parse-rss enabled.

Dima.
----- Original Message -----
From: "Chris Mattmann" <[hidden email]>
To: <[hidden email]>
Sent: Monday, August 28, 2006 9:55 AM
Subject: Re: RSS search by nutch


> Hi there Dima,
>
>   I'm not exactly sure what you mean by "real time", but there is an RSS
> Parsing plugin in Nutch that can parse RSS feeds that Nutch encounters
> during its crawl. You can enable parse-rss by opening up
> $NUTCH_HOME/conf/nutch-site.xml, and searching for the property
> "plugin.includes". For the value of "plugin.includes", ensure that there
is

> an entry for "parse-rss" somewhere in that property value.
>
> HTH,
>   Chris
>
>
> On 8/28/06 10:44 AM, "Dima Gritsenko" <[hidden email]> wrote:
>
> > Hi,
> >
> > Does nutch have a class for searching incoming RSS feeds in real time?
> > Thank you.
> > Dima.
>
> ______________________________________________
> Chris A. Mattmann
> [hidden email]
> Staff Member
> Modeling and Data Management Systems Section (387)
> Data Management Systems and Technologies Group
>
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                        Mailstop:  171-246
> _______________________________________________________
>
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
>
>
>
>