RSS-fecter and index individul-how can i realize this function

classic Classic list List threaded Threaded
48 messages Options
123
Reply | Threaded
Open this post in threaded view
|

RSS-fecter and index individul-how can i realize this function

吴志敏
Hi folks :

   What’s I want to do is to separate a rss file into several pages .

  Just as what has been discussed before. I want fetch a rss page and index
it as different documents in the index. So the searcher can search the
Item’s info as a individual hit.

 What’s my opinion create a protocol for fetch the rss page and store it as
several one which just contain one ITEM tag .but the unique key is the url ,
so how can I store them with the ITEM’s link tag as the unique key for a
document.

  So my question is how to realize this function in nutch-.0.8.x.

  I’ve check the code of the plug-in protocol-http’s code ,but I can’t
find the code where to store a page to a document. I want to separate the
rss page to several ones before storing it as a document but several ones.

  So any one can give me some hints?

Any reply will be appreciated !

 

 

  ITEM’s structure

 <item>


    <title>欧洲暴风雪后发制人 致航班延误交通混乱(组图)</title>


    <description>暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德
国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场
清扫飞机跑道上的积雪。  据报道,迟来的暴风雪连续两天横扫中...



    </description>


    <link>http://news.sohu.com/20070125
<http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml</
link>


    <category>搜狐焦点图新闻</category>


    <author>[hidden email]
</author>


    <pubDate>Thu, 25 Jan 2007 11:29:11 +0800</pubDate>


    <comments
>http://comment.news.sohu.com
<http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
/comment/topic.jsp?id=247833847</comments>


</item

 

Reply | Threaded
Open this post in threaded view
|

Re: RSS-fecter and index individul-how can i realize this function

chrismattmann
Hi there,

 I could most likely be of assistance, if you gave me some more information.
For instance: I'm wondering if the use case you describe below is already
supported by the current RSS parse plugin?

 The current RSS parser, parse-rss, does in fact index individual items that
are pointed to by an RSS document. The items are added as Nutch Outlinks,
and added to the overall queue of URLs to fetch. Doesn't this satisfy what
you mention below? Or am I missing something?

Cheers,
  Chris



On 1/30/07 6:01 PM, "kauu" <[hidden email]> wrote:

> Hi folks :
>
>    What’s I want to do is to separate a rss file into several pages .
>
>   Just as what has been discussed before. I want fetch a rss page and index
> it as different documents in the index. So the searcher can search the
> Item’s info as a individual hit.
>
>  What’s my opinion create a protocol for fetch the rss page and store it as
> several one which just contain one ITEM tag .but the unique key is the url ,
> so how can I store them with the ITEM’s link tag as the unique key for a
> document.
>
>   So my question is how to realize this function in nutch-.0.8.x.
>
>   I’ve check the code of the plug-in protocol-http’s code ,but I can’t
> find the code where to store a page to a document. I want to separate the
> rss page to several ones before storing it as a document but several ones.
>
>   So any one can give me some hints?
>
> Any reply will be appreciated !
>
>  
>
>  
>
>   ITEM’s structure
>
>  <item>
>
>
>     <title>欧洲暴风雪后发制人 致航班延误交通混乱(组图)</title>
>
>
>     <description>暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德
> 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场
> 清扫飞机跑道上的积雪。  据报道,迟来的暴风雪连续两天横扫中...
>
>
>
>     </description>
>
>
>     <link>http://news.sohu.com/20070125
> <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml</
> link>
>
>
>     <category>搜狐焦点图新闻</category>
>
>
>     <author>[hidden email]
> </author>
>
>
>     <pubDate>Thu, 25 Jan 2007 11:29:11 +0800</pubDate>
>
>
>     <comments
>> http://comment.news.sohu.com
> <http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
> /comment/topic.jsp?id=247833847</comments>
>
>
> </item
>
>  
>


Reply | Threaded
Open this post in threaded view
|

Re: RSS-fecter and index individul-how can i realize this function

peter decrem
Chris,

I saw your name associated with the rss parser in nutch.  My understanding is that nutch is using feedparser.  I had two questions:

1.  Have you looked at vtd as an rss parser?
2.  Any view on asynchronous communication as the underlying protocol?  I do not believe that feedparser uses that at this point.

Thanks
 

-----Original Message-----
From: Chris Mattmann <[hidden email]>
Date: Tue, 30 Jan 2007 18:16:44
To:<[hidden email]>
Subject: Re: RSS-fecter and index individul-how can i realize this function

Hi there,

 I could most likely be of assistance, if you gave me some more information.
For instance: I'm wondering if the use case you describe below is already
supported by the current RSS parse plugin?

 The current RSS parser, parse-rss, does in fact index individual items that
are pointed to by an RSS document. The items are added as Nutch Outlinks,
and added to the overall queue of URLs to fetch. Doesn't this satisfy what
you mention below? Or am I missing something?

Cheers,
  Chris



On 1/30/07 6:01 PM, "kauu" <[hidden email]> wrote:

> Hi folks :
>
>    What’s I want to do is to separate a rss file into several pages .
>
>   Just as what has been discussed before. I want fetch a rss page and index
> it as different documents in the index. So the searcher can search the
> Item’s info as a individual hit.
>
>  What’s my opinion create a protocol for fetch the rss page and store it as
> several one which just contain one ITEM tag .but the unique key is the url ,
> so how can I store them with the ITEM’s link tag as the unique key for a
> document.
>
>   So my question is how to realize this function in nutch-.0.8.x.
>
>   I’ve check the code of the plug-in protocol-http’s code ,but I can’t
> find the code where to store a page to a document. I want to separate the
> rss page to several ones before storing it as a document but several ones.
>
>   So any one can give me some hints?
>
> Any reply will be appreciated !
>
>  
>
>  
>
>   ITEM’s structure
>
>  <item>
>
>
>     <title>欧洲暴风雪后发制人 致航班延误交通混乱(组图)</title>
>
>
>     <description>暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德
> 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场
> 清扫飞机跑道上的积雪。  据报道,迟来的暴风雪连续两天横扫中...
>
>
>
>     </description>
>
>
>     <link>http://news.sohu.com/20070125
> <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml</
> link>
>
>
>     <category>搜狐焦点图新闻</category>
>
>
>     <author>[hidden email]
> </author>
>
>
>     <pubDate>Thu, 25 Jan 2007 11:29:11 +0800</pubDate>
>
>
>     <comments
>> http://comment.news.sohu.com
> <http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
> /comment/topic.jsp?id=247833847</comments>
>
>
> </item
>
>  
>


Reply | Threaded
Open this post in threaded view
|

Re: RSS-fecter and index individul-how can i realize this function

吴志敏
In reply to this post by chrismattmann
thx for ur reply .
mybe i didn't tell clearly .
 I want to index the item as a individual page .then when i search the some
thing for example "nutch-open source", the nutch return a hit which contain

   title : nutch-open source
   description : nutch nutch nutch ....nutch  nutch
   url : http://lucene.apache.org/nutch
   category : news
  author  : kauu

so , is the plugin parse-rss can satisfy what i need?

<item>
    <title>nutch--open source</title>
   <description>

>
>        nutch nutch nutch ....nutch  nutch
> >     </description>
> >
> >
> >     <link>http://lucene.apache.org/nutch</link>
> >
> >
> >     <category>news </category>
> >
> >
> >     <author>kauu</author>



On 1/31/07, Chris Mattmann <[hidden email]> wrote:

>
> Hi there,
>
> I could most likely be of assistance, if you gave me some more
> information.
> For instance: I'm wondering if the use case you describe below is already
> supported by the current RSS parse plugin?
>
> The current RSS parser, parse-rss, does in fact index individual items
> that
> are pointed to by an RSS document. The items are added as Nutch Outlinks,
> and added to the overall queue of URLs to fetch. Doesn't this satisfy what
> you mention below? Or am I missing something?
>
> Cheers,
>   Chris
>
>
>
> On 1/30/07 6:01 PM, "kauu" <[hidden email]> wrote:
>
> > Hi folks :
> >
> >    What's I want to do is to separate a rss file into several pages .
> >
> >   Just as what has been discussed before. I want fetch a rss page and
> index
> > it as different documents in the index. So the searcher can search the
> > Item's info as a individual hit.
> >
> >  What's my opinion create a protocol for fetch the rss page and store it
> as
> > several one which just contain one ITEM tag .but the unique key is the
> url ,
> > so how can I store them with the ITEM's link tag as the unique key for a
> > document.
> >
> >   So my question is how to realize this function in nutch-.0.8.x.
> >
> >   I've check the code of the plug-in protocol-http's code ,but I can't
> > find the code where to store a page to a document. I want to separate
> the
> > rss page to several ones before storing it as a document but several
> ones.
> >
> >   So any one can give me some hints?
> >
> > Any reply will be appreciated !
> >
> >
> >
> >
> >
> >   ITEM's structure
> >
> >  <item>
> >
> >
> >     <title>欧洲暴风雪后发制人 致航班延误交通混乱(组图)</title>
> >
> >
> >     <description>暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德
> > 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场
> > 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中...
> >
> >
> >
> >     </description>
> >
> >
> >     <link>http://news.sohu.com/20070125
> > <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml</
> > link>
> >
> >
> >     <category>搜狐焦点图新闻</category>
> >
> >
> >     <author>[hidden email]
> > </author>
> >
> >
> >     <pubDate>Thu, 25 Jan 2007 11:29:11 +0800</pubDate>
> >
> >
> >     <comments
> >> http://comment.news.sohu.com
> > <http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
> > /comment/topic.jsp?id=247833847</comments>
> >
> >
> > </item
> >
> >
> >
>
>
>


--
www.babatu.com
Reply | Threaded
Open this post in threaded view
|

Re: RSS-fecter and index individul-how can i realize this function

chrismattmann
In reply to this post by peter decrem
Hi there,

On 1/30/07 7:00 PM, "[hidden email]" <[hidden email]> wrote:

> Chris,
>
> I saw your name associated with the rss parser in nutch.  My understanding is
> that nutch is using feedparser.  I had two questions:
>
> 1.  Have you looked at vtd as an rss parser?

I haven't in fact; what are its benefits over those of commons-feedparser?

> 2.  Any view on asynchronous communication as the underlying protocol?  I do
> not believe that feedparser uses that at this point.

I'm not sure exactly what asynchronous communication when parsing rss feeds
affords you: what type of communications are you talking about above? Nutch
handles the communications layer for fetching content using a pluggable,
Protocol-based model. The only feature that Nutch's rss parser uses from the
underlying feedparser library is its object model and callback framework for
parsing RSS/Atom/Feed XML documents. When you mention asynchronous above,
are you talking about the protocol for fetching the different RSS documents?

Thanks!

Cheers,
  Chris


>
> Thanks
>  
>
> -----Original Message-----
> From: Chris Mattmann <[hidden email]>
> Date: Tue, 30 Jan 2007 18:16:44
> To:<[hidden email]>
> Subject: Re: RSS-fecter and index individul-how can i realize this function
>
> Hi there,
>
>  I could most likely be of assistance, if you gave me some more information.
> For instance: I'm wondering if the use case you describe below is already
> supported by the current RSS parse plugin?
>
>  The current RSS parser, parse-rss, does in fact index individual items that
> are pointed to by an RSS document. The items are added as Nutch Outlinks,
> and added to the overall queue of URLs to fetch. Doesn't this satisfy what
> you mention below? Or am I missing something?
>
> Cheers,
>   Chris
>
>
>
> On 1/30/07 6:01 PM, "kauu" <[hidden email]> wrote:
>
>> Hi folks :
>>
>>    What’s I want to do is to separate a rss file into several pages .
>>
>>   Just as what has been discussed before. I want fetch a rss page and index
>> it as different documents in the index. So the searcher can search the
>> Item’s info as a individual hit.
>>
>>  What’s my opinion create a protocol for fetch the rss page and store it as
>> several one which just contain one ITEM tag .but the unique key is the url ,
>> so how can I store them with the ITEM’s link tag as the unique key for a
>> document.
>>
>>   So my question is how to realize this function in nutch-.0.8.x.
>>
>>   I’ve check the code of the plug-in protocol-http’s code ,but I can’t
>> find the code where to store a page to a document. I want to separate the
>> rss page to several ones before storing it as a document but several ones.
>>
>>   So any one can give me some hints?
>>
>> Any reply will be appreciated !
>>
>>  
>>
>>  
>>
>>   ITEM’s structure
>>
>>  <item>
>>
>>
>>     <title>欧洲暴风雪后发制人 致航班延误交通混乱(组图)</title>
>>
>>
>>     <description>暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德
>> 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场
>> 清扫飞机跑道上的积雪。  据报道,迟来的暴风雪连续两天横扫中...
>>
>>
>>
>>     </description>
>>
>>
>>     <link>http://news.sohu.com/20070125
>> <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml</
>> link>
>>
>>
>>     <category>搜狐焦点图新闻</category>
>>
>>
>>     <author>[hidden email]
>> </author>
>>
>>
>>     <pubDate>Thu, 25 Jan 2007 11:29:11 +0800</pubDate>
>>
>>
>>     <comments
>>> http://comment.news.sohu.com
>> <http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
>> /comment/topic.jsp?id=247833847</comments>
>>
>>
>> </item
>>
>>  
>>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: RSS-fecter and index individul-how can i realize this function

peter decrem

1.   Claims to be faster
2.   Asynchronous should take care of sitting and waiting for one fetch to return before you do the next.

Ps I am not sure if you checked out tailrank.com for that branch of feedparser (I think its in code.tailrank.com/feedparser)

Thanks


 

-----Original Message-----
From: Chris Mattmann <[hidden email]>
Date: Tue, 30 Jan 2007 19:34:49
To:<[hidden email]>
Subject: Re: RSS-fecter and index individul-how can i realize this function

Hi there,

On 1/30/07 7:00 PM, "[hidden email]" <[hidden email]> wrote:

> Chris,
>
> I saw your name associated with the rss parser in nutch.  My understanding is
> that nutch is using feedparser.  I had two questions:
>
> 1.  Have you looked at vtd as an rss parser?

I haven't in fact; what are its benefits over those of commons-feedparser?

> 2.  Any view on asynchronous communication as the underlying protocol?  I do
> not believe that feedparser uses that at this point.

I'm not sure exactly what asynchronous communication when parsing rss feeds
affords you: what type of communications are you talking about above? Nutch
handles the communications layer for fetching content using a pluggable,
Protocol-based model. The only feature that Nutch's rss parser uses from the
underlying feedparser library is its object model and callback framework for
parsing RSS/Atom/Feed XML documents. When you mention asynchronous above,
are you talking about the protocol for fetching the different RSS documents?

Thanks!

Cheers,
  Chris


>
> Thanks
>  
>
> -----Original Message-----
> From: Chris Mattmann <[hidden email]>
> Date: Tue, 30 Jan 2007 18:16:44
> To:<[hidden email]>
> Subject: Re: RSS-fecter and index individul-how can i realize this function
>
> Hi there,
>
>  I could most likely be of assistance, if you gave me some more information.
> For instance: I'm wondering if the use case you describe below is already
> supported by the current RSS parse plugin?
>
>  The current RSS parser, parse-rss, does in fact index individual items that
> are pointed to by an RSS document. The items are added as Nutch Outlinks,
> and added to the overall queue of URLs to fetch. Doesn't this satisfy what
> you mention below? Or am I missing something?
>
> Cheers,
>   Chris
>
>
>
> On 1/30/07 6:01 PM, "kauu" <[hidden email]> wrote:
>
>> Hi folks :
>>
>>    What’s I want to do is to separate a rss file into several pages .
>>
>>   Just as what has been discussed before. I want fetch a rss page and index
>> it as different documents in the index. So the searcher can search the
>> Item’s info as a individual hit.
>>
>>  What’s my opinion create a protocol for fetch the rss page and store it as
>> several one which just contain one ITEM tag .but the unique key is the url ,
>> so how can I store them with the ITEM’s link tag as the unique key for a
>> document.
>>
>>   So my question is how to realize this function in nutch-.0.8.x.
>>
>>   I’ve check the code of the plug-in protocol-http’s code ,but I can’t
>> find the code where to store a page to a document. I want to separate the
>> rss page to several ones before storing it as a document but several ones.
>>
>>   So any one can give me some hints?
>>
>> Any reply will be appreciated !
>>
>>  
>>
>>  
>>
>>   ITEM’s structure
>>
>>  <item>
>>
>>
>>     <title>欧洲暴风雪后发制人 致航班延误交通混乱(组图)</title>
>>
>>
>>     <description>暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德
>> 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场
>> 清扫飞机跑道上的积雪。  据报道,迟来的暴风雪连续两天横扫中...
>>
>>
>>
>>     </description>
>>
>>
>>     <link>http://news.sohu.com/20070125
>> <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml</
>> link>
>>
>>
>>     <category>搜狐焦点图新闻</category>
>>
>>
>>     <author>[hidden email]
>> </author>
>>
>>
>>     <pubDate>Thu, 25 Jan 2007 11:29:11 +0800</pubDate>
>>
>>
>>     <comments
>>> http://comment.news.sohu.com
>> <http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
>> /comment/topic.jsp?id=247833847</comments>
>>
>>
>> </item
>>
>>  
>>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: RSS-fecter and index individul-how can i realize this function

chrismattmann
In reply to this post by 吴志敏
Hi there,

  With the explanation that you give below, it seems like parse-rss as it
exists would address what you are trying to do. parse-rss parses an RSS
channel as a set of items, and indexes overall metadata about the RSS file,
including parse text, and index data, but it also adds each item (in the
channel)'s URL as an Outlink, so that Nutch will process those pieces of
content as well. The only thing that you suggest below that parse-rss
currently doesn't do, is to allow you to associate the metadata fields
category:, and author: with the item Outlink...

Cheers,
  Chris



On 1/30/07 7:30 PM, "kauu" <[hidden email]> wrote:

> thx for ur reply .
mybe i didn't tell clearly .
 I want to index the item as a
> individual page .then when i search the some
thing for example "nutch-open
> source", the nutch return a hit which contain

   title : nutch-open source

> description : nutch nutch nutch ....nutch  nutch
   url :
> http://lucene.apache.org/nutch
   category : news
  author  : kauu

so , is
> the plugin parse-rss can satisfy what i need?

<item>
    <title>nutch--open
> source</title>
   <description>

>
>        nutch nutch nutch ....nutch
> nutch
> >     </description>
> >
> >
> >
> <link>http://lucene.apache.org/nutch</link>
> >
> >
> >     <category>news
> </category>
> >
> >
> >     <author>kauu</author>



On 1/31/07, Chris

> Mattmann <[hidden email]> wrote:
>
> Hi there,
>
> I could most
> likely be of assistance, if you gave me some more
> information.
> For
> instance: I'm wondering if the use case you describe below is already
>
> supported by the current RSS parse plugin?
>
> The current RSS parser,
> parse-rss, does in fact index individual items
> that
> are pointed to by an
> RSS document. The items are added as Nutch Outlinks,
> and added to the
> overall queue of URLs to fetch. Doesn't this satisfy what
> you mention below?
> Or am I missing something?
>
> Cheers,
>   Chris
>
>
>
> On 1/30/07 6:01 PM,
> "kauu" <[hidden email]> wrote:
>
> > Hi folks :
> >
> >    What's I want to
> do is to separate a rss file into several pages .
> >
> >   Just as what has
> been discussed before. I want fetch a rss page and
> index
> > it as different
> documents in the index. So the searcher can search the
> > Item's info as a
> individual hit.
> >
> >  What's my opinion create a protocol for fetch the rss
> page and store it
> as
> > several one which just contain one ITEM tag .but
> the unique key is the
> url ,
> > so how can I store them with the ITEM's link
> tag as the unique key for a
> > document.
> >
> >   So my question is how to
> realize this function in nutch-.0.8.x.
> >
> >   I've check the code of the
> plug-in protocol-http's code ,but I can't
> > find the code where to store a
> page to a document. I want to separate
> the
> > rss page to several ones
> before storing it as a document but several
> ones.
> >
> >   So any one can
> give me some hints?
> >
> > Any reply will be appreciated !
> >
> >
> >
> >
>
> >
> >   ITEM's structure
> >
> >  <item>
> >
> >
> >     <title>欧洲暴风雪后发制人 致航班
> 延误交通混乱(组图)</title>
> >
> >
> >     <description>暴风雪横扫欧洲,导致多次航班延误 1
> 月24日,几架民航客机在德
> > 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部
> 的慕尼黑机场
> > 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中...
> >
>
> >
> >
> >     </description>
> >
> >
> >
> <link>http://news.sohu.com/20070125
> >
> <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml</
> >
> link>
> >
> >
> >     <category>搜狐焦点图新闻</category>
> >
> >
> >
> <author>[hidden email]
> > </author>
> >
> >
> >     <pubDate>Thu, 25 Jan 2007
> 11:29:11 +0800</pubDate>
> >
> >
> >     <comments
> >>
> http://comment.news.sohu.com
> >
> <http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
> >
> /comment/topic.jsp?id=247833847</comments>
> >
> >
> > </item
> >
> >
>
> >
>
>
>


--
www.babatu.com



Reply | Threaded
Open this post in threaded view
|

Re: RSS-fecter and index individul-how can i realize this function

吴志敏
hi ,
thx any way , but i don't think I tell clearly enough.

what i want  is nutch  just fetch  rss seeds for 1 depth. So  nutch should
just  fetch some xml pages .I don't want to fetch the items' outlink 's
pages, because there r too much spam in those pages.
  so , i just need to parse the rss file.
 so when i search some words which in description tag in one xml's item. the
return hit will be like this
title ==one item's title
summary ==one item's description
link ==one itme's outlink.

so , i don't know whether the parse-rss plugin provide this function?

On 1/31/07, Chris Mattmann <[hidden email]> wrote:

>
> Hi there,
>
>   With the explanation that you give below, it seems like parse-rss as it
> exists would address what you are trying to do. parse-rss parses an RSS
> channel as a set of items, and indexes overall metadata about the RSS
> file,
> including parse text, and index data, but it also adds each item (in the
> channel)'s URL as an Outlink, so that Nutch will process those pieces of
> content as well. The only thing that you suggest below that parse-rss
> currently doesn't do, is to allow you to associate the metadata fields
> category:, and author: with the item Outlink...
>
> Cheers,
>   Chris
>
>
>
> On 1/30/07 7:30 PM, "kauu" <[hidden email]> wrote:
>
> > thx for ur reply .
> mybe i didn't tell clearly .
> I want to index the item as a
> > individual page .then when i search the some
> thing for example "nutch-open
> > source", the nutch return a hit which contain
>
>    title : nutch-open source
>
> > description : nutch nutch nutch ....nutch  nutch
>    url :
> > http://lucene.apache.org/nutch
>    category : news
>   author  : kauu
>
> so , is
> > the plugin parse-rss can satisfy what i need?
>
> <item>
>     <title>nutch--open
> > source</title>
>    <description>
> >
> >        nutch nutch nutch ....nutch
> > nutch
> > >     </description>
> > >
> > >
> > >
> > <link>http://lucene.apache.org/nutch</link>
> > >
> > >
> > >     <category>news
> > </category>
> > >
> > >
> > >     <author>kauu</author>
>
>
>
> On 1/31/07, Chris
> > Mattmann <[hidden email]> wrote:
> >
> > Hi there,
> >
> > I could most
> > likely be of assistance, if you gave me some more
> > information.
> > For
> > instance: I'm wondering if the use case you describe below is already
> >
> > supported by the current RSS parse plugin?
> >
> > The current RSS parser,
> > parse-rss, does in fact index individual items
> > that
> > are pointed to by an
> > RSS document. The items are added as Nutch Outlinks,
> > and added to the
> > overall queue of URLs to fetch. Doesn't this satisfy what
> > you mention below?
> > Or am I missing something?
> >
> > Cheers,
> >   Chris
> >
> >
> >
> > On 1/30/07 6:01 PM,
> > "kauu" <[hidden email]> wrote:
> >
> > > Hi folks :
> > >
> > >    What's I want to
> > do is to separate a rss file into several pages .
> > >
> > >   Just as what has
> > been discussed before. I want fetch a rss page and
> > index
> > > it as different
> > documents in the index. So the searcher can search the
> > > Item's info as a
> > individual hit.
> > >
> > >  What's my opinion create a protocol for fetch the rss
> > page and store it
> > as
> > > several one which just contain one ITEM tag .but
> > the unique key is the
> > url ,
> > > so how can I store them with the ITEM's link
> > tag as the unique key for a
> > > document.
> > >
> > >   So my question is how to
> > realize this function in nutch-.0.8.x.
> > >
> > >   I've check the code of the
> > plug-in protocol-http's code ,but I can't
> > > find the code where to store a
> > page to a document. I want to separate
> > the
> > > rss page to several ones
> > before storing it as a document but several
> > ones.
> > >
> > >   So any one can
> > give me some hints?
> > >
> > > Any reply will be appreciated !
> > >
> > >
> > >
> > >
> >
> > >
> > >   ITEM's structure
> > >
> > >  <item>
> > >
> > >
> > >     <title>欧洲暴风雪后发制人 致航班
> > 延误交通混乱(组图)</title>
> > >
> > >
> > >     <description>暴风雪横扫欧洲,导致多次航班延误 1
> > 月24日,几架民航客机在德
> > > 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部
> > 的慕尼黑机场
> > > 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中...
> > >
> >
> > >
> > >
> > >     </description>
> > >
> > >
> > >
> > <link>http://news.sohu.com/20070125
> > >
> > <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml</
> > >
> > link>
> > >
> > >
> > >     <category>搜狐焦点图新闻</category>
> > >
> > >
> > >
> > <author>[hidden email]
> > > </author>
> > >
> > >
> > >     <pubDate>Thu, 25 Jan 2007
> > 11:29:11 +0800</pubDate>
> > >
> > >
> > >     <comments
> > >>
> > http://comment.news.sohu.com
> > >
> > <http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
> > >
> > /comment/topic.jsp?id=247833847</comments>
> > >
> > >
> > > </item
> > >
> > >
> >
> > >
> >
> >
> >
>
>
> --
> www.babatu.com
>
>
>
>


--
www.babatu.com
Reply | Threaded
Open this post in threaded view
|

RE: RSS-fecter and index individul-how can i realize this function

Gal Nitzan
In reply to this post by chrismattmann
Hi,

Many sites provide RSS feeds for several reasons, usually to save bandwidth, to give the users concentrated data and so forth.

Some of the RSS files supplied by sites are created specially for search engines where each RSS "item" represent a web page in the site.

IMHO the only thing "missing" in the parse-rss plugin is storing the data in the CrawlDatum and "parsing" it in the next fetch phase. Maybe adding a new flag to CrawlDatum, that would flag the URL as "parsable" not "fetchable"?

Just my two cents...

Gal.

-----Original Message-----
From: Chris Mattmann [mailto:[hidden email]]
Sent: Wednesday, January 31, 2007 8:44 AM
To: [hidden email]
Subject: Re: RSS-fecter and index individul-how can i realize this function

Hi there,

  With the explanation that you give below, it seems like parse-rss as it
exists would address what you are trying to do. parse-rss parses an RSS
channel as a set of items, and indexes overall metadata about the RSS file,
including parse text, and index data, but it also adds each item (in the
channel)'s URL as an Outlink, so that Nutch will process those pieces of
content as well. The only thing that you suggest below that parse-rss
currently doesn't do, is to allow you to associate the metadata fields
category:, and author: with the item Outlink...

Cheers,
  Chris



On 1/30/07 7:30 PM, "kauu" <[hidden email]> wrote:

> thx for ur reply .
mybe i didn't tell clearly .
 I want to index the item as a
> individual page .then when i search the some
thing for example "nutch-open
> source", the nutch return a hit which contain

   title : nutch-open source

> description : nutch nutch nutch ....nutch  nutch
   url :
> http://lucene.apache.org/nutch
   category : news
  author  : kauu

so , is
> the plugin parse-rss can satisfy what i need?

<item>
    <title>nutch--open
> source</title>
   <description>

>
>        nutch nutch nutch ....nutch
> nutch
> >     </description>
> >
> >
> >
> <link>http://lucene.apache.org/nutch</link>
> >
> >
> >     <category>news
> </category>
> >
> >
> >     <author>kauu</author>



On 1/31/07, Chris

> Mattmann <[hidden email]> wrote:
>
> Hi there,
>
> I could most
> likely be of assistance, if you gave me some more
> information.
> For
> instance: I'm wondering if the use case you describe below is already
>
> supported by the current RSS parse plugin?
>
> The current RSS parser,
> parse-rss, does in fact index individual items
> that
> are pointed to by an
> RSS document. The items are added as Nutch Outlinks,
> and added to the
> overall queue of URLs to fetch. Doesn't this satisfy what
> you mention below?
> Or am I missing something?
>
> Cheers,
>   Chris
>
>
>
> On 1/30/07 6:01 PM,
> "kauu" <[hidden email]> wrote:
>
> > Hi folks :
> >
> >    What's I want to
> do is to separate a rss file into several pages .
> >
> >   Just as what has
> been discussed before. I want fetch a rss page and
> index
> > it as different
> documents in the index. So the searcher can search the
> > Item's info as a
> individual hit.
> >
> >  What's my opinion create a protocol for fetch the rss
> page and store it
> as
> > several one which just contain one ITEM tag .but
> the unique key is the
> url ,
> > so how can I store them with the ITEM's link
> tag as the unique key for a
> > document.
> >
> >   So my question is how to
> realize this function in nutch-.0.8.x.
> >
> >   I've check the code of the
> plug-in protocol-http's code ,but I can't
> > find the code where to store a
> page to a document. I want to separate
> the
> > rss page to several ones
> before storing it as a document but several
> ones.
> >
> >   So any one can
> give me some hints?
> >
> > Any reply will be appreciated !
> >
> >
> >
> >
>
> >
> >   ITEM's structure
> >
> >  <item>
> >
> >
> >     <title>欧洲暴风雪后发制人 致航班
> 延误交通混乱(组图)</title>
> >
> >
> >     <description>暴风雪横扫欧洲,导致多次航班延误 1
> 月24日,几架民航客机在德
> > 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部
> 的慕尼黑机场
> > 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中...
> >
>
> >
> >
> >     </description>
> >
> >
> >
> <link>http://news.sohu.com/20070125
> >
> <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml</
> >
> link>
> >
> >
> >     <category>搜狐焦点图新闻</category>
> >
> >
> >
> <author>[hidden email]
> > </author>
> >
> >
> >     <pubDate>Thu, 25 Jan 2007
> 11:29:11 +0800</pubDate>
> >
> >
> >     <comments
> >>
> http://comment.news.sohu.com
> >
> <http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
> >
> /comment/topic.jsp?id=247833847</comments>
> >
> >
> > </item
> >
> >
>
> >
>
>
>


--
www.babatu.com





Reply | Threaded
Open this post in threaded view
|

Re: RSS-fecter and index individul-how can i realize this function

chrismattmann
Hi Gal, et al.,

  I'd like to be explicit when we talk about what the issue with the RSS
parsing plugin is here; I think we have had conversations similar to this
before and it seems that we keep talking around each other. I'd like to get
to the heart of this matter so that the issue (if there is an actual one)
gets addressed ;)

  Okay, so you mention below that the thing that you see missing from the
current RSS parsing plugin is the ability to store data in the CrawlDatum,
and parse "it" in the next fetch phase. Well, there are 2 options here for
what you refer to as "it":

 1. If you're talking about the RSS file, then in fact, it is parsed, and
its data is stored in the CrawlDatum, akin to any other form of content that
is fetched, parsed and indexed.

 2. If you're talking about the item links within the RSS file, in fact,
they are parsed (eventually), and their data stored in the CrawlDatum, akin
to any other form of content that is fetched, parsed, and indexed. This is
accomplished by adding the RSS items as Outlinks when the RSS file is
parsed: in this fashion, we go after all of the links in the RSS file, and
make sure that we index their content as well.

Thus, if you had an RSS file R that contained links in it to a PDF file A,
and another HTML page P, then not only would R get fetched, parsed, and
indexed, but so would A and P, because they are item links within R. Then
queries that would match R (the physical RSS file), would additionally match
things such as P and A, and all 3 would be capable of being returned in a
Nutch query. Does this make sense? Is this the issue that you're talking
about? Am I nuts? ;)

Cheers,
  Chris




On 1/31/07 10:40 PM, "Gal Nitzan" <[hidden email]> wrote:

> Hi,
>
> Many sites provide RSS feeds for several reasons, usually to save bandwidth,
> to give the users concentrated data and so forth.
>
> Some of the RSS files supplied by sites are created specially for search
> engines where each RSS "item" represent a web page in the site.
>
> IMHO the only thing "missing" in the parse-rss plugin is storing the data in
> the CrawlDatum and "parsing" it in the next fetch phase. Maybe adding a new
> flag to CrawlDatum, that would flag the URL as "parsable" not "fetchable"?
>
> Just my two cents...
>
> Gal.
>
> -----Original Message-----
> From: Chris Mattmann [mailto:[hidden email]]
> Sent: Wednesday, January 31, 2007 8:44 AM
> To: [hidden email]
> Subject: Re: RSS-fecter and index individul-how can i realize this function
>
> Hi there,
>
>   With the explanation that you give below, it seems like parse-rss as it
> exists would address what you are trying to do. parse-rss parses an RSS
> channel as a set of items, and indexes overall metadata about the RSS file,
> including parse text, and index data, but it also adds each item (in the
> channel)'s URL as an Outlink, so that Nutch will process those pieces of
> content as well. The only thing that you suggest below that parse-rss
> currently doesn't do, is to allow you to associate the metadata fields
> category:, and author: with the item Outlink...
>
> Cheers,
>   Chris
>
>
>
> On 1/30/07 7:30 PM, "kauu" <[hidden email]> wrote:
>
>> thx for ur reply .
> mybe i didn't tell clearly .
>  I want to index the item as a
>> individual page .then when i search the some
> thing for example "nutch-open
>> source", the nutch return a hit which contain
>
>    title : nutch-open source
>
>> description : nutch nutch nutch ....nutch  nutch
>    url :
>> http://lucene.apache.org/nutch
>    category : news
>   author  : kauu
>
> so , is
>> the plugin parse-rss can satisfy what i need?
>
> <item>
>     <title>nutch--open
>> source</title>
>    <description>
>>
>>        nutch nutch nutch ....nutch
>> nutch
>>>     </description>
>>>
>>>
>>>
>> <link>http://lucene.apache.org/nutch</link>
>>>
>>>
>>>     <category>news
>> </category>
>>>
>>>
>>>     <author>kauu</author>
>
>
>
> On 1/31/07, Chris
>> Mattmann <[hidden email]> wrote:
>>
>> Hi there,
>>
>> I could most
>> likely be of assistance, if you gave me some more
>> information.
>> For
>> instance: I'm wondering if the use case you describe below is already
>>
>> supported by the current RSS parse plugin?
>>
>> The current RSS parser,
>> parse-rss, does in fact index individual items
>> that
>> are pointed to by an
>> RSS document. The items are added as Nutch Outlinks,
>> and added to the
>> overall queue of URLs to fetch. Doesn't this satisfy what
>> you mention below?
>> Or am I missing something?
>>
>> Cheers,
>>   Chris
>>
>>
>>
>> On 1/30/07 6:01 PM,
>> "kauu" <[hidden email]> wrote:
>>
>>> Hi folks :
>>>
>>>    What's I want to
>> do is to separate a rss file into several pages .
>>>
>>>   Just as what has
>> been discussed before. I want fetch a rss page and
>> index
>>> it as different
>> documents in the index. So the searcher can search the
>>> Item's info as a
>> individual hit.
>>>
>>>  What's my opinion create a protocol for fetch the rss
>> page and store it
>> as
>>> several one which just contain one ITEM tag .but
>> the unique key is the
>> url ,
>>> so how can I store them with the ITEM's link
>> tag as the unique key for a
>>> document.
>>>
>>>   So my question is how to
>> realize this function in nutch-.0.8.x.
>>>
>>>   I've check the code of the
>> plug-in protocol-http's code ,but I can't
>>> find the code where to store a
>> page to a document. I want to separate
>> the
>>> rss page to several ones
>> before storing it as a document but several
>> ones.
>>>
>>>   So any one can
>> give me some hints?
>>>
>>> Any reply will be appreciated !
>>>
>>>
>>>
>>>
>>
>>>
>>>   ITEM's structure
>>>
>>>  <item>
>>>
>>>
>>>     <title>欧洲暴风雪后发制人 致航班
>> 延误交通混乱(组图)</title>
>>>
>>>
>>>     <description>暴风雪横扫欧洲,导致多次航班延误 1
>> 月24日,几架民航客机在德
>>> 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部
>> 的慕尼黑机场
>>> 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中...
>>>
>>
>>>
>>>
>>>     </description>
>>>
>>>
>>>
>> <link>http://news.sohu.com/20070125
>>>
>> <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml</
>>>
>> link>
>>>
>>>
>>>     <category>搜狐焦点图新闻</category>
>>>
>>>
>>>
>> <author>[hidden email]
>>> </author>
>>>
>>>
>>>     <pubDate>Thu, 25 Jan 2007
>> 11:29:11 +0800</pubDate>
>>>
>>>
>>>     <comments
>>>>
>> http://comment.news.sohu.com
>>>
>> <http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
>>>
>> /comment/topic.jsp?id=247833847</comments>
>>>
>>>
>>> </item
>>>
>>>
>>
>>>
>>
>>
>>
>

______________________________________________
Chris A. Mattmann
[hidden email]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply | Threaded
Open this post in threaded view
|

RE: RSS-fecter and index individul-how can i realize this function

Gal Nitzan

Hi Chris,

I'm sorry I wasn't clear enough. What I mean is that in the current implementation:

1. The RSS (channels, items) page ends up as one Lucene document in the index.
2. Indeed the links are extracted and each <item> link will be fetched in the next fetch as a separate page and will end up as one Lucene document.

IMHO the data that is needed i.e. the data that will be fetched in the next fetch process is already available in the <item> element. Each <item> element represents one web resource. And there is no reason to go to the server and re-fetch that resource.

Another issue that arises from rss feeds is that once the feed page is fetched you can not re-fetch it until its "time to fetch" expired. The feeds TTL is usually very short. Since for now in Nutch, all pages created equal :) it is one more thing to think about.

HTH,

Gal.

-----Original Message-----
From: Chris Mattmann [mailto:[hidden email]]
Sent: Thursday, February 01, 2007 7:01 PM
To: [hidden email]
Subject: Re: RSS-fecter and index individul-how can i realize this function

Hi Gal, et al.,

  I'd like to be explicit when we talk about what the issue with the RSS
parsing plugin is here; I think we have had conversations similar to this
before and it seems that we keep talking around each other. I'd like to get
to the heart of this matter so that the issue (if there is an actual one)
gets addressed ;)

  Okay, so you mention below that the thing that you see missing from the
current RSS parsing plugin is the ability to store data in the CrawlDatum,
and parse "it" in the next fetch phase. Well, there are 2 options here for
what you refer to as "it":

 1. If you're talking about the RSS file, then in fact, it is parsed, and
its data is stored in the CrawlDatum, akin to any other form of content that
is fetched, parsed and indexed.

 2. If you're talking about the item links within the RSS file, in fact,
they are parsed (eventually), and their data stored in the CrawlDatum, akin
to any other form of content that is fetched, parsed, and indexed. This is
accomplished by adding the RSS items as Outlinks when the RSS file is
parsed: in this fashion, we go after all of the links in the RSS file, and
make sure that we index their content as well.

Thus, if you had an RSS file R that contained links in it to a PDF file A,
and another HTML page P, then not only would R get fetched, parsed, and
indexed, but so would A and P, because they are item links within R. Then
queries that would match R (the physical RSS file), would additionally match
things such as P and A, and all 3 would be capable of being returned in a
Nutch query. Does this make sense? Is this the issue that you're talking
about? Am I nuts? ;)

Cheers,
  Chris




On 1/31/07 10:40 PM, "Gal Nitzan" <[hidden email]> wrote:

> Hi,
>
> Many sites provide RSS feeds for several reasons, usually to save bandwidth,
> to give the users concentrated data and so forth.
>
> Some of the RSS files supplied by sites are created specially for search
> engines where each RSS "item" represent a web page in the site.
>
> IMHO the only thing "missing" in the parse-rss plugin is storing the data in
> the CrawlDatum and "parsing" it in the next fetch phase. Maybe adding a new
> flag to CrawlDatum, that would flag the URL as "parsable" not "fetchable"?
>
> Just my two cents...
>
> Gal.
>
> -----Original Message-----
> From: Chris Mattmann [mailto:[hidden email]]
> Sent: Wednesday, January 31, 2007 8:44 AM
> To: [hidden email]
> Subject: Re: RSS-fecter and index individul-how can i realize this function
>
> Hi there,
>
>   With the explanation that you give below, it seems like parse-rss as it
> exists would address what you are trying to do. parse-rss parses an RSS
> channel as a set of items, and indexes overall metadata about the RSS file,
> including parse text, and index data, but it also adds each item (in the
> channel)'s URL as an Outlink, so that Nutch will process those pieces of
> content as well. The only thing that you suggest below that parse-rss
> currently doesn't do, is to allow you to associate the metadata fields
> category:, and author: with the item Outlink...
>
> Cheers,
>   Chris
>
>
>
> On 1/30/07 7:30 PM, "kauu" <[hidden email]> wrote:
>
>> thx for ur reply .
> mybe i didn't tell clearly .
>  I want to index the item as a
>> individual page .then when i search the some
> thing for example "nutch-open
>> source", the nutch return a hit which contain
>
>    title : nutch-open source
>
>> description : nutch nutch nutch ....nutch  nutch
>    url :
>> http://lucene.apache.org/nutch
>    category : news
>   author  : kauu
>
> so , is
>> the plugin parse-rss can satisfy what i need?
>
> <item>
>     <title>nutch--open
>> source</title>
>    <description>
>>
>>        nutch nutch nutch ....nutch
>> nutch
>>>     </description>
>>>
>>>
>>>
>> <link>http://lucene.apache.org/nutch</link>
>>>
>>>
>>>     <category>news
>> </category>
>>>
>>>
>>>     <author>kauu</author>
>
>
>
> On 1/31/07, Chris
>> Mattmann <[hidden email]> wrote:
>>
>> Hi there,
>>
>> I could most
>> likely be of assistance, if you gave me some more
>> information.
>> For
>> instance: I'm wondering if the use case you describe below is already
>>
>> supported by the current RSS parse plugin?
>>
>> The current RSS parser,
>> parse-rss, does in fact index individual items
>> that
>> are pointed to by an
>> RSS document. The items are added as Nutch Outlinks,
>> and added to the
>> overall queue of URLs to fetch. Doesn't this satisfy what
>> you mention below?
>> Or am I missing something?
>>
>> Cheers,
>>   Chris
>>
>>
>>
>> On 1/30/07 6:01 PM,
>> "kauu" <[hidden email]> wrote:
>>
>>> Hi folks :
>>>
>>>    What's I want to
>> do is to separate a rss file into several pages .
>>>
>>>   Just as what has
>> been discussed before. I want fetch a rss page and
>> index
>>> it as different
>> documents in the index. So the searcher can search the
>>> Item's info as a
>> individual hit.
>>>
>>>  What's my opinion create a protocol for fetch the rss
>> page and store it
>> as
>>> several one which just contain one ITEM tag .but
>> the unique key is the
>> url ,
>>> so how can I store them with the ITEM's link
>> tag as the unique key for a
>>> document.
>>>
>>>   So my question is how to
>> realize this function in nutch-.0.8.x.
>>>
>>>   I've check the code of the
>> plug-in protocol-http's code ,but I can't
>>> find the code where to store a
>> page to a document. I want to separate
>> the
>>> rss page to several ones
>> before storing it as a document but several
>> ones.
>>>
>>>   So any one can
>> give me some hints?
>>>
>>> Any reply will be appreciated !
>>>
>>>
>>>
>>>
>>
>>>
>>>   ITEM's structure
>>>
>>>  <item>
>>>
>>>
>>>     <title>欧洲暴风雪后发制人 致航班
>> 延误交通混乱(组图)</title>
>>>
>>>
>>>     <description>暴风雪横扫欧洲,导致多次航班延误 1
>> 月24日,几架民航客机在德
>>> 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部
>> 的慕尼黑机场
>>> 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中...
>>>
>>
>>>
>>>
>>>     </description>
>>>
>>>
>>>
>> <link>http://news.sohu.com/20070125
>>>
>> <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml</
>>>
>> link>
>>>
>>>
>>>     <category>搜狐焦点图新闻</category>
>>>
>>>
>>>
>> <author>[hidden email]
>>> </author>
>>>
>>>
>>>     <pubDate>Thu, 25 Jan 2007
>> 11:29:11 +0800</pubDate>
>>>
>>>
>>>     <comments
>>>>
>> http://comment.news.sohu.com
>>>
>> <http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
>>>
>> /comment/topic.jsp?id=247833847</comments>
>>>
>>>
>>> </item
>>>
>>>
>>
>>>
>>
>>
>>
>

______________________________________________
Chris A. Mattmann
[hidden email]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Reply | Threaded
Open this post in threaded view
|

Re: RSS-fecter and index individul-how can i realize this function

吴志敏
hi all,
  what Gal said is just my meaning on the rss-parse need.
  i just want to fetch rss seeds once,



On 2/2/07, Gal Nitzan <[hidden email]> wrote:

>
>
> Hi Chris,
>
> I'm sorry I wasn't clear enough. What I mean is that in the current
> implementation:
>
> 1. The RSS (channels, items) page ends up as one Lucene document in the
> index.
> 2. Indeed the links are extracted and each <item> link will be fetched in
> the next fetch as a separate page and will end up as one Lucene document.
>
> IMHO the data that is needed i.e. the data that will be fetched in the
> next fetch process is already available in the <item> element. Each <item>
> element represents one web resource. And there is no reason to go to the
> server and re-fetch that resource.
>
> Another issue that arises from rss feeds is that once the feed page is
> fetched you can not re-fetch it until its "time to fetch" expired. The feeds
> TTL is usually very short. Since for now in Nutch, all pages created equal
> :) it is one more thing to think about.
>
> HTH,
>
> Gal.
>
> -----Original Message-----
> From: Chris Mattmann [mailto:[hidden email]]
> Sent: Thursday, February 01, 2007 7:01 PM
> To: [hidden email]
> Subject: Re: RSS-fecter and index individul-how can i realize this
> function
>
> Hi Gal, et al.,
>
>   I'd like to be explicit when we talk about what the issue with the RSS
> parsing plugin is here; I think we have had conversations similar to this
> before and it seems that we keep talking around each other. I'd like to
> get
> to the heart of this matter so that the issue (if there is an actual one)
> gets addressed ;)
>
>   Okay, so you mention below that the thing that you see missing from the
> current RSS parsing plugin is the ability to store data in the CrawlDatum,
> and parse "it" in the next fetch phase. Well, there are 2 options here for
> what you refer to as "it":
>
> 1. If you're talking about the RSS file, then in fact, it is parsed, and
> its data is stored in the CrawlDatum, akin to any other form of content
> that
> is fetched, parsed and indexed.
>
> 2. If you're talking about the item links within the RSS file, in fact,
> they are parsed (eventually), and their data stored in the CrawlDatum,
> akin
> to any other form of content that is fetched, parsed, and indexed. This is
> accomplished by adding the RSS items as Outlinks when the RSS file is
> parsed: in this fashion, we go after all of the links in the RSS file, and
> make sure that we index their content as well.
>
> Thus, if you had an RSS file R that contained links in it to a PDF file A,
> and another HTML page P, then not only would R get fetched, parsed, and
> indexed, but so would A and P, because they are item links within R. Then
> queries that would match R (the physical RSS file), would additionally
> match
> things such as P and A, and all 3 would be capable of being returned in a
> Nutch query. Does this make sense? Is this the issue that you're talking
> about? Am I nuts? ;)
>
> Cheers,
>   Chris
>
>
>
>
> On 1/31/07 10:40 PM, "Gal Nitzan" <[hidden email]> wrote:
>
> > Hi,
> >
> > Many sites provide RSS feeds for several reasons, usually to save
> bandwidth,
> > to give the users concentrated data and so forth.
> >
> > Some of the RSS files supplied by sites are created specially for search
> > engines where each RSS "item" represent a web page in the site.
> >
> > IMHO the only thing "missing" in the parse-rss plugin is storing the
> data in
> > the CrawlDatum and "parsing" it in the next fetch phase. Maybe adding a
> new
> > flag to CrawlDatum, that would flag the URL as "parsable" not
> "fetchable"?
> >
> > Just my two cents...
> >
> > Gal.
> >
> > -----Original Message-----
> > From: Chris Mattmann [mailto:[hidden email]]
> > Sent: Wednesday, January 31, 2007 8:44 AM
> > To: [hidden email]
> > Subject: Re: RSS-fecter and index individul-how can i realize this
> function
> >
> > Hi there,
> >
> >   With the explanation that you give below, it seems like parse-rss as
> it
> > exists would address what you are trying to do. parse-rss parses an RSS
> > channel as a set of items, and indexes overall metadata about the RSS
> file,
> > including parse text, and index data, but it also adds each item (in the
> > channel)'s URL as an Outlink, so that Nutch will process those pieces of
> > content as well. The only thing that you suggest below that parse-rss
> > currently doesn't do, is to allow you to associate the metadata fields
> > category:, and author: with the item Outlink...
> >
> > Cheers,
> >   Chris
> >
> >
> >
> > On 1/30/07 7:30 PM, "kauu" <[hidden email]> wrote:
> >
> >> thx for ur reply .
> > mybe i didn't tell clearly .
> >  I want to index the item as a
> >> individual page .then when i search the some
> > thing for example "nutch-open
> >> source", the nutch return a hit which contain
> >
> >    title : nutch-open source
> >
> >> description : nutch nutch nutch ....nutch  nutch
> >    url :
> >> http://lucene.apache.org/nutch
> >    category : news
> >   author  : kauu
> >
> > so , is
> >> the plugin parse-rss can satisfy what i need?
> >
> > <item>
> >     <title>nutch--open
> >> source</title>
> >    <description>
> >>
> >>        nutch nutch nutch ....nutch
> >> nutch
> >>>     </description>
> >>>
> >>>
> >>>
> >> <link>http://lucene.apache.org/nutch</link>
> >>>
> >>>
> >>>     <category>news
> >> </category>
> >>>
> >>>
> >>>     <author>kauu</author>
> >
> >
> >
> > On 1/31/07, Chris
> >> Mattmann <[hidden email]> wrote:
> >>
> >> Hi there,
> >>
> >> I could most
> >> likely be of assistance, if you gave me some more
> >> information.
> >> For
> >> instance: I'm wondering if the use case you describe below is already
> >>
> >> supported by the current RSS parse plugin?
> >>
> >> The current RSS parser,
> >> parse-rss, does in fact index individual items
> >> that
> >> are pointed to by an
> >> RSS document. The items are added as Nutch Outlinks,
> >> and added to the
> >> overall queue of URLs to fetch. Doesn't this satisfy what
> >> you mention below?
> >> Or am I missing something?
> >>
> >> Cheers,
> >>   Chris
> >>
> >>
> >>
> >> On 1/30/07 6:01 PM,
> >> "kauu" <[hidden email]> wrote:
> >>
> >>> Hi folks :
> >>>
> >>>    What's I want to
> >> do is to separate a rss file into several pages .
> >>>
> >>>   Just as what has
> >> been discussed before. I want fetch a rss page and
> >> index
> >>> it as different
> >> documents in the index. So the searcher can search the
> >>> Item's info as a
> >> individual hit.
> >>>
> >>>  What's my opinion create a protocol for fetch the rss
> >> page and store it
> >> as
> >>> several one which just contain one ITEM tag .but
> >> the unique key is the
> >> url ,
> >>> so how can I store them with the ITEM's link
> >> tag as the unique key for a
> >>> document.
> >>>
> >>>   So my question is how to
> >> realize this function in nutch-.0.8.x.
> >>>
> >>>   I've check the code of the
> >> plug-in protocol-http's code ,but I can't
> >>> find the code where to store a
> >> page to a document. I want to separate
> >> the
> >>> rss page to several ones
> >> before storing it as a document but several
> >> ones.
> >>>
> >>>   So any one can
> >> give me some hints?
> >>>
> >>> Any reply will be appreciated !
> >>>
> >>>
> >>>
> >>>
> >>
> >>>
> >>>   ITEM's structure
> >>>
> >>>  <item>
> >>>
> >>>
> >>>     <title>欧洲暴风雪后发制人 致航班
> >> 延误交通混乱(组图)</title>
> >>>
> >>>
> >>>     <description>暴风雪横扫欧洲,导致多次航班延误 1
> >> 月24日,几架民航客机在德
> >>> 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部
> >> 的慕尼黑机场
> >>> 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中...
> >>>
> >>
> >>>
> >>>
> >>>     </description>
> >>>
> >>>
> >>>
> >> <link>http://news.sohu.com/20070125
> >>>
> >> <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml</
> >>>
> >> link>
> >>>
> >>>
> >>>     <category>搜狐焦点图新闻</category>
> >>>
> >>>
> >>>
> >> <author>[hidden email]
> >>> </author>
> >>>
> >>>
> >>>     <pubDate>Thu, 25 Jan 2007
> >> 11:29:11 +0800</pubDate>
> >>>
> >>>
> >>>     <comments
> >>>>
> >> http://comment.news.sohu.com
> >>>
> >> <http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
> >>>
> >> /comment/topic.jsp?id=247833847</comments>
> >>>
> >>>
> >>> </item
> >>>
> >>>
> >>
> >>>
> >>
> >>
> >>
> >
>
> ______________________________________________
> Chris A. Mattmann
> [hidden email]
> Staff Member
> Modeling and Data Management Systems Section (387)
> Data Management Systems and Technologies Group
>
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                        Mailstop:  171-246
> _______________________________________________________
>
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
>
>
>
>
>


--
www.babatu.com
Reply | Threaded
Open this post in threaded view
|

Generator.java bug?

Gal Nitzan
Hi,

 

After many failures of generate "Generator: 0 records selected for fetching,
exiting ..." I made a post about it a few days back.

 

I narrowed down to the following function:

 

public Path generate(Path dbDir, Path segments, int numLists, long topN,
long curTime, boolean filter, boolean force)

 

in the following if:  if (readers == null || readers.length == 0 ||
!readers[0].next(new FloatWritable()))

 

 

It turns out that the: "!readers[0].next(new FloatWritable())" is the
culprit.

 

 

Gal

Reply | Threaded
Open this post in threaded view
|

RE: Generator.java bug?

Gal Nitzan

PS.

In the following code:

    if (readers == null || readers.length == 0 || !readers[0].next(new
FloatWritable())) {
      LOG.warn("Generator: 0 records selected for fetching, exiting ...");
      LockUtil.removeLockFile(fs, lock);
      fs.delete(tempDir);
      return null;
    }

>>>> There is no need for the if here
    if (readers!=null)
      for (int i = 0; i < readers.length; i++) readers[i].close();

-----Original Message-----
From: Gal Nitzan [mailto:[hidden email]]
Sent: Friday, February 02, 2007 1:56 PM
To: [hidden email]
Subject: Generator.java bug?

Hi,

 

After many failures of generate "Generator: 0 records selected for fetching,
exiting ..." I made a post about it a few days back.

 

I narrowed down to the following function:

 

public Path generate(Path dbDir, Path segments, int numLists, long topN,
long curTime, boolean filter, boolean force)

 

in the following if:  if (readers == null || readers.length == 0 ||
!readers[0].next(new FloatWritable()))

 

 

It turns out that the: "!readers[0].next(new FloatWritable())" is the
culprit.

 

 

Gal



Reply | Threaded
Open this post in threaded view
|

Re: Generator.java bug?

Andrzej Białecki-2
In reply to this post by Gal Nitzan
Gal Nitzan wrote:

> Hi,
>
>  
>
> After many failures of generate "Generator: 0 records selected for fetching,
> exiting ..." I made a post about it a few days back.
>
>  
>
> I narrowed down to the following function:
>
>  
>
> public Path generate(Path dbDir, Path segments, int numLists, long topN,
> long curTime, boolean filter, boolean force)
>
>  
>
> in the following if:  if (readers == null || readers.length == 0 ||
> !readers[0].next(new FloatWritable()))
>
>  
>
>  
>
> It turns out that the: "!readers[0].next(new FloatWritable())" is the
> culprit.
>  

Well, this condition simply checks if the result is not empty. When we
open Reader[] on a SequenceFile, each reader corresponds to a
part-xxxxx. There must be at least one part, so we use the one at index
0. If we cannot retrieve at least one entry from it, then it logically
follows that the file is empty, and we bail out.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

RE: Generator.java bug?

Gal Nitzan
Hi Andrzej,

Well on my system the list does contains urls and the fetcher does fetch it
correctly, however if I keep that test in the "if" it will report the list
is empty.

I am not sure but maybe the first value is not a FloatWritable or maybe
something else?

Thanks,

Gal



-----Original Message-----
From: Andrzej Bialecki [mailto:[hidden email]]
Sent: Friday, February 02, 2007 3:28 PM
To: [hidden email]
Subject: Re: Generator.java bug?

Gal Nitzan wrote:
> Hi,
>
>  
>
> After many failures of generate "Generator: 0 records selected for
fetching,

> exiting ..." I made a post about it a few days back.
>
>  
>
> I narrowed down to the following function:
>
>  
>
> public Path generate(Path dbDir, Path segments, int numLists, long topN,
> long curTime, boolean filter, boolean force)
>
>  
>
> in the following if:  if (readers == null || readers.length == 0 ||
> !readers[0].next(new FloatWritable()))
>
>  
>
>  
>
> It turns out that the: "!readers[0].next(new FloatWritable())" is the
> culprit.
>  

Well, this condition simply checks if the result is not empty. When we
open Reader[] on a SequenceFile, each reader corresponds to a
part-xxxxx. There must be at least one part, so we use the one at index
0. If we cannot retrieve at least one entry from it, then it logically
follows that the file is empty, and we bail out.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Reply | Threaded
Open this post in threaded view
|

Re: Generator.java bug?

Andrzej Białecki-2
Gal Nitzan wrote:
> Hi Andrzej,
>
> Well on my system the list does contains urls and the fetcher does fetch it
> correctly, however if I keep that test in the "if" it will report the list
> is empty.
>  

Hmm. Which version are you referring to? My version of Generator.java
(rev. 499917) doesn't contain the second "if" that you mentioned ...
Make sure you have the latest version (svn update) and that there are no
local diffs in your source tree (svn diff should return no result).

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: RSS-fecter and index individul-how can i realize this function

Renaud Richardet-3-2
In reply to this post by Gal Nitzan
Gal, Chris, Kauu,

So, if I understand correctly, you need a way to pass information along
the fetches, so that when Nutch fetches a feed entry, its <item> value
previously fetched is available.

This is how I tackled the issue:
- extend Outlinks.java and allow to create outlinks with more meta data.
So, in your feed parser, use this way to create outlinks
- pass on the metadata through ParseOutputFormat.java and Fetcher.java
- retrieve the metadata in HtmlParser.java and use it

This is very tedious, will blow the size of your outlinks db, makes
changes in the core code of Nutch, etc... But this is the only way I
came up with...
If someone sees a better way, please let me know :-)

Sample code, for Nutch 0.8.x :

Outlink.java
+  public Outlink(String toUrl, String anchor, String entryContents,
Configuration conf) throws MalformedURLException {
+      this.toUrl = new
UrlNormalizerFactory(conf).getNormalizer().normalize(toUrl);
+      this.anchor = anchor;
+
+      this.entryContents= entryContents;
+  }
and update the other methods

ParseOutputFormat.java, around lines 140
+            // set outlink info in metadata ME
+            String entryContents= links[i].getEntryContents();
+
+            if (entryContents.length() > 0) { // it's a feed entry
+                MapWritable meta = new MapWritable();
+                meta.put(new UTF8("entryContents"), new
UTF8(entryContents));//key/value
+                target = new CrawlDatum(CrawlDatum.STATUS_LINKED,
interval);
+                target.setMetaData(meta);
+            } else {
+                target = new CrawlDatum(CrawlDatum.STATUS_LINKED,
interval); // no meta
+            }

Fetcher.java, around l. 266
+      // add feed info to metadata
+      try {
+          String entryContents = datum.getMetaData().get(new
UTF8("entryContents")).toString();
+          metadata.set("entryContents", entryContents);
+      } catch (Exception e) { } //not found

HtmlParser.java
// get entry metadata
    String entryContents = content.getMetadata().get("entryContents");

HTH,
Renaud



Gal Nitzan wrote:

> Hi Chris,
>
> I'm sorry I wasn't clear enough. What I mean is that in the current implementation:
>
> 1. The RSS (channels, items) page ends up as one Lucene document in the index.
> 2. Indeed the links are extracted and each <item> link will be fetched in the next fetch as a separate page and will end up as one Lucene document.
>
> IMHO the data that is needed i.e. the data that will be fetched in the next fetch process is already available in the <item> element. Each <item> element represents one web resource. And there is no reason to go to the server and re-fetch that resource.
>
> Another issue that arises from rss feeds is that once the feed page is fetched you can not re-fetch it until its "time to fetch" expired. The feeds TTL is usually very short. Since for now in Nutch, all pages created equal :) it is one more thing to think about.
>
> HTH,
>
> Gal.
>
> -----Original Message-----
> From: Chris Mattmann [mailto:[hidden email]]
> Sent: Thursday, February 01, 2007 7:01 PM
> To: [hidden email]
> Subject: Re: RSS-fecter and index individul-how can i realize this function
>
> Hi Gal, et al.,
>
>   I'd like to be explicit when we talk about what the issue with the RSS
> parsing plugin is here; I think we have had conversations similar to this
> before and it seems that we keep talking around each other. I'd like to get
> to the heart of this matter so that the issue (if there is an actual one)
> gets addressed ;)
>
>   Okay, so you mention below that the thing that you see missing from the
> current RSS parsing plugin is the ability to store data in the CrawlDatum,
> and parse "it" in the next fetch phase. Well, there are 2 options here for
> what you refer to as "it":
>
>  1. If you're talking about the RSS file, then in fact, it is parsed, and
> its data is stored in the CrawlDatum, akin to any other form of content that
> is fetched, parsed and indexed.
>
>  2. If you're talking about the item links within the RSS file, in fact,
> they are parsed (eventually), and their data stored in the CrawlDatum, akin
> to any other form of content that is fetched, parsed, and indexed. This is
> accomplished by adding the RSS items as Outlinks when the RSS file is
> parsed: in this fashion, we go after all of the links in the RSS file, and
> make sure that we index their content as well.
>
> Thus, if you had an RSS file R that contained links in it to a PDF file A,
> and another HTML page P, then not only would R get fetched, parsed, and
> indexed, but so would A and P, because they are item links within R. Then
> queries that would match R (the physical RSS file), would additionally match
> things such as P and A, and all 3 would be capable of being returned in a
> Nutch query. Does this make sense? Is this the issue that you're talking
> about? Am I nuts? ;)
>
> Cheers,
>   Chris
>
>
>
>
> On 1/31/07 10:40 PM, "Gal Nitzan" <[hidden email]> wrote:
>
>  
>> Hi,
>>
>> Many sites provide RSS feeds for several reasons, usually to save bandwidth,
>> to give the users concentrated data and so forth.
>>
>> Some of the RSS files supplied by sites are created specially for search
>> engines where each RSS "item" represent a web page in the site.
>>
>> IMHO the only thing "missing" in the parse-rss plugin is storing the data in
>> the CrawlDatum and "parsing" it in the next fetch phase. Maybe adding a new
>> flag to CrawlDatum, that would flag the URL as "parsable" not "fetchable"?
>>
>> Just my two cents...
>>
>> Gal.
>>
>> -----Original Message-----
>> From: Chris Mattmann [mailto:[hidden email]]
>> Sent: Wednesday, January 31, 2007 8:44 AM
>> To: [hidden email]
>> Subject: Re: RSS-fecter and index individul-how can i realize this function
>>
>> Hi there,
>>
>>   With the explanation that you give below, it seems like parse-rss as it
>> exists would address what you are trying to do. parse-rss parses an RSS
>> channel as a set of items, and indexes overall metadata about the RSS file,
>> including parse text, and index data, but it also adds each item (in the
>> channel)'s URL as an Outlink, so that Nutch will process those pieces of
>> content as well. The only thing that you suggest below that parse-rss
>> currently doesn't do, is to allow you to associate the metadata fields
>> category:, and author: with the item Outlink...
>>
>> Cheers,
>>   Chris
>>
>>
>>
>> On 1/30/07 7:30 PM, "kauu" <[hidden email]> wrote:
>>
>>    
>>> thx for ur reply .
>>>      
>> mybe i didn't tell clearly .
>>  I want to index the item as a
>>    
>>> individual page .then when i search the some
>>>      
>> thing for example "nutch-open
>>    
>>> source", the nutch return a hit which contain
>>>      
>>    title : nutch-open source
>>
>>    
>>> description : nutch nutch nutch ....nutch  nutch
>>>      
>>    url :
>>    
>>> http://lucene.apache.org/nutch
>>>      
>>    category : news
>>   author  : kauu
>>
>> so , is
>>    
>>> the plugin parse-rss can satisfy what i need?
>>>      
>> <item>
>>     <title>nutch--open
>>    
>>> source</title>
>>>      
>>    <description>
>>    
>>>        nutch nutch nutch ....nutch
>>> nutch
>>>      
>>>>     </description>
>>>>
>>>>
>>>>
>>>>        
>>> <link>http://lucene.apache.org/nutch</link>
>>>      
>>>>     <category>news
>>>>        
>>> </category>
>>>      
>>>>     <author>kauu</author>
>>>>        
>>
>> On 1/31/07, Chris
>>    
>>> Mattmann <[hidden email]> wrote:
>>>
>>> Hi there,
>>>
>>> I could most
>>> likely be of assistance, if you gave me some more
>>> information.
>>> For
>>> instance: I'm wondering if the use case you describe below is already
>>>
>>> supported by the current RSS parse plugin?
>>>
>>> The current RSS parser,
>>> parse-rss, does in fact index individual items
>>> that
>>> are pointed to by an
>>> RSS document. The items are added as Nutch Outlinks,
>>> and added to the
>>> overall queue of URLs to fetch. Doesn't this satisfy what
>>> you mention below?
>>> Or am I missing something?
>>>
>>> Cheers,
>>>   Chris
>>>
>>>
>>>
>>> On 1/30/07 6:01 PM,
>>> "kauu" <[hidden email]> wrote:
>>>
>>>      
>>>> Hi folks :
>>>>
>>>>    What's I want to
>>>>        
>>> do is to separate a rss file into several pages .
>>>      
>>>>   Just as what has
>>>>        
>>> been discussed before. I want fetch a rss page and
>>> index
>>>      
>>>> it as different
>>>>        
>>> documents in the index. So the searcher can search the
>>>      
>>>> Item's info as a
>>>>        
>>> individual hit.
>>>      
>>>>  What's my opinion create a protocol for fetch the rss
>>>>        
>>> page and store it
>>> as
>>>      
>>>> several one which just contain one ITEM tag .but
>>>>        
>>> the unique key is the
>>> url ,
>>>      
>>>> so how can I store them with the ITEM's link
>>>>        
>>> tag as the unique key for a
>>>      
>>>> document.
>>>>
>>>>   So my question is how to
>>>>        
>>> realize this function in nutch-.0.8.x.
>>>      
>>>>   I've check the code of the
>>>>        
>>> plug-in protocol-http's code ,but I can't
>>>      
>>>> find the code where to store a
>>>>        
>>> page to a document. I want to separate
>>> the
>>>      
>>>> rss page to several ones
>>>>        
>>> before storing it as a document but several
>>> ones.
>>>      
>>>>   So any one can
>>>>        
>>> give me some hints?
>>>      
>>>> Any reply will be appreciated !
>>>>
>>>>
>>>>
>>>>
>>>>        
>>>>   ITEM's structure
>>>>
>>>>  <item>
>>>>
>>>>
>>>>     <title>欧洲暴风雪后发制人 致航班
>>>>        
>>> 延误交通混乱(组图)</title>
>>>      
>>>>     <description>暴风雪横扫欧洲,导致多次航班延误 1
>>>>        
>>> 月24日,几架民航客机在德
>>>      
>>>> 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部
>>>>        
>>> 的慕尼黑机场
>>>      
>>>> 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中...
>>>>
>>>>        
>>>>     </description>
>>>>
>>>>
>>>>
>>>>        
>>> <link>http://news.sohu.com/20070125
>>>      
>>> <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml</
>>>      
>>> link>
>>>      
>>>>     <category>搜狐焦点图新闻</category>
>>>>
>>>>
>>>>
>>>>        
>>> <author>[hidden email]
>>>      
>>>> </author>
>>>>
>>>>
>>>>     <pubDate>Thu, 25 Jan 2007
>>>>        
>>> 11:29:11 +0800</pubDate>
>>>      
>>>>     <comments
>>>>        
>>> http://comment.news.sohu.com
>>>      
>>> <http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
>>>      
>>> /comment/topic.jsp?id=247833847</comments>
>>>      
>>>> </item
>>>>
>>>>
>>>>        
>>>
>>>      
>
> ______________________________________________
> Chris A. Mattmann
> [hidden email]
> Staff Member
> Modeling and Data Management Systems Section (387)
> Data Management Systems and Technologies Group
>
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                        Mailstop:  171-246
> _______________________________________________________
>
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
>
>
>
>
>
>  


--
renaud richardet                           +1 617 230 9112
renaud <at> oslutions.com         http://www.oslutions.com

Reply | Threaded
Open this post in threaded view
|

Re: RSS-fecter and index individul-how can i realize this function

Doug Cutting
In reply to this post by Gal Nitzan
Gal Nitzan wrote:
> IMHO the data that is needed i.e. the data that will be fetched in the next fetch process is already available in the <item> element. Each <item> element represents one web resource. And there is no reason to go to the server and re-fetch that resource.

Perhaps ProtocolOutput should change.  The method:

   Content getContent();

could be deprecated and replaced with:

   Content[] getContents();

This would require changes to the indexing pipeline.  I can't think of
any severe complications, but I haven't looked closely.

Could something like that work?

Doug
Reply | Threaded
Open this post in threaded view
|

Re: RSS-fecter and index individul-how can i realize this function

吴志敏
In reply to this post by Renaud Richardet-3-2
I've change code like what u said, but i get an exception like this.
why, why is the MD5Signature class's exception


2007-02-05 11:28:38,453 WARN  feedparser.FeedFilter (
FeedFilter.java:doDecodeEntities(223)) - Filter encountered unknown entities
2007-02-05 11:28:39,390 INFO  crawl.SignatureFactory (
SignatureFactory.java:getSignature(45)) - Using Signature impl:
org.apache.nutch.crawl.MD5Signature
2007-02-05 11:28:40,078 WARN  mapred.LocalJobRunner
(LocalJobRunner.java:run(120))
- job_f6j55m
java.lang.NullPointerException
    at org.apache.nutch.parse.ParseOutputFormat$1.write(
ParseOutputFormat.java:121)
    at org.apache.nutch.fetcher.FetcherOutputFormat$1.write(
FetcherOutputFormat.java:87)
    at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:235)
    at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(
IdentityReducer.java:39)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:247)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java
:112)


On 2/3/07, Renaud Richardet <[hidden email]> wrote:

>
> Gal, Chris, Kauu,
>
> So, if I understand correctly, you need a way to pass information along
> the fetches, so that when Nutch fetches a feed entry, its <item> value
> previously fetched is available.
>
> This is how I tackled the issue:
> - extend Outlinks.java and allow to create outlinks with more meta data.
> So, in your feed parser, use this way to create outlinks
> - pass on the metadata through ParseOutputFormat.java and Fetcher.java
> - retrieve the metadata in HtmlParser.java and use it
>
> This is very tedious, will blow the size of your outlinks db, makes
> changes in the core code of Nutch, etc... But this is the only way I
> came up with...
> If someone sees a better way, please let me know :-)
>
> Sample code, for Nutch 0.8.x :
>
> Outlink.java
> +  public Outlink(String toUrl, String anchor, String entryContents,
> Configuration conf) throws MalformedURLException {
> +      this.toUrl = new
> UrlNormalizerFactory(conf).getNormalizer().normalize(toUrl);
> +      this.anchor = anchor;
> +
> +      this.entryContents= entryContents;
> +  }
> and update the other methods
>
> ParseOutputFormat.java, around lines 140
> +            // set outlink info in metadata ME
> +            String entryContents= links[i].getEntryContents();
> +
> +            if (entryContents.length() > 0) { // it's a feed entry
> +                MapWritable meta = new MapWritable();
> +                meta.put(new UTF8("entryContents"), new
> UTF8(entryContents));//key/value
> +                target = new CrawlDatum(CrawlDatum.STATUS_LINKED,
> interval);
> +                target.setMetaData(meta);
> +            } else {
> +                target = new CrawlDatum(CrawlDatum.STATUS_LINKED,
> interval); // no meta
> +            }
>
> Fetcher.java, around l. 266
> +      // add feed info to metadata
> +      try {
> +          String entryContents = datum.getMetaData().get(new
> UTF8("entryContents")).toString();
> +          metadata.set("entryContents", entryContents);
> +      } catch (Exception e) { } //not found
>
> HtmlParser.java
> // get entry metadata
>     String entryContents = content.getMetadata().get("entryContents");
>
> HTH,
> Renaud
>
>
>
> Gal Nitzan wrote:
> > Hi Chris,
> >
> > I'm sorry I wasn't clear enough. What I mean is that in the current
> implementation:
> >
> > 1. The RSS (channels, items) page ends up as one Lucene document in the
> index.
> > 2. Indeed the links are extracted and each <item> link will be fetched
> in the next fetch as a separate page and will end up as one Lucene document.
> >
> > IMHO the data that is needed i.e. the data that will be fetched in the
> next fetch process is already available in the <item> element. Each <item>
> element represents one web resource. And there is no reason to go to the
> server and re-fetch that resource.
> >
> > Another issue that arises from rss feeds is that once the feed page is
> fetched you can not re-fetch it until its "time to fetch" expired. The feeds
> TTL is usually very short. Since for now in Nutch, all pages created equal
> :) it is one more thing to think about.
> >
> > HTH,
> >
> > Gal.
> >
> > -----Original Message-----
> > From: Chris Mattmann [mailto:[hidden email]]
> > Sent: Thursday, February 01, 2007 7:01 PM
> > To: [hidden email]
> > Subject: Re: RSS-fecter and index individul-how can i realize this
> function
> >
> > Hi Gal, et al.,
> >
> >   I'd like to be explicit when we talk about what the issue with the RSS
> > parsing plugin is here; I think we have had conversations similar to
> this
> > before and it seems that we keep talking around each other. I'd like to
> get
> > to the heart of this matter so that the issue (if there is an actual
> one)
> > gets addressed ;)
> >
> >   Okay, so you mention below that the thing that you see missing from
> the
> > current RSS parsing plugin is the ability to store data in the
> CrawlDatum,
> > and parse "it" in the next fetch phase. Well, there are 2 options here
> for
> > what you refer to as "it":
> >
> >  1. If you're talking about the RSS file, then in fact, it is parsed,
> and
> > its data is stored in the CrawlDatum, akin to any other form of content
> that
> > is fetched, parsed and indexed.
> >
> >  2. If you're talking about the item links within the RSS file, in fact,
> > they are parsed (eventually), and their data stored in the CrawlDatum,
> akin
> > to any other form of content that is fetched, parsed, and indexed. This
> is
> > accomplished by adding the RSS items as Outlinks when the RSS file is
> > parsed: in this fashion, we go after all of the links in the RSS file,
> and
> > make sure that we index their content as well.
> >
> > Thus, if you had an RSS file R that contained links in it to a PDF file
> A,
> > and another HTML page P, then not only would R get fetched, parsed, and
> > indexed, but so would A and P, because they are item links within R.
> Then
> > queries that would match R (the physical RSS file), would additionally
> match
> > things such as P and A, and all 3 would be capable of being returned in
> a
> > Nutch query. Does this make sense? Is this the issue that you're talking
> > about? Am I nuts? ;)
> >
> > Cheers,
> >   Chris
> >
> >
> >
> >
> > On 1/31/07 10:40 PM, "Gal Nitzan" <[hidden email]> wrote:
> >
> >
> >> Hi,
> >>
> >> Many sites provide RSS feeds for several reasons, usually to save
> bandwidth,
> >> to give the users concentrated data and so forth.
> >>
> >> Some of the RSS files supplied by sites are created specially for
> search
> >> engines where each RSS "item" represent a web page in the site.
> >>
> >> IMHO the only thing "missing" in the parse-rss plugin is storing the
> data in
> >> the CrawlDatum and "parsing" it in the next fetch phase. Maybe adding a
> new
> >> flag to CrawlDatum, that would flag the URL as "parsable" not
> "fetchable"?
> >>
> >> Just my two cents...
> >>
> >> Gal.
> >>
> >> -----Original Message-----
> >> From: Chris Mattmann [mailto:[hidden email]]
> >> Sent: Wednesday, January 31, 2007 8:44 AM
> >> To: [hidden email]
> >> Subject: Re: RSS-fecter and index individul-how can i realize this
> function
> >>
> >> Hi there,
> >>
> >>   With the explanation that you give below, it seems like parse-rss as
> it
> >> exists would address what you are trying to do. parse-rss parses an RSS
> >> channel as a set of items, and indexes overall metadata about the RSS
> file,
> >> including parse text, and index data, but it also adds each item (in
> the
> >> channel)'s URL as an Outlink, so that Nutch will process those pieces
> of
> >> content as well. The only thing that you suggest below that parse-rss
> >> currently doesn't do, is to allow you to associate the metadata fields
> >> category:, and author: with the item Outlink...
> >>
> >> Cheers,
> >>   Chris
> >>
> >>
> >>
> >> On 1/30/07 7:30 PM, "kauu" <[hidden email]> wrote:
> >>
> >>
> >>> thx for ur reply .
> >>>
> >> mybe i didn't tell clearly .
> >>  I want to index the item as a
> >>
> >>> individual page .then when i search the some
> >>>
> >> thing for example "nutch-open
> >>
> >>> source", the nutch return a hit which contain
> >>>
> >>    title : nutch-open source
> >>
> >>
> >>> description : nutch nutch nutch ....nutch  nutch
> >>>
> >>    url :
> >>
> >>> http://lucene.apache.org/nutch
> >>>
> >>    category : news
> >>   author  : kauu
> >>
> >> so , is
> >>
> >>> the plugin parse-rss can satisfy what i need?
> >>>
> >> <item>
> >>     <title>nutch--open
> >>
> >>> source</title>
> >>>
> >>    <description>
> >>
> >>>        nutch nutch nutch ....nutch
> >>> nutch
> >>>
> >>>>     </description>
> >>>>
> >>>>
> >>>>
> >>>>
> >>> <link>http://lucene.apache.org/nutch</link>
> >>>
> >>>>     <category>news
> >>>>
> >>> </category>
> >>>
> >>>>     <author>kauu</author>
> >>>>
> >>
> >> On 1/31/07, Chris
> >>
> >>> Mattmann <[hidden email]> wrote:
> >>>
> >>> Hi there,
> >>>
> >>> I could most
> >>> likely be of assistance, if you gave me some more
> >>> information.
> >>> For
> >>> instance: I'm wondering if the use case you describe below is already
> >>>
> >>> supported by the current RSS parse plugin?
> >>>
> >>> The current RSS parser,
> >>> parse-rss, does in fact index individual items
> >>> that
> >>> are pointed to by an
> >>> RSS document. The items are added as Nutch Outlinks,
> >>> and added to the
> >>> overall queue of URLs to fetch. Doesn't this satisfy what
> >>> you mention below?
> >>> Or am I missing something?
> >>>
> >>> Cheers,
> >>>   Chris
> >>>
> >>>
> >>>
> >>> On 1/30/07 6:01 PM,
> >>> "kauu" <[hidden email]> wrote:
> >>>
> >>>
> >>>> Hi folks :
> >>>>
> >>>>    What's I want to
> >>>>
> >>> do is to separate a rss file into several pages .
> >>>
> >>>>   Just as what has
> >>>>
> >>> been discussed before. I want fetch a rss page and
> >>> index
> >>>
> >>>> it as different
> >>>>
> >>> documents in the index. So the searcher can search the
> >>>
> >>>> Item's info as a
> >>>>
> >>> individual hit.
> >>>
> >>>>  What's my opinion create a protocol for fetch the rss
> >>>>
> >>> page and store it
> >>> as
> >>>
> >>>> several one which just contain one ITEM tag .but
> >>>>
> >>> the unique key is the
> >>> url ,
> >>>
> >>>> so how can I store them with the ITEM's link
> >>>>
> >>> tag as the unique key for a
> >>>
> >>>> document.
> >>>>
> >>>>   So my question is how to
> >>>>
> >>> realize this function in nutch-.0.8.x.
> >>>
> >>>>   I've check the code of the
> >>>>
> >>> plug-in protocol-http's code ,but I can't
> >>>
> >>>> find the code where to store a
> >>>>
> >>> page to a document. I want to separate
> >>> the
> >>>
> >>>> rss page to several ones
> >>>>
> >>> before storing it as a document but several
> >>> ones.
> >>>
> >>>>   So any one can
> >>>>
> >>> give me some hints?
> >>>
> >>>> Any reply will be appreciated !
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>   ITEM's structure
> >>>>
> >>>>  <item>
> >>>>
> >>>>
> >>>>     <title>欧洲暴风雪后发制人 致航班
> >>>>
> >>> 延误交通混乱(组图)</title>
> >>>
> >>>>     <description>暴风雪横扫欧洲,导致多次航班延误 1
> >>>>
> >>> 月24日,几架民航客机在德
> >>>
> >>>> 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部
> >>>>
> >>> 的慕尼黑机场
> >>>
> >>>> 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中...
> >>>>
> >>>>
> >>>>     </description>
> >>>>
> >>>>
> >>>>
> >>>>
> >>> <link>http://news.sohu.com/20070125
> >>>
> >>> <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml</
> >>>
> >>> link>
> >>>
> >>>>     <category>搜狐焦点图新闻</category>
> >>>>
> >>>>
> >>>>
> >>>>
> >>> <author>[hidden email]
> >>>
> >>>> </author>
> >>>>
> >>>>
> >>>>     <pubDate>Thu, 25 Jan 2007
> >>>>
> >>> 11:29:11 +0800</pubDate>
> >>>
> >>>>     <comments
> >>>>
> >>> http://comment.news.sohu.com
> >>>
> >>> <http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
> >>>
> >>> /comment/topic.jsp?id=247833847</comments>
> >>>
> >>>> </item
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >
> > ______________________________________________
> > Chris A. Mattmann
> > [hidden email]
> > Staff Member
> > Modeling and Data Management Systems Section (387)
> > Data Management Systems and Technologies Group
> >
> > _________________________________________________
> > Jet Propulsion Laboratory            Pasadena, CA
> > Office: 171-266B                        Mailstop:  171-246
> > _______________________________________________________
> >
> > Disclaimer:  The opinions presented within are my own and do not reflect
> > those of either NASA, JPL, or the California Institute of Technology.
> >
> >
> >
> >
> >
> >
>
>
> --
> renaud richardet                           +1 617 230 9112
> renaud <at> oslutions.com         http://www.oslutions.com
>
>


--
www.babatu.com
123