[jira] Created: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
46 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

ASF GitHub Bot (Jira)
Possibly use a different library to parse RSS feed for improved performance and compatibility
---------------------------------------------------------------------------------------------

                 Key: NUTCH-444
                 URL: https://issues.apache.org/jira/browse/NUTCH-444
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
    Affects Versions: 0.9.0
            Reporter: Renaud Richardet
            Priority: Minor
             Fix For: 0.9.0


As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues:
- OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom first
- no support for Atom 1.0
- there has been no development in the last year

Alternatives are:
- Rome
- Informa
- custom implementation based on Stax
- ??

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471880 ]

Renaud Richardet commented on NUTCH-444:
----------------------------------------

Gal,
Would you be able to share your code with Stax? What license does Stax uses?

Nutch Newbie,
> In my case sometime the RSS URL doesn't end with .xml or .rss so some of the feeds got indexed like the way current nutch trunk do i.e as html.  
I thought the parser was chosen based on the MIME, right?

Chris,
What was the issues you had with Rome? It seems to be pretty straightforward now:

SyndFeedInput input = new SyndFeedInput();
SyndFeed feed = input.build(new InputStreamReader(new ByteArrayInputStream(raw)));
String feedTitle = feed.getTitle();
String feedAuthor = feed.getAuthor();
String feedUrl = feed.getLink();
String feedLanguage = feed.getLanguage();

List entries = feed.getEntries();
Iterator it = entries.iterator();
while (it.hasNext()) {
 SyndEntryImpl entry = (SyndEntryImpl) it.next();
 String entryLink = entry.getLink();
 String entryTitle = entry.getTitle();
 String entryContents = getFeedText(entry);
 long entryDate = entry.getPublishedDate().getTime();
}

> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome
> - Informa
> - custom implementation based on Stax
> - ??

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471952 ]

nutch.newbie commented on NUTCH-444:
------------------------------------

Renaud :

Thanks for moving the discussion here. First to answer your question yes its based on mime type detectation problem. The goal of the trial was to see if one could make just a feed search site i.e just feeds but I didn't succeed. I will give it a go over the weekend.

Dogcan:

Yes, one could just replace the feedparser with rome or stax and submit back here or use it internally. My discussion point was to see how others see about it and maybe there are others who have ran into problem and their experience. As Gal pointed out about rome (At least it is being further developed) and stax and you pointed out that you are doing something with rome.. I just wanted to know what other think and their experience thats all. Yes you are correct i posted it in the wrong forum nutch-443. But Nutch-443 started off as someone having trouble with RSS and it is important in my view to discuss the issue as we are using (feedparser) which is not going to solve the original issue if one tries to create just a RSS search engine. Nutch -443 would have not surfaced in the first place.

I am looking forward to that day when I can use nutch just to do rss feed search engine  so Dogcan I am very interested in your rome impl. maybe you can post the code here so that i can participate.

> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome
> - Informa
> - custom implementation based on Stax
> - ??

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471955 ]

Chris A. Mattmann commented on NUTCH-444:
-----------------------------------------

Hi Renaud,

 In fact, Rome does appear to be quite easy to use, given the above coding example. If I recall, the main issues that I had with it before involved the large amount of external libraries that it required in order to run it (which may not be the case anymore). Additionally, I recall there being an issue with the fact that Rome loaded the entire RSS structure into memory; on the other hand, commons-feedparser uses a SAX-based approach, which I really liked.

 So, those were some of the deterrents when I originally evaluated the technologies circa May 2005. I'm not against adapting the current parse-rss plugin, or alternatively writing a parse-rss++ that utilizes a different underlying feedparser technology. I just need to be convinced that it makes sense. Non-active development is not a valid excuse for switching libraries -- I've seen a number of really nice implementations and projects that produced an awesome piece of software only to have developers abandon active development on it (I won't name names, but they're out there if you look). This doesn't take away from the fact that the software works, is proven, and suits the needs of the developers that use it.

  In any case, I'll take the lead on shepherding anything produced out of this into the sources. Look forward to working with you all.

Cheers,
  Chris



> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome
> - Informa
> - custom implementation based on Stax
> - ??

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dogacan Güney updated NUTCH-444:
--------------------------------

    Attachment: parse-feed.tar.bz2

OK, here is my feedparsing plugin using rome. Note that this plugin is NOT ready for any serious use. I have only written this so I can test NUTCH-443 better.

> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome
> - Informa
> - custom implementation based on Stax
> - ??

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471967 ]

nutch.newbie commented on NUTCH-444:
------------------------------------

Well, Lets try this again in terms of feedparser.

I completely disagree that a dormant project which doesn't support newer protocol nor shown any activity for the last 12 months is not a reason for change. Let us just focus on the publicly available stats from syndic8.com (They don't have all the feed but they have enough data to get  the big picture)

http://www.syndic8.com/stats.php?Section=feeds#tabtable

Total Feeds: 495,614
Atom Feeds: 84,746
RSS Feeds: 397,565

Roughly 20-25% of the feed are Atom feed. So "Nutch default installation" misses 25% of the "feed web". Imagine having a search engine site that can only do HTML 3.0 and nothing more cos the project who developed the great HTML 3.0 lib is not active. Now you say well thats HTML its a different issue.

Well, blogs and feeds are growing on trees and we can't afford to miss 25% of the blogs/feeds

So is that a good reason to still stick with commons feedparser?

Cheers

> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome
> - Informa
> - custom implementation based on Stax
> - ??

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472005 ]

Chris A. Mattmann commented on NUTCH-444:
-----------------------------------------

Nutch Newbie:

From the commons-feedparser site: http://jakarta.apache.org/commons/sandbox/feedparser/

" Jakarta FeedParser is a Java RSS/Atom parser designed to elegantly support all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension and RSS 1.0 modules capability."

According to this site, in fact, commons-feedparser does in fact, support Atom. Your statistics that you present above make no mention of the version of Atom feeds within the 84, 746. For instance, how many of those are Atom 0.5 feeds? How many are >0.5?

Additionally, as I mentioned above, commons-feedparser did not require the large amount of external libraries that Rome required to run when I looked them at both. Is this still the case?



> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome
> - Informa
> - custom implementation based on Stax
> - ??

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472078 ]

Otis Gospodnetic commented on NUTCH-444:
----------------------------------------

The ASF FeedParser you are talking about has, I believe, continued its life udner Kevin Burton in TailRank:  http://tailrank.com/code.php
Atom 1.0 and everything else supported.


> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome
> - Informa
> - custom implementation based on Stax
> - ??

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472099 ]

nutch.newbie commented on NUTCH-444:
------------------------------------

Otis:

Thanks for the info. But as for me I am going with parse-feed. I will also like to give stax based solution a try.

Dogacan:

It's working rather well with parse-feed. However I would be glad if you could do a quick check on my parse-plugins.xml modifications. Cos this also throws error during dedup... (when magic is false in nutch-site.xml). My intention is to know if its something I am doing wrong or is it some other bug..

I am thinking of doing a test run later tonight with 10 000 feeds. So I would be glad if you could clarify the following cases. (The following case only happens when there is just 1 url)

- urls.txt file contains 1 url, which is http://blog.foofactory.fi/atom.xml
- bin/nutch crawl with depth 1 gives me the following error during dedup

2007-02-11 13:32:26,846 WARN  mapred.LocalJobRunner - job_k9e9c2
java.lang.ArrayIndexOutOfBoundsException: -1
        at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:109)
        at org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
        at org.apache.hadoop.mapred.MapTask$2.next(MapTask.java:166)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:183)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:109)

and during the parse phase of the above blog gives me the following:

2007-02-11 13:32:09,673 DEBUG http.Http - fetched 208 bytes from http://blog.foofactory.fi/robots.txt
2007-02-11 13:32:09,674 DEBUG http.Http - fetching http://blog.foofactory.fi/atom.xml
2007-02-11 13:32:10,560 INFO  mapred.JobClient -  map 100% reduce 0%
2007-02-11 13:32:10,769 DEBUG http.Http - fetched 38151 bytes from http://blog.foofactory.fi/atom.xml
2007-02-11 13:32:10,965 DEBUG parse.ParseUtil - Parsing [http://blog.foofactory.fi/atom.xml] with [org.apache.nutch.parse.feed.FeedParser@360771]
2007-02-11 13:32:11,292 INFO  mapred.LocalJobRunner - 0 pages, 0 errors, 0.0 pages/s, 0 kb/s,
2007-02-11 13:32:11,627 INFO  crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature
2007-02-11 13:32:11,654 WARN  fetcher.Fetcher - Error parsing: http://blog.foofactory.fi/atom.xml: failed(2,200): java.lang.NullPointerException
2007-02-11 13:32:12,293 INFO  mapred.LocalJobRunner - 1 pages, 0 errors, 0.3 pages/s, 99 kb/s,
2007-02-11 13:32:12,306 DEBUG mapred.MapTask - opened spill0.out
2007-02-11 13:32:12,381 INFO  mapred.LocalJobRunner - 1 pages, 0 errors, 0.3 pages/s, 99 kb/s,

Below is my Parse-plugins.xml changes...

       <mimeType name="application/rss+xml">
                <plugin id="parse-feed" />
        </mimeType>

        <mimeType name="text/xml">
                <plugin id="parse-feed" />
         </mimeType>

                <alias name="parse-feed"
                        extension-id="org.apache.nutch.parse.feed.FeedParser" />

I have also mapped text/xml in parse-feed/plugin.xml cos most of the time I get xml rather then rss+xml as content type.. Also as you mentioned you are using this to test .. how is your test configuration? can you re-create my problem..

Thanks again for the plugin and many thanks for your help. I look forward to contribute in terms of index-feed and query-feed.











 

> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome
> - Informa
> - custom implementation based on Stax
> - ??

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dogacan Güney updated NUTCH-444:
--------------------------------

    Attachment: parse-feed-v2.tar.bz2

Updated parse-feed plugin. Still not ready for any serious use, but I think I fixed the problems with indexing and dedup. Use it with NUTCH-443's v5 patch.

nutch.newbie: I change parse-plugins.xml as you do. For this plugin to work, you also have to change default signature to TextProfileSignature(because MD5Signature takes the hash of content, which is the same for every element in a parseMap). This is done by adding:
<property>
  <name>db.signature.class</name>
  <value>org.apache.nutch.crawl.TextProfileSignature</value>
</property>

to your nutch-site.xml.


> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome
> - Informa
> - custom implementation based on Stax
> - ??

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472121 ]

nutch.newbie commented on NUTCH-444:
------------------------------------

A Big thank you! It works with the latest patch etc. All other reported previous bugs are gone now :-) About my test tonight .. I just want to run it one a decent set of urls to collect more bugs nothing more :-)

Cheers


> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome
> - Informa
> - custom implementation based on Stax
> - ??

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472163 ]

nutch.newbie commented on NUTCH-444:
------------------------------------

Hi:

I have now done my initial test run with 10 000 + feeds in 3 batch.

Batch 1
======
A total of 8000 feed ending URL ".rss" and RSS feeds only.. works out of the box.

Batch 2
======
A total of  3000 Atom feeds ending with ".xml" most of the time throws error during dedup process. Sometime gets parsed by parse-html

Batch 3
======
A total of 2000 feeds endinf with all kinds of extension example .aspx, .php .jsp .ece and what not.. also throws error just like batch 2.

Batch 2 and Batch 3 provides same identical bug as before. Note I have ran only 1 round of fetch. One thing that I am a bit confused is the following. Lets say you have a feed with 5 items i.e. 5 title 5 desc shouldn't the search result i.e. if you do url:feed.com shoot out 6 results? 1 for the main feed page which is the actual feed URL and the other 5 for the 5 items.. Currently I get only 1 search result which is the feed URL.
Do I need to do 2 round of fetch? Cos things are getting parsed correctly.. maybe its because I don't have the indexing plugin i.e index-feed? no? I know we will work on it after Nutch-443 is done..but I want to get a clarification..thats all :-) Cheers!


Some log trace from Batch 1
===================
2007-02-12 00:55:23,607 DEBUG parse.ParseUtil - Parsing [http://rss.cnn.com/rss/cnn_marquee.rss] with [org.apache.nutch.parse.feed.FeedParser@f47af3]
2007-02-12 00:55:23,648 INFO  mapred.JobClient -  map 100% reduce 0%
2007-02-12 00:55:24,690 INFO  mapred.LocalJobRunner - 0 pages, 0 errors, 0.0 pages/s, 0 kb/s,
2007-02-12 00:55:25,020 WARN  parse.ParserFactory - ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to contentType application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: application/xhtml+xml
2007-02-12 00:55:25,225 DEBUG parse.html - http://www.cnn.com/HEALTH/blogs/paging.dr.gupta/2007/02/handling-friends-diagnosis.html: falling back to windows-1252
2007-02-12 00:55:25,225 DEBUG parse.html - Parsing...
2007-02-12 00:55:25,255 DEBUG parse.html - http://rss.cnn.com/~r/rss/cnn_warpcnn/~3/88497144/american-voices-savings-lowest-since.html: falling back to windows-1252
2007-02-12 00:55:25,255 DEBUG parse.html - Parsing...
2007-02-12 00:55:25,277 DEBUG parse.html - http://rss.cnn.com/~r/rss/cnn_ac360blog/~3/88245057/new-orleans-parents-fear-losing-kids.html: falling back to windows-1252
2007-02-12 00:55:25,277 DEBUG parse.html - Parsing...
2007-02-12 00:55:25,277 DEBUG parse.html - http://rss.cnn.com/~r/rss/cnn_marquee/~3/88516140/anna-nicole-why.html: falling back to windows-1252
2007-02-12 00:55:25,278 DEBUG parse.html - Parsing...
2007-02-12 00:55:25,691 INFO  mapred.LocalJobRunner - 0 pages, 0 errors, 0.0 pages/s, 0 kb/s,
2007-02-12 00:55:26,309 DEBUG parse.html - Meta tags for http://www.cnn.com/HEALTH/blogs/paging.dr.gupta/2007/02/handling-friends-diagnosis.html: base=null, noCache=false, noFollow=false, noIndex=false, refresh=false, refreshHref=null
 * general tags:
 * http-equiv tags:

2007-02-12 00:55:26,310 DEBUG parse.html - Getting text...
2007-02-12 00:55:26,315 DEBUG parse.html - Getting title...
2007-02-12 00:55:26,316 DEBUG parse.html - Getting links...
2007-02-12 00:55:26,318 WARN  regex.RegexURLNormalizer - can't find rules for scope 'outlink', using default
2007-02-12 00:55:26,319 DEBUG parse.html - found 1 outlinks in http://www.cnn.com/HEALTH/blogs/paging.dr.gupta/2007/02/handling-friends-diagnosis.html
2007-02-12 00:55:26,321 DEBUG parse.html - Meta tags for http://rss.cnn.com/~r/rss/cnn_ac360blog/~3/88245057/new-orleans-parents-fear-losing-kids.html: base=null, noCache=false, noFollow=false, noIndex=false, refresh=false, refreshHref=null
 * general tags:
 * http-equiv tags:

2007-02-12 00:55:26,321 DEBUG parse.html - Getting text...
2007-02-12 00:55:26,330 DEBUG parse.html - Getting title...
2007-02-12 00:55:26,331 DEBUG parse.html - Getting links...



> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome
> - Informa
> - custom implementation based on Stax
> - ??

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472581 ]

Doğacan Güney commented on NUTCH-444:
-------------------------------------

Hi nutch.newbie,

Can you mail me a list of the failing atom urls(or if you can isolate it to a couple of atom feeds - post it here)? If crawl fails in dedup or index(and not parse), this means that there is a bug in NUTCH-443.

>One thing that I am a bit confused is the following. Lets say you have a feed with 5 items i.e. 5 title 5 desc shouldn't the search >result i.e. if you do url:feed.com shoot out 6 results? 1 for the main feed page which is the actual feed URL and the other 5 for >the 5 items.. Currently I get only 1 search result which is the feed URL.
>Do I need to do 2 round of fetch? Cos things are getting parsed correctly.. maybe its because I don't have the indexing plugin i.e >index-feed? no? I know we will work on it after Nutch-443 is done..but I want to get a clarification..thats all :-) Cheers

I wouldn't worry about any shortcomings yet. Pretty much anything other blog search engines does can also be done in Nutch. (And, yes, what you mention can be done too.)

> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome
> - Informa
> - custom implementation based on Stax
> - ??

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472596 ]

nutch.newbie commented on NUTCH-444:
------------------------------------

Hi Dogacan:

I have done some digging around Rome yesterday and it seems to me that rome treats RSS i.e authors.getName differently then Atom authors.getName same goes for description, content and category... i.e. some values are returned nill some throws a exception. Could this be a cos of the problem.. cos all the CNN rss passed with flying colors .. http://www.cnn.com/services/rss/

Brain Storming here.. maybe its a good idea to chop the parser into two parser i.e. parser-feed (link, title, content -- the needed basics) and parser-feedextra (everything else and more) good idea? bad idea? I don't know.. just wondering .. My use case was those who are not making a blog search could just use parser-feed to index basic stuff thus saving parsing and indexing time. Those who are going for blog search will have both parser enables in nutch-site.xml.. Just some thoughts..

I will try to send you some problem URL directly via mail.

Regards

> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome
> - Informa
> - custom implementation based on Stax
> - ??

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472663 ]

nutch.newbie commented on NUTCH-444:
------------------------------------

Hi all:

I didn't realize that there was version 6 patch for NUTCH-443. After applying the patch all seems to be working. Furthermore I like to thank Dogacan for helping me on the way. Fetching, crawling and dedup/index works just fine.

I would like to use parse-feed and be of help in terms of writing/testing index-feed and query-feed so it would be nice if commiters would be kind enough to test the patch NUTCH-443 and apply it to trunk. So that work regarding index-feed and query-feed can begin.

If there are any more test or anything else that you guys want me to perform to test NUTCH-443 or NUTCH-444 please tell me.

Regards


> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome
> - Informa
> - custom implementation based on Stax
> - ??

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann reassigned NUTCH-444:
---------------------------------------

    Assignee: Chris A. Mattmann

> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome
> - Informa
> - custom implementation based on Stax
> - ??

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472907 ]

Nick Lothian commented on NUTCH-444:
------------------------------------

I'm a developer on the ROME project and I done some patches to FeedParser. I've also been a long-time lurker on the Nutch lists.

To clear up a couple of misconceptions:

The current version of FeedParser is Kevin Burton's one available from http://tailrank.com/code.php. It does have Atom 1.0 support.

ROME only has a single dependency: JDom.  

Both FeedParser & ROME load the feed into a DOM before working on it. FeedParser exposes a SAX-like API, while ROME exposes objects. My tests (a while ago now, but probably still reasonable) showed little performance difference between the two libraries (See http://www.mackmo.com/syndbench/feedparserresults.html and http://www.mackmo.com/syndbench/romeresults.html).

I don't understand nutch.newbie's comments about different Atom & RSS mappings. I'm not aware of any issues with the mapping of Author. There are some docs on mappings at http://rollerweblogger.org/roller/entry/rome_0_9_beta_is, http://wiki.java.net/bin/view/Javawsxml/Rome05DateMapping and http://wiki.java.net/bin/view/Javawsxml/Rome05URIMapping.

I'd HIGHLY recommend not writing your own custom feed parser. It's a much bigger job than you'd expect. In particular the difficulties of dealing with the bizzare things seen in real-world feeds should not be underestimated.

Apache Abdera  (http://incubator.apache.org/abdera/) is another option if anyone is just interested in Atom parsing.

> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome
> - Informa
> - custom implementation based on Stax
> - ??

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

RE: Commented: (NUTCH-444) Possibly use a different library toparse RSS feed for improved performance and compatibility

Jeremy Huylebroeck
I am using ROME in a modified version of the feedparse plugin.
It is pretty straight forward and easy.
We had issues with ROME 0.8 and ATOM or some dates. ROME 0.9 resolved
that.


-----Original Message-----
From: Nick Lothian (JIRA) [mailto:[hidden email]]
Sent: Tuesday, February 13, 2007 2:35 PM
To: [hidden email]
Subject: [jira] Commented: (NUTCH-444) Possibly use a different library
toparse RSS feed for improved performance and compatibility


    [
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.
plugin.system.issuetabpanels:comment-tabpanel#action_12472907 ]

Nick Lothian commented on NUTCH-444:
------------------------------------

I'm a developer on the ROME project and I done some patches to
FeedParser. I've also been a long-time lurker on the Nutch lists.

To clear up a couple of misconceptions:

The current version of FeedParser is Kevin Burton's one available from
http://tailrank.com/code.php. It does have Atom 1.0 support.

ROME only has a single dependency: JDom.  

Both FeedParser & ROME load the feed into a DOM before working on it.
FeedParser exposes a SAX-like API, while ROME exposes objects. My tests
(a while ago now, but probably still reasonable) showed little
performance difference between the two libraries (See
http://www.mackmo.com/syndbench/feedparserresults.html and
http://www.mackmo.com/syndbench/romeresults.html).

I don't understand nutch.newbie's comments about different Atom & RSS
mappings. I'm not aware of any issues with the mapping of Author. There
are some docs on mappings at
http://rollerweblogger.org/roller/entry/rome_0_9_beta_is,
http://wiki.java.net/bin/view/Javawsxml/Rome05DateMapping and
http://wiki.java.net/bin/view/Javawsxml/Rome05URIMapping.

I'd HIGHLY recommend not writing your own custom feed parser. It's a
much bigger job than you'd expect. In particular the difficulties of
dealing with the bizzare things seen in real-world feeds should not be
underestimated.

Apache Abdera  (http://incubator.apache.org/abdera/) is another option
if anyone is just interested in Atom parsing.

> Possibly use a different library to parse RSS feed for improved
> performance and compatibility
> ----------------------------------------------------------------------
> -----------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current
library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the
> feed to jdom first
> - no support for Atom 1.0
> - there has been no development in the last year Alternatives are:
> - Rome
> - Informa
> - custom implementation based on Stax
> - ??

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12475794 ]

Chris A. Mattmann commented on NUTCH-444:
-----------------------------------------

Hi Nick,

 Thanks for your insightful comments on this issue. I think I can summarize the discussions on this issue to the following:

1. Folks are seeing limitations in the version of commons-feedparser (0.6) used by parse-rss in the Nutch trunk
2. There are alternatives to feedparser in the form of ROME, informa, abdera, etc.
3. There is a newer, maintained version of Kevin Burton's feed parser that alleviates some of the limitations of feedparser (0.6) used in the Nutch trunk
4. We shouldn't be developing our own feedparsing solution

 Did I miss anything? If not, then I'm thinking the following. Perhaps we should write a transparency layer into the parse-rss plugin to select between different RSS parsing backends, such as ROME, or feedparser. It probably wouldn't be too hard to write a simple transparency interface, at least to begin with. The i/f would provide methods to retrieve channels, and items, and would support arbitrary metadata retrieval from the underlying structures. Would this meet everyone's needs? If not, then I have an alternate suggestion. Perhaps, at the very least, we should upgrade the version of commons-feedparser in parse-rss to the latest version from Kevin Burton? I'd also be willing to hear other suggestions...

Cheers,
  Chris


> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome
> - Informa
> - custom implementation based on Stax
> - ??

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12475795 ]

Renaud Richardet commented on NUTCH-444:
----------------------------------------

+1 for the transparency interface

thanks,
Renaud

> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome
> - Informa
> - custom implementation based on Stax
> - ??

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

123