[jira] Created: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

classic Classic list List threaded Threaded
73 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Sebastian Nagel (Jira)
allow parsers to return multiple Parse object, this will speed up the rss parser
--------------------------------------------------------------------------------

                 Key: NUTCH-443
                 URL: https://issues.apache.org/jira/browse/NUTCH-443
             Project: Nutch
          Issue Type: New Feature
          Components: fetcher
    Affects Versions: 0.9.0
            Reporter: Renaud Richardet
            Priority: Minor
             Fix For: 0.9.0


allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dogacan Güney updated NUTCH-443:
--------------------------------

    Attachment: parse-map-core-untested.patch

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-map-core-untested.patch
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471231 ]

Dogacan Güney commented on NUTCH-443:
-------------------------------------

Here is a very initial patch. It is entirely untested and only changes stuff under src/java(so, code won't even compile:).

I am posting this, because while what we change here is trivial, it is also very intrusive. (I mean this patch is almost
700 lines long, and it doesn't even change the plugins). So, I hope that this patch can get some early review,
suggestions and corrections.

I will post the necessary changes to plugins too, as soon as I can.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-map-core-untested.patch
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471260 ]

Dogacan Güney commented on NUTCH-443:
-------------------------------------

Ok, this is  the second attempt(sorry that I am sending patches in a frenzy, I will slow down now).

In the first patch, I just put Map<String, Parse> to FetcherOutput but that doesn't work,
since the keys are not necessarily ordered.

I mean;
Assume, we have two <key, Map<String, Parse>> pairs:
<"a.com", <"z.com", some_parse>> and <"b.com", <"b.com", some_other_parse>>
With the first patch, we would first get a.com (and thus write z.com) then get b.com (and try to write b.com)
but this would fail since "b.com" < "z.com".

I completely removed FetcherOutput class. What it does can be done with wrapping the objects
in ObjectWritable. I know this is heavier but I couldn't think of another way around the issue of ordering
of keys.

I tested this a bit with a small set of urls. Both parsing seperately and parsing during fetching seems to work.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-map-core-untested.patch
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dogacan Güney updated NUTCH-443:
--------------------------------

    Attachment: parse-map-core-draft-v1.patch

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-map-core-draft-v1.patch, parse-map-core-untested.patch
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Renaud Richardet updated NUTCH-443:
-----------------------------------

    Attachment: parsers.diff

Great, here's my work-in-progress(not finished, not tested) for updating all parsers and extending the RSSparser.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471620 ]

Dogacan Güney commented on NUTCH-443:
-------------------------------------

This is pretty much the merge of our work(except parse-rss, it kept failing on something like RSSContentUtils, so it returns a single parse for now).

I also had a bug in MapWritable, this fixes it.

Since the code now compiles :), I ran junit tests over it. TestFetcher fails for some reason, will look into it.

Also, there is a bug in updatedb. If getParse returns keys different than content.getUrl and if these keys do not have entries in crawl_fetch, CrawlDbReducer will ignore those (assuming [correctly] that they are not fetched and there is no point in processing them). I will look into this too.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dogacan Güney updated NUTCH-443:
--------------------------------

    Attachment: NUTCH-443-draft-v1.patch

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dogacan Güney updated NUTCH-443:
--------------------------------

    Attachment: NUTCH-443-draft-v2.patch

Small update to the patch. Now all core junit tests pass.

Now, a question: When posting patches to JIRA, should I attach a new
patch as I find and fix my bugs(as I do it now), or should I wait till
changes between successive patches include a couple of fixes?

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471703 ]

nutch.newbie commented on NUTCH-443:
------------------------------------

I tried the patch with about 100 rss feed. Some problems

1. atom+xml content type gives trouble .. I am not sure if commons feedparser supports atom 1.0
2. In my case sometime the RSS URL doesn't end with .xml or .rss so some of the feeds got indexed like the way current nutch trunk do i.e as html.

Just some early feedback.. I will do some more testing this weekend. One question I do have is that - it still doesn't solve the problem of index just the RSS feeds.. even if I take away all my other parsers .. I still need HTML parser and index-basic.. maybe its time for index-rss? no?

Cheers

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471743 ]

nutch.newbie commented on NUTCH-443:
------------------------------------

After doing some quick research seems like feedparser dont do atom 1.0. The comment below is not related to the api changes but rather feedparser it seems to be a dead end. maybe its time to seriously consider "Rome" https://rome.dev.java.net/ its being developed and has apache style lic. What others think about the change?

Regards

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471747 ]

Gal Nitzan commented on NUTCH-443:
----------------------------------

Actually, I have tested Rome after feedparser failed with OutOfMemoy. Rome has the same problem as feedparser, both convert the feed to jdom first :(. I had to write my own implementation for rss parser with Stax.

Not Rome and neither feedparser could handle a 100K items feed, which isn't (probably) the common use case however it is not that far fetched use case.

HTH

Gal.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471754 ]

nutch.newbie commented on NUTCH-443:
------------------------------------

Gal:

Thanks for the feedback and the test you have done. If Nutch is going to be open source version of google then maybe we should consider Stax. Could you please provide some info regarding your implementation.. probably in the mailing list..  Well my use case is going to be lot more then 100K items feed so I am interested to know more. I would like to hear others view of feedparser please beside the apache politics :-) The big question is -- Can anyone use Nutch to be a technorati or bloglines using feedparser? seems like no?

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471780 ]

Chris A. Mattmann commented on NUTCH-443:
-----------------------------------------

Nutch Newbie,

   What exactly do you mean when you mention Apache politics? Feedparser wasn't selected because it was an Apache sub-project. In fact, that's as far from the truth as possible. I selected feedparser at the time (in May 2005 or so), because it was the only one of the three RSS reading APIs (informa, feedparser and rome) that I could figure out. The time that it took me to just understand rome, and informa was far greater than the time that it took me to write the entire RSS parser using feedparser.

   That said, things may have changed in the past year and a half. Perhaps Rome provides an easier API than feedparser now. Perhaps informa is faster. I'm not exactly sure what the answer to these and other questions on this subject are. However, before anything is said about feedparser, it's only fair that the folks who wrote it get to chime in. For that matter, it would probably be a good idea to contact Kevin Burton, the lead developer of the commons-feedparser, and ask him about its relationship to rome, and other apis such as Stax, or informa even...

Cheers,
  Chris


> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471806 ]

nutch.newbie commented on NUTCH-443:
------------------------------------

Chris:

Frankly my comments are regarding feedparser and I must say I am great full for the rss-plugin and the hard work you put in. You have decided to go for feedparser cos you thought it was the correct solution. So please don't take this personally.

According to SVN

http://svn.apache.org/viewvc/jakarta/commons/dormant/feedparser/trunk/ the last update was done regarding feedparser was 12 months ago plud there are no Atom 1.0 support. This is how I like to put it and frankly it doesn't matter ..

1. The goal of nutch to be an alternative to open source google.
2. you can't have a dead end feedparser as your fundamental feed parsing soluttion where the project is not moving for the last 12 months!  Well go figure why people think its apache politics.

Sorry I brusted like this. in one hand nutch would like to preach that it is the alternative to google and on the other hand it uses technology that is no longer active ..thats all.



> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dogacan Güney updated NUTCH-443:
--------------------------------

    Attachment: NUTCH-443-draft-v3.patch

new patch, contains a possible fix for CrawlDbReducer problem.

This version finally works! (well, not really, but I can definitely say that it almost kind of works..sometimes:)

I have two main issues with this patch:

1) If fetcher is in parsing mode, and parse returns a SUCCESS_REDIRECT,
fetcher handles this redirect. After this change, fetcher checks if the first element of parseMap.values() (whatever that may be) has a SUCCESS_REDIRECT. It is possible that a multi-entry parseMap has an parse element with a SUCCESS_REDIRECT that is not the first element. (perhaps we can first check if parseMap.get(originalUrl) returns a parse, if not use first element of parseMap.values()? )

2) To be able to pass fetch time to not-actually-fetched-but-generated-in-parse urls, I first put the original fetch time to content and then pass the value in content to all elements in parseMap.values(). I guess this approach is not very optimal since it passes fetch time around a lot.


> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471857 ]

Dogacan Güney commented on NUTCH-443:
-------------------------------------

nutch.newbie:

I fail to see what the problem is. If feedparser doesn't work for you, Nutch has a very powerful plugin api. Just write another plugin that uses Rome or whatever. If you are willing to share it, post it to JIRA explaining why your plugin is better than the current one. Unless there is a license-related problem, I am sure that nutch developers will put it in.

PS: I actually have a half-baked plugin that uses Rome, and I will work on rss index and rss query plugins once this issue is resolved.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Renaud Richardet updated NUTCH-443:
-----------------------------------

    Attachment: NUTCH-443-draft-v4.patch

Hi Dogacan,

Thanks for merging the patches, good teamwork!

I worked on the RSS parser, it should now basically work.
Now, all core and plugin-tests pass, except for TestRSSparser, will work on that. Once this is in place, I will have a look at the other issues with fetch time, etc.

I merged my changes with your patch, version 3.


> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471878 ]

Renaud Richardet commented on NUTCH-443:
----------------------------------------

Nutch Newbie, Gal, Chris

It's great that you discuss alternative RSS parsing libraries, bug the resolution of this bug does not depends on which underlying RSS library is used in RSSParser. Would you mind moving the conversation to the new issue I created for it (NUTCH-444), thanks a bunch.



> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann reassigned NUTCH-443:
---------------------------------------

    Assignee: Chris A. Mattmann

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

1234