[jira] Created: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
73 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473184 ]

Doğacan Güney commented on NUTCH-443:
-------------------------------------

Andrzej:

Why does fetcher need to synchronize? Why does the order fetcher outputs <key, value> pairs matters?

Sami:

> I opened an issue for this NUTCH-434 and I am now recommending that the patch in this issue
> doesn't try to take the world in one piece :)

Right. I just realized just how much this patch changes and most of them are not necessary for the proposed API change. So I am going to post a version that uses ObjectWritable in Fetcher, doesn't remove FetcherOutputFormat and only changes parse-rss so that it works with the new API (sorry about that Renaud, but parse-rss can be updated after this patch)

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-443:
--------------------------------

    Attachment: NUTCH-443-draft-v7.patch

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473380 ]

Andrzej Bialecki  commented on NUTCH-443:
-----------------------------------------

> Why does fetcher need to synchronize? Why does the order fetcher outputs <key, value> pairs matters?

You are right, I've been spending too much time with 0.7 branch lately ... I can't see any need for that either.

Regarding the ObjectWritable: since in this case all data is composed of Writables I think it's still better to use GenericWritable, because it saves some bytes on intermediate data.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473383 ]

Doğacan Güney commented on NUTCH-443:
-------------------------------------

> Regarding the ObjectWritable: since in this case all data is composed of Writables I think it's still better to use GenericWritable, > because it saves some bytes on intermediate data.

Don't get me wrong, I agree with you that GenericWritable is better. The problem is that, fetcher may output a Parse object (thus a ParseData object), so it needs a wrapper that can inject configuration. Once Nutch has such a mechanism I'll be happy to provide a patch that removes ObjectWritable usage here and in Indexer.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Work started: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on NUTCH-443 started by Chris A. Mattmann.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated NUTCH-443:
------------------------------------

    Attachment: NUTCH-443.022507.patch.txt

Hi Folks,

  Attached is a candidate patch for committal, prepared by Doğacan and myself. If there are no objections within the next 2 days, I will commit this patch to the trunk. Please test it out for yourselves, and thanks!

Cheers,
  Chris


> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12475836 ]

nutch.newbie commented on NUTCH-443:
------------------------------------

Thank you very much Chris! I was looking forward to this patch.

I have made the first trial with limited urls and found no problem. I will be doing some larger test today and report back if anything strange happen.

Thanks again


> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476297 ]

Andrzej Bialecki  commented on NUTCH-443:
-----------------------------------------

Overall the idea of this improvement looks very useful, but I'm -1 on this patch as it looks right now (only now I had a chance to review it in detail). I'd like to see the following issues addressed in this patch before it's committed:

* in my opinion it's easier to add missing CrawlDatum's (with correctly set fetch time) for the new urls to the output rather than work-around this by passing around the fetch time in metadata, and then again compensating in Indexer and CrawlDbReducer for the lack of these fetchDatum-s ..

* in Fetcher / Fetcher2 you don't pass the signature in case when there is no valid Parse output, but in the current versions of Fetchers the signature is still calculated and passed in datum.setSignature() (which ends up in crawl_fetch).

* using a generic Map<String, Parse> is IMHO inappropriate, as I indicated earlier, especially since this Map requires special post-processing in ParseUtil.processParseMap - and what would happen if I didn't use ParseUtil? I think this calls for a special-purpose class (ParseResult?), which would encapsulate this behavior without exposing it to its users (or even worse - allowing users to bypass it). This class would also help us to avoid somewhat ugly "convenience" methods in ParseStatus and ParseImpl - these details would be hidden in one of the constructors of ParseResult.

* I'm also not sure why we use Map<String, Parse> and not Map<Text, Parse>, since in all further processing we need to create Text objects ...

* the new section in HtmlParseFilters breaks the loop on encountering the first error, and leaves the parse results incompletely filtered. It should simply continue - the result is an aggregation of more or less independent documents that are parsed on their own.

* the comment about redirects in Parser.java is misplaced - I think this contract should be both defined and enforced in the Fetcher.


And finally, I think this is a significant change in the way how content parsers work with the rest of the framework, so we should wait with this patch after the 0.9 release - and we should push 0.9 out of the door really soon ...

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476357 ]

Doğacan Güney commented on NUTCH-443:
-------------------------------------

Hi Andrzej,

> * in my opinion it's easier to add missing CrawlDatum's (with correctly set fetch time) for the new urls to the
> output rather than work-around this by passing around the fetch time in metadata, and then again
> compensating in Indexer and CrawlDbReducer for the lack of these fetchDatum-s.

I guess what you mean is pushing STATUS_FETCH_SUCCESS datums to crawl_parse, right? I can probably do this by changing ParseImpl and adding a new boolean to it that indicates whether it is fetched or freshly generated in parse.

> * in Fetcher / Fetcher2 you don't pass the signature in case when there is no valid Parse output, but in the
> current versions of Fetchers the signature is still calculated and passed in datum.setSignature() (which ends
> up in crawl_fetch).

OK, I will fix it.

> * using a generic Map<String, Parse> is IMHO inappropriate, as I indicated earlier, especially since this Map > requires special post-processing in ParseUtil.processParseMap - and what would happen if I didn't use
> ParseUtil? I think this calls for a special-purpose class (ParseResult?), which would encapsulate this
> behavior without exposing it to its users (or even worse - allowing users to bypass it). This class would also > help us to avoid somewhat ugly "convenience" methods in ParseStatus and ParseImpl - these details would > be hidden in one of the constructors of ParseResult.

> * I'm also not sure why we use Map<String, Parse> and not Map<Text, Parse>, since in all further
> processing we need to create Text objects ...

If we are going with a special-purpose class, there is one more thing I would like to change. Consider the case of a zip archive with url http://foo.bar/baz.zip that contains two files spam.txt, egg.txt. After parsing this you will return something like <key1, parse of spam.txt>, <key2, parse of egg.txt> and perhaps <original_url, who knows what>.

Now, whatever key1 and key2 is, they are not really urls to be fetched. So I want to add another fetch and db status (let's call them STATUS_FETCH_FAKE and STATUS_DB_FAKE). During parse key1 and key2 will be written with FETCH_FAKE, and updatedb will write them as DB_FAKE to crawldb. Nutch will still index things with FAKE status, but generate will never generate them to be fetched. And updatedb will never change their status to DB_UNFETCHED(since, as I said before, they can't be fetched).

So, ParseResult will contain a group of <'real' url, parse> and a group of <'phony' url, parse> pairs.

What do you think?

> * the new section in HtmlParseFilters breaks the loop on encountering the first error, and leaves the parse
> results incompletely filtered. It should simply continue - the result is an aggregation of more or less
> independent documents that are parsed on their own.

This is the same as the old behavior. Why change it? (There was a bug there, but I fixed in one of the newer patches)

> * the comment about redirects in Parser.java is misplaced - I think this contract should be both defined and > enforced in the Fetcher.

OK.

> And finally, I think this is a significant change in the way how content parsers work with the rest of the
> framework, so we should wait with this patch after the 0.9 release - and we should push 0.9 out of the door
> really soon ...

Anything to get 0.9 out of the door:)

I will send an updated patch that fixes 1,2 and 4 (and 5 if I am missing something there) tomorrow unless someone beats me to it. I want to hear what others think on 3 before doing anything.

Thanks for your review and comments.


> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476361 ]

nutch.newbie commented on NUTCH-443:
------------------------------------

Hi:

We were really counting on this patch that it will make it to trunk as our site launch depends on it. This patch let us to complete Nutch-444. However I don't have enough knowledge about the inner workings of the patch to comment. I can only say that I tried it on a large set of seeds and it works without error.

Regarding 0.9 release .. its been months since it was discussed on the list ... and it is not possible to predict when 0.9 release will take place.... what I worry about is .... like many other patch this patch will also die out .. which is sad. I tend not to use code that are not in the trunk... so its a big loss for me cos my site needs to be launched...anyway thats my headache :-(

Regards



> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-443:
--------------------------------

    Attachment: NUTCH-443.02282007.patch

Hi everyone,

Here is the updated patch.

Andrzej, I believe this patch covers all your points (except javadocs, I will update them once API/code issues are resolved). Looking forward to your (and other's) review.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476600 ]

Andrzej Bialecki  commented on NUTCH-443:
-----------------------------------------

Almost there ... ParseResult seemed to tidy up this patch quite a bit. Remaining issues:

* you create the "fake" CrawlDatum-s in ParseOutputFormat, and then set fetchTime to the current time. This is incorrect - parsing may have been performed long after the content was fetched. The correct place to create and store these "fake" CrawlDatum-s is in the FetcherThread.output(), where you loop through Entry<Text, Parse>, i.e.:

          long curTime = System.currentTimeMillis();
          for (Entry<Text, Parse> entry : parseResult) {
            Text k = entry.getKey();
            output.collect(k,
                new ObjectWritable(new ParseImpl(entry.getValue())));
            if (!k.equals(key)) {
              CrawlDatum fake = datum.clone();
              fake.set
              fake.setFetchTime(curTime);
              output.collect(k, new ObjectWritable(fake));
            } else {
              // save the real datum
              output.collect(k, new ObjectWritable(datum));
            }
          }

* I'm pretty sure that ParseResult.filter() must NOT be called under normal circumstances ... We need to store the information that parsing was unsuccessful - if we remove this information from the ParseResult we will never know that parsing failed for this content (or a part thereof).

* we have a backward-compatibility issue with ParseImpl.isFetched - i.e. data created with earlier versions of Nutch won't be compatible with the new format, and there is no versioning information in the already existing data. We need to do one of the following:
  - bite the bullet, and don't care about backward compatibility - not so nice ... all existing segments will have to be re-parsed. Ouch.
  - add look-ahead code to test the data coming from DataInput if it contains this boolean flag or a likely Text length - somewhat unreliable...
  - store this flag in ParseData.contentMeta - ugly hack.

Out of these three the last option seems the safest for now. From the long-term point of view we should later on add versioning information and handling of different versions in Parse.

* the name of this method Parse.isFetched is somewhat misleading - it's not about fetching or not, it's whether this Parse corresponds to the original url or to a sub-url. Perhaps isCanonical, isRoot, or some other name ...?

* in ParseSegment - what's the reason for creating a new copy of ParseImpl in this line below? I think we should store the one we already have in "parse":

      output.collect(url, new ParseImpl(new ParseText(parse.getText()),
                                        parse.getData(), parse.isFetched()));


Thank you for your perseverance!

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476611 ]

Doğacan Güney commented on NUTCH-443:
-------------------------------------

> * you create the "fake" CrawlDatum-s in ParseOutputFormat, and then set fetchTime to the current time. This is incorrect -
> parsing may have been performed long after the content was fetched. The correct place to create and store these "fake"
> CrawlDatum-s is in the FetcherThread.output(), where you loop through Entry<Text, Parse>, i.e.:

What if I run my fetcher in non-parsing mode?(which, coincidentally, is always for me) I can add the code to fetcher but it will still be wrong in parse. I guess I will have to put FETCH_TIME_KEY back in. What do you think? Is there a better way to handle this?

> * I'm pretty sure that ParseResult.filter() must NOT be called under normal circumstances ... We need to store the information
> that parsing was unsuccessful - if we remove this information from the ParseResult we will never know that parsing failed for
> this  content (or a part thereof).

The current code does not store unsuccessful parses. I mean, take ParseSegment, it only outputs code if parse status is success. So Nutch removes this information anyway, I just changed the place where Nutch removes this information. My approach is cleaner (IMO), but I don't really feel that strongly about it, so I can change it.

> * we have a backward-compatibility issue with ParseImpl.isFetched - i.e. data created with earlier versions of Nutch won't be
> compatible with the new format, and there is no versioning information in the already existing data. We need to do one of the > following:
>  - bite the bullet, and don't care about backward compatibility - not so nice ... all existing segments will have to be re-parsed. > Ouch.
> - add look-ahead code to test the data coming from DataInput if it contains this boolean flag or a likely Text length -
> somewhat unreliable...
>  - store this flag in ParseData.contentMeta - ugly hack.

> Out of these three the last option seems the safest for now. From the long-term point of view we should later on add
> versioning information and handling of different versions in Parse.

Parse (actually ParseImpl) is used as a temporary data structure to pass data from ParseSegment.map to ParseSegment.reduce (or Fetcher.something but you get the point). So, unless someone stores the temporary outputs of ParseSegment.map and wants to reduce them with this patch, I don't see what can go wrong. ParseOutputFormat writes parse text and parse data doesn't care about what else is in there.

> * the name of this method Parse.isFetched is somewhat misleading - it's not about fetching or not, it's whether this Parse
> corresponds to the original url or to a sub-url. Perhaps isCanonical, isRoot, or some other name ...?

Giving names to things is hard. Usually harder than creating them :). Will think of something here.

> * in ParseSegment - what's the reason for creating a new copy of ParseImpl in this line below? I think we should store the one > we already have in "parse":

That's because Parser.getParse method's return value is Parse - not ParseImpl - which is not writable. So I take the not Writable Parse and create a Writable ParseImpl from it.

This is almost certainly not necessary, though. I will check this and update the patch.

> Thank you for your perseverance!

Sure, I just want to get this patch out of the way, so I can bug you all with my other patches:).

I will not send another patch, since I need some guidance on 1, I don't think that 2 and 3 are issues(but feel free to prove me wrong) and 4-5 are easy to solve.

Thanks again for your review.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Re: log guards

Dennis Kubes
In reply to this post by Jérôme Charron
I can also work on this, Chris do you want me to do it or do you want to
coordinate our efforts?

Dennis Kubes

Jérôme Charron wrote:

> Hi Chris,
>
> The JIRA issue is the 309 : https://issues.apache.org/jira/browse/NUTCH-309
> Thanks for your help.
>
> Jérôme
>
> On 2/13/07, Chris Mattmann <[hidden email]> wrote:
>>
>> Hi Doug, and Jerome,
>>
>>   Ah, yes, the log guard conversation. I remember this from a while back.
>> Hmmm, do you guys know what issue that this recorded as in JIRA? I have
>> some
>> free time recently, so I will be able to add this to my list of Nutch
>> stuff
>> to work on, and would be happy to take the lead on removing the guards
>> where
>> needed, and reviewing whether or not the debug ones make sense where they
>> are.
>>
>> Cheers,
>>   Chris
>>
>>
>>
>> On 2/13/07 11:17 AM, "Jérôme Charron" <[hidden email]> wrote:
>>
>> >> These guards were all introduced by a patch some time ago.  I
>> complained
>> >> at the time and it was promised that this would be repaired, but it
>> has
>> >> not yet been.
>> >
>> > Yes, Sorry Doug that's my own fault....
>> > I really don't have time to fix this   :-(
>> >
>> > Best regards
>> >
>> > Jérôme
>>
>> ______________________________________________
>> Chris A. Mattmann
>> [hidden email]
>> Staff Member
>> Modeling and Data Management Systems Section (387)
>> Data Management Systems and Technologies Group
>>
>> _________________________________________________
>> Jet Propulsion Laboratory            Pasadena, CA
>> Office: 171-266B                        Mailstop:  171-246
>> _______________________________________________________
>>
>> Disclaimer:  The opinions presented within are my own and do not reflect
>> those of either NASA, JPL, or the California Institute of Technology.
>>
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: log guards

chrismattmann
Hi Dennis,

  I'd be happy to: please contact me off list ([hidden email]),
and let's chat :-)

Cheers,
  Chris



On 2/28/07 7:38 AM, "Dennis Kubes" <[hidden email]> wrote:

> I can also work on this, Chris do you want me to do it or do you want to
> coordinate our efforts?
>
> Dennis Kubes
>
> Jérôme Charron wrote:
>> Hi Chris,
>>
>> The JIRA issue is the 309 : https://issues.apache.org/jira/browse/NUTCH-309
>> Thanks for your help.
>>
>> Jérôme
>>
>> On 2/13/07, Chris Mattmann <[hidden email]> wrote:
>>>
>>> Hi Doug, and Jerome,
>>>
>>>   Ah, yes, the log guard conversation. I remember this from a while back.
>>> Hmmm, do you guys know what issue that this recorded as in JIRA? I have
>>> some
>>> free time recently, so I will be able to add this to my list of Nutch
>>> stuff
>>> to work on, and would be happy to take the lead on removing the guards
>>> where
>>> needed, and reviewing whether or not the debug ones make sense where they
>>> are.
>>>
>>> Cheers,
>>>   Chris
>>>
>>>
>>>
>>> On 2/13/07 11:17 AM, "Jérôme Charron" <[hidden email]> wrote:
>>>
>>>>> These guards were all introduced by a patch some time ago.  I
>>> complained
>>>>> at the time and it was promised that this would be repaired, but it
>>> has
>>>>> not yet been.
>>>>
>>>> Yes, Sorry Doug that's my own fault....
>>>> I really don't have time to fix this   :-(
>>>>
>>>> Best regards
>>>>
>>>> Jérôme
>>>
>>> ______________________________________________
>>> Chris A. Mattmann
>>> [hidden email]
>>> Staff Member
>>> Modeling and Data Management Systems Section (387)
>>> Data Management Systems and Technologies Group
>>>
>>> _________________________________________________
>>> Jet Propulsion Laboratory            Pasadena, CA
>>> Office: 171-266B                        Mailstop:  171-246
>>> _______________________________________________________
>>>
>>> Disclaimer:  The opinions presented within are my own and do not reflect
>>> those of either NASA, JPL, or the California Institute of Technology.
>>>
>>>
>>>
>>

______________________________________________
Chris A. Mattmann
[hidden email]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply | Threaded
Open this post in threaded view
|

Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476635 ]

Andrzej Bialecki  commented on NUTCH-443:
-----------------------------------------

Re: the "fake" CrawlDatum-s: this looks ugly no matter which way we look at it ... :| It appears you were right from the start, FETCH_TIME_KEY seems to be the lesser evil at the moment.

Re: ParseResult.filter(): indeed - in fact, there is an inconsistency between what Fetcher does and what ParseSegment does. Fetcher actually stores the information about failed parsing - I had an impression that ParseSegment does this too. IMHO it's a good opportunity to fix this so that it works the same way in both places. Currently this information is used only in SegmentReader to provide the info about the total numbers of generated, fetched and parsed urls. However, other tools may use it to determine the failure rate of a specific parser ... so I would hate to discard it.

Re: ParseImpl.isFetched compat issue - I was wrong here. That's a relief - I hate such complications ...

Thanks!


> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-443:
--------------------------------

    Attachment: NUTCH-443.02282007-v2.patch

Yet another patch.

ParseResult.filter is out and Nutch no longer discards unsuccessful parses.

FETCH_TIME_KEY is back in.

isFetched is now isCanonical.

About avoiding constructing a new ParseImpl in output.collect: Doesn't work, because getEmptyParse (which is used to empty out a parse if its status is not success) returns a parse and not parse impl. Can be fixed by changing getEmptyParse and EmptyParseImpl later on.

Javadocs are still not updated. I will do this once everything else is done (which I hope is soon).

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, NUTCH-443.02282007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12493977 ]

Antonio Eggberg commented on NUTCH-443:
---------------------------------------

Hello:

I could really benefit from this patch so I am trying to find out if the provided patch will work with nutch trunk? any update or info.

Thanks

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, NUTCH-443.02282007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-443:
--------------------------------

    Attachment: NUTCH-443.08052007.patch

Patch updated to latest trunk.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  reassigned NUTCH-443:
---------------------------------------

    Assignee: Andrzej Bialecki   (was: Chris A. Mattmann)

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Andrzej Bialecki
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

1234