[jira] Created: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
73 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  resolved NUTCH-443.
-------------------------------------

    Resolution: Fixed

Committed in rev. 536606. Big thanks to all who contributed to this patch!

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Andrzej Bialecki
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495357 ]

Doğacan Güney commented on NUTCH-443:
-------------------------------------

Well... That's embarrassing. It seems I forgot to include the necessary changes to Indexer. Indexer has to read crawl_parse too so that it can pickup sub-urls' fetch datums.

So, that seemed easy (just a couple of lines) but then I realized that there is another bug. (Which, in my defense, was present in Nutch before 443. So the bug was there, I only made it worse:)

It is a bit difficult to describe, so please bear with me. The problem goes like this:

In fetcher, if max.redirect is 0, Nutch pushes an empty Content to content and a LINKED datum to crawl_fetch (let's call this url foo). ParseSegment parses empty Content and creates a parse data and an empty parse text. After updatedb and one more generate-fetch-parse-updatedb cycle, we now have a proper content, parse text and parse data for foo in the new segment.

Now, assume I index both of these segments together. Url foo will have two sets of (fetch datum, parse), one coming from the first segment, the other coming from the second segment. Since first fetch datum is LINKED,  this code in Indexer.reduce will cause foo to be discarded:

    if (redir != null) {
      // XXX page was redirected - what should we do?
      // XXX discard it for now
      return;
    }

And it doesn't work if we just remove this code. Remember that foo has two sets of (fetch datum, parse) and one of the parses contains an empty parse text. Since, in reduce Indexer will randomly choose one of the parses it is likely that we will get an empty parse text for url foo.

This is the part that I made worse: Since Indexer has to read crawl_parse it will get a lot of STATUS_LINKED (that are written to crawl_parse as outlinks) and discard a lot of useful pages in any multi-segment index job.

Sorry if the description is unnecessarily complex.



> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Andrzej Bialecki
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-443:
--------------------------------

    Attachment: redirect_and_index.patch

Patch for the problem.

Now, if Fetcher gets a null content, instead of pushing an empty content, it filters null content.

It may change the semantics very slightly, but I don't think that it will be a problem. Before this patch, Fetcher creates an empty content than passes score from datum to content. Parse then passes it from content to parse data so that it can distribute the score to outlinks. But empty pages don't have outlinks anyway and they should not be indexed (so an adjust datum has no purpose).

Sorry about missing this bug in the first place, but, man, this is a subtle one.


> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Andrzej Bialecki
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff, redirect_and_index.patch
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Reopened: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann reopened NUTCH-443:
-------------------------------------

      Assignee: Chris A. Mattmann  (was: Andrzej Bialecki )

Per Doğacan's comment, we need to reopen this and test out his new patch for it. Andrzej, I'd be happy if you reassigned to you, however, I will have some time on Tuesday to look at this if you don't until then.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff, redirect_and_index.patch
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated NUTCH-443:
------------------------------------

    Attachment: patch.txt

I'm not too happy with the direction you took in the latest patch. IMHO adding crawl_parse to Indexer only complicates things, since you need to handle additional conditions, not to mention the performance hit to process a lot more data.

Instead, why don't we simply omit storing empty content in Fetcher / Fetcher2 when we get a redirect that we don't immediately follow? This would solve the original issue.

Please see the patch to Fetcher2 that illustrates this issue.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff, patch.txt, redirect_and_index.patch
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495696 ]

Doğacan Güney commented on NUTCH-443:
-------------------------------------

I am not sure I follow you Andrzej. My patch already does a very similar thing in Fetchers . Actually, the only difference between our patches - w.r.t Fetcher code - is in your patch the parsing condition also includes (content != null) check. Beyond that our code is pretty much the same. (I will send an updated patch that does that, btw). Besides the code change in Fetchers, we also need to remove the redir != null condition for indexer to work correctly. See my comment above for a hopefully more understandable description.

Indexer has to read crawl_parse, because that is where ParseSegment pushes sub-urls fetch datums. So, it is not related to the redirection issue. It is related to the "Oh man, I forgot to include that line in my patch" issue:).

If reading crawl_parse turns out to be a big burden to Indexer, perhaps we can make ParseSegment push these datums to another file.  (crawl_late_fetch? Yeah, I know that name sucks:) It would be awesome if hadoop allowed us to reopen SequenceFiles to append data(so we could have just pushed them to crawl_fetch). AFAIK, hadoop doesn't have that yet.




> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff, patch.txt, redirect_and_index.patch
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-443:
--------------------------------

    Attachment: redirect_and_index_v2.patch

New version. Moves parsing code into (content != null) branch.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff, patch.txt, redirect_and_index.patch, redirect_and_index_v2.patch
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495797 ]

Andrzej Bialecki  commented on NUTCH-443:
-----------------------------------------

Indeed... I forgot that we need crawl_parse to collect new sub-urls. Disregard my comment - your last patch seems to address the issue.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff, patch.txt, redirect_and_index.patch, redirect_and_index_v2.patch
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505448 ]

Doğacan Güney commented on NUTCH-443:
-------------------------------------

Chris, did you get a chance to look at this? If you are busy, I can assign this to myself, give it one final test and commit it.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff, patch.txt, redirect_and_index.patch, redirect_and_index_v2.patch
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505501 ]

Chris A. Mattmann commented on NUTCH-443:
-----------------------------------------

Doğacan,

   Whoops :) This one kind of fell off the radar screen. Doğacan, AFAIK, this patch is ready for committing, no? If so, I will go ahead and commit it, and close out the issue. Let me know, thanks!

Cheers,
  Chris


> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff, patch.txt, redirect_and_index.patch, redirect_and_index_v2.patch
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-443:
--------------------------------

    Attachment: NUTCH_443_reopened_v3.patch

New version against latest trunk.

Tested locally, seems to work.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, NUTCH_443_reopened_v3.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff, patch.txt, redirect_and_index.patch, redirect_and_index_v2.patch
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann closed NUTCH-443.
-----------------------------------


Patch applied to trunk:

http://svn.apache.org/viewvc?rev=548076&view=rev

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, NUTCH_443_reopened_v3.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff, patch.txt, redirect_and_index.patch, redirect_and_index_v2.patch
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann resolved NUTCH-443.
-------------------------------------

    Resolution: Fixed

Patch tested and contributed by Dogacan. This update is a fix and semantics change from the original patch for NUTCH-443. The original patch did not tell the  Indexer to read crawl_parse too so that it can pickup sub-urls' fetch datums. This patch addresses that issue. Now, if Fetcher gets a null content, instead of pushing an empty content, it filters the null content.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, NUTCH_443_reopened_v3.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff, patch.txt, redirect_and_index.patch, redirect_and_index_v2.patch
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

1234