[jira] Created: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
73 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471956 ]

Chris A. Mattmann commented on NUTCH-443:
-----------------------------------------

I'll take the lead on evaluating these patches, and getting them into the sources. I'll take a look at what you've done so far, and contact you over the weekend to discuss next steps.

Cheers,
  Chris


> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471991 ]

nutch.newbie commented on NUTCH-443:
------------------------------------

Dogacan:

It works rather ok, But hen I changed  the parse-plugins.xml a bit  just to see if I was explicitly using the parse-rss ..

        <mimeType name="text/xml">
<!--            <plugin id="parse-html" /> -->
                <plugin id="parse-rss" />
        </mimeType>

I got stuck in dedup phase .. I tried couple of time with content type magic being on/off
 but same error.. stack trace..

2007-02-10 16:48:35,873 DEBUG mapred.MapTask - Started thread: Sort progress reporter for task map_e7j0jb
2007-02-10 16:48:35,877 WARN  mapred.LocalJobRunner - job_kyv1oj
java.lang.ArrayIndexOutOfBoundsException: -1
        at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:109)
        at org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
        at org.apache.hadoop.mapred.MapTask$2.next(MapTask.java:166)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:183)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:109)

I will be doing some more testing tonight with various config changes.. But in general without any changes it works. I however haven't tested searching yet.

Please advise me if I am doing something wrong or if you want me to test something specifically.
Cheers


> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471998 ]

nutch.newbie commented on NUTCH-443:
------------------------------------

Hi..

After swaping the parse-plugin.xml i.e. the following way .. (and turning off magic detection)

        <mimeType name="text/xml">
           <plugin id="parse-rss" />
          <plugin id="parse-html" />
        </mimeType>

Hoping that parse-rss will pick-up the doc firs and not return NPE so out of 25 RSS URL with 1 round of fetch I managed to escape dedup with only 4 doc being indexed all other 21 docs throw NPE ..

Error parsing: http://rss.cnn.com/rss/cnn_warpcnn.rss: failed(2,200): java.lang.NullPointerException
Error parsing: http://rss.cnn.com/rss/cnn_ac360blog.rss: failed(2,200): java.lang.NullPointerException
Error parsing: http://rss.cnn.com/rss/cnn_marquee.rss: failed(2,200): java.lang.NullPointerException
Error parsing: http://rss.cnn.com/rss/cnn_gupta.rss: failed(2,200): java.lang.NullPointerException

I must be doing something sily there must be way to tell nutch to index using plugin X.. I thought you do that turning magic off and using plugin-parse.xml .. no? am I missing something .. Please let me know..

I am going to try the parse-feed now to see what happens. Issues regarding that I will post in Nutch-444

Cheers


> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472079 ]

Dogacan Güney commented on NUTCH-443:
-------------------------------------

nutch.newbie,

I will take a look at these issues, but parse-rss and almost all other plugins are updated by Renaud Richardet, so he may give you a better answer.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dogacan Güney updated NUTCH-443:
--------------------------------

    Attachment: NUTCH-443-draft-v5.patch

New version. Now indexing also works but has a catch. Many ScoringFilter functions take both a dbDatum and a fetchDatum. After this change fetchDatum may be null as that url may not be fetched but generated in parse. This does not affect scoring-opic, though.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dogacan Güney updated NUTCH-443:
--------------------------------

    Attachment: NUTCH-443-draft-v6.patch

Oops... I forgot to merge Renaud Richardet's work.

This is same as v5 except it includes Renaud Richardet's changes from v4.

I am really really sorry about this. Will be more careful next time.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472669 ]

nutch.newbie commented on NUTCH-443:
------------------------------------

Chris:

I been testing NUTCH-444 and NUTCH-443 lately. Renaud and Dogacan have done great work. So far all the bugs I have found are squashed. If there are other test that needs to be done just let me know.

Beside our differences in view regarding underlaying parser technology :-) I would be very very glad if you would have the time to test the patch. I really need NUTCH-443 so that I can start on NUTCH-444 in terms of index-feed and query-feed. I would appreciate your attention.

Regards


> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472692 ]

Chris A. Mattmann commented on NUTCH-443:
-----------------------------------------

Hi Nutch Newbie:

I've already contacted Doğacan off-list and am currently in the process of testing his patch. In open source development projects, the developers all have their own day jobs typical, along with other stuff that they are busy doing. I am no different in this case. Additionally, a patch such as this one, requires * a lot * of testing, since it fundamentally changes things about the core Nutch API. I need to test the patch thoroughly before committing anything. Additionally, this patch has its idiosyncracies, as do all other patches (e.g., for instance, this patch in some places removes the log guards, and I'm not sure why yet, it has whitespace issues as many patches do, it removes code in places and then adds it back in others, etc.). These types of things must be addressed before anything is committed to Nutch. Since Doğacan has taken the lead on making this patch happen (which is great by the way, thanks Doğacan!), I will continue to work with him offlist to enlist him to perform these required updates.

So, while I'm not there yet, I am working on it. In the meanwhile, you are welcome tto patch your Nutch system with the existing NUTCH-443 patch that I am working on, and start your development from there.

Cheers,
 Chris


> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472698 ]

nutch.newbie commented on NUTCH-443:
------------------------------------

Thanks a bunch Chris! Thats all I needed to hear :-) Super :-)

Cheers

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472733 ]

Renaud Richardet commented on NUTCH-443:
----------------------------------------

hi All,

Glad to see that this patch is moving forward :-)
I have been carried away by a project, but will have some time this week. Please let me know if there is anything I can help on, especially on the parsers.

Cheers,
Renaud

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472821 ]

Doug Cutting commented on NUTCH-443:
------------------------------------

> this patch in some places removes the log guards

Most of the log guards are misguided.  Log guards should only be used on DEBUG level messages in performance-critical inner loops.  Since INFO is the expected log level, a guard on INFO & WARN level messages does not improve performance, since these will be shown.  And most DEBUG-level messages are not in performance critical code and hence do not need guards.  The guards only make the code bigger and thus harder to read and maintain.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

log guards

Doug Cutting
Doug Cutting (JIRA) wrote:
>> this patch in some places removes the log guards
>
> Most of the log guards are misguided.  Log guards should only be used on DEBUG level messages in performance-critical inner loops.  Since INFO is the expected log level, a guard on INFO & WARN level messages does not improve performance, since these will be shown.  And most DEBUG-level messages are not in performance critical code and hence do not need guards.  The guards only make the code bigger and thus harder to read and maintain.

In particular, in all places where we check isWarnEnabled(),
isFatalEnabled() and isInfoEnabled(), the 'if' should be removed.  All
calls to isDebugEnabled() should be reviewed, and most should be removed.

These guards were all introduced by a patch some time ago.  I complained
at the time and it was promised that this would be repaired, but it has
not yet been.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: log guards

Jérôme Charron
> These guards were all introduced by a patch some time ago.  I complained
> at the time and it was promised that this would be repaired, but it has
> not yet been.

Yes, Sorry Doug that's my own fault....
I really don't have time to fix this   :-(

Best regards

Jérôme
Reply | Threaded
Open this post in threaded view
|

Re: log guards

chrismattmann
Hi Doug, and Jerome,

  Ah, yes, the log guard conversation. I remember this from a while back.
Hmmm, do you guys know what issue that this recorded as in JIRA? I have some
free time recently, so I will be able to add this to my list of Nutch stuff
to work on, and would be happy to take the lead on removing the guards where
needed, and reviewing whether or not the debug ones make sense where they
are.

Cheers,
  Chris



On 2/13/07 11:17 AM, "Jérôme Charron" <[hidden email]> wrote:

>> These guards were all introduced by a patch some time ago.  I complained
>> at the time and it was promised that this would be repaired, but it has
>> not yet been.
>
> Yes, Sorry Doug that's my own fault....
> I really don't have time to fix this   :-(
>
> Best regards
>
> Jérôme

______________________________________________
Chris A. Mattmann
[hidden email]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply | Threaded
Open this post in threaded view
|

Re: log guards

Jérôme Charron
Hi Chris,

The JIRA issue is the 309 : https://issues.apache.org/jira/browse/NUTCH-309
Thanks for your help.

Jérôme

On 2/13/07, Chris Mattmann <[hidden email]> wrote:

>
> Hi Doug, and Jerome,
>
>   Ah, yes, the log guard conversation. I remember this from a while back.
> Hmmm, do you guys know what issue that this recorded as in JIRA? I have
> some
> free time recently, so I will be able to add this to my list of Nutch
> stuff
> to work on, and would be happy to take the lead on removing the guards
> where
> needed, and reviewing whether or not the debug ones make sense where they
> are.
>
> Cheers,
>   Chris
>
>
>
> On 2/13/07 11:17 AM, "Jérôme Charron" <[hidden email]> wrote:
>
> >> These guards were all introduced by a patch some time ago.  I
> complained
> >> at the time and it was promised that this would be repaired, but it has
> >> not yet been.
> >
> > Yes, Sorry Doug that's my own fault....
> > I really don't have time to fix this   :-(
> >
> > Best regards
> >
> > Jérôme
>
> ______________________________________________
> Chris A. Mattmann
> [hidden email]
> Staff Member
> Modeling and Data Management Systems Section (387)
> Data Management Systems and Technologies Group
>
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                        Mailstop:  171-246
> _______________________________________________________
>
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
>
>
>
Reply | Threaded
Open this post in threaded view
|

Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473114 ]

Andrzej Bialecki  commented on NUTCH-443:
-----------------------------------------

The contract for ParseUtil.getFirstParseEntry() seems unclear - since in most cases this is a HashMap, there is no predictable way to get the first entry added to the map ... I propose also that we should use a specialized class instead of general-purpose Map; and then we can record in that class which entry was the first. Also, the naming of some methods seems a bit awkward - why should we insist that we createSingleEntryMap while we create an ordinary Map, and we don't use this special-case knowledge later? I suggest to simply name it createParseMap.

In recent versions of Hadoop there is a GenericWritable class - it replaces ObjectWritable when classes are known in advance, and provides a more compact representation.

Changes to MapWritable must preserve old code values, at most adding some new ones - otherwise the new code will get confused when working with older data.

CrawlDbReducer, TODO item: this should be the time stored under Nutch.FETCH_TIME_KEY, no?

If I'm not mistaken, ParseUtil doesn't need the import of HashMap, only Map.

The new model for returning results from parse plugins allows a much better approach to parsing archives (eg. zip files) containing multiple documents in supported formats - although this should be a separate patch.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473129 ]

Doğacan Güney commented on NUTCH-443:
-------------------------------------

Andrzej:

Thanks for taking the time to review this.

> The contract for ParseUtil.getFirstParseEntry() seems unclear - since in most cases this is a HashMap, there is no predictable > way to get the first entry added to the map ... I propose also that we should use a specialized class instead of
> general-purpose Map; and then we can record in that class which entry was the first.

ParseUtil.getFirstParseEntry is only a convenience method used by plugins to get the first(and only) entry in a map when it knows that it will create a one-entry parse map(with original url as the key) and it is mostly used in a plugin's main method to get the parse and print it. It is not used in any core part of Nutch.

Anyway, it is very incorrectly named. What we meant was ParseUtil.getOnlyParseEntry. Hmm, that doesn't make any sense either :D

Instead of creating a specialized class, how about removing the method and just using parseMap.get(key)? Most plugins will use it like parseMap.get(content.getUrl()).

> Also, the naming of some methods
> seems a bit awkward - why should we insist that we createSingleEntryMap while we create an ordinary Map, and we don't use > this special-case knowledge later? I suggest to simply name it createParseMap.

You are right, I will change this in the next patch.

> In recent versions of Hadoop there is a GenericWritable class - it replaces ObjectWritable when classes are known in advance, > and provides a more compact representation.

Didn't know this, will change this too. (Why is Nutch not using this class in Indexer?)

> Changes to MapWritable must preserve old code values, at most adding some new ones - otherwise the new code will get
> confused when working with older data.

I see your point but I am not sure how to fix this. Since this patch removes the FetcherOutput class, what to put there instead of it? I guess we can just keep FetcherOutput as it is, and update its javadoc to reflect the fact that it is not used anymore.

> CrawlDbReducer, TODO item: this should be the time stored under Nutch.FETCH_TIME_KEY, no?
> If I'm not mistaken, ParseUtil doesn't need the import of HashMap, only Map.

I will remove the TODO item and fix the imports in the next patch.



> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473141 ]

Andrzej Bialecki  commented on NUTCH-443:
-----------------------------------------

> Didn't know this, will change this too. (Why is Nutch not using this class in Indexer?)

Inertia, and lack of committer time ... ;)

> Since this patch removes the FetcherOutput class, what to put there instead of it?

Hmm, actually this is an important question. I don't think FetcherOutput is persisted anywhere, it's just an aggregate class to keep things together before they hit the disk. I propose to leave a comment in MapWritable like this "// code -123 was reserved for FetcherOutput - no longer in use". As for the class itself - again, since it's not persisted we don't have to keep it around, just remove it.

Sections in Fetcher.FetcherThread.output() and similar in Fetcher2 that output the data need to be synchronized now - output.collect() is no longer a single atomic operation. Perhaps it's better to leave FetcherOutput after all?

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473147 ]

Doğacan Güney commented on NUTCH-443:
-------------------------------------

> Hmm, actually this is an important question. I don't think FetcherOutput is persisted anywhere, it's just an aggregate class to
> keep things together before they hit the disk. I propose to leave a comment in MapWritable like this "// code -123 was
> reserved  for FetcherOutput - no longer in use". As for the class itself - again, since it's not persisted we don't have to keep it
> around, just remove it.

I implemented this approach in one of the earlier patches. The problem is that, the code in MapWritable does this:

addIdEntry((byte) (-128 + CLASS_ID_MAP.size() + ++fIdCount), // ...

Now, I don't claim to understand the code perfectly but because of the "-128 + CLASS_ID_MAP.size()" part I think CLASS_ID_MAP must have consecutive values always, so not having -123 breaks it. IIRC, removing that line and running TestMapWritable fails.

> Sections in Fetcher.FetcherThread.output() and similar in Fetcher2 that output the data need to be synchronized now -
> output.collect() is no longer a single atomic operation. Perhaps it's better to leave FetcherOutput after all?

This causes key ordering problems. See my admittedly-could-have-been-clearer 2nd comment.

Anyway, I am assumming that you are OK with removing ParseUtil.getFirstParseEntry and just using Map.get?

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473148 ]

Sami Siren commented on NUTCH-443:
----------------------------------

>> Didn't know this, will change this too. (Why is Nutch not using this class in Indexer?)

>Inertia, and lack of committer time ... ;)

IIRC you actually cannot use GenericWritable because it requires wrapped objects to be Writables, Lucene objects obviously aren't. But you are able to imitate it and make similar object capable of storing Objects (as those writables are not persisted in indexer).

I opened an issue for this NUTCH-434 and I am now recommending that the patch in this issue doesn't try to take the world in one piece :)



> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

1234