[jira] Created: (NUTCH-770) Timebomb for Fetcher

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-770) Timebomb for Fetcher

ASF GitHub Bot (Jira)
Timebomb for Fetcher
--------------------

                 Key: NUTCH-770
                 URL: https://issues.apache.org/jira/browse/NUTCH-770
             Project: Nutch
          Issue Type: Improvement
            Reporter: Julien Nioche


This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-770) Timebomb for Fetcher

ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-770:
--------------------------------

    Attachment: NUTCH-770.patch

> Timebomb for Fetcher
> --------------------
>
>                 Key: NUTCH-770
>                 URL: https://issues.apache.org/jira/browse/NUTCH-770
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Julien Nioche
>         Attachments: NUTCH-770.patch
>
>
> This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-770) Timebomb for Fetcher

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

MilleBii updated NUTCH-770:
---------------------------

    Attachment: log-770

Please find the logs of the patch... I did effectively try it but I could not compile after it.

> Timebomb for Fetcher
> --------------------
>
>                 Key: NUTCH-770
>                 URL: https://issues.apache.org/jira/browse/NUTCH-770
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Julien Nioche
>         Attachments: log-770, NUTCH-770.patch
>
>
> This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-770) Timebomb for Fetcher

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783248#action_12783248 ]

Julien Nioche commented on NUTCH-770:
-------------------------------------

The log simply shows that the patch has not been applied properly.
See http://markmail.org/message/wbd3r3t5bfxzkbpn for a discussion on how to apply patches

Should work fine from the root directory of Nutch with
patch -p0 < ~/Desktop/NUTCH-770.patch

> Timebomb for Fetcher
> --------------------
>
>                 Key: NUTCH-770
>                 URL: https://issues.apache.org/jira/browse/NUTCH-770
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Julien Nioche
>         Attachments: log-770, NUTCH-770.patch
>
>
> This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-770) Timebomb for Fetcher

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783252#action_12783252 ]

MilleBii commented on NUTCH-770:
--------------------------------

That's what I did  and just retried ... so I'm a bit suprised too.
Other patches worked fine so far.

???

> Timebomb for Fetcher
> --------------------
>
>                 Key: NUTCH-770
>                 URL: https://issues.apache.org/jira/browse/NUTCH-770
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Julien Nioche
>         Attachments: log-770, NUTCH-770.patch
>
>
> This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-770) Timebomb for Fetcher

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783283#action_12783283 ]

Andrzej Bialecki  commented on NUTCH-770:
-----------------------------------------

I propose to change the name of this functionality - "timebomb" is not self-explanatory, and it suggests that if you misbehave then your cluster may explode ;) Instead I would use "time limit", rename all vars and methods to follow this naming, and document it properly in nutch-default.xml.

A few comments to the patch:

* it has some overlap with NUTCH-769 (the emptyQueue() method), but that's easy to resolve, see also the next point.

* why change the code in FetchQueues at all? Time limit is a global condition, we could just break the main loop in run() and ignore the QueueFeeder (or don't start it if the time limit already passed when starting run() ).

* the patch does not follow the code style (notably whitespace in for/while loops and assignments).

> Timebomb for Fetcher
> --------------------
>
>                 Key: NUTCH-770
>                 URL: https://issues.apache.org/jira/browse/NUTCH-770
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Julien Nioche
>         Attachments: log-770, NUTCH-770.patch
>
>
> This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (NUTCH-770) Timebomb for Fetcher

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783252#action_12783252 ]

MilleBii edited comment on NUTCH-770 at 11/29/09 8:47 PM:
----------------------------------------------------------

That's what I did  and just retried ... so I'm a bit suprised too.
Other patches worked fine so far.

Changed my method and used patching by Eclipse and I get the following compiling error :
992: cannot find symbol
    [javac] symbol  : method checkTimeBomb()
    [javac] location: class org.apache.nutch.fetcher.Fetcher.FetchItemQueues
    [javac]         int timeBombed  =fetchQueues.checkTimeBomb();
    [javac]                                     ^
    [javac] 1 error



      was (Author: millebii):
    That's what I did  and just retried ... so I'm a bit suprised too.
Other patches worked fine so far.

???
 

> Timebomb for Fetcher
> --------------------
>
>                 Key: NUTCH-770
>                 URL: https://issues.apache.org/jira/browse/NUTCH-770
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Julien Nioche
>         Attachments: log-770, NUTCH-770.patch
>
>
> This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-770) Timebomb for Fetcher

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783622#action_12783622 ]

Julien Nioche commented on NUTCH-770:
-------------------------------------

"time limit" is definitely better than timebomb (but not as amusing).
FetchQueues : having it there has the advantage that we can count how many URLs have been skipped due to the time limit. That's in the same spirit as https://issues.apache.org/jira/browse/NUTCH-658 which I have been using for a while.  It's very useful to know what happens to the URLs as input and reveals quite a lot about the behaviour of the fetch.
Codestyle : I suppose the following Eclipse codestyle is the one to use ? (http://wiki.apache.org/lucene-java/HowToContribute?action=AttachFile&do=view&target=Eclipse-Lucene-Codestyle.xml)

> Timebomb for Fetcher
> --------------------
>
>                 Key: NUTCH-770
>                 URL: https://issues.apache.org/jira/browse/NUTCH-770
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Julien Nioche
>         Attachments: log-770, NUTCH-770.patch
>
>
> This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-770) Timebomb for Fetcher

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783638#action_12783638 ]

Andrzej Bialecki  commented on NUTCH-770:
-----------------------------------------

bq.   "time limit" is definitely better than timebomb (but not as amusing).

:) let's got for "informative" and "less confusing" now ... Could you please also add the nutch-default.xml property and its documentation.

Re: FetchQueues - ok, you have a point here.

Re: code style - yes.

> Timebomb for Fetcher
> --------------------
>
>                 Key: NUTCH-770
>                 URL: https://issues.apache.org/jira/browse/NUTCH-770
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Julien Nioche
>         Attachments: log-770, NUTCH-770.patch
>
>
> This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-770) Timebomb for Fetcher

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-770:
--------------------------------

    Attachment: NUTCH-770-v2.patch

* renamed timebomb into timelimit
* added parameter and its description in nutch-default.xml
* applied Lucene codestyle from http://wiki.apache.org/lucene-java/HowToContribute?action=AttachFile&do=view&target=Eclipse-Lucene-Codestyle.xml

> Timebomb for Fetcher
> --------------------
>
>                 Key: NUTCH-770
>                 URL: https://issues.apache.org/jira/browse/NUTCH-770
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Julien Nioche
>         Attachments: log-770, NUTCH-770-v2.patch, NUTCH-770.patch
>
>
> This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-770) Timebomb for Fetcher

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-770:
--------------------------------

    Attachment: NUTCH-770-v3.patch

the v2 applied the Lucene code formatting to the whole java file which caused far too many changes, the v3 does the same as the v2 (add param and description to nutch default + change timebomb to timelimit) but applies the code formatting only to the relevant portions of code

> Timebomb for Fetcher
> --------------------
>
>                 Key: NUTCH-770
>                 URL: https://issues.apache.org/jira/browse/NUTCH-770
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Julien Nioche
>         Attachments: log-770, NUTCH-770-v2.patch, NUTCH-770-v3.patch, NUTCH-770.patch
>
>
> This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-770) Timebomb for Fetcher

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784250#action_12784250 ]

Andrzej Bialecki  commented on NUTCH-770:
-----------------------------------------

Fixed in rev. 885776. Thank you!

> Timebomb for Fetcher
> --------------------
>
>                 Key: NUTCH-770
>                 URL: https://issues.apache.org/jira/browse/NUTCH-770
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Julien Nioche
>            Assignee: Andrzej Bialecki
>             Fix For: 1.1
>
>         Attachments: log-770, NUTCH-770-v2.patch, NUTCH-770-v3.patch, NUTCH-770.patch
>
>
> This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-770) Timebomb for Fetcher

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  closed NUTCH-770.
-----------------------------------

       Resolution: Fixed
    Fix Version/s: 1.1
         Assignee: Andrzej Bialecki

> Timebomb for Fetcher
> --------------------
>
>                 Key: NUTCH-770
>                 URL: https://issues.apache.org/jira/browse/NUTCH-770
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Julien Nioche
>            Assignee: Andrzej Bialecki
>             Fix For: 1.1
>
>         Attachments: log-770, NUTCH-770-v2.patch, NUTCH-770-v3.patch, NUTCH-770.patch
>
>
> This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-770) Timebomb for Fetcher

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786443#action_12786443 ]

MilleBii commented on NUTCH-770:
--------------------------------

Tried it succesfully on a windows platform.

It does not work on a Ubuntu, pseudo-distributed hadoop configuration with mappers running in parallel ????



> Timebomb for Fetcher
> --------------------
>
>                 Key: NUTCH-770
>                 URL: https://issues.apache.org/jira/browse/NUTCH-770
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Julien Nioche
>            Assignee: Andrzej Bialecki
>             Fix For: 1.1
>
>         Attachments: log-770, NUTCH-770-v2.patch, NUTCH-770-v3.patch, NUTCH-770.patch
>
>
> This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (NUTCH-770) Timebomb for Fetcher

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786443#action_12786443 ]

MilleBii edited comment on NUTCH-770 at 12/5/09 4:50 PM:
---------------------------------------------------------

Tried it succesfully on a windows platform.

It does not work on a Ubuntu, pseudo-distributed hadoop configuration with two mappers running in parallel ????



      was (Author: millebii):
    Tried it succesfully on a windows platform.

It does not work on a Ubuntu, pseudo-distributed hadoop configuration with mappers running in parallel ????


 

> Timebomb for Fetcher
> --------------------
>
>                 Key: NUTCH-770
>                 URL: https://issues.apache.org/jira/browse/NUTCH-770
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Julien Nioche
>            Assignee: Andrzej Bialecki
>             Fix For: 1.1
>
>         Attachments: log-770, NUTCH-770-v2.patch, NUTCH-770-v3.patch, NUTCH-770.patch
>
>
> This patch provides the Fetcher with a timebomb mechanism. By default the timebomb is not activated; it can be set using the parameter fetcher.timebomb.mins. The number of minutes is relative to the start of the Fetch job. When the number of minutes is reached, the QueueFeeder skips all remaining entries then all active queues are purged. This allows to keep the Fetch step under comtrol and works well in combination with NUTCH-769

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.