[jira] Created: (NUTCH-627) Minimize host address lookup

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-627) Minimize host address lookup

Tim Allison (Jira)
Minimize host address lookup
----------------------------

                 Key: NUTCH-627
                 URL: https://issues.apache.org/jira/browse/NUTCH-627
             Project: Nutch
          Issue Type: Improvement
          Components: generator
            Reporter: Otis Gospodnetic
         Attachments: NUTCH-627.patch

The simple patch that I'm about to attach keeps track of hosts whose "max URLs per host" limit we already reached, as well as hosts whose hostname->IP lookup already failed.  For such hosts, further DNS lookups are skipped:
- there is no point in looking up a hostname yet again if we already have the max number of URLs for that host
- there is little point in attempting to look up a hostname yet again if the previous lookup already failed

In a simple test, this saved a few hundred thousand lookups for the first case and a few hundred lookups for the second case.

If nobody complains, I'll commit by the end of the week.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-627) Minimize host address lookup

Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic updated NUTCH-627:
-----------------------------------

    Attachment: NUTCH-627.patch

> Minimize host address lookup
> ----------------------------
>
>                 Key: NUTCH-627
>                 URL: https://issues.apache.org/jira/browse/NUTCH-627
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Otis Gospodnetic
>         Attachments: NUTCH-627.patch
>
>
> The simple patch that I'm about to attach keeps track of hosts whose "max URLs per host" limit we already reached, as well as hosts whose hostname->IP lookup already failed.  For such hosts, further DNS lookups are skipped:
> - there is no point in looking up a hostname yet again if we already have the max number of URLs for that host
> - there is little point in attempting to look up a hostname yet again if the previous lookup already failed
> In a simple test, this saved a few hundred thousand lookups for the first case and a few hundred lookups for the second case.
> If nobody complains, I'll commit by the end of the week.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (NUTCH-627) Minimize host address lookup

Andrzej Białecki-2
Otis Gospodnetic (JIRA) wrote:

>> If nobody complains, I'll commit by the end of the week.

Hi Otis,

Thanks for helping with Nutch - we are indeed very shorthanded at the
moment, and any help is appreciated, and doubly so that of a person who
can commit things ...

However, on the formal side I think the Nutch team needs to vote you in
as a Nutch committer (even though svn allows you to commit directly) -
witness the recent situation with Grant. If you wish I can start a vote,
and I'm sure it will be positive, and we will have a clean situation
from the formal POV. Ok?


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (NUTCH-627) Minimize host address lookup

Dennis Kubes-2


Andrzej Bialecki wrote:

> Otis Gospodnetic (JIRA) wrote:
>
>>> If nobody complains, I'll commit by the end of the week.
>
> Hi Otis,
>
> Thanks for helping with Nutch - we are indeed very shorthanded at the
> moment, and any help is appreciated, and doubly so that of a person who
> can commit things ...
>
> However, on the formal side I think the Nutch team needs to vote you in
> as a Nutch committer (even though svn allows you to commit directly) -
> witness the recent situation with Grant. If you wish I can start a vote,
> and I'm sure it will be positive, and we will have a clean situation
> from the formal POV. Ok?
>
+1
>
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (NUTCH-627) Minimize host address lookup

chrismattmann
On 4/10/08 8:25 AM, "Dennis Kubes" <[hidden email]> wrote:

>
>
> Andrzej Bialecki wrote:
>> Otis Gospodnetic (JIRA) wrote:
>>
>>>> If nobody complains, I'll commit by the end of the week.
>>
>> Hi Otis,
>>
>> Thanks for helping with Nutch - we are indeed very shorthanded at the
>> moment, and any help is appreciated, and doubly so that of a person who
>> can commit things ...
>>
>> However, on the formal side I think the Nutch team needs to vote you in
>> as a Nutch committer (even though svn allows you to commit directly) -
>> witness the recent situation with Grant. If you wish I can start a vote,
>> and I'm sure it will be positive, and we will have a clean situation
>> from the formal POV. Ok?
>>
> +1
>>

+1, as well.

Cheers,
 Chris


______________________________________________
Chris Mattmann, Ph.D.
[hidden email]
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (NUTCH-627) Minimize host address lookup

Otis Gospodnetic-2-2
In reply to this post by Tim Allison (Jira)
Hi Andrzej,

Sure, that sounds good - thanks!

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Andrzej Bialecki <[hidden email]>
To: [hidden email]
Sent: Thursday, April 10, 2008 4:45:08 AM
Subject: Re: [jira] Updated: (NUTCH-627) Minimize host address lookup

Otis Gospodnetic (JIRA) wrote:

>> If nobody complains, I'll commit by the end of the week.

Hi Otis,

Thanks for helping with Nutch - we are indeed very shorthanded at the
moment, and any help is appreciated, and doubly so that of a person who
can commit things ...

However, on the formal side I think the Nutch team needs to vote you in
as a Nutch committer (even though svn allows you to commit directly) -
witness the recent situation with Grant. If you wish I can start a vote,
and I'm sure it will be positive, and we will have a clean situation
from the formal POV. Ok?


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (NUTCH-627) Minimize host address lookup

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic reassigned NUTCH-627:
--------------------------------------

    Assignee: Otis Gospodnetic

> Minimize host address lookup
> ----------------------------
>
>                 Key: NUTCH-627
>                 URL: https://issues.apache.org/jira/browse/NUTCH-627
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Otis Gospodnetic
>            Assignee: Otis Gospodnetic
>         Attachments: NUTCH-627.patch
>
>
> The simple patch that I'm about to attach keeps track of hosts whose "max URLs per host" limit we already reached, as well as hosts whose hostname->IP lookup already failed.  For such hosts, further DNS lookups are skipped:
> - there is no point in looking up a hostname yet again if we already have the max number of URLs for that host
> - there is little point in attempting to look up a hostname yet again if the previous lookup already failed
> In a simple test, this saved a few hundred thousand lookups for the first case and a few hundred lookups for the second case.
> If nobody complains, I'll commit by the end of the week.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-627) Minimize host address lookup

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662474#action_12662474 ]

Andrzej Bialecki  commented on NUTCH-627:
-----------------------------------------

Otis, is the patch already applied? If not, +1 from me.

> Minimize host address lookup
> ----------------------------
>
>                 Key: NUTCH-627
>                 URL: https://issues.apache.org/jira/browse/NUTCH-627
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Otis Gospodnetic
>            Assignee: Otis Gospodnetic
>         Attachments: NUTCH-627.patch
>
>
> The simple patch that I'm about to attach keeps track of hosts whose "max URLs per host" limit we already reached, as well as hosts whose hostname->IP lookup already failed.  For such hosts, further DNS lookups are skipped:
> - there is no point in looking up a hostname yet again if we already have the max number of URLs for that host
> - there is little point in attempting to look up a hostname yet again if the previous lookup already failed
> In a simple test, this saved a few hundred thousand lookups for the first case and a few hundred lookups for the second case.
> If nobody complains, I'll commit by the end of the week.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (NUTCH-627) Minimize host address lookup

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic resolved NUTCH-627.
------------------------------------

    Resolution: Fixed

Thanks Otis.
Sending        CHANGES.txt
Sending        src/java/org/apache/nutch/crawl/Generator.java
Transmitting file data ..
Committed revision 734257.


> Minimize host address lookup
> ----------------------------
>
>                 Key: NUTCH-627
>                 URL: https://issues.apache.org/jira/browse/NUTCH-627
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Otis Gospodnetic
>            Assignee: Otis Gospodnetic
>         Attachments: NUTCH-627.patch
>
>
> The simple patch that I'm about to attach keeps track of hosts whose "max URLs per host" limit we already reached, as well as hosts whose hostname->IP lookup already failed.  For such hosts, further DNS lookups are skipped:
> - there is no point in looking up a hostname yet again if we already have the max number of URLs for that host
> - there is little point in attempting to look up a hostname yet again if the previous lookup already failed
> In a simple test, this saved a few hundred thousand lookups for the first case and a few hundred lookups for the second case.
> If nobody complains, I'll commit by the end of the week.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-627) Minimize host address lookup

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663619#action_12663619 ]

Hudson commented on NUTCH-627:
------------------------------

Integrated in Nutch-trunk #692 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/692/])
     - Minimize host address lookup while running generate


> Minimize host address lookup
> ----------------------------
>
>                 Key: NUTCH-627
>                 URL: https://issues.apache.org/jira/browse/NUTCH-627
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Otis Gospodnetic
>            Assignee: Otis Gospodnetic
>         Attachments: NUTCH-627.patch
>
>
> The simple patch that I'm about to attach keeps track of hosts whose "max URLs per host" limit we already reached, as well as hosts whose hostname->IP lookup already failed.  For such hosts, further DNS lookups are skipped:
> - there is no point in looking up a hostname yet again if we already have the max number of URLs for that host
> - there is little point in attempting to look up a hostname yet again if the previous lookup already failed
> In a simple test, this saved a few hundred thousand lookups for the first case and a few hundred lookups for the second case.
> If nobody complains, I'll commit by the end of the week.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-627) Minimize host address lookup

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney closed NUTCH-627.
-------------------------------


> Minimize host address lookup
> ----------------------------
>
>                 Key: NUTCH-627
>                 URL: https://issues.apache.org/jira/browse/NUTCH-627
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>            Reporter: Otis Gospodnetic
>            Assignee: Otis Gospodnetic
>         Attachments: NUTCH-627.patch
>
>
> The simple patch that I'm about to attach keeps track of hosts whose "max URLs per host" limit we already reached, as well as hosts whose hostname->IP lookup already failed.  For such hosts, further DNS lookups are skipped:
> - there is no point in looking up a hostname yet again if we already have the max number of URLs for that host
> - there is little point in attempting to look up a hostname yet again if the previous lookup already failed
> In a simple test, this saved a few hundred thousand lookups for the first case and a few hundred lookups for the second case.
> If nobody complains, I'll commit by the end of the week.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.