Faster UpdateDB

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Faster UpdateDB

Jon Shoberg
Calling UpdateDB for my segments (500K) is pretty slow as a relative
obersvation.

Aside from bigger hardware, is ther anything that can be done to speed
up the update process?  Can multiple segments update the DB at the same
time?

Any optimizations or suggested useages?

thanks
-j

Reply | Threaded
Open this post in threaded view
|

Re: Faster UpdateDB

Stefan Groschupf-2
You can try to experiment with seetings in the nutch-config.xml
Open file streams, more cache for sorting things like that may help,  
but also may crash the system because to many open files (under unix  
this can be configured).
HTH
Stefan

Am 30.09.2005 um 18:31 schrieb Jon Shoberg:

> Calling UpdateDB for my segments (500K) is pretty slow as a  
> relative obersvation.
>
> Aside from bigger hardware, is ther anything that can be done to  
> speed up the update process?  Can multiple segments update the DB  
> at the same time?
>
> Any optimizations or suggested useages?
>
> thanks
> -j
>
>
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net


Reply | Threaded
Open this post in threaded view
|

Re: Faster UpdateDB

Andy Liu-3
If you have a lot of regex expressions in your crawl-urlfilter.txt file,
that's probably what's making updatedb so slow. If you're just filtering
against a list of domains, I believe there's a new domain URL filter that
was just added to JIRA which caches domain names and speeds things up
considerably.

Andy

On 9/30/05, Stefan Groschupf <[hidden email]> wrote:

>
> You can try to experiment with seetings in the nutch-config.xml
> Open file streams, more cache for sorting things like that may help,
> but also may crash the system because to many open files (under unix
> this can be configured).
> HTH
> Stefan
>
> Am 30.09.2005 um 18:31 schrieb Jon Shoberg:
>
> > Calling UpdateDB for my segments (500K) is pretty slow as a
> > relative obersvation.
> >
> > Aside from bigger hardware, is ther anything that can be done to
> > speed up the update process? Can multiple segments update the DB
> > at the same time?
> >
> > Any optimizations or suggested useages?
> >
> > thanks
> > -j
> >
> >
> >
>
> ---------------------------------------------------------------
> company: http://www.media-style.com
> forum: http://www.text-mining.org
> blog: http://www.find23.net
>
>
>
>


--
Andy Liu
[hidden email]
(301) 873-8458
Reply | Threaded
Open this post in threaded view
|

Re: Faster UpdateDB

Jon Shoberg
I'm using the whole web crawl strategy with Nutch 0.7.

I have 26 statements in regex-urlfilter.txt.

There are 6 regexs in regex-normalize.xml.

-j


Andy Liu wrote:

> If you have a lot of regex expressions in your crawl-urlfilter.txt file,
> that's probably what's making updatedb so slow. If you're just filtering
> against a list of domains, I believe there's a new domain URL filter that
> was just added to JIRA which caches domain names and speeds things up
> considerably.
>
> Andy
>
> On 9/30/05, Stefan Groschupf <[hidden email]> wrote:
>
>>You can try to experiment with seetings in the nutch-config.xml
>>Open file streams, more cache for sorting things like that may help,
>>but also may crash the system because to many open files (under unix
>>this can be configured).
>>HTH
>>Stefan
>>
>>Am 30.09.2005 um 18:31 schrieb Jon Shoberg:
>>
>>
>>>Calling UpdateDB for my segments (500K) is pretty slow as a
>>>relative obersvation.
>>>
>>>Aside from bigger hardware, is ther anything that can be done to
>>>speed up the update process? Can multiple segments update the DB
>>>at the same time?
>>>
>>>Any optimizations or suggested useages?
>>>
>>>thanks
>>>-j

Reply | Threaded
Open this post in threaded view
|

Re: Faster UpdateDB

Jon Shoberg
In reply to this post by Stefan Groschupf-2

I presume you mean the i/o properties?

Any suggested values?  This is a 4GB ram dual opteron box and 1TB of 15K
PRM raid SCSI.

-j

Stefan Groschupf wrote:

> You can try to experiment with seetings in the nutch-config.xml
> Open file streams, more cache for sorting things like that may help,  
> but also may crash the system because to many open files (under unix  
> this can be configured).
> HTH
> Stefan
>
> Am 30.09.2005 um 18:31 schrieb Jon Shoberg:
>
>> Calling UpdateDB for my segments (500K) is pretty slow as a  relative
>> obersvation.
>>
>> Aside from bigger hardware, is ther anything that can be done to  
>> speed up the update process?  Can multiple segments update the DB  at
>> the same time?
>>
>> Any optimizations or suggested useages?
>>
>> thanks
>> -j
>>

Reply | Threaded
Open this post in threaded view
|

Re: Faster UpdateDB

Jon Shoberg
In reply to this post by Stefan Groschupf-2

I presume you mean the i/o properties?

Any suggested values?  This is a 4GB ram dual opteron box and 1TB of 15K
PRM raid SCSI.

-j

Stefan Groschupf wrote:

> You can try to experiment with seetings in the nutch-config.xml
> Open file streams, more cache for sorting things like that may help,  
> but also may crash the system because to many open files (under unix  
> this can be configured).
> HTH
> Stefan
>
> Am 30.09.2005 um 18:31 schrieb Jon Shoberg:
>
>> Calling UpdateDB for my segments (500K) is pretty slow as a  relative
>> obersvation.
>>
>> Aside from bigger hardware, is ther anything that can be done to  
>> speed up the update process?  Can multiple segments update the DB  at
>> the same time?
>>
>> Any optimizations or suggested useages?
>>
>> thanks
>> -j
>>

Reply | Threaded
Open this post in threaded view
|

Re: Faster UpdateDB

Jon Shoberg
In reply to this post by Stefan Groschupf-2

I presume you mean the i/o properties?

Any suggested values?  This is a 4GB ram dual opteron box and 1TB of 15K
PRM raid SCSI.

-j

Stefan Groschupf wrote:

> You can try to experiment with seetings in the nutch-config.xml
> Open file streams, more cache for sorting things like that may help,  
> but also may crash the system because to many open files (under unix  
> this can be configured).
> HTH
> Stefan
>
> Am 30.09.2005 um 18:31 schrieb Jon Shoberg:
>
>> Calling UpdateDB for my segments (500K) is pretty slow as a  relative
>> obersvation.
>>
>> Aside from bigger hardware, is ther anything that can be done to  
>> speed up the update process?  Can multiple segments update the DB  at
>> the same time?
>>
>> Any optimizations or suggested useages?
>>
>> thanks
>> -j
>>

Reply | Threaded
Open this post in threaded view
|

Re: Faster UpdateDB

Gal Nitzan
In reply to this post by Jon Shoberg
Hi Jon,

If I understand it correctly, it is 26 regex matcher calls for each url !

Gal

Jon Shoberg wrote:

> I'm using the whole web crawl strategy with Nutch 0.7.
>
> I have 26 statements in regex-urlfilter.txt.
>
> There are 6 regexs in regex-normalize.xml.
>
> -j
>
>
> Andy Liu wrote:
>> If you have a lot of regex expressions in your crawl-urlfilter.txt file,
>> that's probably what's making updatedb so slow. If you're just filtering
>> against a list of domains, I believe there's a new domain URL filter
>> that
>> was just added to JIRA which caches domain names and speeds things up
>> considerably.
>>
>> Andy
>>
>> On 9/30/05, Stefan Groschupf <[hidden email]> wrote:
>>
>>> You can try to experiment with seetings in the nutch-config.xml
>>> Open file streams, more cache for sorting things like that may help,
>>> but also may crash the system because to many open files (under unix
>>> this can be configured).
>>> HTH
>>> Stefan
>>>
>>> Am 30.09.2005 um 18:31 schrieb Jon Shoberg:
>>>
>>>
>>>> Calling UpdateDB for my segments (500K) is pretty slow as a
>>>> relative obersvation.
>>>>
>>>> Aside from bigger hardware, is ther anything that can be done to
>>>> speed up the update process? Can multiple segments update the DB
>>>> at the same time?
>>>>
>>>> Any optimizations or suggested useages?
>>>>
>>>> thanks
>>>> -j
>
>
> .
>


Reply | Threaded
Open this post in threaded view
|

Re: Faster UpdateDB

Jon Shoberg
Correct,

    But these are core to my crawling process to limit certain URLs and
keep the index (as intended for searching) from going to crap.

-j

Gal Nitzan wrote:

> Hi Jon,
>
> If I understand it correctly, it is 26 regex matcher calls for each url !
>
> Gal
>
> Jon Shoberg wrote:
>
>> I'm using the whole web crawl strategy with Nutch 0.7.
>>
>> I have 26 statements in regex-urlfilter.txt.
>>
>> There are 6 regexs in regex-normalize.xml.
>>
>> -j
>>
>>
>> Andy Liu wrote:
>>
>>> If you have a lot of regex expressions in your crawl-urlfilter.txt file,
>>> that's probably what's making updatedb so slow. If you're just filtering
>>> against a list of domains, I believe there's a new domain URL filter
>>> that
>>> was just added to JIRA which caches domain names and speeds things up
>>> considerably.
>>>
>>> Andy
>>>
>>> On 9/30/05, Stefan Groschupf <[hidden email]> wrote:
>>>
>>>> You can try to experiment with seetings in the nutch-config.xml
>>>> Open file streams, more cache for sorting things like that may help,
>>>> but also may crash the system because to many open files (under unix
>>>> this can be configured).
>>>> HTH
>>>> Stefan
>>>>
>>>> Am 30.09.2005 um 18:31 schrieb Jon Shoberg:
>>>>
>>>>
>>>>> Calling UpdateDB for my segments (500K) is pretty slow as a
>>>>> relative obersvation.
>>>>>
>>>>> Aside from bigger hardware, is ther anything that can be done to
>>>>> speed up the update process? Can multiple segments update the DB
>>>>> at the same time?
>>>>>
>>>>> Any optimizations or suggested useages?
>>>>>
>>>>> thanks
>>>>> -j



Reply | Threaded
Open this post in threaded view
|

Re: Faster UpdateDB

Jon Shoberg
In reply to this post by Gal Nitzan
Correct,

    But these are core to my crawling process to limit certain URLs and
keep the index (as intended for searching) from going to crap.

-j

Gal Nitzan wrote:

> Hi Jon,
>
> If I understand it correctly, it is 26 regex matcher calls for each url !
>
> Gal
>
> Jon Shoberg wrote:
>
>> I'm using the whole web crawl strategy with Nutch 0.7.
>>
>> I have 26 statements in regex-urlfilter.txt.
>>
>> There are 6 regexs in regex-normalize.xml.
>>
>> -j
>>
>>
>> Andy Liu wrote:
>>
>>> If you have a lot of regex expressions in your crawl-urlfilter.txt file,
>>> that's probably what's making updatedb so slow. If you're just filtering
>>> against a list of domains, I believe there's a new domain URL filter
>>> that
>>> was just added to JIRA which caches domain names and speeds things up
>>> considerably.
>>>
>>> Andy
>>>
>>> On 9/30/05, Stefan Groschupf <[hidden email]> wrote:
>>>
>>>> You can try to experiment with seetings in the nutch-config.xml
>>>> Open file streams, more cache for sorting things like that may help,
>>>> but also may crash the system because to many open files (under unix
>>>> this can be configured).
>>>> HTH
>>>> Stefan
>>>>
>>>> Am 30.09.2005 um 18:31 schrieb Jon Shoberg:
>>>>
>>>>
>>>>> Calling UpdateDB for my segments (500K) is pretty slow as a
>>>>> relative obersvation.
>>>>>
>>>>> Aside from bigger hardware, is ther anything that can be done to
>>>>> speed up the update process? Can multiple segments update the DB
>>>>> at the same time?
>>>>>
>>>>> Any optimizations or suggested useages?
>>>>>
>>>>> thanks
>>>>> -j



Reply | Threaded
Open this post in threaded view
|

Little help with MapRed code

Gal Nitzan
In reply to this post by Gal Nitzan
Hi,

I have developed a small plugin based on URLFilter. It works fine for
0.7 and 0.8 dev

However, in MapRed it can not work.

The basic idea is like that. URLFilers.java has only one static method
filter(url).

I added a static method called cleanup to URLFilters which call finalize
on all plugins it contains.

from where could I call the cleanup method of URLFilters (should be
between the map and reduce).

Thanks,

Gal