[ANNOUNCE] New Nutch committer and PMC - Furkan Kamaci

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[ANNOUNCE] New Nutch committer and PMC - Furkan Kamaci

Sebastian Nagel
Dear all,

it is my pleasure to announce that Furkan Kamacı has joined
the Nutch team as committer and PMC member. Furkan, please
feel free to introduce yourself and to tell the Nutch community
about your interests and your relation to Nutch.

Congratulations and welcome on board!

Regards,
Sebastian (on behalf of the Nutch PMC)
Reply | Threaded
Open this post in threaded view
|

Re: crawlDb speed around deduplication

Michael Coffey
Thanks, I will do some testing with $commonOptions applied to dedup. I suspect that the dedup-update is not compressing its output. Any easy way to check for just that?



Hi Michael, both "crawldb" jobs are similar - they merge status information into the CrawlDb,fetch status and newly found links resp. detected duplicates. There are two situations where
I could think of the second job takes longer: - if there are many duplicates, significantly more than status updates and additions in the preceding updatedb job - if the CrawlDb has grown significantly (the preceding updatedb added many new URLs) But you're right. I can see no reason why $commonOptions is not used for the dedup job.
Please, open an issue on https://issues.apache.org/jira/browse/NUTCH, should be also
checked for the other jobs which are not run with $commonOptions.
If possible, please test whether running the dedup job with the common options fixes your
problem.
That's easily done: just edit src/bin/crawl and run "ant runtime". Thanks,
Sebastian On 04/28/2017 02:54 AM, Michael Coffey wrote:
> In the standard crawl script, there is a _bin_nutch updatedb command and, soon after
that, a _bin_nutch dedup command. Both of them launch hadoop jobs with "crawldb /path/to/crawl/db"
in their names (in addition to the actual deduplication job).
> In my situation, the "crawldb" job launched by dedup takes twice as long as the one launched
by updatedb. Why should that be? Is it doing something different?
> I notice that the script passes $commonOptions to updatedb but not to dedup.
>
Reply | Threaded
Open this post in threaded view
|

Re: crawlDb speed around deduplication

Sebastian Nagel
Hi Michael,

the easiest way is probably to check the actual job configuration as shown by the Hadoop resource
manager webapp, see screenshot. It's also indicated from where a configuration property is set.

Best,
Sebastian

On 05/02/2017 12:57 AM, Michael Coffey wrote:

> Thanks, I will do some testing with $commonOptions applied to dedup. I suspect that the dedup-update is not compressing its output. Any easy way to check for just that?
>
>
>
> Hi Michael, both "crawldb" jobs are similar - they merge status information into the CrawlDb,fetch status and newly found links resp. detected duplicates. There are two situations where
> I could think of the second job takes longer: - if there are many duplicates, significantly more than status updates and additions in the preceding updatedb job - if the CrawlDb has grown significantly (the preceding updatedb added many new URLs) But you're right. I can see no reason why $commonOptions is not used for the dedup job.
> Please, open an issue on https://issues.apache.org/jira/browse/NUTCH, should be also
> checked for the other jobs which are not run with $commonOptions.
> If possible, please test whether running the dedup job with the common options fixes your
> problem.
> That's easily done: just edit src/bin/crawl and run "ant runtime". Thanks,
> Sebastian On 04/28/2017 02:54 AM, Michael Coffey wrote:
>> In the standard crawl script, there is a _bin_nutch updatedb command and, soon after
> that, a _bin_nutch dedup command. Both of them launch hadoop jobs with "crawldb /path/to/crawl/db"
> in their names (in addition to the actual deduplication job).
>> In my situation, the "crawldb" job launched by dedup takes twice as long as the one launched
> by updatedb. Why should that be? Is it doing something different?
>> I notice that the script passes $commonOptions to updatedb but not to dedup.
>>

Reply | Threaded
Open this post in threaded view
|

Re: crawlDb speed around deduplication

Michael Coffey
In reply to this post by Michael Coffey
Good tip!

The compression topic is interesting because we spend a lot of time reading and writing files.

For the dedup-crawldb job, I have:

mapreduce.map.output.compress  = true (from command line)
mapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.DefaultCodec
mapreduce.output.fileoutputformat.compress = false (from default)
mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.DefaultCodec
mapreduce.output.fileoutputformat.compress.type = RECORD


For the linkdb-merge job, it is the same except:
mapreduce.output.fileoutputformat.compress = true


But the jobhistory config page does not say how fileoutputformat.compress was set = true.


Anyway, I'm thinking of using bzip2 for the final output compression. Anybody know a reason I shouldn't try that?

--

Hi Michael,

the easiest way is probably to check the actual job configuration as shown by the Hadoop resource
manager webapp, see screenshot. It's also indicated from where a configuration property is
set.

Best,
Sebastian

On 05/02/2017 12:57 AM, Michael Coffey wrote:
> Thanks, I will do some testing with $commonOptions applied to dedup. I suspect that the
dedup-update is not compressing its output. Any easy way to check for just that?
>
>
>
> Hi Michael, both "crawldb" jobs are similar - they merge status information into the
CrawlDb,fetch status and newly found links resp. detected duplicates. There are two situations
where
> I could think of the second job takes longer: - if there are many duplicates, significantly
more than status updates and additions in the preceding updatedb job - if the CrawlDb has
grown significantly (the preceding updatedb added many new URLs) But you're right. I can see
no reason why $commonOptions is not used for the dedup job.
> Please, open an issue on https://issues.apache.org/jira/browse/NUTCH, should be also
> checked for the other jobs which are not run with $commonOptions.
> If possible, please test whether running the dedup job with the common options fixes
your
> problem.
> That's easily done: just edit src/bin/crawl and run "ant runtime". Thanks,
> Sebastian On 04/28/2017 02:54 AM, Michael Coffey wrote:
>> In the standard crawl script, there is a _bin_nutch updatedb command and, soon after
> that, a _bin_nutch dedup command. Both of them launch hadoop jobs with "crawldb /path/to/crawl/db"
> in their names (in addition to the actual deduplication job).
>> In my situation, the "crawldb" job launched by dedup takes twice as long as the one
launched
> by updatedb. Why should that be? Is it doing something different?
>> I notice that the script passes $commonOptions to updatedb but not to dedup.
>>
Reply | Threaded
Open this post in threaded view
|

Re: crawlDb speed around deduplication

Sebastian Nagel
Hi Michael,

> Anyway, I'm thinking of using bzip2 for the final output compression. Anybody know a reason I
shouldn't try that?

That's not a bad choice.

But more important: the records (CrawlDatum objects) in the CrawlDb are small,
so you should set
> mapreduce.output.fileoutputformat.compress.type = BLOCK
and use it in combination with a splittable codec (BZip2 is splittable).

I haven't played with compression option for LinkDb but BLOCK may be also worth to try.
In doubt, you need to find a good balance between CPU and IO usage for your hardware.

Best,
Sebastian

On 05/03/2017 08:26 PM, Michael Coffey wrote:

> Good tip!
>
> The compression topic is interesting because we spend a lot of time reading and writing files.
>
> For the dedup-crawldb job, I have:
>
> mapreduce.map.output.compress  = true (from command line)
> mapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.DefaultCodec
> mapreduce.output.fileoutputformat.compress = false (from default)
> mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.DefaultCodec
> mapreduce.output.fileoutputformat.compress.type = RECORD
>
>
> For the linkdb-merge job, it is the same except:
> mapreduce.output.fileoutputformat.compress = true
>
>
> But the jobhistory config page does not say how fileoutputformat.compress was set = true.
>
>
> Anyway, I'm thinking of using bzip2 for the final output compression. Anybody know a reason I shouldn't try that?
>
> --
>
> Hi Michael,
>
> the easiest way is probably to check the actual job configuration as shown by the Hadoop resource
> manager webapp, see screenshot. It's also indicated from where a configuration property is
> set.
>
> Best,
> Sebastian
>
> On 05/02/2017 12:57 AM, Michael Coffey wrote:
>> Thanks, I will do some testing with $commonOptions applied to dedup. I suspect that the
> dedup-update is not compressing its output. Any easy way to check for just that?
>>
>>
>>
>> Hi Michael, both "crawldb" jobs are similar - they merge status information into the
> CrawlDb,fetch status and newly found links resp. detected duplicates. There are two situations
> where
>> I could think of the second job takes longer: - if there are many duplicates, significantly
> more than status updates and additions in the preceding updatedb job - if the CrawlDb has
> grown significantly (the preceding updatedb added many new URLs) But you're right. I can see
> no reason why $commonOptions is not used for the dedup job.
>> Please, open an issue on https://issues.apache.org/jira/browse/NUTCH, should be also
>> checked for the other jobs which are not run with $commonOptions.
>> If possible, please test whether running the dedup job with the common options fixes
> your
>> problem.
>> That's easily done: just edit src/bin/crawl and run "ant runtime". Thanks,
>> Sebastian On 04/28/2017 02:54 AM, Michael Coffey wrote:
>>> In the standard crawl script, there is a _bin_nutch updatedb command and, soon after
>> that, a _bin_nutch dedup command. Both of them launch hadoop jobs with "crawldb /path/to/crawl/db"
>> in their names (in addition to the actual deduplication job).
>>> In my situation, the "crawldb" job launched by dedup takes twice as long as the one
> launched
>> by updatedb. Why should that be? Is it doing something different?
>>> I notice that the script passes $commonOptions to updatedb but not to dedup.
>>>

Reply | Threaded
Open this post in threaded view
|

RE: crawlDb speed around deduplication

Markus Jelsma-2
In reply to this post by Michael Coffey
Hello - if you have slow disks but plenty of CPU power, bzip2 would be a good choice. Otherwise gzip is probably a more suitable candidate.

Markus

 
 
-----Original message-----

> From:Michael Coffey <[hidden email]>
> Sent: Wednesday 3rd May 2017 20:30
> To: User <[hidden email]>
> Subject: Re: crawlDb speed around deduplication
>
> Good tip!
>
> The compression topic is interesting because we spend a lot of time reading and writing files.
>
> For the dedup-crawldb job, I have:
>
> mapreduce.map.output.compress  = true (from command line)
> mapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.DefaultCodec
> mapreduce.output.fileoutputformat.compress = false (from default)
> mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.DefaultCodec
> mapreduce.output.fileoutputformat.compress.type = RECORD
>
>
> For the linkdb-merge job, it is the same except:
> mapreduce.output.fileoutputformat.compress = true
>
>
> But the jobhistory config page does not say how fileoutputformat.compress was set = true.
>
>
> Anyway, I'm thinking of using bzip2 for the final output compression. Anybody know a reason I shouldn't try that?
>
> --
>
> Hi Michael,
>
> the easiest way is probably to check the actual job configuration as shown by the Hadoop resource
> manager webapp, see screenshot. It's also indicated from where a configuration property is
> set.
>
> Best,
> Sebastian
>
> On 05/02/2017 12:57 AM, Michael Coffey wrote:
> > Thanks, I will do some testing with $commonOptions applied to dedup. I suspect that the
> dedup-update is not compressing its output. Any easy way to check for just that?
> >
> >
> >
> > Hi Michael, both "crawldb" jobs are similar - they merge status information into the
> CrawlDb,fetch status and newly found links resp. detected duplicates. There are two situations
> where
> > I could think of the second job takes longer: - if there are many duplicates, significantly
> more than status updates and additions in the preceding updatedb job - if the CrawlDb has
> grown significantly (the preceding updatedb added many new URLs) But you're right. I can see
> no reason why $commonOptions is not used for the dedup job.
> > Please, open an issue on https://issues.apache.org/jira/browse/NUTCH, should be also
> > checked for the other jobs which are not run with $commonOptions.
> > If possible, please test whether running the dedup job with the common options fixes
> your
> > problem.
> > That's easily done: just edit src/bin/crawl and run "ant runtime". Thanks,
> > Sebastian On 04/28/2017 02:54 AM, Michael Coffey wrote:
> >> In the standard crawl script, there is a _bin_nutch updatedb command and, soon after
> > that, a _bin_nutch dedup command. Both of them launch hadoop jobs with "crawldb /path/to/crawl/db"
> > in their names (in addition to the actual deduplication job).
> >> In my situation, the "crawldb" job launched by dedup takes twice as long as the one
> launched
> > by updatedb. Why should that be? Is it doing something different?
> >> I notice that the script passes $commonOptions to updatedb but not to dedup.
> >>
>