plugins in job file.

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

plugins in job file.

Stefan Groschupf-2
Hi,

I'm wondering why the plugins are in the job file, since it looks  
like the plugins are never loaded from the job file but from the  
outside (plugin folder).
Should they?

Thanks for any thoughts?
Stefan
Reply | Threaded
Open this post in threaded view
|

Re: plugins in job file.

Stack-6
Stefan Groschupf wrote:
> Hi,
>
> I'm wondering why the plugins are in the job file, since it looks like
> the plugins are never loaded from the job file but from the outside
> (plugin folder).
> Should they?

If running your job jar on a pure hadoop platform, there are no plugins
on local disk.  The job jar needs to carry all it needs to run.

If you have nutch everywhere on your cluster, there will be plugins on
disk and plugins in your job jar.  Which gets favored should just be a
matter of the CLASSPATH when the child runs: The first plugin found wins
(Looks like those on disk will be found first going by TaskRunner
classpath).

In the past, I've had some trouble trying to load up extra plugins and
overrides of plugins already present in the nutch default 'plugins'
directory.   At the time, naming the plugins directory in my job jar
other than 'plugins' -- e.g. 'xtra-plugins' -- and then adding it to the
plugins.include property in configuration loaded into my job jar AHEAD
of default 'plugin' directory got me further.

Nowadays, I build a job jar that that picks and chooses from multiple
plugin sources, the plugins I need, aggregating them under a plugin dir
in the job jar.  The resultant job jar is run on a pure hadoop rather
than nutch platform.

St.Ack

Reply | Threaded
Open this post in threaded view
|

RE: plugins in job file.

Monu Ogbe-2
Many thanks, Andrzej,

It makes sense that in a pure hadoop environment, where nutch has not
been distributed to the tasktracker machines, that there needs to be a
method to pass configurations and plugins to them.  Thus, I can begin to
understand why the hadoop code would need to prioritise the sources of
these configs and plugins.  My brain still aches as to why this would
apply in our case (where nutch and the configs HAVE been distributed to
the tasktrackers), but I'm willling to accept that it is so, and
compiled and distributed the job file!

As, yesterday, I'm out at meetings all day today, and will be able to
report "our" progress :), I will also clear an extra 10 hours with the
bosses, although even if they declined I will be good for it.

Sadly my 2.5m fetch failed during the reduce phase, which raises the
priority having the tools for combining db's and segments (If you have
time on your hands today :))

I'll be back in the evening with any news.

Thanks for all your support, and talk to you later,

Monu

-----Original Message-----
From: Michael Stack [mailto:[hidden email]]
Sent: 04 May 2006 20:30
To: [hidden email]
Subject: Re: plugins in job file.


Stefan Groschupf wrote:
> Hi,
>
> I'm wondering why the plugins are in the job file, since it looks like
> the plugins are never loaded from the job file but from the outside
> (plugin folder).
> Should they?

If running your job jar on a pure hadoop platform, there are no plugins
on local disk.  The job jar needs to carry all it needs to run.

If you have nutch everywhere on your cluster, there will be plugins on
disk and plugins in your job jar.  Which gets favored should just be a
matter of the CLASSPATH when the child runs: The first plugin found wins

(Looks like those on disk will be found first going by TaskRunner
classpath).

In the past, I've had some trouble trying to load up extra plugins and
overrides of plugins already present in the nutch default 'plugins'
directory.   At the time, naming the plugins directory in my job jar
other than 'plugins' -- e.g. 'xtra-plugins' -- and then adding it to the

plugins.include property in configuration loaded into my job jar AHEAD
of default 'plugin' directory got me further.

Nowadays, I build a job jar that that picks and chooses from multiple
plugin sources, the plugins I need, aggregating them under a plugin dir
in the job jar.  The resultant job jar is run on a pure hadoop rather
than nutch platform.

St.Ack


--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.392 / Virus Database: 268.5.3/331 - Release Date:
03/05/2006
 

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.392 / Virus Database: 268.5.4/332 - Release Date:
04/05/2006
 

Reply | Threaded
Open this post in threaded view
|

to count the number of pages from each domain

Anton Potekhin
We tried to develop a solution to count the number of pages from each
domain.

We thought to do it so:

.map - had following input k - UTF8 (url of page), v - CrawlDatum and
following output k - UTF8 (domain of page), v - UrlAndPage implemented
Writable (structure which contained url of page and its CrawlDatum)  

.reduce - had following input k - UTF8 (domain of page), v - iterator for
list of UrlAndPage and output was k - UTF8 (url of page), v - CrawlDatum

.in map function we parsed domain from url, created UrlAndPage structure and
put them to OutputCollector

.in reduce we counted how many elements are in list connected with iterator,
and put it into each CrawlDatum, then we formed new pairs of k, v (url,
CrawlDatum) and put them to OutputCollector
 

Following problem has arisen: as far as we see the types of input and output
of map and reduce should be same, but in our case they were different and it
caused the error like this:

060505 183200 task_0104_m_000000_3 java.lang.RuntimeException:
java.lang.InstantiationException:
org.apache.nutch.crawl.PostUpdateFilter$UrlAn

dPage

060505 183200 task_0104_m_000000_3      at
org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:366)

060505 183200 task_0104_m_000000_3      at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45)

060505 183200 task_0104_m_000000_3      at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:129)

060505 183200 task_0104_m_000000_3      at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:755)

060505 183200 task_0104_m_000000_3 Caused by:
java.lang.InstantiationException:
org.apache.nutch.crawl.PostUpdateFilter$UrlAndPage

060505 183200 task_0104_m_000000_3      at
java.lang.Class.newInstance0(Class.java:335)

060505 183200 task_0104_m_000000_3      at
java.lang.Class.newInstance(Class.java:303)

060505 183200 task_0104_m_000000_3      at
org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:364)

 

We decided that it is impossible in hadoop to have different input/output
types for map and reduce. Then we decided to use another scheme. This scheme
assumes to run two jobs. First job has map function, second job has reduce
task. These jobs have different classes for input and output parameters. New
map and reduce will do the same as described above.  

 

 

We'd like to ask you for advice which way is best for tasks like these. Is
the second way is good? Are there any other variants to do this better?




Reply | Threaded
Open this post in threaded view
|

Re: to count the number of pages from each domain

Andrzej Białecki-2
[hidden email] wrote:
> We decided that it is impossible in hadoop to have different input/output
> types for map and reduce. Then we decided to use another scheme. This scheme
> assumes to run two jobs. First job has map function, second job has reduce
> task. These jobs have different classes for input and output parameters. New
> map and reduce will do the same as described above.  
>  

You can use ObjectWritable to pass any type of Writable inside it. This
way you can mix/match different input/output types easily. The overhead
of this wrapping is probably still smaller than submitting another job
just to change the types...

Please take a look at Indexer.java, where this trick is used.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Merging segments

Chris Fellows-3
Hello,

So the last discussion on merging segments was back in
Jan. Has there been any progress in this direction?
What would be the benefit of being able merge
segments? Would being able to merge segments open up
new functionality options or is merging just a
convience? Also, what's the estimate for how involved
merge functionality development is?

Regards,

- Chris

Reply | Threaded
Open this post in threaded view
|

Re: Merging segments

Andrzej Białecki-2
Chris Fellows wrote:

> Hello,
>
> So the last discussion on merging segments was back in
> Jan. Has there been any progress in this direction?
> What would be the benefit of being able merge
> segments? Would being able to merge segments open up
> new functionality options or is merging just a
> convience? Also, what's the estimate for how involved
> merge functionality development is?
>  

Relief is on the way. Fine folks at houxou.com have sponsored the
development of a brand-new SegmentMerger + slicer, and decided to donate
it to the project - big thanks!

I'm running some final tests, and will commit it today/tomorrow.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Merging segments

Chris Fellows-3
That's great.

Well, my follow up to that then is:

Will the new tool allow any form of "diff'ing"
segments? In practice this would allow you to run a
crawl on a series of sites one week. Then run another
crawl on the same sites a week or so later. Diff the
segments and allow users to search on changes within
the search domain.

--- Andrzej Bialecki <[hidden email]> wrote:

> Chris Fellows wrote:
> > Hello,
> >
> > So the last discussion on merging segments was
> back in
> > Jan. Has there been any progress in this
> direction?
> > What would be the benefit of being able merge
> > segments? Would being able to merge segments open
> up
> > new functionality options or is merging just a
> > convience? Also, what's the estimate for how
> involved
> > merge functionality development is?
> >  
>
> Relief is on the way. Fine folks at houxou.com have
> sponsored the
> development of a brand-new SegmentMerger + slicer,
> and decided to donate
> it to the project - big thanks!
>
> I'm running some final tests, and will commit it
> today/tomorrow.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Merging segments

Andrzej Białecki-2
Chris Fellows wrote:
> That's great.
>
> Well, my follow up to that then is:
>
> Will the new tool allow any form of "diff'ing"
> segments? In practice this would allow you to run a
>  

No, it does only two things - merging and slicing. That's already one
too many... ;)

> crawl on a series of sites one week. Then run another
> crawl on the same sites a week or so later. Diff the
> segments and allow users to search on changes within
> the search domain.
>  

Interesting concept, but I think it would be better implemented as a
variant of de-duplication, rather than segment content manipulation.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com