How can I get one plugin's root dir

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

How can I get one plugin's root dir

scott green
Hi,

I need to load some resources from mine plugin's sub-directory. Any
avaiable method to get the specified plugin's root directory now?
thanks

- scott
Reply | Threaded
Open this post in threaded view
|

Re: How can I get one plugin's root dir

scott green
Can someone give a answer? I dont think it is good idea we put all
configuration/resources under "conf" dir.

On 1/15/07, Scott Green <[hidden email]> wrote:
> Hi,
>
> I need to load some resources from mine plugin's sub-directory. Any
> avaiable method to get the specified plugin's root directory now?
> thanks
>
> - scott
>
Reply | Threaded
Open this post in threaded view
|

Re: How can I get one plugin's root dir

Andrzej Białecki-2
Scott Green wrote:
> Can someone give a answer? I dont think it is good idea we put all
> configuration/resources under "conf" dir.
>
> On 1/15/07, Scott Green <[hidden email]> wrote:
>> Hi,
>>
>> I need to load some resources from mine plugin's sub-directory. Any
>> avaiable method to get the specified plugin's root directory now?
>> thanks

You need to make sure that this resource is packaged into the plugin jar
(just see how it's done in other plugins). Then you should be able to
access it through the ClassLoader that loaded this plugin, e.g.

package a.b.c;

public class MyPlugin {
...
    InputStream is = MyPlugin.class.getResourceAsStream("myResource.txt");
...
}

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: How can I get one plugin's root dir

Dennis Kubes
In reply to this post by scott green
You can get the PluginRepository and then from there get the plugin
descriptor and its path.  From there resources inside the plugin folder.
    Change out parse-html with your plugin id.

     Configuration conf = NutchConfiguration.create();
     PluginRepository rep = PluginRepository.get(conf);
     PluginDescriptor desc = rep.getPluginDescriptor("parse-html");
     String path = desc.getPluginPath();
     System.out.println(path);


Dennis Kubes

Scott Green wrote:

> Can someone give a answer? I dont think it is good idea we put all
> configuration/resources under "conf" dir.
>
> On 1/15/07, Scott Green <[hidden email]> wrote:
>> Hi,
>>
>> I need to load some resources from mine plugin's sub-directory. Any
>> avaiable method to get the specified plugin's root directory now?
>> thanks
>>
>> - scott
>>
Reply | Threaded
Open this post in threaded view
|

Re: How can I get one plugin's root dir

scott green
In reply to this post by Andrzej Białecki-2
Hi,

I want to propose a bit clean plugin directory structure:

xxx-plugin
           `------ lib
           `------ conf
           `------ src
           `------ web (only for web plugin)
           `------ plugin.xml
           `------ build.xml

Take urlfilter-regex plugin as example, the configuration file
"regex-urlfilter.txt" should be put in conf/ dir. Does this make
sense?

On 1/16/07, Andrzej Bialecki <[hidden email]> wrote:

> Scott Green wrote:
> > Can someone give a answer? I dont think it is good idea we put all
> > configuration/resources under "conf" dir.
> >
> > On 1/15/07, Scott Green <[hidden email]> wrote:
> >> Hi,
> >>
> >> I need to load some resources from mine plugin's sub-directory. Any
> >> avaiable method to get the specified plugin's root directory now?
> >> thanks
>
> You need to make sure that this resource is packaged into the plugin jar
> (just see how it's done in other plugins). Then you should be able to
> access it through the ClassLoader that loaded this plugin, e.g.
>
> package a.b.c;
>
> public class MyPlugin {
> ...
>     InputStream is = MyPlugin.class.getResourceAsStream("myResource.txt");
> ...
> }
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: How can I get one plugin's root dir

scott green
In reply to this post by Dennis Kubes
Thanks Dennis! Your methond should work.

And I really hope there is one directly method say getPluginRootDir()
in the plugin implementation.


On 1/16/07, Dennis Kubes <[hidden email]> wrote:

> You can get the PluginRepository and then from there get the plugin
> descriptor and its path.  From there resources inside the plugin folder.
>     Change out parse-html with your plugin id.
>
>      Configuration conf = NutchConfiguration.create();
>      PluginRepository rep = PluginRepository.get(conf);
>      PluginDescriptor desc = rep.getPluginDescriptor("parse-html");
>      String path = desc.getPluginPath();
>      System.out.println(path);
>
>
> Dennis Kubes
>
> Scott Green wrote:
> > Can someone give a answer? I dont think it is good idea we put all
> > configuration/resources under "conf" dir.
> >
> > On 1/15/07, Scott Green <[hidden email]> wrote:
> >> Hi,
> >>
> >> I need to load some resources from mine plugin's sub-directory. Any
> >> avaiable method to get the specified plugin's root directory now?
> >> thanks
> >>
> >> - scott
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: How can I get one plugin's root dir

Sami Siren-2
Scott Green wrote:
> Thanks Dennis! Your methond should work.
>
> And I really hope there is one directly method say getPluginRootDir()
> in the plugin implementation.

I'd recommend taking path shown by Andrzej because IMO it's bad design
to depend on plugin system from a plugin.

--
 Sami Siren


Reply | Threaded
Open this post in threaded view
|

Re: How can I get one plugin's root dir

scott green
Hi Sami

On 1/16/07, Sami Siren <[hidden email]> wrote:
> Scott Green wrote:
> > Thanks Dennis! Your methond should work.
> >
> > And I really hope there is one directly method say getPluginRootDir()
> > in the plugin implementation.
>
> I'd recommend taking path shown by Andrzej because IMO it's bad design
> to depend on plugin system from a plugin.

I am not much clear about your reason.

The getPluginRootDir() method mentioned above should expose the
(absolutely) path of xxx-plugin in the below example.

plugins
  `-xxx-plugin
          `------ lib
          `------ conf
          `------ src
          `------ web (only for web plugin)
          `------ plugin.xml
          `------ build.xml

Andrzej's idea is limited(?) since i cannot get resources from conf dir.


> --
>  Sami Siren
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: How can I get one plugin's root dir

Andrzej Białecki-2
Scott Green wrote:

> Hi Sami
>
> On 1/16/07, Sami Siren <[hidden email]> wrote:
>> Scott Green wrote:
>> > Thanks Dennis! Your methond should work.
>> >
>> > And I really hope there is one directly method say getPluginRootDir()
>> > in the plugin implementation.
>>
>> I'd recommend taking path shown by Andrzej because IMO it's bad design
>> to depend on plugin system from a plugin.
>
> I am not much clear about your reason.
>
> The getPluginRootDir() method mentioned above should expose the
> (absolutely) path of xxx-plugin in the below example.
>
> plugins
>  `-xxx-plugin
>          `------ lib
>          `------ conf
>          `------ src
>          `------ web (only for web plugin)
>          `------ plugin.xml
>          `------ build.xml

Ok. Now imagine that all plugins are packed together in a Jar file (as
is the case with Nutch). Is your method still going to work? Nope.
getPluginRootDir() may still return some non-null value (not sure about
that), but the resources are not available as files because they are
packed into a Jar.

Now, you may have tested your method and found that it does indeed work
- but the reason is a bit obscure: the bin/nutch and bin/hadoop scripts
add your build/ directory to the classpath, so that you can locally test
the latest versions of the code without creating the *.job file.
However, when you run your code on a Hadoop cluster your local build/
directory is no longer accessible, and your method will mysteriously
fail - or even worse, you may get a different version of a resource from
an older version of the build/ directory found on Hadoop tasktracker
nodes ...

>
> Andrzej's idea is limited(?) since i cannot get resources from conf dir.

Absolutely not - that's how the whole Configuration system works.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: How can I get one plugin's root dir

scott green
On 1/16/07, Andrzej Bialecki <[hidden email]> wrote:

> Scott Green wrote:
> > Hi Sami
> >
> > On 1/16/07, Sami Siren <[hidden email]> wrote:
> >> Scott Green wrote:
> >> > Thanks Dennis! Your methond should work.
> >> >
> >> > And I really hope there is one directly method say getPluginRootDir()
> >> > in the plugin implementation.
> >>
> >> I'd recommend taking path shown by Andrzej because IMO it's bad design
> >> to depend on plugin system from a plugin.
> >
> > I am not much clear about your reason.
> >
> > The getPluginRootDir() method mentioned above should expose the
> > (absolutely) path of xxx-plugin in the below example.
> >
> > plugins
> >  `-xxx-plugin
> >          `------ lib
> >          `------ conf
> >          `------ src
> >          `------ web (only for web plugin)
> >          `------ plugin.xml
> >          `------ build.xml
>
> Ok. Now imagine that all plugins are packed together in a Jar file (as
> is the case with Nutch). Is your method still going to work? Nope.
> getPluginRootDir() may still return some non-null value (not sure about
> that), but the resources are not available as files because they are
> packed into a Jar.

Well, why should all resources needed to be packed?
The built result may looks like:

xxx-plugin
  `--- conf
  `--- web
  `--- xxx-plugin.jar
  `--- deps.jar
  `-- plugin.xml

> Now, you may have tested your method and found that it does indeed work
> - but the reason is a bit obscure: the bin/nutch and bin/hadoop scripts
> add your build/ directory to the classpath, so that you can locally test
> the latest versions of the code without creating the *.job file.
> However, when you run your code on a Hadoop cluster your local build/
> directory is no longer accessible, and your method will mysteriously
> fail - or even worse, you may get a different version of a resource from
> an older version of the build/ directory found on Hadoop tasktracker
> nodes ...

If you packed everything into jar(s), it is possible that the jar on
hadoop tasktracker node is old version, right?

> >
> > Andrzej's idea is limited(?) since i cannot get resources from conf dir.
>
> Absolutely not - that's how the whole Configuration system works.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: How can I get one plugin's root dir

Andrzej Białecki-2
Scott Green wrote:
> Well, why should all resources needed to be packed?

Because when you run Nutch on a Hadoop cluster, Hadoop requires that all
job resources be packed into a job JAR, which is then submitted to each
tasktracker as a part of the job. So, if you want to run in non-local
mode you have to build the nutch-xxx.job JAR ("ant job" target).

Apparently you are running in so called "local" mode, where these issues
are quite muddy - but as soon as you try to execute it on a cluster your
method will stop working.


> The built result may looks like:
>
> xxx-plugin
>  `--- conf
>  `--- web
>  `--- xxx-plugin.jar
>  `--- deps.jar
>  `-- plugin.xml

Again: in the "local" mode this may work, but these unpacked plugins are
not available for jobs executing on a Hadoop cluster.

>
>> Now, you may have tested your method and found that it does indeed work
>> - but the reason is a bit obscure: the bin/nutch and bin/hadoop scripts
>> add your build/ directory to the classpath, so that you can locally test
>> the latest versions of the code without creating the *.job file.
>> However, when you run your code on a Hadoop cluster your local build/
>> directory is no longer accessible, and your method will mysteriously
>> fail - or even worse, you may get a different version of a resource from
>> an older version of the build/ directory found on Hadoop tasktracker
>> nodes ...
>
> If you packed everything into jar(s), it is possible that the jar on
> hadoop tasktracker node is old version, right?

No. The job jar is always up to date, because it is sent with every job.

But if you don't get the resources from this jar, and instead rely on
using java.io.File-s, you may pick some old cruft from the local build/
directory that you may have accidentally deployed to your tasktrackers ...

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: How can I get one plugin's root dir

scott green
Thanks you for the detailed explanation, Andrzej.

My plugin contains one language-model(configuration file) whose size
is 40M, and could you please suggest me where the model file should
put.
 a) put it into nutch/conf dir like "regex-urlfilter.txt" file
 b) put it into plugin's jar package.

On 1/17/07, Andrzej Bialecki <[hidden email]> wrote:

> Scott Green wrote:
> > Well, why should all resources needed to be packed?
>
> Because when you run Nutch on a Hadoop cluster, Hadoop requires that all
> job resources be packed into a job JAR, which is then submitted to each
> tasktracker as a part of the job. So, if you want to run in non-local
> mode you have to build the nutch-xxx.job JAR ("ant job" target).
>
> Apparently you are running in so called "local" mode, where these issues
> are quite muddy - but as soon as you try to execute it on a cluster your
> method will stop working.
>
>
> > The built result may looks like:
> >
> > xxx-plugin
> >  `--- conf
> >  `--- web
> >  `--- xxx-plugin.jar
> >  `--- deps.jar
> >  `-- plugin.xml
>
> Again: in the "local" mode this may work, but these unpacked plugins are
> not available for jobs executing on a Hadoop cluster.
>
> >
> >> Now, you may have tested your method and found that it does indeed work
> >> - but the reason is a bit obscure: the bin/nutch and bin/hadoop scripts
> >> add your build/ directory to the classpath, so that you can locally test
> >> the latest versions of the code without creating the *.job file.
> >> However, when you run your code on a Hadoop cluster your local build/
> >> directory is no longer accessible, and your method will mysteriously
> >> fail - or even worse, you may get a different version of a resource from
> >> an older version of the build/ directory found on Hadoop tasktracker
> >> nodes ...
> >
> > If you packed everything into jar(s), it is possible that the jar on
> > hadoop tasktracker node is old version, right?
>
> No. The job jar is always up to date, because it is sent with every job.
>
> But if you don't get the resources from this jar, and instead rely on
> using java.io.File-s, you may pick some old cruft from the local build/
> directory that you may have accidentally deployed to your tasktrackers ...
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: How can I get one plugin's root dir

Andrzej Białecki-2
Scott Green wrote:
> Thanks you for the detailed explanation, Andrzej.
>
> My plugin contains one language-model(configuration file) whose size
> is 40M, and could you please suggest me where the model file should
> put.
> a) put it into nutch/conf dir like "regex-urlfilter.txt" file
> b) put it into plugin's jar package.

 From the purely theoretic point of view, either way it should work fine
- the content of conf/ dir is packed into the job jar too.

One comment though, and I hope I'm not confusing you too much ;) If the
file is that large, AND you execute your jobs using
jobtracker/tasktrackers, AND you run on Hadoop DFS, you may want to do
exactly the opposite from what I advocated ;) I.e. keep this file in a
well-known external location on DFS, where it's accessible to all tasks.
You should also set its replication factor equal to the number of
datanodes, and then load this file directly from DFS. Still, you
wouldn't use java.io.File, but FileSystem.open(Path).

The reason is that if you pack this file into your job JAR, the job jar
would become very large (presumably this 40MB is already compressed?).
Job jar needs to be copied to each tasktracker for each task, so you
will experience performance hit just because of the size of the job jar
... whereas if this file sits on DFS and is highly replicated, its
content will always be available locally.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: How can I get one plugin's root dir

Doug Cutting
Andrzej Bialecki wrote:
> The reason is that if you pack this file into your job JAR, the job jar
> would become very large (presumably this 40MB is already compressed?).
> Job jar needs to be copied to each tasktracker for each task, so you
> will experience performance hit just because of the size of the job jar
> ... whereas if this file sits on DFS and is highly replicated, its
> content will always be available locally.

Note that the job jar is copied into HDFS with a highish replication
(10?), and that it is only copied to each tasktracker node once per
*job*, not per task.  So it's only faster to manage this yourself if you
have a sequence of jobs that share this data, and if the time to
re-replicate it per job is significant.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: How can I get one plugin's root dir

scott green
Thanks Andrzej and Doug!

I will try both in my later work and evaluate them.

On 1/17/07, Doug Cutting <[hidden email]> wrote:

> Andrzej Bialecki wrote:
> > The reason is that if you pack this file into your job JAR, the job jar
> > would become very large (presumably this 40MB is already compressed?).
> > Job jar needs to be copied to each tasktracker for each task, so you
> > will experience performance hit just because of the size of the job jar
> > ... whereas if this file sits on DFS and is highly replicated, its
> > content will always be available locally.
>
> Note that the job jar is copied into HDFS with a highish replication
> (10?), and that it is only copied to each tasktracker node once per
> *job*, not per task.  So it's only faster to manage this yourself if you
> have a sequence of jobs that share this data, and if the time to
> re-replicate it per job is significant.
>
> Doug
>
Reply | Threaded
Open this post in threaded view
|

Re: How can I get one plugin's root dir

Dennis Kubes
Scott,

I should have read your original post in more detail.  I was assuming
you were just trying to get the root directory of the plugin, not
loading resources during a MR job.  I would have to agree with Andrzej
approach if this were to be used during a MR job.  Sorry for the confusion.

Dennis Kubes

Scott Green wrote:

> Thanks Andrzej and Doug!
>
> I will try both in my later work and evaluate them.
>
> On 1/17/07, Doug Cutting <[hidden email]> wrote:
>> Andrzej Bialecki wrote:
>> > The reason is that if you pack this file into your job JAR, the job jar
>> > would become very large (presumably this 40MB is already compressed?).
>> > Job jar needs to be copied to each tasktracker for each task, so you
>> > will experience performance hit just because of the size of the job jar
>> > ... whereas if this file sits on DFS and is highly replicated, its
>> > content will always be available locally.
>>
>> Note that the job jar is copied into HDFS with a highish replication
>> (10?), and that it is only copied to each tasktracker node once per
>> *job*, not per task.  So it's only faster to manage this yourself if you
>> have a sequence of jobs that share this data, and if the time to
>> re-replicate it per job is significant.
>>
>> Doug
>>