Plugins initialized all the time!

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Plugins initialized all the time!

Nicol�Lichtmaier
I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems
that the plugin repository initializes itself all the timem until I get
an out of memory exception. I've been seeing the code... the plugin
repository mantains a map from Configuration to plugin repositories, but
the Configuration object does not have an equals or hashCode method...
wouldn't it be nice to add such a method (comparing property values)?
Wouldn't that help prevent initializing many plugin repositories? What
could be the cause to may problem? (Aaah.. so many questions... =) )

Bye!
Reply | Threaded
Open this post in threaded view
|

Re: Plugins initialized all the time!

Nicol�Lichtmaier

More info...

I see "map" progressing from 0% to 100. It seems to reload plugins whan
reaching 100%. Besides, I've realized that each NutchJob is a
Configuration, so (as is there's no "equals") a plugin repo would be
created per each NutchJob...

Reply | Threaded
Open this post in threaded view
|

Re: Plugins initialized all the time!

Doğacan Güney-3
In reply to this post by Nicol�Lichtmaier
Hi,

On 5/28/07, Nicolás Lichtmaier <[hidden email]> wrote:
> I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems
> that the plugin repository initializes itself all the timem until I get
> an out of memory exception. I've been seeing the code... the plugin
> repository mantains a map from Configuration to plugin repositories, but
> the Configuration object does not have an equals or hashCode method...
> wouldn't it be nice to add such a method (comparing property values)?
> Wouldn't that help prevent initializing many plugin repositories? What
> could be the cause to may problem? (Aaah.. so many questions... =) )

Which job causes the problem? Perhaps, we can find out what keeps
creating a conf object over and over.

Also, I have tried what you have suggested (better caching for plugin
repository) and it really seems to make a difference. Can you try with
this patch(*) to see if it solves your problem?

(*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch

>
> Bye!
>


--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

Re: Plugins initialized all the time!

Briggs
I have also noticed this. The code explicitly loads an instance of the
plugins for every fetch (well, or parse etc., depending on what you
are doing). This causes OutOfMemoryErrors. So, if you dump the heap,
you can see the filter classes get loaded and the never get unloaded
(they are loaded within their own classloader). So, you'll see the
same class loaded thousands of time, which is bad.

So, in my case, I had to change the way the plugins are loaded.
Basically, I changed all the main plugin loaders (like
URLFilters.java, IndexFilters.java) to be singletons with a single
'getInstance()' method on each. I don't need special configs for
filters so I can deal with singletons.

You'll find the heart of the problem somewhere in the extension point
class(es).  It calls newInstance() an aweful lot. But, the classloader
(one per plugin) never gets destroyed, or something so.... this can be
nasty.

I'm still dealing with my OutOfMemory errors on parsing, yuck.





On 5/29/07, Doğacan Güney <[hidden email]> wrote:

> Hi,
>
> On 5/28/07, Nicolás Lichtmaier <[hidden email]> wrote:
> > I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems
> > that the plugin repository initializes itself all the timem until I get
> > an out of memory exception. I've been seeing the code... the plugin
> > repository mantains a map from Configuration to plugin repositories, but
> > the Configuration object does not have an equals or hashCode method...
> > wouldn't it be nice to add such a method (comparing property values)?
> > Wouldn't that help prevent initializing many plugin repositories? What
> > could be the cause to may problem? (Aaah.. so many questions... =) )
>
> Which job causes the problem? Perhaps, we can find out what keeps
> creating a conf object over and over.
>
> Also, I have tried what you have suggested (better caching for plugin
> repository) and it really seems to make a difference. Can you try with
> this patch(*) to see if it solves your problem?
>
> (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch
>
> >
> > Bye!
> >
>
>
> --
> Doğacan Güney
>


--
"Conscious decisions by conscious minds are what make reality real"
Reply | Threaded
Open this post in threaded view
|

Re: Plugins initialized all the time!

Doğacan Güney-3
On 5/29/07, Briggs <[hidden email]> wrote:

> I have also noticed this. The code explicitly loads an instance of the
> plugins for every fetch (well, or parse etc., depending on what you
> are doing). This causes OutOfMemoryErrors. So, if you dump the heap,
> you can see the filter classes get loaded and the never get unloaded
> (they are loaded within their own classloader). So, you'll see the
> same class loaded thousands of time, which is bad.
>
> So, in my case, I had to change the way the plugins are loaded.
> Basically, I changed all the main plugin loaders (like
> URLFilters.java, IndexFilters.java) to be singletons with a single
> 'getInstance()' method on each. I don't need special configs for
> filters so I can deal with singletons.
>
> You'll find the heart of the problem somewhere in the extension point
> class(es).  It calls newInstance() an aweful lot. But, the classloader
> (one per plugin) never gets destroyed, or something so.... this can be
> nasty.
>
> I'm still dealing with my OutOfMemory errors on parsing, yuck.

Well then can you test the patch too? Nicolas's idea seems to be the
right one. After this patch, I think plugin loaders will see the same
PluginRepository instance.

>
>
>
>
>
> On 5/29/07, Doğacan Güney <[hidden email]> wrote:
> > Hi,
> >
> > On 5/28/07, Nicolás Lichtmaier <[hidden email]> wrote:
> > > I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems
> > > that the plugin repository initializes itself all the timem until I get
> > > an out of memory exception. I've been seeing the code... the plugin
> > > repository mantains a map from Configuration to plugin repositories, but
> > > the Configuration object does not have an equals or hashCode method...
> > > wouldn't it be nice to add such a method (comparing property values)?
> > > Wouldn't that help prevent initializing many plugin repositories? What
> > > could be the cause to may problem? (Aaah.. so many questions... =) )
> >
> > Which job causes the problem? Perhaps, we can find out what keeps
> > creating a conf object over and over.
> >
> > Also, I have tried what you have suggested (better caching for plugin
> > repository) and it really seems to make a difference. Can you try with
> > this patch(*) to see if it solves your problem?
> >
> > (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch
> >
> > >
> > > Bye!
> > >
> >
> >
> > --
> > Doğacan Güney
> >
>
>
> --
> "Conscious decisions by conscious minds are what make reality real"
>


--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

Re: Plugins initialized all the time!

Briggs
I'll have to get around to trying this in the future.  I have already
'forked' the code. But, would like to get back on track too.  So,
guess I will post something, someday.   The plugin part is now the
least of my worries.  Again, the parsing is what is killing me now.  I
don't use nutch in the 'out-of-the-box' fashion.  My app is running in
a container that crawls when messages to crawl are received.

On 5/29/07, Doğacan Güney <[hidden email]> wrote:

> On 5/29/07, Briggs <[hidden email]> wrote:
> > I have also noticed this. The code explicitly loads an instance of the
> > plugins for every fetch (well, or parse etc., depending on what you
> > are doing). This causes OutOfMemoryErrors. So, if you dump the heap,
> > you can see the filter classes get loaded and the never get unloaded
> > (they are loaded within their own classloader). So, you'll see the
> > same class loaded thousands of time, which is bad.
> >
> > So, in my case, I had to change the way the plugins are loaded.
> > Basically, I changed all the main plugin loaders (like
> > URLFilters.java, IndexFilters.java) to be singletons with a single
> > 'getInstance()' method on each. I don't need special configs for
> > filters so I can deal with singletons.
> >
> > You'll find the heart of the problem somewhere in the extension point
> > class(es).  It calls newInstance() an aweful lot. But, the classloader
> > (one per plugin) never gets destroyed, or something so.... this can be
> > nasty.
> >
> > I'm still dealing with my OutOfMemory errors on parsing, yuck.
>
> Well then can you test the patch too? Nicolas's idea seems to be the
> right one. After this patch, I think plugin loaders will see the same
> PluginRepository instance.
>
> >
> >
> >
> >
> >
> > On 5/29/07, Doğacan Güney <[hidden email]> wrote:
> > > Hi,
> > >
> > > On 5/28/07, Nicolás Lichtmaier <[hidden email]> wrote:
> > > > I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems
> > > > that the plugin repository initializes itself all the timem until I get
> > > > an out of memory exception. I've been seeing the code... the plugin
> > > > repository mantains a map from Configuration to plugin repositories, but
> > > > the Configuration object does not have an equals or hashCode method...
> > > > wouldn't it be nice to add such a method (comparing property values)?
> > > > Wouldn't that help prevent initializing many plugin repositories? What
> > > > could be the cause to may problem? (Aaah.. so many questions... =) )
> > >
> > > Which job causes the problem? Perhaps, we can find out what keeps
> > > creating a conf object over and over.
> > >
> > > Also, I have tried what you have suggested (better caching for plugin
> > > repository) and it really seems to make a difference. Can you try with
> > > this patch(*) to see if it solves your problem?
> > >
> > > (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch
> > >
> > > >
> > > > Bye!
> > > >
> > >
> > >
> > > --
> > > Doğacan Güney
> > >
> >
> >
> > --
> > "Conscious decisions by conscious minds are what make reality real"
> >
>
>
> --
> Doğacan Güney
>


--
"Conscious decisions by conscious minds are what make reality real"
Reply | Threaded
Open this post in threaded view
|

Re: Plugins initialized all the time!

Nicol�Lichtmaier
In reply to this post by Doğacan Güney-3

> Which job causes the problem? Perhaps, we can find out what keeps
> creating a conf object over and over.
>
> Also, I have tried what you have suggested (better caching for plugin
> repository) and it really seems to make a difference. Can you try with
> this patch(*) to see if it solves your problem?
>
> (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch

Some comments about you patch. The approach seems nice, you only check
the parameters that affect plugin loading. But have in mind that the
plugin themselves will configure themselves with many other parameters,
so to keep things safe there should be a PluginRepository for each set
of parameters (including all of them). Besides, remember that CACHE is a
WeakHashMap, you are creating ad-hoc PluginProperty objects as keys,
something doesn't loook right... the lifespan of those objects will be
much shorter than you require, perhaps you should be using
SoftReferences instead, or a simple LRU (LinkedHashMap provides that
simply) cache.

Anyway, I'll try to build my own Nutch to test your patch.

Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: Plugins initialized all the time!

Nicol�Lichtmaier
In reply to this post by Doğacan Güney-3

>> I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems
>> that the plugin repository initializes itself all the timem until I get
>> an out of memory exception. I've been seeing the code... the plugin
>> repository mantains a map from Configuration to plugin repositories, but
>> the Configuration object does not have an equals or hashCode method...
>> wouldn't it be nice to add such a method (comparing property values)?
>> Wouldn't that help prevent initializing many plugin repositories? What
>> could be the cause to may problem? (Aaah.. so many questions... =) )
>
> Which job causes the problem? Perhaps, we can find out what keeps
> creating a conf object over and over.
>
> Also, I have tried what you have suggested (better caching for plugin
> repository) and it really seems to make a difference. Can you try with
> this patch(*) to see if it solves your problem?
>
> (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch

I'm running it. So far it's working ok, and I haven't seen all those
plugin loadings...

I've modified your patch though to define CACHE like this:

  private static final Map<PluginProperty, PluginRepository> CACHE =
      new LinkedHashMap<PluginProperty, PluginRepository>() {
    @Override
    protected boolean removeEldestEntry(
        Entry<PluginProperty, PluginRepository> eldest) {
          return size() > 10;
    }
  };

...which means an LRU cache with a fixed size of 10.

Reply | Threaded
Open this post in threaded view
|

Re: Plugins initialized all the time!

Doğacan Güney-3
In reply to this post by Nicol�Lichtmaier
Hi,

On 5/29/07, Nicolás Lichtmaier <[hidden email]> wrote:

>
> > Which job causes the problem? Perhaps, we can find out what keeps
> > creating a conf object over and over.
> >
> > Also, I have tried what you have suggested (better caching for plugin
> > repository) and it really seems to make a difference. Can you try with
> > this patch(*) to see if it solves your problem?
> >
> > (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch
>
> Some comments about you patch. The approach seems nice, you only check
> the parameters that affect plugin loading. But have in mind that the
> plugin themselves will configure themselves with many other parameters,
> so to keep things safe there should be a PluginRepository for each set
> of parameters (including all of them). Besides, remember that CACHE is a
> WeakHashMap, you are creating ad-hoc PluginProperty objects as keys,
> something doesn't loook right... the lifespan of those objects will be
> much shorter than you require, perhaps you should be using
> SoftReferences instead, or a simple LRU (LinkedHashMap provides that
> simply) cache.

My patch is just a draft to see if we can create a better caching
mechanism. There are definitely some rough edges there:)

I don't really worry about WeakHashMap->LinkedHashMap stuff. But your
approach is simple and should be faster so I guess it's OK.

You are right about per-plugin parameters but I think it will be very
difficult to keep PluginProperty class in sync with plugin parameters.
I mean, if a plugin defines a new parameter, we have to remember to
update PluginProperty. Perhaps, we can force plugins to define
configuration options it will use in, say, its plugin.xml file, but
that will be very error-prone too. I don't want to compare entire
configuration objects, because changing irrevelant options, like
fetcher.store.content shouldn't force loading plugins again, though it
seems it may be inevitable....

>
> Anyway, I'll try to build my own Nutch to test your patch.
>
> Thanks!
>
>


--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

Re: Plugins initialized all the time!

Andrzej Białecki-2
Doğacan Güney wrote:

> My patch is just a draft to see if we can create a better caching
> mechanism. There are definitely some rough edges there:)

One important information: in future versions of Hadoop the method
Configuration.setObject() is deprecated and then will be removed, so we
have to grow our own caching mechanism anyway - either use a singleton
cache, or change nearly all API-s to pass around a user/job/task context.

So, we will face this problem pretty soon, with the next upgrade of Hadoop.



> You are right about per-plugin parameters but I think it will be very
> difficult to keep PluginProperty class in sync with plugin parameters.
> I mean, if a plugin defines a new parameter, we have to remember to
> update PluginProperty. Perhaps, we can force plugins to define
> configuration options it will use in, say, its plugin.xml file, but
> that will be very error-prone too. I don't want to compare entire
> configuration objects, because changing irrevelant options, like
> fetcher.store.content shouldn't force loading plugins again, though it
> seems it may be inevitable....

Let me see if I understand this ... In my opinion this is a non-issue.

Child tasks are started in separate JVMs, so the only "context"
information that they have is what they can read from job.xml (which is
a superset of all properties from config files + job-specific data +
task-specific data). This context is currently instantiated as a
Configuration object, and we (ab)use it also as a local per-JVM cache
for plugin instances and other objects.

Once we instantiate the plugins, they exist unchanged throughout the
lifecycle of JVM (== lifecycle of a single task), so we don't have to
worry about having different sets of plugins with different parameters
for different jobs (or even tasks).

In other words, it seems to me that there is no such situation in which
we have to reload plugins within the same JVM, but with different
parameters.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Plugins initialized all the time!

Doğacan Güney-3
On 5/30/07, Andrzej Bialecki <[hidden email]> wrote:

> Doğacan Güney wrote:
>
> > My patch is just a draft to see if we can create a better caching
> > mechanism. There are definitely some rough edges there:)
>
> One important information: in future versions of Hadoop the method
> Configuration.setObject() is deprecated and then will be removed, so we
> have to grow our own caching mechanism anyway - either use a singleton
> cache, or change nearly all API-s to pass around a user/job/task context.
>
> So, we will face this problem pretty soon, with the next upgrade of Hadoop.

Hmm, well, that sucks, but this is not really a problem for
PluginRepository: PluginRepository already has its own cache
mechanism.

>
>
>
> > You are right about per-plugin parameters but I think it will be very
> > difficult to keep PluginProperty class in sync with plugin parameters.
> > I mean, if a plugin defines a new parameter, we have to remember to
> > update PluginProperty. Perhaps, we can force plugins to define
> > configuration options it will use in, say, its plugin.xml file, but
> > that will be very error-prone too. I don't want to compare entire
> > configuration objects, because changing irrevelant options, like
> > fetcher.store.content shouldn't force loading plugins again, though it
> > seems it may be inevitable....
>
> Let me see if I understand this ... In my opinion this is a non-issue.
>
> Child tasks are started in separate JVMs, so the only "context"
> information that they have is what they can read from job.xml (which is
> a superset of all properties from config files + job-specific data +
> task-specific data). This context is currently instantiated as a
> Configuration object, and we (ab)use it also as a local per-JVM cache
> for plugin instances and other objects.
>
> Once we instantiate the plugins, they exist unchanged throughout the
> lifecycle of JVM (== lifecycle of a single task), so we don't have to
> worry about having different sets of plugins with different parameters
> for different jobs (or even tasks).
>
> In other words, it seems to me that there is no such situation in which
> we have to reload plugins within the same JVM, but with different
> parameters.

Problem is that someone might get a little too smart. Like one may
write a new job where he has two IndexingFilters but creates each from
completely different configuration objects. Then filters some
documents with the first filter and others with the second. I agree
that this is a bit of a reach, but it is possible.


>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

Re: Plugins initialized all the time!

Doğacan Güney-3
On 5/30/07, Doğacan Güney <[hidden email]> wrote:

> On 5/30/07, Andrzej Bialecki <[hidden email]> wrote:
> > Doğacan Güney wrote:
> >
> > > My patch is just a draft to see if we can create a better caching
> > > mechanism. There are definitely some rough edges there:)
> >
> > One important information: in future versions of Hadoop the method
> > Configuration.setObject() is deprecated and then will be removed, so we
> > have to grow our own caching mechanism anyway - either use a singleton
> > cache, or change nearly all API-s to pass around a user/job/task context.
> >
> > So, we will face this problem pretty soon, with the next upgrade of Hadoop.
>
> Hmm, well, that sucks, but this is not really a problem for
> PluginRepository: PluginRepository already has its own cache
> mechanism.
>
> >
> >
> >
> > > You are right about per-plugin parameters but I think it will be very
> > > difficult to keep PluginProperty class in sync with plugin parameters.
> > > I mean, if a plugin defines a new parameter, we have to remember to
> > > update PluginProperty. Perhaps, we can force plugins to define
> > > configuration options it will use in, say, its plugin.xml file, but
> > > that will be very error-prone too. I don't want to compare entire
> > > configuration objects, because changing irrevelant options, like
> > > fetcher.store.content shouldn't force loading plugins again, though it
> > > seems it may be inevitable....
> >
> > Let me see if I understand this ... In my opinion this is a non-issue.
> >
> > Child tasks are started in separate JVMs, so the only "context"
> > information that they have is what they can read from job.xml (which is
> > a superset of all properties from config files + job-specific data +
> > task-specific data). This context is currently instantiated as a
> > Configuration object, and we (ab)use it also as a local per-JVM cache
> > for plugin instances and other objects.
> >
> > Once we instantiate the plugins, they exist unchanged throughout the
> > lifecycle of JVM (== lifecycle of a single task), so we don't have to
> > worry about having different sets of plugins with different parameters
> > for different jobs (or even tasks).
> >
> > In other words, it seems to me that there is no such situation in which
> > we have to reload plugins within the same JVM, but with different
> > parameters.
>
> Problem is that someone might get a little too smart. Like one may
> write a new job where he has two IndexingFilters but creates each from
> completely different configuration objects. Then filters some
> documents with the first filter and others with the second. I agree
> that this is a bit of a reach, but it is possible.

Actually thinking a bit further into this, I kind of agree with you. I
initially thought that the best approach would be to change
PluginRepository.get(Configuration) to PluginRepository.get() where
get() just creates a configuration internally and initializes itself
with it. But then we wouldn't be passing JobConf to PluginRepository
but PluginRepository would do something like a
NutchConfiguration.create(), which is probably wrong.

So, all in all, I've come to believe that my (and Nicolas') patch is a
not-so-bad way of fixing this. It allows us to pass JobConf to
PluginRepository and stops creating new PluginRepository-s again and
again...

What do you think?

>
>
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >   ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
>
>
> --
> Doğacan Güney
>


--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

Re: Plugins initialized all the time!

Nicol�Lichtmaier

> Actually thinking a bit further into this, I kind of agree with you. I
> initially thought that the best approach would be to change
> PluginRepository.get(Configuration) to PluginRepository.get() where
> get() just creates a configuration internally and initializes itself
> with it. But then we wouldn't be passing JobConf to PluginRepository
> but PluginRepository would do something like a
> NutchConfiguration.create(), which is probably wrong.
>
> So, all in all, I've come to believe that my (and Nicolas') patch is a
> not-so-bad way of fixing this. It allows us to pass JobConf to
> PluginRepository and stops creating new PluginRepository-s again and
> again...
>
> What do you think?

IMO a better way would be to add a proper equals() method to  Hadoop's
Configuration object (and hashcode) that would call
getProps().equals(o.getProps()). So that you could use them as keys...
Every class which is a map from keys to values has "equals & hashcode"
(Properties, HashMap, etc.).

Another nice thing would be to be able to "freeze" a configuration
object, preventing anyone from modifying it.

Reply | Threaded
Open this post in threaded view
|

Re: Plugins initialized all the time!

Doğacan Güney-3
On 5/31/07, Nicolás Lichtmaier <[hidden email]> wrote:

>
> > Actually thinking a bit further into this, I kind of agree with you. I
> > initially thought that the best approach would be to change
> > PluginRepository.get(Configuration) to PluginRepository.get() where
> > get() just creates a configuration internally and initializes itself
> > with it. But then we wouldn't be passing JobConf to PluginRepository
> > but PluginRepository would do something like a
> > NutchConfiguration.create(), which is probably wrong.
> >
> > So, all in all, I've come to believe that my (and Nicolas') patch is a
> > not-so-bad way of fixing this. It allows us to pass JobConf to
> > PluginRepository and stops creating new PluginRepository-s again and
> > again...
> >
> > What do you think?
>
> IMO a better way would be to add a proper equals() method to  Hadoop's
> Configuration object (and hashcode) that would call
> getProps().equals(o.getProps()). So that you could use them as keys...
> Every class which is a map from keys to values has "equals & hashcode"
> (Properties, HashMap, etc.).
>
> Another nice thing would be to be able to "freeze" a configuration
> object, preventing anyone from modifying it.
>
>

I found that there is already an issue for this problem - NUTCH-356. I
will update it with most recent discussions.

--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

Re: Plugins initialized all the time!

Briggs
Well, you could always 'freeze' it, just create a decorator for it.  So,
create a new Configuration (call it ImmutableConfiguration) store the
original configuration object in it, and delegate the methods appropriately.
Wouldn't that work?




On 6/8/07, Doğacan Güney <[hidden email]> wrote:

>
> On 5/31/07, Nicolás Lichtmaier <[hidden email]> wrote:
> >
> > > Actually thinking a bit further into this, I kind of agree with you. I
> > > initially thought that the best approach would be to change
> > > PluginRepository.get(Configuration) to PluginRepository.get() where
> > > get() just creates a configuration internally and initializes itself
> > > with it. But then we wouldn't be passing JobConf to PluginRepository
> > > but PluginRepository would do something like a
> > > NutchConfiguration.create(), which is probably wrong.
> > >
> > > So, all in all, I've come to believe that my (and Nicolas') patch is a
> > > not-so-bad way of fixing this. It allows us to pass JobConf to
> > > PluginRepository and stops creating new PluginRepository-s again and
> > > again...
> > >
> > > What do you think?
> >
> > IMO a better way would be to add a proper equals() method to  Hadoop's
> > Configuration object (and hashcode) that would call
> > getProps().equals(o.getProps()). So that you could use them as keys...
> > Every class which is a map from keys to values has "equals & hashcode"
> > (Properties, HashMap, etc.).
> >
> > Another nice thing would be to be able to "freeze" a configuration
> > object, preventing anyone from modifying it.
> >
> >
>
> I found that there is already an issue for this problem - NUTCH-356. I
> will update it with most recent discussions.
>
> --
> Doğacan Güney
>



--
"Conscious decisions by conscious minds are what make reality real"
Reply | Threaded
Open this post in threaded view
|

Re: Plugins initialized all the time!

Briggs
I should have used the word "encapsulate" instead of "store".  :-)

On 6/8/07, Briggs <[hidden email]> wrote:

> Well, you could always 'freeze' it, just create a decorator for it.  So,
> create a new Configuration (call it ImmutableConfiguration) store the
> original configuration object in it, and delegate the methods appropriately.
> Wouldn't that work?
>
>
>
>
>
> On 6/8/07, Doğacan Güney <[hidden email]> wrote:
> > On 5/31/07, Nicolás Lichtmaier <[hidden email]> wrote:
> > >
> > > > Actually thinking a bit further into this, I kind of agree with you. I
> > > > initially thought that the best approach would be to change
> > > > PluginRepository.get(Configuration) to PluginRepository.get() where
> > > > get() just creates a configuration internally and initializes itself
> > > > with it. But then we wouldn't be passing JobConf to PluginRepository
> > > > but PluginRepository would do something like a
> > > > NutchConfiguration.create(), which is probably wrong.
> > > >
> > > > So, all in all, I've come to believe that my (and Nicolas') patch is a
> > > > not-so-bad way of fixing this. It allows us to pass JobConf to
> > > > PluginRepository and stops creating new PluginRepository-s again and
> > > > again...
> > > >
> > > > What do you think?
> > >
> > > IMO a better way would be to add a proper equals() method to  Hadoop's
> > > Configuration object (and hashcode) that would call
> > > getProps().equals(o.getProps()). So that you could use them as keys...
> > > Every class which is a map from keys to values has "equals & hashcode"
> > > (Properties, HashMap, etc.).
> > >
> > > Another nice thing would be to be able to "freeze" a configuration
> > > object, preventing anyone from modifying it.
> > >
> > >
> >
> > I found that there is already an issue for this problem - NUTCH-356. I
> > will update it with most recent discussions.
> >
> > --
> > Doğacan Güney
> >
>
>
>
> --
>
> "Conscious decisions by conscious minds are what make reality real"


--
"Conscious decisions by conscious minds are what make reality real"