Doğacan Güney updated NUTCH-356:
Patch against latest trunk. I believe that this patch should fix the problem.
* WeakHashMap's javadoc says that if a value keeps a strong reference to key then garbage collector doesn't claim space properly. Since configuration object used as a key to WeakHashMap is also used to instantiate PluginRepository (and then stored inside PluginRepository) PluginRepository never got finalized. This patch changes PluginRepository to store a copy of the configuration instead of the original configuration. This way, PluginRepository's finalizer works.
* I added a static <String, <PluginClassLoader, Class>> map to PluginRepository. This caches classes so that we do not load classes again and again and again....
I think these changes will plug all holes in PluginRepository. I ran a small test and it seems we don't leak anymore (or the leak is much smaller compared to what it used to be). Reviews, comments, suggestions are welcome. I also could use some help with testing it :).
PS: Enzo, thanks for the class unloading link. It certainly helped.
PPS: We can still add equals and hashCode methods to Configuration (assuming Hadoop guys are OK with it). But, as I said before, I am not quite fond of that approach. One of the main reasons is that it doesn't fix the problem of leaking PluginRepository objects but works around it (by not creating many PluginRepository objects). After this patch, suggested change to Configurations becomes an optimization instead of a fix. And if this optimization is worthwhile, we can always add it later on.
> Plugin repository cache can lead to memory leak
> Key: NUTCH-356
> URL: https://issues.apache.org/jira/browse/NUTCH-356 > Project: Nutch
> Issue Type: Bug
> Affects Versions: 0.8
> Reporter: Enrico Triolo
> Attachments: cache_classes.patch, NutchTest.java, patch.txt
> While I was trying to solve a problem I reported a while ago (see Nutch-314), I found out that actually the problem was related to the plugin cache used in class PluginRepository.java.
> As I said in Nutch-314, I think I somehow 'force' the way nutch is meant to work, since I need to frequently submit new urls and append their contents to the index; I don't (and I can't) have an urls.txt file with all urls I'm going to fetch, but I recreate it each time a new url is submitted.
> Thus, I think in the majority of times you won't have problems using nutch as-is, since the problem I found occours only if nutch is used in a way similar to the one I use.
> To simplify your test I'm attaching a class that performs something similar to what I need. It fetches and index some sample urls; to avoid webmasters complaints I left the sample urls list empty, so you should modify the source code and add some urls.
> Then you only have to run it and watch your memory consumption with top. In my experience I get an OutOfMemoryException after a couple of minutes, but it clearly depends on your heap settings and on the plugins you are using (I'm using 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').
> The problem is bound to the PluginRepository 'singleton' instance, since it never get released. It seems that some class maintains a reference to it and this class is never released since it is cached somewhere in the configuration.
> So I modified the PluginRepository's 'get' method so that it never uses the cache and always returns a new instance (you can find the patch in attachment). This way the memory consumption is always stable and I get no OOM anymore.
> Clearly this is not the solution, since I guess there are many performance issues involved, but for the moment it works.
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.