Possible memory leak?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Possible memory leak?

Enrico Triolo-2
Hi all, in my application I often need to perform the inject ->
generate -> .. -> index loop multiple times, since users can 'suggest'
new web pages to be crawled and indexed.
I also need to enable the language identifier plugin.

Everything seems to work correctly, but after some time I get an
OutOfMemoryException. Actually the time isn't important, since I
noticed that the problem arises when the user submits many urls
(~100). As I said, for each submitted url a new loop is performed
(similar to the one in the Crawl.main method).

Using a profiler (specifically, netbeans profiler) I found out that
for each submitted url a new LanguageIdentifier instance is created,
and never released. With the memory inspector tool I can see as many
instances of LanguageIdentifier and NGramProfile$NGramEntry as the
number of fetched pages, each of them occupying about 180kb. Forcing
garbage collection doesn't release much memory.

LanguageIdentifier has a static class variable 'identifier' that is
never used; reading through the code it seems that the original idea
was to implement a singleton pattern.
So, to limit memory usage, I implemented a static getInstance method
and modified the LanguageIndexingFilter class making it to use the
singleton.
Since I was still having some strange results with the profiler, I
added a println message in the getInstance method, to monitor
effectively singleton creation. It turns out that the singleton is
re-istantiated each time!
I can't really understand why this is happening, maybe is something
related to hadoop internals?

Cheers,
Enrico
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak?

Andrzej Białecki-2
Enrico Triolo wrote:
> Using a profiler (specifically, netbeans profiler) I found out that
> for each submitted url a new LanguageIdentifier instance is created,
> and never released. With the memory inspector tool I can see as many
> instances of LanguageIdentifier and NGramProfile$NGramEntry as the
> number of fetched pages, each of them occupying about 180kb. Forcing
> garbage collection doesn't release much memory.

Yes, this looks like a bug. A single instance of LanguageIdentifier per
task should be cached in the job "context" (i.e. Configuration
instance), to avoid too many instantiations.


> Since I was still having some strange results with the profiler, I
> added a println message in the getInstance method, to monitor
> effectively singleton creation. It turns out that the singleton is
> re-istantiated each time!
> I can't really understand why this is happening, maybe is something
> related to hadoop internals?

I remember a similar situation I had, where instance variables were not
initialized after the object was created with Class.newInstance(). VM
bug? not sure... I didn't track it down that time, I simply moved the
variable initialization to setConf(), which solved my problem.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak?

Jérôme Charron
It seems to be a side effect of NUTCH-169 (remove static NutchConf).
Prior to this, the language identifier was a singleton.
I think we should cache its instance in the conf as we do for many others
objects
in Nutch.
Enrico, could you please create a JIRA issue.

Thanks

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak?

Enrico Triolo-2
Sure!

On 6/28/06, Jérôme Charron <[hidden email]> wrote:

> It seems to be a side effect of NUTCH-169 (remove static NutchConf).
> Prior to this, the language identifier was a singleton.
> I think we should cache its instance in the conf as we do for many others
> objects
> in Nutch.
> Enrico, could you please create a JIRA issue.
>
> Thanks
>
> Jérôme
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak?

Enrico Triolo-2
I'm trying to fix this bug, so I looked at some source code to see how
other objects are cached in the configuration.
I see for example in CommonGrams.java that an Hashtable is put into
the configuration using the setObject() method. Could I use the same
method? Can I put arbitrary objects in the configuration or must they
implement/extend some interface/class (maybe Serializable?).

Enrico

On 6/28/06, Enrico Triolo <[hidden email]> wrote:

> Sure!
>
> On 6/28/06, Jérôme Charron <[hidden email]> wrote:
> > It seems to be a side effect of NUTCH-169 (remove static NutchConf).
> > Prior to this, the language identifier was a singleton.
> > I think we should cache its instance in the conf as we do for many others
> > objects
> > in Nutch.
> > Enrico, could you please create a JIRA issue.
> >
> > Thanks
> >
> > Jérôme
> >
> > --
> > http://motrech.free.fr/
> > http://www.frutch.org/
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak?

Sami Siren-2
You do not need to implement any special interface any object will do.
--
 Sami Siren

Enrico Triolo wrote:

> I'm trying to fix this bug, so I looked at some source code to see how
> other objects are cached in the configuration.
> I see for example in CommonGrams.java that an Hashtable is put into
> the configuration using the setObject() method. Could I use the same
> method? Can I put arbitrary objects in the configuration or must they
> implement/extend some interface/class (maybe Serializable?).
>
> Enrico
>
> On 6/28/06, Enrico Triolo <[hidden email]> wrote:
>
>> Sure!
>>
>> On 6/28/06, Jérôme Charron <[hidden email]> wrote:
>> > It seems to be a side effect of NUTCH-169 (remove static NutchConf).
>> > Prior to this, the language identifier was a singleton.
>> > I think we should cache its instance in the conf as we do for many
>> others
>> > objects
>> > in Nutch.
>> > Enrico, could you please create a JIRA issue.
>> >
>> > Thanks
>> >
>> > Jérôme
>> >
>> > --
>> > http://motrech.free.fr/
>> > http://www.frutch.org/
>> >
>> >
>>
>