[jira] [Created] (NUTCH-2407) Memory leak causing Nutch Server to run out of memory

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
Report Content as Inappropriate

[jira] [Created] (NUTCH-2407) Memory leak causing Nutch Server to run out of memory

JIRA jira@apache.org
Vyacheslav Pascarel created NUTCH-2407:

             Summary: Memory leak causing Nutch Server to run out of memory
                 Key: NUTCH-2407
                 URL: https://issues.apache.org/jira/browse/NUTCH-2407
             Project: Nutch
          Issue Type: Bug
          Components: nutch server
    Affects Versions: 2.3.1
         Environment: Ubuntu 16.04 64-bit
Oracle Java 8 64-bit
Nutch 2.3.1 (standalone deployment)
MongoDB 3.4
            Reporter: Vyacheslav Pascarel

My application is trying to perform continuous crawling using Nutch REST services. The application injects a seed URL and then repeats GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times (each step in the sequence is executed upon successful competition of the previous step then the whole sequence is repeated again). Here is a brief description of the job:
* Number of GENERATE/FETCH/PARSE/UPDATEDB cycles per run: 50
* 'topN' parameter value of GENERATE step in each cycle: 10
* Seed URL: http://www.cnn.com
* Regex URL filters for all jobs:
** *"-^.\{1000,\}$"* - exclude very long URLs
** *"+."* - include the rest

To monitor Nutch server I use Java VisualVM that comes with Java SDK. After each run (50 cycles of GENERATE/FETCH/PARSE/UPDATEDB) I perform garbage collection using the mentioned tool and check memory usage. My observation is that Nutch Server leaks ~25MB per run.

NOTES: I added custom HTTP DELETE services to clean job history in NutchServerPoolExecutor and remove all custom configurations from RAMConfManager after each run. So observed ~25MB memory leak is after job history/configuration cleanup.

This message was sent by Atlassian JIRA