Issue with > 200% CPU after bulk usage

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Issue with > 200% CPU after bulk usage

hans.meijer

Hi

I have encountered an issue with Tika running locally on a box that the Java runtime goes up to over 200% CPU, after running a bulk load of documents over a couple of days, it is more than 3 million documents.

But memory consumption is not an issue it seems like.

 

I had 3 processes running against it processing various documents.

 

It got stalled and went up to over 200% CPU on the Java process.

It got ok after restarting the tika server.

 

Are there any known issues with CPU spots that it stalls at over 200% and seems not to get back in processing?

If so, are there any configuration issues that could be adjusted for startup (Java heap, etc.)?

 

I could not find specific logs to attach, but if there are any that could interesting to see, let me know.

 

Details:

 

Tika version is 1.4

I enclose the xml configuration file.

 

It is running on a debian system (stretch), single node:

 

Linux 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1+deb9u5 (2019-08-11) x86_64 GNU/Linux

Distributor ID: Debian

Description:    Debian GNU/Linux 9.9 (stretch)

Release:        9.9

Codename:       stretch

 

MemTotal:        4032120 kB

 

 

Kind regards

Hans


tika-deb-config.xml (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Issue with > 200% CPU after bulk usage

Tim Allison
Hi Hans,
  You inspired me to document my thoughts on this:
https://cwiki.apache.org/confluence/display/TIKA/The+Robustness+of+Apache+Tika
Please let us know if you have any questions.

      Best,

            Tim

On Wed, Apr 15, 2020 at 11:16 AM <[hidden email]> wrote:

> Hi
>
> I have encountered an issue with Tika running locally on a box that the
> Java runtime goes up to over 200% CPU, after running a bulk load of
> documents over a couple of days, it is more than 3 million documents.
>
> But memory consumption is not an issue it seems like.
>
>
>
> I had 3 processes running against it processing various documents.
>
>
>
> It got stalled and went up to over 200% CPU on the Java process.
>
> It got ok after restarting the tika server.
>
>
>
> Are there any known issues with CPU spots that it stalls at over 200% and
> seems not to get back in processing?
>
> If so, are there any configuration issues that could be adjusted for
> startup (Java heap, etc.)?
>
>
>
> I could not find specific logs to attach, but if there are any that could
> interesting to see, let me know.
>
>
>
> Details:
>
>
>
> Tika version is 1.4
>
> I enclose the xml configuration file.
>
>
>
> It is running on a debian system (stretch), single node:
>
>
>
> Linux 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1+deb9u5 (2019-08-11) x86_64
> GNU/Linux
>
> Distributor ID: Debian
>
> Description:    Debian GNU/Linux 9.9 (stretch)
>
> Release:        9.9
>
> Codename:       stretch
>
>
>
> MemTotal:        4032120 kB
>
>
>
>
>
> Kind regards
>
> Hans
>
Reply | Threaded
Open this post in threaded view
|

Re: Issue with > 200% CPU after bulk usage

Tim Allison
In reply to this post by hans.meijer
In short, are you running tika-server in --spawnChild mode?  You can set
the max number of files to process before it restarts the child
process...this prevents slow building memory leaks, and it will restart the
child if one of the threads hits an infinite loop.

On Wed, Apr 15, 2020 at 11:16 AM <[hidden email]> wrote:

> Hi
>
> I have encountered an issue with Tika running locally on a box that the
> Java runtime goes up to over 200% CPU, after running a bulk load of
> documents over a couple of days, it is more than 3 million documents.
>
> But memory consumption is not an issue it seems like.
>
>
>
> I had 3 processes running against it processing various documents.
>
>
>
> It got stalled and went up to over 200% CPU on the Java process.
>
> It got ok after restarting the tika server.
>
>
>
> Are there any known issues with CPU spots that it stalls at over 200% and
> seems not to get back in processing?
>
> If so, are there any configuration issues that could be adjusted for
> startup (Java heap, etc.)?
>
>
>
> I could not find specific logs to attach, but if there are any that could
> interesting to see, let me know.
>
>
>
> Details:
>
>
>
> Tika version is 1.4
>
> I enclose the xml configuration file.
>
>
>
> It is running on a debian system (stretch), single node:
>
>
>
> Linux 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1+deb9u5 (2019-08-11) x86_64
> GNU/Linux
>
> Distributor ID: Debian
>
> Description:    Debian GNU/Linux 9.9 (stretch)
>
> Release:        9.9
>
> Codename:       stretch
>
>
>
> MemTotal:        4032120 kB
>
>
>
>
>
> Kind regards
>
> Hans
>
Reply | Threaded
Open this post in threaded view
|

Re: Issue with > 200% CPU after bulk usage

Nick Burch-2
In reply to this post by hans.meijer
On Wed, 15 Apr 2020, [hidden email] wrote:
> I have encountered an issue with Tika running locally on a box that the
> Java runtime goes up to over 200% CPU, after running a bulk load of
> documents over a couple of days, it is more than 3 million documents.

Can you do a thread dump to show what the JVM is doing?
https://access.redhat.com/solutions/18178

Nick
Reply | Threaded
Open this post in threaded view
|

Re: Issue with > 200% CPU after bulk usage

Eric Pugh-4
Does anyone have a good example of combining Tika with some sort of pool of Docker containers?   I think a lot of folks treat their Tika server like a pet, not like a cow.  https://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/ <https://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/>

I wonder if we could ship some “recipes” that describe how to deploy a pool of Tika’s.    Tika running over 200% for 1 hour, kill it and start the next.



> On Apr 16, 2020, at 9:40 AM, Nick Burch <[hidden email]> wrote:
>
> On Wed, 15 Apr 2020, [hidden email] wrote:
>> I have encountered an issue with Tika running locally on a box that the Java runtime goes up to over 200% CPU, after running a bulk load of documents over a couple of days, it is more than 3 million documents.
>
> Can you do a thread dump to show what the JVM is doing?
> https://access.redhat.com/solutions/18178
>
> Nick

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.

Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Re: Issue with > 200% CPU after bulk usage

Chris Mattmann
Yes, some of us have been developing an Elastic scaling stack for Tika server…

 

That does just that with AWS. Don’t have it ready to push upstream yet.


Cheers,

Chris

 

 

From: Eric Pugh <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Thursday, April 16, 2020 at 7:09 AM
To: "[hidden email]" <[hidden email]>
Subject: [EXTERNAL] Re: Issue with > 200% CPU after bulk usage

 

Does anyone have a good example of combining Tika with some sort of pool of Docker containers?   I think a lot of folks treat their Tika server like a pet, not like a cow.  https://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/ <https://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/>

 

I wonder if we could ship some “recipes” that describe how to deploy a pool of Tika’s.    Tika running over 200% for 1 hour, kill it and start the next.

 

 

 

On Apr 16, 2020, at 9:40 AM, Nick Burch <[hidden email]> wrote:

On Wed, 15 Apr 2020, [hidden email] wrote:

I have encountered an issue with Tika running locally on a box that the Java runtime goes up to over 200% CPU, after running a bulk load of documents over a couple of days, it is more than 3 million documents.

Can you do a thread dump to show what the JVM is doing?

https://access.redhat.com/solutions/18178

Nick

 

_______________________

Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  

Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>      

This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.

 

 

Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Re: Issue with > 200% CPU after bulk usage

Tim Allison
I very much like Eric's ideas of recipes and possibly code because of the
differences in capabilities available via the various cloud providers.

On Thu, Apr 16, 2020 at 10:11 AM Chris Mattmann <[hidden email]> wrote:

> Yes, some of us have been developing an Elastic scaling stack for Tika
> server…
>
>
>
> That does just that with AWS. Don’t have it ready to push upstream yet.
>
>
> Cheers,
>
> Chris
>
>
>
>
>
> From: Eric Pugh <[hidden email]>
> Reply-To: "[hidden email]" <[hidden email]>
> Date: Thursday, April 16, 2020 at 7:09 AM
> To: "[hidden email]" <[hidden email]>
> Subject: [EXTERNAL] Re: Issue with > 200% CPU after bulk usage
>
>
>
> Does anyone have a good example of combining Tika with some sort of pool
> of Docker containers?   I think a lot of folks treat their Tika server like
> a pet, not like a cow.
> https://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/
> <
> https://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/
> >
>
>
>
> I wonder if we could ship some “recipes” that describe how to deploy a
> pool of Tika’s.    Tika running over 200% for 1 hour, kill it and start the
> next.
>
>
>
>
>
>
>
> On Apr 16, 2020, at 9:40 AM, Nick Burch <[hidden email]> wrote:
>
> On Wed, 15 Apr 2020, [hidden email] wrote:
>
> I have encountered an issue with Tika running locally on a box that the
> Java runtime goes up to over 200% CPU, after running a bulk load of
> documents over a couple of days, it is more than 3 million documents.
>
> Can you do a thread dump to show what the JVM is doing?
>
> https://access.redhat.com/solutions/18178
>
> Nick
>
>
>
> _______________________
>
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/> | My Free/Busy <
> http://tinyurl.com/eric-cal>
>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>
>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Sv: Issue with > 200% CPU after bulk usage

hans.meijer
In reply to this post by Nick Burch-2
Thanks Nick,
I will do that, but unfortunately since the Tika occupied all CPU and
stopped all other processing, i had to restart it, that solved the issue but
i restarted all and i can imagine it will happen again.

When it does i will do a thread dump of it.
I will also investigate if it was some specific document causing it. I was
running three processes against it and occasionally other processed (up to
+4) could also start loading the tika-server.
I was running it locally, via downloading the tika-server version 1.4 and
starting it as a process.

Kind regards
Hans

-----Ursprungligt meddelande-----
Från: Nick Burch <[hidden email]>
Skickat: den 16 april 2020 15:40
Till: [hidden email]
Kopia: [hidden email]
Ämne: Re: Issue with > 200% CPU after bulk usage

On Wed, 15 Apr 2020, [hidden email] wrote:
> I have encountered an issue with Tika running locally on a box that
> the Java runtime goes up to over 200% CPU, after running a bulk load
> of documents over a couple of days, it is more than 3 million documents.

Can you do a thread dump to show what the JVM is doing?
https://access.redhat.com/solutions/18178

Nick

Reply | Threaded
Open this post in threaded view
|

Sv: Issue with > 200% CPU after bulk usage

hans.meijer
In reply to this post by Nick Burch-2
Hi
I ran in to the issue again with Tika/Java taking more CPU, up to 200+ CPU%.
 
The scenario is that i have 3-4 long running processes calling Tika server
(Version 1.24) and occassionaly 3-4 additional shorter processes (2-3 hours)
starts up and calls the Tika server.
The scenario is being run for a couple of days, extracting text from various
types of documents.

The Tika server is running locally.

 
Top shows this:

----------------------------------------------------------------------------
----------------------
top - 16:21:17 up 5 days,  8:12,  6 users,  load average: 2,64, 2,63, 2,61
Tasks: 145 total,   1 running, 144 sleeping,   0 stopped,   0 zombie
%Cpu(s): 50,8 us,  0,3 sy,  0,0 ni, 48,8 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0
st
KiB Mem :  4032128 total,   129052 free,  2702236 used,  1200840 buff/cache
KiB Swap:  4192252 total,  2968864 free,  1223388 used.  1040340 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
   911 root      20   0 4578604 1,229g   8024 S 204,3 32,0 859:11.22 java
   743 root      20   0  196596   5772    920 S   0,7  0,1  35:28.02
wizit_rest
 34637 elastic+  20   0 21,346g 883808  30616 S   0,3 21,9   1250:04 java
     1 root      20   0  204620   3440   2376 S   0,0  0,1   0:14.99 systemd
     2 root      20   0       0      0      0 S   0,0  0,0   0:00.15
kthreadd
     3 root      20   0       0      0      0 S   0,0  0,0   1:46.20
ksoftirqd+
     5 root       0 -20       0      0      0 S   0,0  0,0   0:00.00
kworker/0+
     7 root      20   0       0      0      0 S   0,0  0,0   4:59.14
rcu_sched
     8 root      20   0       0      0      0 S   0,0  0,0   0:00.00 rcu_bh
     9 root      rt   0       0      0      0 S   0,0  0,0   0:03.83
migration+
----------------------------------------------------------------------------
----------------------


At first i ran the jstackseries.sh:
----------------------------------------------------------------------------
----------------------
more jstack.911.202904.163848252
Attaching to process ID 911, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.242-b08
Deadlock Detection:

Can't print deadlocks:Unable to deduce type of thread from address
0x00007f30bc0
2d800 (expected type JavaThread, CompilerThread, ServiceThread,
JvmtiAgentThread
, or SurrogateLockerThread)
----------------------------------------------------------------------------
----------------------

It also freeze the system, "systemd[1]: Freezing execution."


But i finally got a threaddump via jstack, i attach that file. I also attach
the tika-config file in case that also could be useful.
Hope this helps to analyze the issue.


Kind regards
Hans


-----Ursprungligt meddelande-----
Från: Nick Burch <[hidden email]>
Skickat: den 16 april 2020 15:40
Till: [hidden email]
Kopia: [hidden email]
Ämne: Re: Issue with > 200% CPU after bulk usage

On Wed, 15 Apr 2020, [hidden email] wrote:
> I have encountered an issue with Tika running locally on a box that
> the Java runtime goes up to over 200% CPU, after running a bulk load
> of documents over a couple of days, it is more than 3 million documents.

Can you do a thread dump to show what the JVM is doing?
https://access.redhat.com/solutions/18178

Nick

tika-config.xml (8K) Download Attachment