Search

Friday, August 12, 2016

appagent.bin + unable to fork new process

The problem:
We had intermittent disconnections from node a (we have a 3-node cluster). I was told the error being raised was “unable to fork new process”.
I did have a look on Metalink, but I could not find any entry related to appagent.bin.


The solution:

 I do not recall Oracle providing us any solution to this problem.
We fixed it by killing the process (appagent.bin) after agreeing downtime with business (in case Database or cluster crashed).
This was the workaround we used, as described on the post. If you decide to follow up, please, arrange downtime and proceed with caution, I would also recommend to check you have an up-to-date backup.

We did have the same problem some time ago and we fixing by stopping the server. This type of fix is not our prefer option as we have 3 databases running on the same clusterware.

To identify the process with more threads we used the following command (Thanks to Coskan (Josh) Gundogar):

[oracle@ssssa trace]$ ps -eLf | grep -i appagent | wc -l
30793

we compared this number with node 2:
[oracle@ssssb trace]$  ps -eLf | grep appagent | wc -l
24

and it proved something was not right with the process.
Something interesting about this process on the first node was the date:
[oracle@ssssa trace]$ ps -eLf | grep appagent
oracle   14768     1  9159  0 30792 Oct15 ?       00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   14768     1  9223  0 30792 Oct15 ?       00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   14768     1  9317  0 30792 Oct15 ?       00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin

so all of them (30793) were 3 days old.

I found some errors on the alert.log file regarding this process:
[oracle@ssssssa]$ cd /u01/crs/oracle/product/11.2.0/grid/log/sssssssa
[oracle@sssssssa]$ tail -100f alertssssssssa.log

[client(17908)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/crs/oracle/product/11.2.0/grid/log/ssssssa/client/css.log.
2010-10-18 14:16:25.223
[/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin(9413)]CRS-5013:Agent "/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin" failed to start process "/u01/crs/oracle/product/11.2.0/grid/bin/lsnrctl" for action "check": details at "(:CLSN00008:)" in "/u01/crs/oracle/product/11.2.0/grid/log/ssssssa/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2010-10-18 14:16:37.512
[client(10936)]CRS-2313:Fatal error: failed to initialize GIPC communication layer. GIPC error: gipcretInitialization (2).
2010-10-18 14:16:37.512
[client(10936)]CRS-2316:Fatal error: cannot initialize GPnP, CLSGPNP_ERR (Generic GPnP error).
2010-10-18 14:16:37.539
[client(10936)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/crs/oracle/product/11.2.0/grid/log/ssssssa/client/css.log.
2010-10-18 14:16:38.920
[/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin(9413)]CRS-5013:Agent "/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin" failed to start process "/u01/crs/oracle/product/11.2.0/grid/opmn/bin/onsctli" for action "check": details at "(:CLSN00008:)" in "/u01/crs/oracle/product/11.2.0/grid/log/ssssssa/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2010-10-18 14:16:53.100
[/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin(9413)]CRS-5013:Agent "/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin" failed to start process "/u01/crs/oracle/product/11.2.0/grid/bin/lsnrctl" for action "check": details at "(:CLSN00008:)" in "/u01/crs/oracle/product/11.2.0/grid/log/ssssssa/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2010-10-18 14:17:25.227
[/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin(9413)]CRS-5013:Agent "/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin" failed to start process "/u01/crs/oracle/product/11.2.0/grid/bin/lsnrctl" for action "check": details at "(:CLSN00008:)" in "/u01/crs/oracle/product/11.2.0/grid/log/sssssa/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2010-10-18 14:17:38.921
[/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin(9413)]CRS-5013:Agent "/u01/crs/oracle/product/11.2.0/grid/bin/oraagent.bin" failed to start process "/u01/crs/oracle/product/11.2.0/grid/opmn/bin/onsctli" for action "check": details at "(:CLSN00008:)" in "/u01/crs/oracle/product/11.2.0/grid/log/sssssa/agent/crsd/oraagent_oracle/oraagent_oracle.log"


and I was told one of the filesystem filled on Friday October 15th.

As we had an agreement with all 3 businesses to have a downtime I decided to kill the process:
[oracle@ssssssa trace]$ kill -9 14768

and it fixed the problem:
[oracle@ssssa trace]$ ps -eLf | grep appagent | wc -l
14

[oracle@ssssssa trace]$ ps -eLf | grep appagent
oracle   18655     1 18655  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18657  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18658  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18659  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18660  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18661  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18662  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18663  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18664  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18665  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18670  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18673  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   18655     1 18694  0   13 15:17 ?        00:00:00 /u01/crs/oracle/product/11.2.0/grid/bin/appagent.bin
oracle   19230 22249 19230  0    1 15:18 pts/4    00:00:00 grep appagent


Conclusions:
I am not sure if the problem with filesystems getting full is related to this issue or “The OCR location in an ASM disk group is inaccessible” logged on the alert.log.
For the above, we have an SR with Oracle.