Broken Database Connections

May 19, 2017

Broken Database Connections

Database Connections - A brief Case study

Of late, we have been experiencing frequent connection timeouts in our application from Database. We were wondering how could this happen when nothing has changed. On continuous monitoring, it was figured out that all the connections that gets dropped are between 15th and 16th minute. Something fishy!!!

Also we were able to see the below error in database alert log whenever the issue happens in application.

Database alert log error:

Fatal NI connect error 12170.

VERSION INFORMATION:

TNS for Linux: Version 11.2.0.4.0 - Production

Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production

TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production

Time: 24-APR-2017 14:34:00

Tracing not turned on.

Tns error struct:

ns main err code: 12535

TNS-12535: TNS:operation timed out

ns secondary err code: 12560

nt main err code: 505

TNS-00505: Operation timed out

nt secondary err code: 110

nt OS err code: 0

Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=<<Application_Server>>)(PORT=18428))

Mon Apr 24 14:34:41 2017

Analysis:

To get into the root of the issue, we tried to reproduce the issue.
Our application spawns database connection via JDBC connectivity. At a time, five threads (connections) gets established to database. We found that whenever we increase the number of threads, we experience this connection timeout issue intermittently.

Our initial analysis, says that the timeout is related to firewall as I already mentioned the connection gets dropped at 15th or 16th minute.

So we were checking on the firewall timeouts but it was set to 3 hours. Our Firewall team confirmed. Oops! our first hope is slashed.
Next we checked the Database level timeouts.

The parameter at sqlnet.ora which responsible for keepalive at database level is SQLNET.EXPIRETIME This parameter should be set with a non-zero value. The values are always in seconds.
Navigate to the TNS_ADMIN of the database server and view the sqlnet.ora file and search for the sqlnet.expiretime parameter.
And yes, this is set to 10seconds in our Database. The next hope is slashed too.

Now the final option is to check the tcp keepalive timeout at server level
Here comes the hint.

How to check the keepalive settings at unix server?

  # cat /proc/sys/net/ipv4/tcp_keepalive_time
  1000

  # cat /proc/sys/net/ipv4/tcp_keepalive_intvl
  75

  # cat /proc/sys/net/ipv4/tcp_keepalive_probes
  9

Tadaaa!!
TCP Keepalive timeout set at Application server is 1000 (~15minutes) , which is why the sessions were unable to send the packets to Database after 15 minutes, and since the handshake is broken, Oracle assumes this as a Dead Connection and the pmon clears the processes and reports this in the alert log.

Now we are left with two options either increase the TCP Keepalive at server or make the DB connection from Application server alive.
Since the former, is a security breach at server level, we went for the later.

How to keep the DB connection active?
The keepalive feature on the supported TCP transports can be enabled for a net service client by embedding (ENABLE=BROKEN) under the Description parameter in the connect string. Keepalive allows the caller to detect a dead remote server. Operating system TCP configurables, which vary by platform, define the actual keepalive timing details.

Sample TNS:
net_service_name=

 (DESCRIPTION= 
  (enable=broken)
  (ADDRESS=(PROTOCOL=tcp)(HOST=sales1-svr)(PORT=1521))
  (ADDRESS=(PROTOCOL=tcp)(HOST=sales2-svr)(PORT=1521)))
  (CONNECT_DATA=(SERVICE_NAME=sales.us.example.com))

Search This Blog

DBA Concepts

Featured

Missing Java Classes - Oracle Database

Broken Database Connections

Comments

Post a Comment

Popular Posts

Disable HTTPS for Informatica Admin url after installation