Database Connections - A brief Case study
Of late, we have been experiencing frequent connection timeouts in our application from Database. We were wondering how could this happen when nothing has changed. On continuous monitoring, it was figured out that all the connections that gets dropped are between 15th and 16th minute. Something fishy!!!
Also we were able to see the below error in database alert log whenever the issue happens in application.
Database alert log error:
Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for Linux: Version 11.2.0.4.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
Time: 24-APR-2017 14:34:00
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505
TNS-00505: Operation timed out
nt secondary err code: 110
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=<<Application_Server>>)(PORT=18428))
Mon Apr 24 14:34:41 2017
Analysis:
To get into the root of the issue, we tried to reproduce the issue.
Our application spawns database connection via JDBC connectivity. At a time, five threads (connections) gets established to database. We found that whenever we increase the number of threads, we experience this connection timeout issue intermittently.
Our initial analysis, says that the timeout is related to firewall as I already mentioned the connection gets dropped at 15th or 16th minute.
So we were checking on the firewall timeouts but it was set to 3 hours. Our Firewall team confirmed. Oops! our first hope is slashed.
Next we checked the Database level timeouts.
The parameter at sqlnet.ora which responsible for keepalive at database level is SQLNET.EXPIRETIME This parameter should be set with a non-zero value. The values are always in seconds.
Navigate to the TNS_ADMIN of the database server and view the sqlnet.ora file and search for the sqlnet.expiretime parameter.
And yes, this is set to 10seconds in our Database. The next hope is slashed too.
Now the final option is to check the tcp keepalive timeout at server level
Here comes the hint.
How to check the keepalive settings at unix server?
# cat /proc/sys/net/ipv4/tcp_keepalive_time
1000
# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
# cat /proc/sys/net/ipv4/tcp_keepalive_probes
9
Tadaaa!!
TCP Keepalive timeout set at Application server is 1000 (~15minutes) , which is why the sessions were unable to send the packets to Database after 15 minutes, and since the handshake is broken, Oracle assumes this as a Dead Connection and the pmon clears the processes and reports this in the alert log.
Now we are left with two options either increase the TCP Keepalive at server or make the DB connection from Application server alive.
Since the former, is a security breach at server level, we went for the later.
How to keep the DB connection active?
The keepalive feature on the supported TCP transports can be enabled for a net service client by embedding (ENABLE=BROKEN) under the Description
parameter in the connect string. Keepalive allows the caller to detect a
dead remote server. Operating system TCP configurables, which vary by platform,
define the actual keepalive timing details.
Sample TNS:
net_service_name=
(DESCRIPTION=
(enable=broken)
(ADDRESS=(PROTOCOL=tcp)(HOST=sales1-svr)(PORT=1521))
(ADDRESS=(PROTOCOL=tcp)(HOST=sales2-svr)(PORT=1521)))
(CONNECT_DATA=(SERVICE_NAME=sales.us.example.com))
Comments
Post a Comment