Database Connection Maxed
This is one of the common issues I have seen in production systems. Most of the major issues I have seen started with DB connections maxing out.Maxing out( using all the available connections in the connection pool) could be mainly due to two reason.
1. Increase in traffic and the pool size is not large enough.
2. Slowness - Slowness could in the app server, db servers, network, or query itself.
Trouble Shooting :-
1. Verify response times for any slowness. Compare with past date, a few weeks or months.
2. Check the app server's, db servers's for abnormal CPU usage, Memory usage.
3. Check if there any queries slowing down in the db.
4. Check for increase in traffic relating to a recent roll out of a new application or new feature.
5. Compare historic data at least a few weeks or months to figure out the increase in traffic.
6. Check the network checked, firewall issues etc.
Slowness in Response Time
Possible reasons
1. Database issue
2. Query issue - When the data increases the database could alter the plan it used until yesterday and start using a different plane, that could slow down the query today
3. Network issue
4. App server issue.
Troubelshooting :-
1. Try to separate out the layer which is slowing down.
2. Check if a particular database call is slow or all the database calls are slow.
3. Check if all the app servers are slow or only one or two node are slow - It could be possible that due to one or two bad nodes in the server farm we might experience a slowness.
4. Check the CPU usage in all the app servers.
5.Verify the memory usage.
6. Verify the garbage collection (GC logs), for Java applications. Due to pause times from Full GC the system could respond slow.
CPU Spikes
Possible reason :-
1. Application Threads
2. Garbage Collection
3. CPU was running hot and just the load increases or one of the node is bad.
Trouble Shooting :-
1. Take application thread dumps and analyse to see if any dead locks or blocked threads.
2. Analyse the GC logs and see if there any frequent Full GC and check for the pause times.
3. Check the type of Garbage Collection Algorithm used
Out of memory
Possible reasons :
1. Memory Leak
2. Not enough memory for the JVM. Eg: One a 4GB Machine you are running a JVM with min 500MB and 1GB max heap size and a memcached 2.5 GM cache size.(Assumption)
Trouble Shooting
1. Analyse Heap Bump - Always set the JVM parameter to get the heap dump on out of memory.
2. Check the JVM memory parameters.
3. Check the memory allocation
4. Verify the GC logs and analyse memory freed, full gc's etc.
This is one of the common issues I have seen in production systems. Most of the major issues I have seen started with DB connections maxing out.Maxing out( using all the available connections in the connection pool) could be mainly due to two reason.
1. Increase in traffic and the pool size is not large enough.
2. Slowness - Slowness could in the app server, db servers, network, or query itself.
Trouble Shooting :-
1. Verify response times for any slowness. Compare with past date, a few weeks or months.
2. Check the app server's, db servers's for abnormal CPU usage, Memory usage.
3. Check if there any queries slowing down in the db.
4. Check for increase in traffic relating to a recent roll out of a new application or new feature.
5. Compare historic data at least a few weeks or months to figure out the increase in traffic.
6. Check the network checked, firewall issues etc.
Slowness in Response Time
Possible reasons
1. Database issue
2. Query issue - When the data increases the database could alter the plan it used until yesterday and start using a different plane, that could slow down the query today
3. Network issue
4. App server issue.
Troubelshooting :-
1. Try to separate out the layer which is slowing down.
2. Check if a particular database call is slow or all the database calls are slow.
3. Check if all the app servers are slow or only one or two node are slow - It could be possible that due to one or two bad nodes in the server farm we might experience a slowness.
4. Check the CPU usage in all the app servers.
5.Verify the memory usage.
6. Verify the garbage collection (GC logs), for Java applications. Due to pause times from Full GC the system could respond slow.
CPU Spikes
Possible reason :-
1. Application Threads
2. Garbage Collection
3. CPU was running hot and just the load increases or one of the node is bad.
Trouble Shooting :-
1. Take application thread dumps and analyse to see if any dead locks or blocked threads.
2. Analyse the GC logs and see if there any frequent Full GC and check for the pause times.
3. Check the type of Garbage Collection Algorithm used
Out of memory
Possible reasons :
1. Memory Leak
2. Not enough memory for the JVM. Eg: One a 4GB Machine you are running a JVM with min 500MB and 1GB max heap size and a memcached 2.5 GM cache size.(Assumption)
Trouble Shooting
1. Analyse Heap Bump - Always set the JVM parameter to get the heap dump on out of memory.
2. Check the JVM memory parameters.
3. Check the memory allocation
4. Verify the GC logs and analyse memory freed, full gc's etc.