.NET Garbage Collection in a nutshell - debugging code down to the metal - Part 1

Recently I was asked to troubleshoot "out of memory" production issue in a services layer. An opportunity to dig into DNA of .NET Garbage Collection.

Here is a Problem:

In a distributed application with components and services deployed on 3 different geographically located servers and MS Azure. One of the components deployed as a Service was continuously throwing the Out of Memoryexceptions. Before throwing this exception, these service response time was coming down to more than 5 seconds, consequently throwing Operation Timeout exception too on the client side

Here goes debugging and root cause analysis.

  1. WinDbg and SOS came to rescue in troubleshooting the issue. After investigating the memory dump, I found Large Object Heap (LOH) was highly fragmented. Several large size (up to 374 MB) unused memory holes were created on the managed heap.
  2. At initial stage, after W3WP process is reset, Memory and CPU utilization was huge when the Cache building in is progress. Later that, memory and CPU utilization was steadily increasing.

After continuously taking 5 memory dumps (during the application start and an hour later), I found it is the same object consuming huge memory on the LOH. And I caught the culprit; it was Listcustom_object> eating all the memory on the managed heap and creating those unused memory holes. Investigating dumps and code further, I got to the code causing this mess on the heap: list.Add(custom_object); After fetching data from Database, logic was written to create entities and add to theSystem.Collections.Generic.List. This list was then returned to the Business layer for rest of the manipulation which is not relevant to discuss here. This generic list was being added with more than 200K entities and the size of every entity on the managed heap was 1877 bytes. Adding items to the List this way, not only fragments the LOH but also the Gen 0, Gen 1 and Gen 2 heaps. Consequently, compels frequent GC to happen and brings down the performance further as GC requires more processor time.

Root cause and a fix is explained in Part-2 of this post.