HPC C# Algorithms HPC DFS HPC Algorithms

Recursive DFS: 2 BILLION Stack Frames Deep, BEATING Iterative!

I started stripping away List<int>. Generics. Foreach loops. Then suddenly a latch was triggered and it started free falling—the depth grew vastly deeper and went faster.

By Robert Long | January 10, 2026

In Part 1, we hit 30 million recursive calls using HPC techniques—struct-based nodes, integer indices, and bool[] for visited tracking. I thought that was the limit. Then I started stripping away more: the List<int>, the generic type parameter, the foreach loop. Eventually we made a single change that ended up causing us to suddenly hit a depth of 2 billion and speed up by 24x.


The Problem with Our Previous Implementation

Here's the HPC implementation from Part 1 that achieved 30 million depth:

// HPC DFS Recursive v1: 30 million depth (from Part 1)
public struct Node<T>
{
    public T Value;
    public List<int> Neighbors;  // ← Hidden allocation
}

public static void Dfs(
    this Node<int>[] nodes,
    int currentIndex,
    bool[] visited)
{
    if (currentIndex < 0 || currentIndex >= nodes.Length)
        return;
    if (visited[currentIndex])
        return;

    visited[currentIndex] = true;

    foreach (var neighborIndex in nodes[currentIndex].Neighbors)  // ← Enumerator overhead
        nodes.Dfs(neighborIndex, visited);
}

I optimized it to this:

// 8 bytes per node (two 4-byte ints)
public struct NodeLinear
{
    public int Value;
    public int NextIndex;  // -1 = no neighbor
}

public static int DfsRecursiveLinear(
    this NodeLinear[] nodes,
    int index,
    bool[] visited)
{
    // Unsigned comparison trick: handles negative AND overflow in one check
    if ((uint)index >= (uint)nodes.Length || visited[index])
        return 0;

    visited[index] = true;

    int next = nodes[index].NextIndex;
    if (next >= 0)
        return nodes.DfsRecursiveLinear(next, visited) + 1;  // ← count visited nodes
    return 1;
}

What We Stripped: Before vs After

Component Part 1 (30M) Part 2 (2B) Impact
Node type Node<T> (generic) NodeLinear (concrete) No JIT specialization overhead
Neighbors List<int> (heap reference) int NextIndex (inline) No heap pointer, no indirection
Node size ~24+ bytes 8 bytes 3x smaller, better cache
Iteration foreach (enumerator) if (next >= 0) No MoveNext/Current calls
Bounds check i < 0 || i >= len (uint)i >= (uint)len Single comparison vs two
The Unsigned Comparison Trick (AI-suggested)

(uint)index >= (uint)nodes.Length is a common performance optimization. If index is negative, casting to uint makes it a very large positive number, which will always be greater than the array length. This eliminates a separate index < 0 check.


Triggering the Hatch

Depth Time Call Stack Used Result
30,000,000 825ms 1.4GB SUCCESS
40,000,000 1,100ms 1.8GB SUCCESS
44,720,000 1,238ms 2.0GB MAX (stack limit)
45,000,000 >2GB STACK OVERFLOW

The crash at 44.72 million confirmed our frame size was ~48 bytes. We had fully saturated the 2GB stack. This should have been the end of the experiment.

But I tried one more thing. I removed the return value to see if it saved a few bytes:

public static void DfsRecursiveLinear(
    this NodeLinear[] nodes,
    int index,
    bool[] visited)
{
    if ((uint)index >= (uint)nodes.Length || visited[index])
        return;

    visited[index] = true;

    int next = nodes[index].NextIndex;
    if (next >= 0)
        nodes.DfsRecursiveLinear(next, visited);  // ← just recurse, nothing after
}

Suddenly, the hatch released!! It started free falling way faster than it did before.

I ran it again at 60 million? Deeper...
Again at 100 million? Deeper...
Again at 500 million??? DEEPER...
Again at 1 BILLION?!?! DEEEEEEPER!!!!!

What the hell is happening?!

The Free Fall

Depth New Recursion Old Recursion Details
10,0000ms0ms
100,0000ms2ms
1,000,0001ms27msNew 27x faster
10,000,00011ms276msNew 25x faster
20,000,00024ms555msNew 23x faster
30,000,00034ms828msNew 24x faster
40,000,00046ms1,098msNew 24x faster
44,720,00051msStack Overflow
50,000,00057ms
100,000,000115ms
500,000,000576ms
1,000,000,0001,183ms
2,000,000,0002,378ms
2,147,483,5912,560msNew max depth

2 BILLION

recursive calls in 2.5s

That's unbelievable! How the hell is that possible? Not only did it exceed 50 million, it was 2,400% faster! I ran it over and over... and over. It was definitely legit. I was completely baffled!

After some thought, it had to be the CLR itself doing something. It found a pattern in my code. Could this be the .tail IL instruction I mentioned in Part 1? Was it somehow activated???

It can't be—this is C#, not a functional language!

Could this be tail call optimization happening in C#?! An imperative language?! The C# compiler doesn't emit .tail—that's an F# thing.

This is very strange...


Disassembling the Binary

If tail call optimization was happening, there would be evidence in the Native AOT compiled binary. Since we're using dotnet publish with AOT, the C# code gets compiled directly to native machine code—no JIT at runtime. The smoking gun would be a jmp instruction instead of a call instruction at the recursive call site.

The Tool: dumpbin

I used dumpbin.exe from Visual Studio to disassemble the Native AOT binary. This tool comes with Visual Studio and can be found at:

C:\Program Files\Microsoft Visual Studio\...\VC\Tools\MSVC\...\bin\Hostx64\x64\dumpbin.exe

The command I ran:

dumpbin /disasm LargeStackBenchmark.exe | grep -A 50 "DfsRecursiveLinear"

The New Recursion (TCO Version) - Actual Disassembly

; HpcDfsLinearExtensions__DfsRecursiveLinear - Native AOT x64
0140002220: sub    rsp,28h
0140002224: mov    eax,[rcx+8]              ; eax = nodes.Length
0140002227: cmp    eax,edx                  ; bounds check
0140002229: jbe    0140002250               ; if out of bounds, exit
014000222B: mov    r10d,[r8+8]              ; r10d = visited.Length
014000222F: cmp    edx,r10d                 ; ← JUMP TARGET (start)
0140002232: jae    0140002255               ; bounds check on visited[]
0140002236: cmp    byte ptr [r8+rdx+10h],0  ; visited[index] == false?
014000223C: jne    0140002250               ; if visited, exit
014000223E: mov    byte ptr [r8+rdx+10h],1  ; visited[index] = true
0140002244: mov    edx,[rcx+rdx*8+14h]      ; edx = nodes[index].NextIndex
0140002248: test   edx,edx                  ; check if next >= 0
014000224A: jl     0140002250               ; if negative, exit
014000224C: cmp    eax,edx
014000224E: ja     014000222F               ; ← JUMP BACK to start!
0140002250: add    rsp,28h
0140002254: ret

BINGO!!! There's no call opcode at all—just ja (jump to)!

Look at address 014000224E: ja 014000222F. That's a conditional jump back to address 014000222F—which is earlier in the same function. The compiler transformed the tail recursion into a simple goto!

The Native AOT compiler recognized the tail-recursive pattern and converted it to a jump instruction. No function call overhead. No stack frame allocation. Just a jmp.

The Old Recursion (No TCO) - Actual Disassembly

For comparison, here's the version that crashes at 44.72 million—the one with return result + 1 after the recursive call:

; HpcDfsLinearNoTcoExtensions__DfsRecursiveLinearNoTco - Native AOT x64
0140002260: sub    rsp,28h
; ... bounds check and visited check ...
014000227D: mov    byte ptr [r8+rax+10h],1  ; visited[index] = true
0140002283: mov    edx,[rcx+rax*8+14h]      ; edx = nodes[index].NextIndex
0140002287: test   edx,edx                  ; check if next >= 0
0140002289: jge    0140002295               ; if positive, recurse
014000228B: mov    eax,1                    ; base case: return 1
0140002290: add    rsp,28h
0140002294: ret
0140002295: call   ...DfsRecursiveLinearNoTco  ; ← RECURSIVE CALL!
014000229A: inc    eax                      ; ← result + 1 (BREAKS TCO!)
014000229C: add    rsp,28h
01400022A0: ret

See the difference?

  • call ...DfsRecursiveLinearNoTco at address 0140002295—this creates a new stack frame every time
  • inc eax at address 014000229A—that's the return result + 1 from the C# code

That single inc eax after the call prevents tail call optimization because the function has work to do after the recursive call returns. The compiler must keep the stack frame alive to store where to return to.

The Verdict

Method Key Instruction What It Means
DfsRecursiveLinear ja 014000222F (jump back) Goto/jump—no stack frames!
DfsRecursiveLinearNoTco call + inc eax Recursive call—48 bytes per frame

What Triggers the Hatch?

Native AOT recognized the tail call pattern and compiled it as a jmp instruction. No stack frames accumulate. The recursive call becomes a simple goto at the machine code level.

What triggered Tail Call Recursion Optimization:

public static void DfsRecursiveLinear(
    this NodeLinear[] nodes,
    int index,
    bool[] visited)
{
    if ((uint)index >= (uint)nodes.Length || visited[index])
        return;

    visited[index] = true;

    int next = nodes[index].NextIndex;
    if (next >= 0)
        nodes.DfsRecursiveLinear(next, visited);  // ← TAIL POSITION!
}

The key: the recursive call is the last thing the function does. There's no work after it—no return result + 1, no cleanup, nothing. This is called tail position, and it allows the compiler to replace the call with a jmp.

What this means:

  • Zero stack frame overhead — each "recursion" reuses the same frame
  • Zero heap allocation — unlike Stack<T> which allocates on the heap
  • No GC pressure — nothing to clean up
  • Identical performance to hand-written iteration — both become goto at machine level

JIT Mode Also Triggers TCO

I tested JIT mode (using dotnet run instead of Native AOT) and JIT also triggers TCO—reaching 2 billion depth successfully. Both compilation modes recognize the tail call pattern.

Speed: TCO Recursive vs Stack<T> Iterative
Depth AOT TCO Recursive JIT TCO Recursive AOT Iterative JIT Iterative
30,000,000 34ms 272ms 79ms 43ms
100,000,000 115ms 376ms 258ms 160ms
1,000,000,000 1,183ms 1,423ms 2,650ms 1,410ms
2,000,000,000 2,378ms 2,617ms 5,387ms 2,791ms
Traversal Overhead: TCO Recursive vs Stack<T> Iterative (Linear Graph)
Approach Traversal Memory Where
TCO Recursive 0 bytes Call stack (reused frame)
Stack<T> Iterative 4 bytes Heap (1 element max for linear graph)

Note: Both approaches require the same ~17GB for data structures (NodeLinear[] + bool[]) at 2B depth. The difference is the traversal overhead—TCO runs entirely on the call stack with 0 heap allocation.

Max Depth Comparison
Approach Max Depth Time at Max
AOT TCO Recursive 2,147,483,591 2,560ms
JIT TCO Recursive 2,147,483,591 2,777ms
AOT Iterative 2,147,483,591 5,708ms
JIT Iterative 2,147,483,591 2,992ms

All approaches hit the same max depth—limited by RAM for data structures, not stack space.

TCO recursive is the fastest in both JIT and AOT modes, with AOT being the champion. The traversal runs entirely on the call stack, reusing the same frame 2 billion times with 0 heap allocation.


Why Is Iterative Slower?

The Stack<T> approach feels like the "right" way to do things, but it has hidden costs that accumulate with every step:

  • Method call overhead — Every step requires Push(), Pop(), and Count property access
  • Bounds checkingStack<T> validates every operation to ensure you aren't accessing invalid memory
  • Indirection — You're following pointers to the heap rather than using direct register access

The TCO recursive version bypasses all of this. It compiles down to essentially a goto. No method calls. No bounds checks. Just CPU registers and a jump instruction.


Wait... How Is This Possible?

Let's do the napkin math. With a ~2GB stack and the smallest possible stack frame, how deep should we be able to go?

Even if we assume a highly optimized stack frame, the physical limits of memory should stop us long before we hit 2 billion.

Frame Size What fits? Theoretical Max Depth (2GB Stack)
48 bytes Standard C# Frame ~44,739,242
32 bytes Optimized Frame ~67,108,864
16 bytes Minimal (Ret + Pointer) ~134,217,728
0 bytes TCO (Tail Call) Infinite (CPU limited)

We hit 2 billion. That's 15x to 45x deeper than the mathematical maximum for a standard recursive function.

Something else is happening here. The compiler didn't just optimize the stack usage; it fundamentally changed the physics of the execution.


🏆 Winner: Recursion

1. The Champion 👑

AOT (Ahead-of-Time compiled) TCO (Tail Call Optimized) Recursive DFS didn't just stand toe-to-toe against the iterative approach—it consistently outranked the other approaches and stayed on top. I am pleasantly surprised! Who would've thought that TCO suddenly comes out of nowhere to save the day?!

⚡ Speed

Fastest

than Stack<T>

🧠 Memory

0 Bytes

traversal allocation

🕳️ Depth

2,147,483,591

recursive calls

We didn't stop because we ran out of stack space; we stopped because we hit the maximum size of a C# Array.

2. The Fourth Metric: Expressiveness & Abstraction ✨

The truth is, both for loops and recursion are language constructs that compile down to GOTO opcodes. Neither is inherently "better" than the other. They are simply tools that allow us to write code more naturally. We use them because they are easier to read and safer to write.

Recursion is the native language of Functional Programming.
It describes the problem definition itself: "Visit this node, then visit its neighbors."

The Iterative approach is just a manual simulation.
Most high-level languages force you into this corner. When we write an iterative DFS, we are essentially saying: "I don't trust the compiler to manage the Call Stack, so I will manually manage a Stack<T> object on the Heap." It is mechanical, verbose, and disconnected from the algorithm's intent.

C# allows you to be an assassin with recursion.
While other languages trap you in safety nets, C# gives you the keys to the castle. It allows you to strip away the overhead and activate tail call recursion optimization in hardcore mode.

And the results speak for themselves. In our benchmark, this wasn't just "comparable" to iteration—it was the fastest of all 4 approaches.

The Definition of Control:
Real control isn't about manually managing memory or writing complex loops. Real control comes from understanding the language and hardware limits. Once you understand the physics, you can optimize the system to do exactly what you want—faster, cleaner, and more efficiently than the "safe" default.

3. The Fifth Metric: Flexibility 🔧

This is the most critical takeaway. There is a myth that "High Performance" means "Low Level" or "Complex Code."

  • The Reality: Deep knowledge of the hardware (Bare Metal Physics) doesn't restrict you; it gives you flexibility.
  • The Freedom: Because we understood exactly how the Stack works and exactly how the CPU handles Tail Calls, we gained the freedom to use a "dangerous" tool (Recursion) safely.
  • The Result: We didn't have to choose between "Clean Code" and "Fast Code." We got both because we respected the physics of the machine.

Most developers are unaware of these limits, so they blindly follow the rule: "Always use Iteration." When you know the physics, you can break the rules.

4. The Real-World "Hidden Killer" 💀

In our clean linear benchmark, the Stack<T> stayed small, so the Garbage Collector remained mostly quiet. But in a messy, real-world production system with branching graphs and complex state, the Iterative approach is a ticking time bomb.

  • 🧩 Heap Fragmentation: As Stack<T> resizes, it demands larger and larger contiguous blocks of memory. On a long-running server with a fragmented heap, this leads to OutOfMemory exceptions even when you theoretically have free RAM.
  • 🗑️ LOH Issues: Once your stack array exceeds 85KB, it moves to the Large Object Heap (LOH). The LOH is rarely compacted, leading to permanent fragmentation.
  • 📈 Latency Spikes: Every expansion triggers an allocation. Every drop triggers cleanup. In latency-sensitive systems, this jitter kills your p99 performance.

The TCO Advantage: Because TCO uses the Thread Stack (LIFO) and CPU registers:

  • Pre-allocated, contiguous memory—no fragmentation possible
  • No heap interaction during traversal
  • No GC involvement—zero pauses
  • LIFO by design—the natural home for recursion

5. Undocumented and Defies Conventional Wisdom 📚

I looked for this as a language feature in the official C# Language Specification (ECMA-334). I checked the major reference books.

There is nothing.

This behavior is completely undocumented in official Microsoft sources. If you rely on the standard literature, this capability effectively does not exist.

Additionally, if you ask anyone about this scenario, they will insist that the iterative approach is safer, won't overflow, and runs faster. Even if you ask any AI about this problem today, it will immediately generate the iterative code and tell you it is the superior solution.

But they are all dead wrong.

Recursion is the superior solution. If you know what you're doing, you really don't have to choose—you can get speed, memory efficiency, and expressiveness all at once.

Do NOT trust conventional wisdom when you work with recursion—at least not while you're using HPC C#! 😉

6. Final Verdict 🎯

You don't have to pick iterative all the time anymore. C# Native AOT gives you much more flexibility here. When you understand the machine and its restrictions, you can ironically gain a higher degree of freedom.


Test Environment

OSWindows 11
Runtime.NET 10 Native AOT
CPUAMD Ryzen 9 9950X3D 16-Core Processor
RAM192 GB DDR5
Stack Size~2GB (int.MaxValue thread stack allocation)

Closing Thoughts

Tail Call Optimization isn't new. It has been a fundamental concept since the dawn of Lisp in the 1960s. It didn't "suddenly appear"—it was always there, sitting quietly in the compiler.

But because of HPC and the new memory features in C#, this introduces new unexplored territory. We often forget that despite being a high-level managed language, C# has deep roots in C. It possesses a "bare metal" capability that other managed languages simply cannot touch. By respecting that heritage and writing code that the hardware understands, I was able to unlock a capability that most C# developers don't even know exists.

This experiment didn't just teach me something new; it validated something I have always suspected. I've always questioned the industry dogma that you must always convert recursion to iteration (and I'm pretty sure others have as well). It never sat right with me. With High-Performance C#, I was finally able to put that skepticism to the test.

It's amazing how this experiment is now changing my mental model of what is possible and what isn't—and debunks the myth that you should always convert recursive calls to iterative ones. Not only is this more verbose and less expressive, but it could also cost you performance! 🤯

It brings me back to over a decade ago at UC Irvine, sitting in Professor Cristina Videira Lopes' course. In her book Exercises in Programming Style, there is a specific chapter on this—Chapter 9: "The Kick Forward" (Continuation-Passing Style).

I remember obsessing over this problem, drawing out the function stacks and diagrams on paper, trying to visualize where the memory was going. I spent hours and days trying to get it to work in C#, staring at what looked like such a simple, elegant function call. But it would always crash. I eventually gave up and switched to F# just to get the recursion to compile without overflowing.

I thought C# simply couldn't do it. I thought it was a language limitation.

Little did I know, 10 years later, I would trigger that exact optimization in C# to push beyond the boundaries of what I thought possible. To go down into Alice's bottomless rabbit hole.... into hyper-dimensional space.

Stay tuned. :)

Keywords
DFS depth first search recursion stack overflow tail call optimization TCO Native AOT structs value types high performance algorithms graph traversal

Related Articles

HPC C# Part 1
Recursive DFS: 30 Million Stack Frames

The foundation: structs, integer indices, and bool[] tracking. Where this journey started.

HPC C#
HPC C# (Hardcore Mode)

The 5 features that make C# competitive with C++ and Rust. The theory behind the practice.

About the Author

Robert Long
Robert Long

Senior Software Engineer

10+ years building cloud-native, high-performance software using first-principles thinking.