Recursive DFS: 2 BILLION Stack Frames Deep, BEATING Iterative!

In Part 1, we hit 30 million recursive calls using HPC techniques—struct-based nodes, integer indices, and bool[] for visited tracking. I thought that was the limit. Then I started stripping away more: the List<int>, the generic type parameter, the foreach loop. Eventually we made a single change that ended up causing us to suddenly hit a depth of 2 billion and speed up by 24x.

The Problem with Our Previous Implementation

Here's the HPC implementation from Part 1 that achieved 30 million depth:

// HPC DFS Recursive v1: 30 million depth (from Part 1)
public struct Node<T>
{
    public T Value;
    public List<int> Neighbors;  // ← Hidden allocation
}

public static void Dfs(
    this Node<int>[] nodes,
    int currentIndex,
    bool[] visited)
{
    if (currentIndex < 0 || currentIndex >= nodes.Length)
        return;
    if (visited[currentIndex])
        return;

    visited[currentIndex] = true;

    foreach (var neighborIndex in nodes[currentIndex].Neighbors)  // ← Enumerator overhead
        nodes.Dfs(neighborIndex, visited);
}

I optimized it to this:

// 8 bytes per node (two 4-byte ints)
public struct NodeLinear
{
    public int Value;
    public int NextIndex;  // -1 = no neighbor
}

public static int DfsRecursiveLinear(
    this NodeLinear[] nodes,
    int index,
    bool[] visited)
{
    // Unsigned comparison trick: handles negative AND overflow in one check
    if ((uint)index >= (uint)nodes.Length || visited[index])
        return 0;

    visited[index] = true;

    int next = nodes[index].NextIndex;
    if (next >= 0)
        return nodes.DfsRecursiveLinear(next, visited) + 1;  // ← count visited nodes
    return 1;
}

What We Stripped: Before vs After

Component	Part 1 (30M)	Part 2 (2B)	Impact
Node type	`Node<T>` (generic)	`NodeLinear` (concrete)	No JIT specialization overhead
Neighbors	`List<int>` (heap reference)	`int NextIndex` (inline)	No heap pointer, no indirection
Node size	~24+ bytes	8 bytes	3x smaller, better cache
Iteration	`foreach` (enumerator)	`if (next >= 0)`	No MoveNext/Current calls
Bounds check	`i < 0 \|\| i >= len`	`(uint)i >= (uint)len`	Single comparison vs two

The Unsigned Comparison Trick (AI-suggested)

(uint)index >= (uint)nodes.Length is a common performance optimization. If index is negative, casting to uint makes it a very large positive number, which will always be greater than the array length. This eliminates a separate index < 0 check.

Triggering the Hatch

Depth	Time	Call Stack Used	Result
30,000,000	825ms	1.4GB	SUCCESS
40,000,000	1,100ms	1.8GB	SUCCESS
44,720,000	1,238ms	2.0GB	MAX (stack limit)
45,000,000	—	>2GB	STACK OVERFLOW

The crash at 44.72 million confirmed our frame size was ~48 bytes. We had fully saturated the 2GB stack. This should have been the end of the experiment.

But I tried one more thing. I removed the return value to see if it saved a few bytes:

public static void DfsRecursiveLinear(
    this NodeLinear[] nodes,
    int index,
    bool[] visited)
{
    if ((uint)index >= (uint)nodes.Length || visited[index])
        return;

    visited[index] = true;

    int next = nodes[index].NextIndex;
    if (next >= 0)
        nodes.DfsRecursiveLinear(next, visited);  // ← just recurse, nothing after
}

Suddenly, the hatch released!! It started free falling way faster than it did before.

I ran it again at 60 million? Deeper...
Again at 100 million? Deeper...
Again at 500 million??? DEEPER...
Again at 1 BILLION?!?! DEEEEEEPER!!!!!

What the hell is happening?!

The Free Fall

Depth	New Recursion	Old Recursion	Details
10,000	0ms	0ms	—
100,000	0ms	2ms	—
1,000,000	1ms	27ms	New 27x faster
10,000,000	11ms	276ms	New 25x faster
20,000,000	24ms	555ms	New 23x faster
30,000,000	34ms	828ms	New 24x faster
40,000,000	46ms	1,098ms	New 24x faster
44,720,000	51ms	Stack Overflow	—
50,000,000	57ms	—	—
100,000,000	115ms	—	—
500,000,000	576ms	—	—
1,000,000,000	1,183ms	—	—
2,000,000,000	2,378ms	—	—
2,147,483,591	2,560ms	—	New max depth

2 BILLION

recursive calls in 2.5s

That's unbelievable! How the hell is that possible? Not only did it exceed 50 million, it was 2,400% faster! I ran it over and over... and over. It was definitely legit. I was completely baffled!

After some thought, it had to be the CLR itself doing something. It found a pattern in my code. Could this be the .tail IL instruction I mentioned in Part 1? Was it somehow activated???

It can't be—this is C#, not a functional language!

Could this be tail call optimization happening in C#?! An imperative language?! The C# compiler doesn't emit .tail—that's an F# thing.

This is very strange...

Disassembling the Binary

If tail call optimization was happening, there would be evidence in the Native AOT compiled binary. Since we're using dotnet publish with AOT, the C# code gets compiled directly to native machine code—no JIT at runtime. The smoking gun would be a jmp instruction instead of a call instruction at the recursive call site.

The Tool: dumpbin

I used dumpbin.exe from Visual Studio to disassemble the Native AOT binary. This tool comes with Visual Studio and can be found at:

C:\Program Files\Microsoft Visual Studio\...\VC\Tools\MSVC\...\bin\Hostx64\x64\dumpbin.exe

The command I ran:

dumpbin /disasm LargeStackBenchmark.exe | grep -A 50 "DfsRecursiveLinear"

The New Recursion (TCO Version) - Actual Disassembly

; HpcDfsLinearExtensions__DfsRecursiveLinear - Native AOT x64
0140002220: sub    rsp,28h
0140002224: mov    eax,[rcx+8]              ; eax = nodes.Length
0140002227: cmp    eax,edx                  ; bounds check
0140002229: jbe    0140002250               ; if out of bounds, exit
014000222B: mov    r10d,[r8+8]              ; r10d = visited.Length
014000222F: cmp    edx,r10d                 ; ← JUMP TARGET (start)
0140002232: jae    0140002255               ; bounds check on visited[]
0140002236: cmp    byte ptr [r8+rdx+10h],0  ; visited[index] == false?
014000223C: jne    0140002250               ; if visited, exit
014000223E: mov    byte ptr [r8+rdx+10h],1  ; visited[index] = true
0140002244: mov    edx,[rcx+rdx*8+14h]      ; edx = nodes[index].NextIndex
0140002248: test   edx,edx                  ; check if next >= 0
014000224A: jl     0140002250               ; if negative, exit
014000224C: cmp    eax,edx
014000224E: ja     014000222F               ; ← JUMP BACK to start!
0140002250: add    rsp,28h
0140002254: ret

BINGO!!! There's no call opcode at all—just ja (jump to)!

Look at address 014000224E: ja 014000222F. That's a conditional jump back to address 014000222F—which is earlier in the same function. The compiler transformed the tail recursion into a simple goto!

The Native AOT compiler recognized the tail-recursive pattern and converted it to a jump instruction. No function call overhead. No stack frame allocation. Just a jmp.

The Old Recursion (No TCO) - Actual Disassembly

For comparison, here's the version that crashes at 44.72 million—the one with return result + 1 after the recursive call:

; HpcDfsLinearNoTcoExtensions__DfsRecursiveLinearNoTco - Native AOT x64
0140002260: sub    rsp,28h
; ... bounds check and visited check ...
014000227D: mov    byte ptr [r8+rax+10h],1  ; visited[index] = true
0140002283: mov    edx,[rcx+rax*8+14h]      ; edx = nodes[index].NextIndex
0140002287: test   edx,edx                  ; check if next >= 0
0140002289: jge    0140002295               ; if positive, recurse
014000228B: mov    eax,1                    ; base case: return 1
0140002290: add    rsp,28h
0140002294: ret
0140002295: call   ...DfsRecursiveLinearNoTco  ; ← RECURSIVE CALL!
014000229A: inc    eax                      ; ← result + 1 (BREAKS TCO!)
014000229C: add    rsp,28h
01400022A0: ret

See the difference?

call ...DfsRecursiveLinearNoTco at address 0140002295—this creates a new stack frame every time
inc eax at address 014000229A—that's the return result + 1 from the C# code

That single inc eax after the call prevents tail call optimization because the function has work to do after the recursive call returns. The compiler must keep the stack frame alive to store where to return to.

The Verdict

Method	Key Instruction	What It Means
DfsRecursiveLinear	`ja 014000222F` (jump back)	Goto/jump—no stack frames!
DfsRecursiveLinearNoTco	`call` + `inc eax`	Recursive call—48 bytes per frame

What Triggers the Hatch?

Native AOT recognized the tail call pattern and compiled it as a jmp instruction. No stack frames accumulate. The recursive call becomes a simple goto at the machine code level.

What triggered Tail Call Recursion Optimization:

public static void DfsRecursiveLinear(
    this NodeLinear[] nodes,
    int index,
    bool[] visited)
{
    if ((uint)index >= (uint)nodes.Length || visited[index])
        return;

    visited[index] = true;

    int next = nodes[index].NextIndex;
    if (next >= 0)
        nodes.DfsRecursiveLinear(next, visited);  // ← TAIL POSITION!
}

The key: the recursive call is the last thing the function does. There's no work after it—no return result + 1, no cleanup, nothing. This is called tail position, and it allows the compiler to replace the call with a jmp.

What this means:

Zero stack frame overhead — each "recursion" reuses the same frame
Zero heap allocation — unlike Stack<T> which allocates on the heap
No GC pressure — nothing to clean up
Identical performance to hand-written iteration — both become goto at machine level

JIT Mode Also Triggers TCO

I tested JIT mode (using dotnet run instead of Native AOT) and JIT also triggers TCO—reaching 2 billion depth successfully. Both compilation modes recognize the tail call pattern.

Speed: TCO Recursive vs Stack<T> Iterative

Depth	AOT TCO Recursive	JIT TCO Recursive	AOT Iterative	JIT Iterative
30,000,000	34ms	272ms	79ms	43ms
100,000,000	115ms	376ms	258ms	160ms
1,000,000,000	1,183ms	1,423ms	2,650ms	1,410ms
2,000,000,000	2,378ms	2,617ms	5,387ms	2,791ms

Traversal Overhead: TCO Recursive vs Stack<T> Iterative (Linear Graph)

Approach	Traversal Memory	Where
TCO Recursive	0 bytes	Call stack (reused frame)
Stack<T> Iterative	4 bytes	Heap (1 element max for linear graph)

Note: Both approaches require the same ~17GB for data structures (NodeLinear[] + bool[]) at 2B depth. The difference is the traversal overhead—TCO runs entirely on the call stack with 0 heap allocation.

Max Depth Comparison

Approach	Max Depth	Time at Max
AOT TCO Recursive	2,147,483,591	2,560ms
JIT TCO Recursive	2,147,483,591	2,777ms
AOT Iterative	2,147,483,591	5,708ms
JIT Iterative	2,147,483,591	2,992ms

All approaches hit the same max depth—limited by RAM for data structures, not stack space.

TCO recursive is the fastest in both JIT and AOT modes, with AOT being the champion. The traversal runs entirely on the call stack, reusing the same frame 2 billion times with 0 heap allocation.

Why Is Iterative Slower?

The Stack<T> approach feels like the "right" way to do things, but it has hidden costs that accumulate with every step:

Method call overhead — Every step requires Push(), Pop(), and Count property access
Bounds checking — Stack<T> validates every operation to ensure you aren't accessing invalid memory
Indirection — You're following pointers to the heap rather than using direct register access

The TCO recursive version bypasses all of this. It compiles down to essentially a goto. No method calls. No bounds checks. Just CPU registers and a jump instruction.

Wait... How Is This Possible?

Let's do the napkin math. With a ~2GB stack and the smallest possible stack frame, how deep should we be able to go?

Even if we assume a highly optimized stack frame, the physical limits of memory should stop us long before we hit 2 billion.

Frame Size	What fits?	Theoretical Max Depth (2GB Stack)
48 bytes	Standard C# Frame	~44,739,242
32 bytes	Optimized Frame	~67,108,864
16 bytes	Minimal (Ret + Pointer)	~134,217,728
0 bytes	TCO (Tail Call)	Infinite (CPU limited)

We hit 2 billion. That's 15x to 45x deeper than the mathematical maximum for a standard recursive function.

Something else is happening here. The compiler didn't just optimize the stack usage; it fundamentally changed the physics of the execution.

🏆 Winner: Recursion

1. The Champion 👑

AOT (Ahead-of-Time compiled) TCO (Tail Call Optimized) Recursive DFS didn't just stand toe-to-toe against the iterative approach—it consistently outranked the other approaches and stayed on top. I am pleasantly surprised! Who would've thought that TCO suddenly comes out of nowhere to save the day?!

⚡ Speed

Fastest

than Stack<T>

🧠 Memory

0 Bytes

traversal allocation

🕳️ Depth

2,147,483,591

recursive calls

We didn't stop because we ran out of stack space; we stopped because we hit the maximum size of a C# Array.

2. The Fourth Metric: Expressiveness & Abstraction ✨

The truth is, both for loops and recursion are language constructs that compile down to GOTO opcodes. Neither is inherently "better" than the other. They are simply tools that allow us to write code more naturally. We use them because they are easier to read and safer to write.

Recursion is the native language of Functional Programming.
It describes the problem definition itself: "Visit this node, then visit its neighbors."

The Iterative approach is just a manual simulation.
Most high-level languages force you into this corner. When we write an iterative DFS, we are essentially saying: "I don't trust the compiler to manage the Call Stack, so I will manually manage a Stack<T> object on the Heap." It is mechanical, verbose, and disconnected from the algorithm's intent.

C# allows you to be an assassin with recursion.
While other languages trap you in safety nets, C# gives you the keys to the castle. It allows you to strip away the overhead and activate tail call recursion optimization in hardcore mode.

And the results speak for themselves. In our benchmark, this wasn't just "comparable" to iteration—it was the fastest of all 4 approaches.

The Definition of Control:
Real control isn't about manually managing memory or writing complex loops. Real control comes from understanding the language and hardware limits. Once you understand the physics, you can optimize the system to do exactly what you want—faster, cleaner, and more efficiently than the "safe" default.

3. The Fifth Metric: Flexibility 🔧

This is the most critical takeaway. There is a myth that "High Performance" means "Low Level" or "Complex Code."

The Reality: Deep knowledge of the hardware (Bare Metal Physics) doesn't restrict you; it gives you flexibility.
The Freedom: Because we understood exactly how the Stack works and exactly how the CPU handles Tail Calls, we gained the freedom to use a "dangerous" tool (Recursion) safely.
The Result: We didn't have to choose between "Clean Code" and "Fast Code." We got both because we respected the physics of the machine.

Most developers are unaware of these limits, so they blindly follow the rule: "Always use Iteration." When you know the physics, you can break the rules.

4. The Real-World "Hidden Killer" 💀

In our clean linear benchmark, the Stack<T> stayed small, so the Garbage Collector remained mostly quiet. But in a messy, real-world production system with branching graphs and complex state, the Iterative approach is a ticking time bomb.

🧩 Heap Fragmentation: As Stack<T> resizes, it demands larger and larger contiguous blocks of memory. On a long-running server with a fragmented heap, this leads to OutOfMemory exceptions even when you theoretically have free RAM.
🗑️ LOH Issues: Once your stack array exceeds 85KB, it moves to the Large Object Heap (LOH). The LOH is rarely compacted, leading to permanent fragmentation.
📈 Latency Spikes: Every expansion triggers an allocation. Every drop triggers cleanup. In latency-sensitive systems, this jitter kills your p99 performance.

✅ The TCO Advantage: Because TCO uses the Thread Stack (LIFO) and CPU registers:

Pre-allocated, contiguous memory—no fragmentation possible
No heap interaction during traversal
No GC involvement—zero pauses
LIFO by design—the natural home for recursion

5. Undocumented and Defies Conventional Wisdom 📚

I looked for this as a language feature in the official C# Language Specification (ECMA-334). I checked the major reference books.

There is nothing.

This behavior is completely undocumented in official Microsoft sources. If you rely on the standard literature, this capability effectively does not exist.

Additionally, if you ask anyone about this scenario, they will insist that the iterative approach is safer, won't overflow, and runs faster. Even if you ask any AI about this problem today, it will immediately generate the iterative code and tell you it is the superior solution.

But they are all dead wrong.

Recursion is the superior solution. If you know what you're doing, you really don't have to choose—you can get speed, memory efficiency, and expressiveness all at once.

Do NOT trust conventional wisdom when you work with recursion—at least not while you're using HPC C#! 😉

6. Final Verdict 🎯

You don't have to pick iterative all the time anymore. C# Native AOT gives you much more flexibility here. When you understand the machine and its restrictions, you can ironically gain a higher degree of freedom.

Test Environment

OS	Windows 11
Runtime	.NET 10 Native AOT
CPU	AMD Ryzen 9 9950X3D 16-Core Processor
RAM	192 GB DDR5
Stack Size	~2GB (int.MaxValue thread stack allocation)

Closing Thoughts

Tail Call Optimization isn't new. It has been a fundamental concept since the dawn of Lisp in the 1960s. It didn't "suddenly appear"—it was always there, sitting quietly in the compiler.

But because of HPC and the new memory features in C#, this introduces new unexplored territory. We often forget that despite being a high-level managed language, C# has deep roots in C. It possesses a "bare metal" capability that other managed languages simply cannot touch. By respecting that heritage and writing code that the hardware understands, I was able to unlock a capability that most C# developers don't even know exists.

This experiment didn't just teach me something new; it validated something I have always suspected. I've always questioned the industry dogma that you must always convert recursion to iteration (and I'm pretty sure others have as well). It never sat right with me. With High-Performance C#, I was finally able to put that skepticism to the test.

It's amazing how this experiment is now changing my mental model of what is possible and what isn't—and debunks the myth that you should always convert recursive calls to iterative ones. Not only is this more verbose and less expressive, but it could also cost you performance! 🤯

It brings me back to over a decade ago at UC Irvine, sitting in Professor Cristina Videira Lopes' course. In her book Exercises in Programming Style, there is a specific chapter on this—Chapter 9: "The Kick Forward" (Continuation-Passing Style).

I remember obsessing over this problem, drawing out the function stacks and diagrams on paper, trying to visualize where the memory was going. I spent hours and days trying to get it to work in C#, staring at what looked like such a simple, elegant function call. But it would always crash. I eventually gave up and switched to F# just to get the recursion to compile without overflowing.

I thought C# simply couldn't do it. I thought it was a language limitation.

Little did I know, 10 years later, I would trigger that exact optimization in C# to push beyond the boundaries of what I thought possible. To go down into Alice's bottomless rabbit hole.... into hyper-dimensional space.

Stay tuned. :)

Related Skills

csharp dotnet performance algorithm-design

Keywords

DFS depth first search recursion stack overflow tail call optimization TCO Native AOT structs value types high performance algorithms graph traversal