In Part 1, we hit 30 million recursive calls
using HPC techniques—struct-based nodes, integer indices, and bool[] for visited tracking.
I thought that was the limit. Then I started stripping away more: the List<int>,
the generic type parameter, the foreach loop. Eventually we made a single change that ended up
causing us to suddenly hit a depth of 2 billion and speed up by 24x.
The Problem with Our Previous Implementation
Here's the HPC implementation from Part 1 that achieved 30 million depth:
// HPC DFS Recursive v1: 30 million depth (from Part 1)
public struct Node<T>
{
public T Value;
public List<int> Neighbors; // ← Hidden allocation
}
public static void Dfs(
this Node<int>[] nodes,
int currentIndex,
bool[] visited)
{
if (currentIndex < 0 || currentIndex >= nodes.Length)
return;
if (visited[currentIndex])
return;
visited[currentIndex] = true;
foreach (var neighborIndex in nodes[currentIndex].Neighbors) // ← Enumerator overhead
nodes.Dfs(neighborIndex, visited);
}
I optimized it to this:
// 8 bytes per node (two 4-byte ints)
public struct NodeLinear
{
public int Value;
public int NextIndex; // -1 = no neighbor
}
public static int DfsRecursiveLinear(
this NodeLinear[] nodes,
int index,
bool[] visited)
{
// Unsigned comparison trick: handles negative AND overflow in one check
if ((uint)index >= (uint)nodes.Length || visited[index])
return 0;
visited[index] = true;
int next = nodes[index].NextIndex;
if (next >= 0)
return nodes.DfsRecursiveLinear(next, visited) + 1; // ← count visited nodes
return 1;
}
What We Stripped: Before vs After
| Component | Part 1 (30M) | Part 2 (2B) | Impact |
|---|---|---|---|
| Node type | Node<T> (generic) |
NodeLinear (concrete) |
No JIT specialization overhead |
| Neighbors | List<int> (heap reference) |
int NextIndex (inline) |
No heap pointer, no indirection |
| Node size | ~24+ bytes | 8 bytes | 3x smaller, better cache |
| Iteration | foreach (enumerator) |
if (next >= 0) |
No MoveNext/Current calls |
| Bounds check | i < 0 || i >= len |
(uint)i >= (uint)len |
Single comparison vs two |
The Unsigned Comparison Trick (AI-suggested)
(uint)index >= (uint)nodes.Length is a common performance optimization.
If index is negative, casting to uint makes it a very large positive number,
which will always be greater than the array length. This eliminates a separate index < 0 check.
Triggering the Hatch
| Depth | Time | Call Stack Used | Result |
|---|---|---|---|
| 30,000,000 | 825ms | 1.4GB | SUCCESS |
| 40,000,000 | 1,100ms | 1.8GB | SUCCESS |
| 44,720,000 | 1,238ms | 2.0GB | MAX (stack limit) |
| 45,000,000 | — | >2GB | STACK OVERFLOW |
The crash at 44.72 million confirmed our frame size was ~48 bytes. We had fully saturated the 2GB stack. This should have been the end of the experiment.
But I tried one more thing. I removed the return value to see if it saved a few bytes:
public static void DfsRecursiveLinear(
this NodeLinear[] nodes,
int index,
bool[] visited)
{
if ((uint)index >= (uint)nodes.Length || visited[index])
return;
visited[index] = true;
int next = nodes[index].NextIndex;
if (next >= 0)
nodes.DfsRecursiveLinear(next, visited); // ← just recurse, nothing after
}
Suddenly, the hatch released!! It started free falling way faster than it did before.
I ran it again at 60 million? Deeper...
Again at 100 million? Deeper...
Again at 500 million??? DEEPER...
Again at 1 BILLION?!?! DEEEEEEPER!!!!!
What the hell is happening?!
The Free Fall
| Depth | New Recursion | Old Recursion | Details |
|---|---|---|---|
| 10,000 | 0ms | 0ms | — |
| 100,000 | 0ms | 2ms | — |
| 1,000,000 | 1ms | 27ms | New 27x faster |
| 10,000,000 | 11ms | 276ms | New 25x faster |
| 20,000,000 | 24ms | 555ms | New 23x faster |
| 30,000,000 | 34ms | 828ms | New 24x faster |
| 40,000,000 | 46ms | 1,098ms | New 24x faster |
| 44,720,000 | 51ms | Stack Overflow | — |
| 50,000,000 | 57ms | — | — |
| 100,000,000 | 115ms | — | — |
| 500,000,000 | 576ms | — | — |
| 1,000,000,000 | 1,183ms | — | — |
| 2,000,000,000 | 2,378ms | — | — |
| 2,147,483,591 | 2,560ms | — | New max depth |
2 BILLION
recursive calls in 2.5s
That's unbelievable! How the hell is that possible? Not only did it exceed 50 million, it was 2,400% faster! I ran it over and over... and over. It was definitely legit. I was completely baffled!
After some thought, it had to be the CLR itself doing something. It found a pattern in my code.
Could this be the .tail IL instruction I mentioned in Part 1? Was it somehow activated???
It can't be—this is C#, not a functional language!
Could this be tail call optimization happening in C#?! An imperative language?!
The C# compiler doesn't emit .tail—that's an F# thing.
This is very strange...
Disassembling the Binary
If tail call optimization was happening, there would be evidence in the Native AOT compiled binary.
Since we're using dotnet publish with AOT, the C# code gets compiled directly to native
machine code—no JIT at runtime. The smoking gun would be a jmp instruction instead of
a call instruction at the recursive call site.
The Tool: dumpbin
I used dumpbin.exe from Visual Studio to disassemble the Native AOT binary.
This tool comes with Visual Studio and can be found at:
C:\Program Files\Microsoft Visual Studio\...\VC\Tools\MSVC\...\bin\Hostx64\x64\dumpbin.exe
The command I ran:
dumpbin /disasm LargeStackBenchmark.exe | grep -A 50 "DfsRecursiveLinear"
The New Recursion (TCO Version) - Actual Disassembly
; HpcDfsLinearExtensions__DfsRecursiveLinear - Native AOT x64
0140002220: sub rsp,28h
0140002224: mov eax,[rcx+8] ; eax = nodes.Length
0140002227: cmp eax,edx ; bounds check
0140002229: jbe 0140002250 ; if out of bounds, exit
014000222B: mov r10d,[r8+8] ; r10d = visited.Length
014000222F: cmp edx,r10d ; ← JUMP TARGET (start)
0140002232: jae 0140002255 ; bounds check on visited[]
0140002236: cmp byte ptr [r8+rdx+10h],0 ; visited[index] == false?
014000223C: jne 0140002250 ; if visited, exit
014000223E: mov byte ptr [r8+rdx+10h],1 ; visited[index] = true
0140002244: mov edx,[rcx+rdx*8+14h] ; edx = nodes[index].NextIndex
0140002248: test edx,edx ; check if next >= 0
014000224A: jl 0140002250 ; if negative, exit
014000224C: cmp eax,edx
014000224E: ja 014000222F ; ← JUMP BACK to start!
0140002250: add rsp,28h
0140002254: ret
BINGO!!! There's no call opcode at all—just ja (jump to)!
Look at address 014000224E: ja 014000222F. That's a conditional jump
back to address 014000222F—which is earlier in the same function.
The compiler transformed the tail recursion into a simple goto!
The Native AOT compiler recognized the tail-recursive pattern and converted it to a jump instruction.
No function call overhead. No stack frame allocation. Just a jmp.
The Old Recursion (No TCO) - Actual Disassembly
For comparison, here's the version that crashes at 44.72 million—the one with
return result + 1 after the recursive call:
; HpcDfsLinearNoTcoExtensions__DfsRecursiveLinearNoTco - Native AOT x64
0140002260: sub rsp,28h
; ... bounds check and visited check ...
014000227D: mov byte ptr [r8+rax+10h],1 ; visited[index] = true
0140002283: mov edx,[rcx+rax*8+14h] ; edx = nodes[index].NextIndex
0140002287: test edx,edx ; check if next >= 0
0140002289: jge 0140002295 ; if positive, recurse
014000228B: mov eax,1 ; base case: return 1
0140002290: add rsp,28h
0140002294: ret
0140002295: call ...DfsRecursiveLinearNoTco ; ← RECURSIVE CALL!
014000229A: inc eax ; ← result + 1 (BREAKS TCO!)
014000229C: add rsp,28h
01400022A0: ret
See the difference?
call ...DfsRecursiveLinearNoTcoat address0140002295—this creates a new stack frame every timeinc eaxat address014000229A—that's thereturn result + 1from the C# code
That single inc eax after the call prevents tail call optimization because the
function has work to do after the recursive call returns. The compiler must keep the
stack frame alive to store where to return to.
The Verdict
| Method | Key Instruction | What It Means |
|---|---|---|
| DfsRecursiveLinear | ja 014000222F (jump back) |
Goto/jump—no stack frames! |
| DfsRecursiveLinearNoTco | call + inc eax |
Recursive call—48 bytes per frame |
What Triggers the Hatch?
Native AOT recognized the tail call pattern and compiled it as a jmp instruction.
No stack frames accumulate. The recursive call becomes a simple goto at the machine code level.
What triggered Tail Call Recursion Optimization:
public static void DfsRecursiveLinear(
this NodeLinear[] nodes,
int index,
bool[] visited)
{
if ((uint)index >= (uint)nodes.Length || visited[index])
return;
visited[index] = true;
int next = nodes[index].NextIndex;
if (next >= 0)
nodes.DfsRecursiveLinear(next, visited); // ← TAIL POSITION!
}
The key: the recursive call is the last thing the function does. There's no work
after it—no return result + 1, no cleanup, nothing. This is called tail position,
and it allows the compiler to replace the call with a jmp.
What this means:
- Zero stack frame overhead — each "recursion" reuses the same frame
- Zero heap allocation — unlike
Stack<T>which allocates on the heap - No GC pressure — nothing to clean up
- Identical performance to hand-written iteration — both become
gotoat machine level
JIT Mode Also Triggers TCO
I tested JIT mode (using dotnet run instead of Native AOT) and JIT also triggers TCO—reaching
2 billion depth successfully. Both compilation modes recognize the tail call pattern.
Speed: TCO Recursive vs Stack<T> Iterative
| Depth | AOT TCO Recursive | JIT TCO Recursive | AOT Iterative | JIT Iterative |
|---|---|---|---|---|
| 30,000,000 | 34ms | 272ms | 79ms | 43ms |
| 100,000,000 | 115ms | 376ms | 258ms | 160ms |
| 1,000,000,000 | 1,183ms | 1,423ms | 2,650ms | 1,410ms |
| 2,000,000,000 | 2,378ms | 2,617ms | 5,387ms | 2,791ms |
Traversal Overhead: TCO Recursive vs Stack<T> Iterative (Linear Graph)
| Approach | Traversal Memory | Where |
|---|---|---|
| TCO Recursive | 0 bytes | Call stack (reused frame) |
| Stack<T> Iterative | 4 bytes | Heap (1 element max for linear graph) |
Note: Both approaches require the same ~17GB for data structures (NodeLinear[] + bool[]) at 2B depth. The difference is the traversal overhead—TCO runs entirely on the call stack with 0 heap allocation.
Max Depth Comparison
| Approach | Max Depth | Time at Max |
|---|---|---|
| AOT TCO Recursive | 2,147,483,591 | 2,560ms |
| JIT TCO Recursive | 2,147,483,591 | 2,777ms |
| AOT Iterative | 2,147,483,591 | 5,708ms |
| JIT Iterative | 2,147,483,591 | 2,992ms |
All approaches hit the same max depth—limited by RAM for data structures, not stack space.
TCO recursive is the fastest in both JIT and AOT modes, with AOT being the champion. The traversal runs entirely on the call stack, reusing the same frame 2 billion times with 0 heap allocation.
Why Is Iterative Slower?
The Stack<T> approach feels like the "right" way to do things, but it has hidden costs
that accumulate with every step:
- Method call overhead — Every step requires
Push(),Pop(), andCountproperty access - Bounds checking —
Stack<T>validates every operation to ensure you aren't accessing invalid memory - Indirection — You're following pointers to the heap rather than using direct register access
The TCO recursive version bypasses all of this. It compiles down to essentially a goto.
No method calls. No bounds checks. Just CPU registers and a jump instruction.
Wait... How Is This Possible?
Let's do the napkin math. With a ~2GB stack and the smallest possible stack frame, how deep should we be able to go?
Even if we assume a highly optimized stack frame, the physical limits of memory should stop us long before we hit 2 billion.
| Frame Size | What fits? | Theoretical Max Depth (2GB Stack) |
|---|---|---|
| 48 bytes | Standard C# Frame | ~44,739,242 |
| 32 bytes | Optimized Frame | ~67,108,864 |
| 16 bytes | Minimal (Ret + Pointer) | ~134,217,728 |
| 0 bytes | TCO (Tail Call) | Infinite (CPU limited) |
We hit 2 billion. That's 15x to 45x deeper than the mathematical maximum for a standard recursive function.
Something else is happening here. The compiler didn't just optimize the stack usage; it fundamentally changed the physics of the execution.
🏆 Winner: Recursion
1. The Champion 👑
AOT (Ahead-of-Time compiled) TCO (Tail Call Optimized) Recursive DFS didn't just stand toe-to-toe against the iterative approach—it consistently outranked the other approaches and stayed on top. I am pleasantly surprised! Who would've thought that TCO suddenly comes out of nowhere to save the day?!
⚡ Speed
Fastest
than Stack<T>
🧠 Memory
0 Bytes
traversal allocation
🕳️ Depth
2,147,483,591
recursive calls
We didn't stop because we ran out of stack space; we stopped because we hit the maximum size of a C# Array.
2. The Fourth Metric: Expressiveness & Abstraction ✨
The truth is, both for loops and recursion are language constructs that compile down to GOTO opcodes.
Neither is inherently "better" than the other. They are simply tools that allow us to write code more naturally.
We use them because they are easier to read and safer to write.
Recursion is the native language of Functional Programming.
It describes the problem definition itself: "Visit this node, then visit its neighbors."
The Iterative approach is just a manual simulation.
Most high-level languages force you into this corner. When we write an iterative DFS, we are essentially saying:
"I don't trust the compiler to manage the Call Stack, so I will manually manage a Stack<T> object on the Heap."
It is mechanical, verbose, and disconnected from the algorithm's intent.
C# allows you to be an assassin with recursion.
While other languages trap you in safety nets, C# gives you the keys to the castle.
It allows you to strip away the overhead and activate tail call recursion optimization in hardcore mode.
And the results speak for themselves. In our benchmark, this wasn't just "comparable" to iteration—it was the fastest of all 4 approaches.
The Definition of Control:
Real control isn't about manually managing memory or writing complex loops. Real control comes from understanding the language and hardware limits.
Once you understand the physics, you can optimize the system to do exactly what you want—faster, cleaner, and more efficiently than the "safe" default.
3. The Fifth Metric: Flexibility 🔧
This is the most critical takeaway. There is a myth that "High Performance" means "Low Level" or "Complex Code."
- The Reality: Deep knowledge of the hardware (Bare Metal Physics) doesn't restrict you; it gives you flexibility.
- The Freedom: Because we understood exactly how the Stack works and exactly how the CPU handles Tail Calls, we gained the freedom to use a "dangerous" tool (Recursion) safely.
- The Result: We didn't have to choose between "Clean Code" and "Fast Code." We got both because we respected the physics of the machine.
Most developers are unaware of these limits, so they blindly follow the rule: "Always use Iteration." When you know the physics, you can break the rules.
4. The Real-World "Hidden Killer" 💀
In our clean linear benchmark, the Stack<T> stayed small, so the Garbage Collector remained mostly quiet.
But in a messy, real-world production system with branching graphs and complex state, the Iterative approach is a ticking time bomb.
- 🧩 Heap Fragmentation: As
Stack<T>resizes, it demands larger and larger contiguous blocks of memory. On a long-running server with a fragmented heap, this leads to OutOfMemory exceptions even when you theoretically have free RAM. - 🗑️ LOH Issues: Once your stack array exceeds 85KB, it moves to the Large Object Heap (LOH). The LOH is rarely compacted, leading to permanent fragmentation.
- 📈 Latency Spikes: Every expansion triggers an allocation. Every drop triggers cleanup. In latency-sensitive systems, this jitter kills your p99 performance.
✅ The TCO Advantage: Because TCO uses the Thread Stack (LIFO) and CPU registers:
- Pre-allocated, contiguous memory—no fragmentation possible
- No heap interaction during traversal
- No GC involvement—zero pauses
- LIFO by design—the natural home for recursion
5. Undocumented and Defies Conventional Wisdom 📚
I looked for this as a language feature in the official C# Language Specification (ECMA-334). I checked the major reference books.
There is nothing.
This behavior is completely undocumented in official Microsoft sources. If you rely on the standard literature, this capability effectively does not exist.
Additionally, if you ask anyone about this scenario, they will insist that the iterative approach is safer, won't overflow, and runs faster. Even if you ask any AI about this problem today, it will immediately generate the iterative code and tell you it is the superior solution.
But they are all dead wrong.
Recursion is the superior solution. If you know what you're doing, you really don't have to choose—you can get speed, memory efficiency, and expressiveness all at once.
Do NOT trust conventional wisdom when you work with recursion—at least not while you're using HPC C#! 😉
6. Final Verdict 🎯
You don't have to pick iterative all the time anymore. C# Native AOT gives you much more flexibility here. When you understand the machine and its restrictions, you can ironically gain a higher degree of freedom.
Test Environment
| OS | Windows 11 |
|---|---|
| Runtime | .NET 10 Native AOT |
| CPU | AMD Ryzen 9 9950X3D 16-Core Processor |
| RAM | 192 GB DDR5 |
| Stack Size | ~2GB (int.MaxValue thread stack allocation) |
Closing Thoughts
Tail Call Optimization isn't new. It has been a fundamental concept since the dawn of Lisp in the 1960s. It didn't "suddenly appear"—it was always there, sitting quietly in the compiler.
But because of HPC and the new memory features in C#, this introduces new unexplored territory. We often forget that despite being a high-level managed language, C# has deep roots in C. It possesses a "bare metal" capability that other managed languages simply cannot touch. By respecting that heritage and writing code that the hardware understands, I was able to unlock a capability that most C# developers don't even know exists.
This experiment didn't just teach me something new; it validated something I have always suspected. I've always questioned the industry dogma that you must always convert recursion to iteration (and I'm pretty sure others have as well). It never sat right with me. With High-Performance C#, I was finally able to put that skepticism to the test.
It's amazing how this experiment is now changing my mental model of what is possible and what isn't—and debunks the myth that you should always convert recursive calls to iterative ones. Not only is this more verbose and less expressive, but it could also cost you performance! 🤯
It brings me back to over a decade ago at UC Irvine, sitting in Professor Cristina Videira Lopes' course. In her book Exercises in Programming Style, there is a specific chapter on this—Chapter 9: "The Kick Forward" (Continuation-Passing Style).
I remember obsessing over this problem, drawing out the function stacks and diagrams on paper, trying to visualize where the memory was going. I spent hours and days trying to get it to work in C#, staring at what looked like such a simple, elegant function call. But it would always crash. I eventually gave up and switched to F# just to get the recursion to compile without overflowing.
I thought C# simply couldn't do it. I thought it was a language limitation.
Little did I know, 10 years later, I would trigger that exact optimization in C# to push beyond the boundaries of what I thought possible. To go down into Alice's bottomless rabbit hole.... into hyper-dimensional space.
Stay tuned. :)