HPC C# Algorithms HPC DFS HPC Algorithms

Recursive DFS III: Hyper-Dimensional Depth

From 2.1 billion to 1 TRILLION recursive calls. We stripped away everything—node arrays, int limits, bool arrays—and hit bare metal. Welcome to Hyper-Dimensional Depth.

By Robert Long | January 19, 2026

In Part 2, we hit 2.1 billion recursive calls using Tail Call Optimization. We thought that was the limit. It wasn't. The rabbit hole goes much deeper. Welcome to Hyper-Dimensional Depth.


Where We Left Off

At the end of Part 2, we had this algorithm hitting 2.1 billion recursive calls:

// Part 2: 2.1 Billion (limited by array size)
public static void DfsRecursiveLinear(
    this NodeLinear[] nodes,
    int index,
    bool[] visited)
{
    if ((uint)index >= (uint)nodes.Length || visited[index])
        return;

    visited[index] = true;

    int next = nodes[index].NextIndex;
    if (next >= 0)
        nodes.DfsRecursiveLinear(next, visited);
}

We stopped at 2.1 billion because of two bottlenecks:

  • NodeLinear[] - C# arrays cap at ~2.1B elements
  • bool[] - Another 2.1B element array
  • int indices - Max value of ~2.1B

But here's the thing: TCO itself has no limit. It's just a jmp instruction. The only thing stopping us was our data structures. So we stripped them away.


The Stripping: Piece by Piece 🔪

1. Kill the Node Array

For a linear graph, we don't need a node array at all. The next node is always index + 1. That's 8 bytes per node Ă— 2.1 billion = 17GB of RAM eliminated.

// Before: nodes[index].NextIndex
// After:  index + 1
// Node array? GONE.

2. Upgrade to long Indices

int caps at 2.1 billion. long caps at 9.2 quintillion. That's 9,223,372,036,854,775,807. We're not hitting that ceiling.

3. ChunkedBitArray Instead of bool[]

bool[] uses 1 byte per element. That's wasteful—we only need 1 bit. But C# arrays cap at 2.1B elements. Solution: chunk it.

// ChunkedBitArray: 1 bit per node, unlimited size
public class ChunkedBitArray
{
    private const int BitsPerChunk = 1 << 30; // 1 billion bits per chunk
    private readonly ulong[][] _chunks;

    // Get/Set use bit manipulation - O(1)
    public bool Get(long index) { ... }
    public void Set(long index) { ... }
}

Memory savings:

  • bool[] at 1 trillion: ~1 TB (impossible)
  • ChunkedBitArray at 1 trillion: ~119 GB âś…

The Final Algorithm: Bare Metal 🔩

// Hyper-Dimensional DFS: 1 TRILLION+ recursive calls
[MethodImpl(MethodImplOptions.AggressiveOptimization)]
public static void DfsLinearLong(
    long index,
    long maxDepth,
    ChunkedBitArray visited)
{
    if (index >= maxDepth || visited.Get(index))
        return;

    visited.Set(index);

    // Linear graph: next = index + 1
    // This is in tail position - TCO kicks in
    DfsLinearLong(index + 1, maxDepth, visited);
}

That's it. No node array. No bounds checking. No unnecessary allocations. Just pure recursion hitting the metal.


🚀 The Results: Hyper-Dimensional Depth

Depth Time Ops/sec Memory
🎯 3 Billion 2.8s 1.06B 357 MB
🔥 5 Billion 4.6s 1.08B 596 MB
🚀 10 Billion 8.7s 1.15B 1.2 GB
🤯 16 Billion 13.6s 1.18B 1.9 GB
đź’Ą 50 Billion 45.9s 1.09B 5.9 GB
🌟 100 Billion 89.8s 1.11B 11.9 GB
🌌 500 Billion 7.6 min 1.09B 59.6 GB
🏆 1 TRILLION 15.2 min 1.10B 119 GB

The Key Insights

1. TCO Has No Depth Limit

Tail Call Optimization transforms recursion into a jmp instruction. It doesn't accumulate stack frames. The only limits are:

  • RAM for your data structures
  • Time (patience)

2. Data Structures Are the Bottleneck

We didn't hit 2.1 billion because of recursion limits—we hit it because of int and array limits. Strip those away, and recursion goes to infinity.

3. BitArrays Are 8x More Efficient

bool[] wastes 7 bits per element. A proper bit array uses 1 bit. At scale, that's the difference between "possible" and "impossible."

4. Linear Performance

Every run maintained ~1.1 billion operations per second. The performance is linear. Double the depth, double the time. No degradation, no surprises.


But Wait—Isn't Using the Heap "Cheating"? 🤔

You might be thinking: "Hold on—you're using a ChunkedBitArray on the heap. Doesn't that violate the rules of recursion vs iteration?"

No. In fact, this exposes the biggest lie in the "Recursion vs. Iteration" debate.

The "Iterative Stack" is ALSO on the Heap!

This is the most common misconception. When people say "Use Stack<T> to avoid Stack Overflow," they are effectively saying: "Move the data from the Thread Stack (limited to ~1MB) to the Heap (limited by RAM)."

Approach Control Flow State Storage
Standard Recursive Thread Stack (crashes ~10K) Heap (visited array)
Iterative Stack<T> Heap (Stack<T>) Heap (visited array)
TCO Recursive CPU Registers (jmp) Heap (BitArray)

The comparison is fair: Heap vs. Heap.

  • The Iterative approach uses Heap memory to store the Stack<T>.
  • My approach uses Heap memory to store the BitArray.

Why TCO Still Wins

The iterative approach is double-dipping on the Heap:

  • Stack<T> for the traversal stack (pointers to nodes)
  • Plus memory for the visited tracking

The Stack<T> class internally uses an array T[]. In .NET, arrays cannot have more than ~2.1 Billion elements—even if you have 2TB of RAM. That's a software limit, not a hardware one.

My TCO approach:

  • Control Flow: Zero memory (just a JMP instruction)
  • State: The ChunkedBitArray (Heap)

I bypassed the T[] array limit entirely by chunking. I removed the "middle man" (the Stack object). The result? 1 Trillion+ while iterative is stuck at 2.1 Billion.


Test Environment

OSWindows 11
Runtime.NET 10 Native AOT
CPUAMD Ryzen 9 9950X3D 16-Core Processor
RAM192 GB DDR5
Stack Size~2GB (int.MaxValue thread stack allocation)

Closing Thoughts

The funny part? Even though this result was entirely unexpected, the irony is that tail call optimization has been around since the very beginning. It is not at all a new idea.

But it's amazing how it is now changing my mental model of what is possible and what isn't—and debunks the myth that you should always convert recursive calls to iterative ones.

Stay tuned. :)

Keywords
DFS depth first search recursion TCO tail call optimization Native AOT trillion BitArray ChunkedBitArray hyper-dimensional bare metal

Related Articles

Part 1
HPC DFS Recursive

Where it all began—30 million recursive calls using HPC techniques.

Part 2
HPC DFS Recursive II

2.1 billion recursive calls with Tail Call Optimization.