#### Benchmarking Fib with Go

##### August 22, 2018
go fibonacci benchmark

Go has a really nice testing suite built into its compiler tool chains. It's as simple as running go test in your package, and all files marked *_tests.go will be executed. We can also run benchmarks during testing, by adding the -bench flag to the go test command. I never really benchmarked anything before, so I decided to explore Go's benchmarking utilities, and flex some of my algorithm skills in the process. Naturally I decided implement the Fibonacci sequence many, many times over.

The repository lies at https://github.com/TerrenceHo/fib, where you can use git to clone it. You can also use go get github.com/TerrenceHo/fib if you have a go runtime, which you'll need to run tests. The file fib.go holds various fibconacci implementations and some documentation. The associated tests and benchmarks lie in fib_test.go. To run the benchmarks, use go test -bench=.. To just run the tests, use go test . My benchmarks are at the bottom of the article, but we refer to these benchmarks multiple times in our discussion. I encourage you to try these benchmarks for yourselves.

### FibRecursive: Exponential Recursive

func FibRecursive(n int) int {
if n < 2 {
return n
}
return FibRecursive(n-1) + FibRecursive(n-2)
}


The most basic Fib sequence algorithm known to man, FibRecursive boasts a mighty exponential $$O(2^n)$$ run time, and is so slow I can't even run FibRecursive(64) on my laptop without having to wait for an eternity.
The reason being, for an input $$n$$, $$n+1$$ takes almost twice as long to computer, since the recursive tree makes two recursive calls each level and doubles the computation need for the next level.

In our testing suite, we max our runtimecalls at $$n=32$$. Memory-wise, FibRecursive is also inefficient, since each function call creates adds to the function call stack, and so runs the danger of actually running out of memory at high inputs.

FibRecursive is clearly unoptimal and so the next few examples of Fibonacci show slightly more optimized versions.

### FibRecursiveCache: Exponential Recursive Cache

func FibRecursiveCache(n int) int {
cache := make([]int, n+1, n+1)
fibRecursiveCache(n, &cache)
return cache[n]
}

func fibRecursiveCache(n int, cache *[]int) {
if n < 2 {
(*cache) = 0
(*cache) = 1
return
}
fibRecursiveCache(n-1, cache)

(*cache)[n] = (*cache)[n-1] + (*cache)[n-2]
}


We notice from the previous expanded FibRecursive tree that many values are computed many times (i.e. FibRecursive(2) is computed thrice). What if we compute these values once, and then store it? We can optimize our previous algorithm by caching previous computed values in an array. In doing so, we lower our runtime from exponential to linear $$O(n)$$, since we cut off most of the computation tree.

FibRecursiveCache recurses from $$n$$ to 1, and then builds the cache going backwards. The right-most value in our array holds our final answer. FibRecursiveCache ends up with both a linear runtime and linear memory usage, due to the array cache.

So where can we improve from here? Let's set aside the runtime optimizations for now, and tackle the memory issue first. If we can get rid of the entire cache, we could save a lot of memeory, since memory is not infinite nor free even on modern computers. The Fibonacci only requires the addition of the two previous terms. So really, we don't need to save all previously calculated Fibonacci values. We can make do with just the previous two, leading us to...

### FibIterative: Linear Iterative Implementation

func FibIterative(n int) int {
var temp int
first := 0
second := 1
for i := 0; i < n-1; i++ {
temp = second
second = first + second
first = temp
}
return second
}


FibIterative takes the concept that we do not really need to keep every previous computed value, only the previous number. Thus we get rid of the cache entirely and instead add the two previous numbers together in a loop. This eliminates the need to keep the entire cache in memory, while keeping the runtime linear, and making the memory required constant.

FibIterative and FibRecursiveCache both have a linear runtime, but FibIterative performs better on benchmarks, largely due to practical computer limitations. Performing function calls is slower than iterating through a loop, since the program must deal with managing the function call stack compared to simply incrementing a variable. Thus, on my computer anyway, the iterative implementation is faster.

### FibTailRecursive: Tail Recursive Implementation

func FibTailRecursive(n int) int {
return fibTailRecursive(n, 0, 1)
}

func fibTailRecursive(n, first, second int) int {
if n == 0 {
return first
}
return fibTailRecursive(n-1, second, first+second)
}


Tail recursion is a trick to place all recursive calls at the end, and do no more computation after the recursive call returns. This enables the compiler to reuse the current function call frame and avoid having to create another (useless) function stack frame. Therefore, tail recursive implementations can be optimized away by the compiler under the hood into a for-loop implementation. On benchmarks, FibTailRecursive performs better than FibRecursiveCache, but still worse than FibIterative. I speculate that memory access and recursive calls contribute to FibRecursiveCache's worse performace. FibTailRecursive still has to deal with the very first and last function call, which may explain the slightly worse performance compared to FibIterative.

Memory wise, FibTailRecursive can be considered to use constant time memory, because function call frames are not allocated every function call, and FibTailRecursive does not allocate extra memory, unlike FibRecursiveCache.

Let's see if we can still find a faster implementation for the Fibonacci sequence.

### FibPowerMatrix: Linear Matrix Implementation

func FibPowerMatrix(n int) int {
F := int{
int{1, 1},
int{1, 0},
}
if n == 0 {
return 0
}
fibPower(&F, n-1) return F
}

func fibPower(F *int, n int) {
M := int{
int{1, 1},
int{1, 0},
}

for i := 2; i <= n; i++ {
fibMultiply(F, &M)
}
}

func fibMultiply(F *int, M *int) {
f := *F
m := *M
x := f*m + f*m
y := f*m + f*m
z := f*m + f*m
w := f*m + f*m

(*F) = x
(*F) = y
(*F) = z
(*F) = w
}


For the unintiated, this algorithm looks a rather complicated. It utilizes $$2 \times 2$$ matrix multiplication, and it is not immediately obvious how matrix multiplcation can help solve the Fibonacci sequence. We shall prove the following theorem using induction, where $$F_{n}$$ represents the Fibonacci sequence at step $$n$$.

$M = \left( \begin{array}{ccc} F_{n+1} & F_{n} \\ F_{n} & F_{n-1} \\ \end{array} \right) = \left( \begin{array}{ccc} 1 & 1 \\ 1 & 0 \\ \end{array} \right)^n$

For $$n=1$$,

$\left( \begin{array}{ccc} 1 & 1 \\ 1 & 0 \\ \end{array} \right)^1 = \left( \begin{array}{ccc} F_{2} & F_{1} \\ F_{1} & F_{0} \\ \end{array} \right)$

Which is obvious, these are the first three values in the Fibonacci sequence, $$1, 1, 0$$.

Assume that this step holds for $$n=k$$.

$\left( \begin{array}{ccc} 1 & 1 \\ 1 & 0 \\ \end{array} \right)^k = \left( \begin{array}{ccc} F_{k+1} & F_{k} \\ F_{k} & F_{k-1} \\ \end{array} \right)$

Now we attempt to prove that the statement still holds for $$n=k+1$$, and so the statement would be solved. We multiple both sides by $$\left( \begin{array}{ccc} 1 & 1 \\ 1 & 0 \\ \end{array} \right)$$

$\left( \begin{array}{ccc} 1 & 1 \\ 1 & 0 \\ \end{array} \right)^k \left( \begin{array}{ccc} 1 & 1 \\ 1 & 0 \\ \end{array} \right) = \left( \begin{array}{ccc} F_{k+1} & F_{k} \\ F_{k} & F_{k-1} \\ \end{array} \right) \left( \begin{array}{ccc} 1 & 1 \\ 1 & 0 \\ \end{array} \right)$

$\left( \begin{array}{ccc} 1 & 1 \\ 1 & 0 \\ \end{array} \right)^{k+1} = \left( \begin{array}{ccc} F_{k+1} + F_{k} & F_{k+1} \\ F_{k} + F_{k-1} & F_{k} \\ \end{array} \right)$

This implies that

$\left( \begin{array}{ccc} 1 & 1 \\ 1 & 0 \\ \end{array} \right)^{k+1} = \left( \begin{array}{ccc} F_{k+2} & F_{k+1} \\ F_{k+1} & F_{k} \\ \end{array} \right)$

Thus, by proof by induction, our assumption holds true for any $$n=k+1$$. This is basically a simple way to increment the Fibonacci sequence once every matrix multiplcation. We calculate the matrices in a loop, similiar to our tail recursive implementation. Our final answer will be held in the $$F_{k+2}$$ after $$n$$ iterations.

However, this is still a linear runtime implementation. Worse, since matrix muliplication involves more operations, it is slower than most of our other linear implementations. However, matrix multiplcation opens up a trick to gain a massive speed boost. Memory wise, because we reuse the same matrix, this implementation is constant time.

### FibPowerMatrixRecursive: $$Log_n$$ Matrix Implementation

func FibPowerMatrixRecursive(n int) int {
F := int{
int{1, 1},
int{1, 0},
}

if n == 0 {
return 0
}
fibPowerRecursive(&F, n-1) return F
}

func fibPowerRecursive(F *int, n int) {
if n == 0 || n == 1 {
return
}

M := int{
int{1, 1},
int{1, 0},
}
fibPowerRecursive(F, n/2)
fibMultiply(F, F)
if n%2 != 0 {
fibMultiply(F, &M)
}
}


FibPowerMatrixRecursive is pretty similiar to the previous implementation. However, it uses exponentiation to calculate matrices, rather than simply multiplying $$\left( \begin{array}{ccc} 1 & 1 \\ 1 & 0 \\ \end{array} \right)$$ to the previous result $$\left( \begin{array}{ccc} 1 & 1 \\ 1 & 0 \\ \end{array} \right)^{k}$$

Ignoring matrices for now, consider calculating $$2^8$$. We can compute that as $$2 * 2 * 2 * 2 * 2 * 2 * 2 * 2$$, and multiple across, which would require 7 multiplication operations. However, you can achieve the same result by using $$4 * 4 * 4 * 4$$, since $$2 * 2 = 4$$. So we can multiply $$2 * 2$$ once to get $$4$$, and then use that result to finish the calculation. But we can do that previous step multiple times, and so $$4 * 4 = 16$$, $$16 * 16 = 256 = 2^8$$. So rather than 7 multiplcation steps, we achieved the same result with 3 steps.

$$2*2 = 4$$

$$4*4 = 16$$

$$16 * 16 = 256$$

This formula is known as the fast exponentiation formula. It brings exponentiation from a linear runtime to a logarithmic runtime, which halves the amount of work every iteration. We can apply the same logic to matrices, squaring the matrices every iteration rather than simply multiplying straight across. In math notation, this would look something like:

$\left( \begin{array}{ccc} 1 & 1 \\ 1 & 0 \\ \end{array} \right)^n \left( \begin{array}{ccc} 1 & 1 \\ 1 & 0 \\ \end{array} \right)^n = \left( \begin{array}{ccc} F_{k+1} & F_{k} \\ F_{k} & F_{k-1} \\ \end{array} \right) \left( \begin{array}{ccc} F_{k+1} & F_{k} \\ F_{k} & F_{k-1} \\ \end{array} \right)$

$\left( \begin{array}{ccc} 1 & 1 \\ 1 & 0 \\ \end{array} \right)^{2n} = \left( \begin{array}{ccc} F^2_{k+1} + F^2_{k} & F_{k+1}F_{k} + F_{k}F_{k-1} \\ F_{k+1}F_{k} + F_{k}F_{k-1} & F^2_{k} + F^2_{k-1} \\ \end{array} \right)$

If $$n$$ was odd, then we would multiply once more by $$\left( \begin{array}{ccc} 1 & 1 \\ 1 & 0 \\ \end{array} \right)$$

Because we halve the work necessary each matrix multiplcation, this qualifies as a $$O(log_n)$$ implementation. However, that does not mean it's automatically faster than the linear implementations. Due to the overhead in matrix multiplcation, it is actually slower for low values of $$n$$, which on my computer was $$n<256$$.

In terms of memory usage, because we reuse the same matrices, and do not store all past cached values, the memory usage can be considered linear, with the exception of the recursive calls.

### Benchmarks

We test our benchmarks on values $$Fib(n)$$, where $$n = {1,2,4,8,16,32,64,128,256,512,1024}$$ (except for FibTestRecursive, we stop at $$n=32$$ because any higher values of $$n$$ simply takes too long to run).

Here are the benchmarks on for fib on my computer.

Benchmark Test                        | Iterations | Mean time of operation
---------------------------------------------------------------------------
BenchmarkFibRecursive1-4                1000000000       2.93 ns/op
BenchmarkFibRecursive2-4                200000000        7.77 ns/op
BenchmarkFibRecursive4-4                100000000        22.5 ns/op
BenchmarkFibRecursive8-4                10000000          162 ns/op
BenchmarkFibRecursive16-4               200000           7727 ns/op
BenchmarkFibRecursive32-4               100          17192531 ns/op
BenchmarkFibIterative1-4                500000000        3.50 ns/op
BenchmarkFibIterative2-4                300000000        4.49 ns/op
BenchmarkFibIterative4-4                200000000        6.38 ns/op
BenchmarkFibIterative8-4                100000000        10.3 ns/op
BenchmarkFibIterative16-4               100000000        17.7 ns/op
BenchmarkFibIterative32-4               50000000         39.6 ns/op
BenchmarkFibIterative64-4               30000000         59.9 ns/op
BenchmarkFibIterative128-4              20000000          104 ns/op
BenchmarkFibIterative256-4              10000000          182 ns/op
BenchmarkFibIterative512-4              5000000           351 ns/op
BenchmarkFibIterative1024-4             2000000           673 ns/op
BenchmarkFibRecursiveCache1-4           50000000         31.0 ns/op
BenchmarkFibRecursiveCache2-4           50000000         36.0 ns/op
BenchmarkFibRecursiveCache4-4           30000000         47.4 ns/op
BenchmarkFibRecursiveCache8-4           20000000         68.4 ns/op
BenchmarkFibRecursiveCache16-4          10000000          123 ns/op
BenchmarkFibRecursiveCache32-4          10000000          201 ns/op
BenchmarkFibRecursiveCache64-4          5000000           405 ns/op
BenchmarkFibRecursiveCache128-4         2000000           763 ns/op
BenchmarkFibRecursiveCache256-4         1000000          1514 ns/op
BenchmarkFibRecursiveCache512-4         500000           3091 ns/op
BenchmarkFibRecursiveCache1024-4        200000           6041 ns/op
BenchmarkFibTailRecursive1-4            300000000        5.86 ns/op
BenchmarkFibTailRecursive2-4            200000000        8.63 ns/op
BenchmarkFibTailRecursive4-4            100000000        14.6 ns/op
BenchmarkFibTailRecursive8-4            50000000         27.7 ns/op
BenchmarkFibTailRecursive16-4           20000000         57.7 ns/op
BenchmarkFibTailRecursive32-4           20000000          120 ns/op
BenchmarkFibTailRecursive64-4           10000000          224 ns/op
BenchmarkFibTailRecursive128-4          3000000           438 ns/op
BenchmarkFibTailRecursive256-4          2000000           868 ns/op
BenchmarkFibTailRecursive512-4          1000000          1696 ns/op
BenchmarkFibTailRecursive1024-4         300000           4205 ns/op
BenchmarkFibPowerMatrix1-4              100000000        15.9 ns/op
BenchmarkFibPowerMatrix2-4              200000000        8.05 ns/op
BenchmarkFibPowerMatrix4-4              50000000         32.5 ns/op
BenchmarkFibPowerMatrix8-4              20000000         70.9 ns/op
BenchmarkFibPowerMatrix16-4             10000000          141 ns/op
BenchmarkFibPowerMatrix32-4             5000000           290 ns/op
BenchmarkFibPowerMatrix64-4             2000000           619 ns/op
BenchmarkFibPowerMatrix128-4            1000000          1221 ns/op
BenchmarkFibPowerMatrix256-4            500000           2467 ns/op
BenchmarkFibPowerMatrix512-4            300000           4963 ns/op
BenchmarkFibPowerMatrix1024-4           200000           9599 ns/op
BenchmarkFibPowerMatrixRecursive1-4     200000000        6.24 ns/op
BenchmarkFibPowerMatrixRecursive2-4     200000000        7.44 ns/op
BenchmarkFibPowerMatrixRecursive4-4     100000000        24.8 ns/op
BenchmarkFibPowerMatrixRecursive8-4     30000000         47.9 ns/op
BenchmarkFibPowerMatrixRecursive16-4    20000000         72.6 ns/op
BenchmarkFibPowerMatrixRecursive32-4    20000000         95.5 ns/op
BenchmarkFibPowerMatrixRecursive64-4    20000000          120 ns/op
BenchmarkFibPowerMatrixRecursive128-4   10000000          142 ns/op
BenchmarkFibPowerMatrixRecursive256-4   10000000          166 ns/op
BenchmarkFibPowerMatrixRecursive512-4   10000000          191 ns/op
BenchmarkFibPowerMatrixRecursive1024-4  10000000          213 ns/op


The two noteworthy are BenchmarkFibPowerMatrixRecursive1024-8 and BenchmarkFibIterative1024-8, which clocked in at 213 ns/op and 673 ns/op, showing that as $${n\to\infty}$$, the $$O(log_n)$$ implementation does indeed scale better. However, at lower values of $$n$$, the iterative implementation should still be used.

### Conclusion

Well, we've learned to benchmark programs in Go, and tested out real world implications of complexity analysis. We can achieve real performance gains with the fast exponentiation algorithm, but the overhead in calculating matrices drags the the runtime for lower orders of $$n$$. Thus, it can be advisable in real world systems to choose your algorithm depending on the value of $$n$$. Good algorithms do really matter after all.