B-Tree operation experiment¶

NTIN066, homework 5, b-tree operation experiment. Student: Mikoláš Fromm.

In this notebook, we will experiment with B-trees, specifically we will compare (2,4)-trees with (2,3)-trees.

In [1]:

import matplotlib.pyplot as plt
import numpy as np
import math

from ab_experiment import perform_tests, load_results

In [2]:

## perform all tests
if (True):
    perform_tests(42, False, True)

Running test min, a=2, b=3
Running test min, a=2, b=4
Running test insert, a=2, b=3
Running test insert, a=2, b=4
Running test insert_incremental, a=2, b=3
Running test insert_incremental, a=2, b=4
Running test random, a=2, b=3
Running test random, a=2, b=4

In [3]:

tests = load_results()
test_types = list(tests.keys())
tree_types = list(tests[test_types[0]].keys())

rename_dict = {
    ## empty for now
}

Minimal test¶

Insert n elements sequentially and then n times repeat: remove the minimal element in the tree and then insert it back.

In [4]:

## first test
test_type = test_types[0]

## plot both tests for each tree type
for tree_type in tree_types:
    x = tests[test_type][tree_type][:,0]
    y = tests[test_type][tree_type][:,1]
    plt.plot(x, y, label=f"{rename_dict.get(tree_type, tree_type)} tree", marker='o')
## also add log of x
log_x = np.log(x)
plt.plot(x, log_x, label=f"log(x)", marker='x')


plt.title(f"{test_type.capitalize()} Test")
plt.xlabel("Number of Inserted Elements")
plt.ylabel("Average num. of structural changes per operation")
plt.legend(title="Tree Type")
plt.grid()

plt.xscale('log')

fig = plt.gcf()
fig.set_size_inches(10,8)
plt.show()

No description has been provided for this image

Interpretation¶

From the results above, we can see that the (2,3)-tree is very unstable, compared to the (2,4)-tree. We can also see that the (2,3)-tree is having a specific pattern in the y-growth every 4th size step. On the other hand, the (2,4)-tree is having stable performance.

Generally speaking, the operations insert and remove can take O( log(n) * ( b / log(a) ) ) of time. (lecture 3)

The reason why the average number of structural changes per insertion & removal in the (2,3)-tree increases with respect to the number of inserted elements is related to the fact that in the (2,3)-tree, each node can only have from 2 to 3 children and is therefore more sensitive to any change in its number of children. Therefore when removing the element, only when there are other 2 children in the node, a structural change is not required. Otherwise the tree must be rebalanced and from the same reason as stated above, all the other nodes (on the path from the root) are the same sensitive to the change of the number of children, and therefore same likely to change their structure. We therefore end up in rebalancing the whole path from the root to the minimal item.

Compared to the (2,4)-tree, where there might be from 2 to 4 children, the removal of only one (and still the same) item is invoking only local rebalancing (at the lowest level), which therefore is not dependant on the size of the whole tree.

Lastly the reason for seeing the specific pattern in the y-growth in the (2,3)-tree might be relevant to the number of all items in the tree, where some sizes of the tree might result in having different number of keys in the nodes close to top. We have already stated that we need to rebalance only when having less then 2 children and this might not be the case for the top nodes in some of the sizes.

Insert test¶

Insert n elements in random order.

In [5]:

## second test
test_type = test_types[1]
test_type_incremental = test_types[2]

## plot both tests for each tree type
for tree_type in tree_types:
    x = tests[test_type][tree_type][:,0]
    y = tests[test_type][tree_type][:,1]
    plt.plot(x, y, label=f"{rename_dict.get(tree_type, tree_type)} tree", marker='o')

    x_incremental = tests[test_type][tree_type][:,0]
    y_incremental = tests[test_type_incremental][tree_type][:,1]
    plt.plot(x_incremental, y_incremental, label=f"{rename_dict.get(tree_type, tree_type)} tree (incremental)", marker='x')

plt.title(f"{test_type.capitalize()} Test")
plt.xlabel("Number of Inserted Elements")
plt.ylabel("Average num. of structural changes per operation")

## show legend outside of plot
plt.legend(title="Tree Type", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid()

plt.xscale('log')

fig = plt.gcf()
fig.set_size_inches(12,8)
plt.show()

Interpretation¶

From the results above, we can see that both the (2,3) and (2,4) oscillate around a certain value.

Generally speaking, the insert operation has a complexity O( log(n) * ( b / log(a) ) ) (lecture 3).

When adding an item to the tree, we might already have the node, to which we insert the new item, already full of children, therefore we would need to rebalance the node after the insertion. For the same reasons as in the min tests, we can find out that for the (2,3)-tree we perform more structural changes compared to the (2,4)-tree, which is caused by the sensitivity to change in the number of children in each node.

But when splitting a node, it might also invoke a split in its parent. And this might go recursively up to the root. How often can this happen? The worst case might be sequential insert, where we are trying to always insert into the rightmost lowest position in the tree. We demonstrate it in the plot with the *(incremental) examples.

The node starts with a children and needs to split when having more than b children. Therefore at the lowest level, we are doing the change every (b-a)-th step. The parent of this node is being reconstructed every (b-a)**2-th time, its parent every (b-a)**3-th time etc. The average number of changes per operation is then sum of the averages on each level, therefore the sum 1 / (b-a)**i for each layer i.

Then for the infinite sums holds: 1 / p**i <= 1 / q**i iff. p >= q for p, q <= 1.

Since (3-2) <= (4-2), the (2,4)-tree must have better performance, which can be also observed on the plot.

Random test¶

Insert n elements sequentially and then n times repeat: remove random element from the tree and then insert random element into the tree. Removed element is always present in the tree and inserted element is always not present in the tree.

In [6]:

## third test
test_type = test_types[3]

## plot both tests for each tree type
for tree_type in tree_types:
    x = tests[test_type][tree_type][:,0]
    y = tests[test_type][tree_type][:,1]
    plt.plot(x, y, label=f"{rename_dict.get(tree_type, tree_type)} tree", marker='o')

plt.title(f"{test_type.capitalize()} Test")
plt.xlabel("Number of Inserted Elements")
plt.ylabel("Average num. of structural changes per operation")
plt.legend(title="Tree Type")
plt.grid()

plt.xscale('log')

fig = plt.gcf()
fig.set_size_inches(10,8)
plt.show()

Interpretation¶

From the results above, we can see two main things:

(2,4)-tree is again having less avg. structural changes per operation compared to the (2,3)-tree,
(2,3)-tree oscillates more then (2,4)-tree.

The reasons for this behaviour were discussed already above. What is worth mentioning instead, is the fact that part of this test is the random insertion which was performed in the previous test. We have seen in these tests that the (2,4)-tree was perfoming better and that both were upper bounded by 1. What differs is therefore the random delete operation. When we end up having too few children in a node, we merge the node with another child of the same parent and, consequently, decrease the number of children of that parent. That might, again, result in recursive rebalancing. And the principle of how many times this operation might happen remains the same as with the insert.

Together we perform two random operations that require, by the same principle, every (b-a)-th a structural change of the node.