from k1lib.imports import *
Tests how much load can load balancers like Nginx and HAProxy really handle within a single node, and how to push things to the limit. All of these tests are mostly done on DigitalOcean's servers at different sizes (RAM and core count).
Turns out, nginx is quite terrible (3500 requests/s), and haproxy is 5 times faster, at around 17500 requests/s, but still not great. Vertically scaling the server up does not help that much. So how do big companies deal with the firehose of requests? They use multiple HAProxy instances, one for each ip, then in their DNS settings, they add all ips under 1 single url. In fact, DigitalOcean's load balancer advertised that they can handle 10k requests/s/node, which is totally within the realm of possibility here.
Max throughput reached on a single node is 27k requests/s, which is quite beefy, but that uses multiple nodejs processes to achieve, without using any load balancer at all, which gives me quite mixed feelings. Does that mean I shouldn't use HAProxy?
Exact configurations are at https://github.com/157239n/lbbench. The instructions in the readme are kinda okay, but it's highly recommended that read everything if you're curious.
In this config, I have multiple node processes on a single host, then use haproxy container on the same host to load balance varying number of node processes (sometimes 2, 4, 8, 16). Then I put load into the system using 1 ApacheBench process.
This is quite basic, straightforward and should serve as our baseline for all future comparisons.
The settings prefixed with "shared" means it's a shared VPS instance on DigitalOcean, and "compute" means it's a cpu-optimized version
# from text file contents (str) to single entity
metric = grep("Requests per second") | item() | op().split(":")[1].strip(" ").split(" ")[0] | aS(float)
# from setup path (str) to [['haproxy-2', [2, 3, ...]], ...]
processSetup = ls() | apply(lambda x: [x.split("/")[-1], x]) | apply(cat() | grep("root@", sep=True).till("root@") | metric.all(), 1)
base = "data/1_ab-haproxy_container-same_host"
data = [f'{base}/shared-1-cpu-1-ram',
f'{base}/shared-1-cpu-2-ram',
f'{base}/shared-2-cpu-2-ram',
f'{base}/shared-4-cpu-8-ram',
f'{base}/shared-8-cpu-16-ram',
f'{base}/compute-8-cpu-16-ram',
f'{base}/compute-16-cpu-32-ram',
f'{base}/compute-32-cpu-64-ram',] | apply(lambda x: [x.split("/")[-1], x]) | apply(processSetup, 1) | deref(); data
[['shared-1-cpu-1-ram', [['haproxy-2', [836.55, 780.21]], ['haproxy-4', [721.27, 726.65]], ['node', [2670.14, 2455.9]]]], ['shared-1-cpu-2-ram', [['haproxy-2', [839.71, 810.24]], ['haproxy-4', [765.37, 735.79]], ['node', [2792.4, 2642.95]]]], ['shared-2-cpu-2-ram', [['haproxy-2', [1319.73, 1241.91]], ['haproxy-4', [1260.05, 1246.86]], ['node', [2614.27, 2241.17]]]], ['shared-4-cpu-8-ram', [['haproxy-2', [2720.7, 2599.82]], ['haproxy-4', [2808.79, 3639.15]], ['node', [3838.07, 3580.05]]]], ['shared-8-cpu-16-ram', [['haproxy-8', [7057.47, 8785.8]], ['haproxy-2', [4468.68, 4672.22]], ['haproxy-4', [7863.34, 8518.66]], ['node', [3549.1, 3494.32]]]], ['compute-8-cpu-16-ram', [['haproxy-8', [7511.57, 9016.65]], ['haproxy-2', [5187.86, 5002.66]], ['haproxy-4', [7969.17, 8740.43]], ['node', [5258.35, 5085.68]]]], ['compute-16-cpu-32-ram', [['haproxy-8', [16551.94, 16040.13]], ['haproxy-2', [7561.1, 7349.99]], ['haproxy-4', [12558.71, 13592.13]], ['node', [7070.54, 6698.09]]]], ['compute-32-cpu-64-ram', [['haproxy-8', [19237.28, 17926.14]], ['haproxy-2', [8645.5, 9014.25]], ['haproxy-4', [14641.46, 15704.03]], ['haproxy-16', [18446.09, 17429.7]], ['node', [7132.02, 6792.72]]]]]
# thumbnail
modes = ["node", "haproxy-2", "haproxy-4", "haproxy-8", "haproxy-16"]
for mode in modes:
plt.plot(*data | transpose.wrap(op() + (grep(mode, col=0).all() | toList().all())) | filt(lambda x: len(x) > 0, 1) | apply(item() | rows(1) | item(), 1) | apply(repeat(2), 0) | transpose(1, 2) | joinStreams() | transpose() | deref(), "o", alpha=0.7)
plt.legend(modes); plt.xticks(rotation=75); plt.ylabel("Requests/s"); plt.grid(True);
Here, "node" means we're benchmarking a single node process directly, without going through any load balancer, and "haproxy-4" means that we're hitting the load balancer and it tries to balance between 4 node processes upstream. As you can see, the load balancing is actually the heavy operation that takes the most cpu cycles. Even with 32 cpu cores, the performance is only 8 times the amount of single-cpu node process (2500 req/s of "shared-1-cpu-1-ram"). Also notice that the scaling honestly isn't even that great as you scale the node vertically.
a = ls("data/multi_ab-multiprocess_node") | (ls() | grep("same_host-2_ab") | item()).all() | sortF(op().split("/")[2].split("-")[1].ab_int()) | deref()
labels = a | op().split("/")[2].all() | deref()
ys = a | (cat() | grep("Requests per second") | op().split(":")[1].strip().split(" ")[0].ab_float().all() | toSum()).all() | deref()
f = k1.polyfit(range(3), ys, 1)
plt.plot(labels, ys, "o"); plt.plot(labels, f(np.arange(3)), "--", alpha=0.7); plt.ylabel("Requests/s"); plt.grid(True)
I'm quite surprised. It scaled pretty linearly, and the absolute max is much greater than the baseline config.
ys = a | (cat() | grep("Transfer rate") | op().split(":")[1].strip().split(" ")[0].ab_float().all() | toSum()).all() | deref()
f = k1.polyfit(range(3), ys, 1)
plt.plot(labels, ys, "o"); plt.plot(labels, f(np.arange(3)), "--", alpha=0.7); plt.ylabel("Transfer rate (KB/s)"); plt.grid(True)