The Problem

Most modern load balancers are either too complex, too slow, or both. When you're building systems that need to handle millions of requests per second, every microsecond counts. I wanted something that could:

  • Handle 1M+ requests per second
  • Maintain sub-millisecond latency
  • Use minimal resources
  • Be simple to deploy and configure

So I built UltraBalancer.

Architecture

UltraBalancer is written in C/C++ and built around three core principles:

1. Lock-Free Data Structures

Traditional load balancers use mutexes and locks, which kill performance at scale. UltraBalancer uses lock-free data structures for:

  • Request queuing
  • Backend server health tracking
  • Connection pooling

This eliminates contention and allows us to scale linearly with CPU cores.

2. Zero-Copy Networking

Every byte copied is wasted CPU time. UltraBalancer uses:

  • io_uring for async I/O on Linux
  • sendfile() for efficient data transfer
  • TCP splicing where possible

This reduces CPU usage by ~40% compared to traditional approaches.

3. NUMA-Aware Architecture

On multi-socket systems, memory access patterns matter. UltraBalancer:

  • Pins worker threads to specific NUMA nodes
  • Allocates memory locally to each worker
  • Uses per-core caching to avoid cross-NUMA traffic

Load Balancing Algorithms

UltraBalancer supports multiple algorithms:

enum class Algorithm {
    ROUND_ROBIN,
    LEAST_CONNECTIONS,
    WEIGHTED_ROUND_ROBIN,
    LEAST_RESPONSE_TIME,
    IP_HASH
};

The LEAST_RESPONSE_TIME algorithm tracks backend latency in real-time and routes requests to the fastest available backend. This alone improved P99 latency by 60% in our tests.

Performance Numbers

Here's what UltraBalancer can do on a single server (AMD EPYC 7763, 64 cores):

  • Throughput: 1.2M RPS sustained
  • Latency:
  • - P50: 0.3ms

    - P99: 0.8ms

    - P99.9: 1.2ms

  • CPU Usage: 45% at 1M RPS
  • Memory: 2GB RAM for 100k concurrent connections

SSL/TLS Termination

UltraBalancer handles TLS termination efficiently using:

  • OpenSSL with hardware acceleration (AES-NI)
  • Session resumption via tickets
  • OCSP stapling
  • Support for TLS 1.3

We achieve ~800k TLS handshakes per second on the same hardware.

Health Checks & Failover

The health check system runs independently:

  • Configurable check intervals (default: 5s)
  • Multiple check types: TCP, HTTP, custom scripts
  • Automatic failover in less than 100ms
  • Circuit breaker pattern to avoid thundering herd
backends:
  - host: 10.0.1.10
    port: 8080
    weight: 100
    health_check:
      type: http
      path: /health
      interval: 5s
      timeout: 1s
      unhealthy_threshold: 3

Configuration

UltraBalancer uses a simple YAML config:

listeners:
  - address: 0.0.0.0
    port: 80
    protocol: http
    algorithm: least_response_time
    
backends:
  - host: 10.0.1.10
    port: 8080
  - host: 10.0.1.11
    port: 8080
  - host: 10.0.1.12
    port: 8080

Observability

Built-in Prometheus metrics for everything:

  • Request rate and latency histograms
  • Backend health status
  • Connection pool stats
  • CPU and memory usage per worker

What's Next

Currently working on:

  • HTTP/3 support with QUIC
  • gRPC load balancing with connection pooling
  • eBPF integration for kernel-level packet processing
  • WebAssembly plugins for custom routing logic

Try It

UltraBalancer is open source and production-ready:

If you're building systems that need to scale, give it a shot. Would love to hear your feedback.

---

*Building fast systems is hard. But it's also fun as hell.*