This is an example of how to go from a totally unoptimized c++23 web server to a highly optimized version.
I mainly developed this as a learning project on how to optimize c++ code interating with sockets.
I'll present each version chronologically, starting from the intentionally naive implementation and ending with the most optimized version. For each step, I'll explain what changed, why it matters, and how it affected performance.
For compilation and (quick) testing of each version I used:
# Terminal 1
rm -rf build/
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
./build/server
# Terminal 2
wrk -t2 -c400 -d10s http://localhost:8888/ # Mimics normal server requests
wrk -t2 -c400 -d10s -s scripts/wrk-get.lua http://localhost:8888/ # sends 64KB header
wrk -t2 -c400 -d10s -s scripts/wrk-post.lua http://localhost:8888/ # sends 64KB header and 10MB bodyInitial optimizations are significant enough that we don't need to measure it using professional tooling.
Final version of the code can be found at master, all of the other versions are refered to by their appropriate git tag.
I tried to write a version with as many beginner mistakes as possible. It can be found at commit 072df00e03af5c9978e642f355cda08153a987a0.
TLDR;
- It reads the HTTP Request Line byte-by-byte (one
readsyscall per char). - It uses
sscanfto parse the request line (forces unnecessary memory copies). - It pauses reading halfway to parse the request line, then starts a new read loop for the headers (ruins OS network buffering).
- It builds the outbound response using
+=to concatenate everything. This thrashes the heap and doubles memory usage (serving a 10MB file takes 20MB of RAM). - It double-copies the request body (reads into a temporary
mallocbuffer, then copies it into astd::string). - It sends the response byte-by-byte (one
sendsyscall per char, completely tanking throughput). - It parses headers using unsafe, raw C pointer math (
strstr,strchr). - It allocates a brand new
std::stringjust to pass the Content-Length view toatoi().
[cpp-web-server] (master) > wrk -t2 -c400 -d10s http://localhost:8888/
Running 10s test @ http://localhost:8888/
2 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 81.65ms 4.49ms 94.26ms 94.66%
Req/Sec 786.51 43.13 0.91k 68.50%
15656 requests in 10.04s, 1.58MB read
Socket errors: connect 151, read 91, write 0, timeout 0
Requests/sec: 1559.13
Transfer/sec: 161.40KB
[cpp-web-server] (master) > wrk -t2 -c400 -d10s -s scripts/wrk-get.lua http://localhost:8888/
Running 10s test @ http://localhost:8888/
2 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.01s 582.17ms 1.99s 57.89%
Req/Sec 29.98 5.37 40.00 72.36%
602 requests in 10.10s, 62.32KB read
Socket errors: connect 151, read 120, write 0, timeout 488
Requests/sec: 59.61
Transfer/sec: 6.17KB
[cpp-web-server] (master) > wrk -t2 -c400 -d10s -s scripts/wrk-post.lua http://localhost:8888/
Running 10s test @ http://localhost:8888/
2 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.06s 547.55ms 1.98s 57.69%
Req/Sec 20.73 6.85 50.00 77.01%
398 requests in 10.10s, 41.20KB read
Socket errors: connect 151, read 120, write 0, timeout 320
Requests/sec: 39.42
Transfer/sec: 4.08KBThe main goal of these optimizations was to reduce syscalls as much as possible + add some allocation optimizations here and there. It can be found at commit fb042f04c656a0c0ddf77b9a04b2aa1df24593ef.
- Read headers and request line in a single
::readcall loop using a stack buffer.
size_t headers_end = std::string::npos;
{
size_t search_start = 0;
ssize_t bytes_read = 0;
char buffer[HEADERS_USUAL_SIZE];
while (true) {
bytes_read = ::read(this->_client_fd, buffer, sizeof(buffer));
if (bytes_read <= 0) [[unlikely]] return RequestParseError_SocketError;
size_t bytes_read_t = static_cast<std::size_t>(bytes_read);
if ((this->_request_raw.size() + bytes_read_t) >= HEADERS_MAX_SIZE) [[unlikely]] return RequestParseError_PayloadTooLarge;
this->_request_raw.append(buffer, bytes_read_t);
headers_end = this->_request_raw.find("\r\n\r\n", (search_start >= 3) ? search_start - 3 : 0);
if (headers_end != std::string::npos) [[likely]] break;
search_start = this->_request_raw.size();
}
}This ensures that we're not reading from the socket millions of times for a single request. By reading data in 4KB chunks, we drastically reduce context switches between user space and the kernel. It also includes an O(1) search resumption logic (search_start) so we don't rescan the entire string for \r\n\r\n on every loop iteration.
- Replace
sscanfand raw C-pointer math withstd::string_viewmath.
// Request line parsing
size_t first_space = req_line.find(' ');
size_t second_space = req_line.find(' ', first_space + 1);
std::string_view method_str = req_line.substr(0, first_space);
// Header parsing
size_t colon = line.find(":");
std::string_view name = line.substr(0, colon);
size_t val_start = line.find_first_not_of(" \t", colon + 1);Using find and substr on string_view creates zero runtime overhead and emits highly optimized assembly compared to sscanf (which copies memory) and manual pointer arithmetic (which is error-prone).
- Zero-allocation string-to-int conversion for the
Content-Length.
size_t content_length = 0;
auto [_, err] = std::from_chars(it->second.data(), it->second.data() + it->second.size(), content_length);Instead of converting the string_view into a std::string just to use atoi(), std::from_chars parses the integer directly from the pointer boundaries.
- Zero-copy Body Parsing.
// since we're reading HEADERS_USUAL_SIZE while reading headers, it's possible we've already read all of the body bytes
// if not, calculate how many are left to read
size_t body_start = headers_end + 4; // Skip past the \r\n\r\n
size_t body_already_read = this->_request_raw.size() - body_start;
if (body_already_read < content_length) {
size_t bytes_remaining = content_length - body_already_read;
size_t current_size = this->_request_raw.size();
size_t new_size = current_size + bytes_remaining;
this->_request_raw.resize_and_overwrite(new_size, [new_size](char*, size_t) { return new_size; }); // resize without zero-filling
char* write_ptr = this->_request_raw.data() + current_size;
while (bytes_remaining > 0) {
ssize_t bytes_read = ::read(this->_client_fd, write_ptr, bytes_remaining);
if (bytes_read <= 0) [[unlikely]] return RequestParseError_SocketError;
write_ptr += bytes_read;
bytes_remaining -= static_cast<std::size_t>(bytes_read);
}
}
this->body = std::string_view(this->_request_raw.data() + body_start, content_length);Instead of malloc-ing a temporary buffer and copying it into the C++ string, we calculate exactly how many bytes remain and use C++23's resize_and_overwrite to expand the string's capacity without zero-filling the memory. We then pass a pointer to read() to DMA the data directly into the heap buffer with absolute zero overhead, and simply bind a std::string_view to it.
- Eliminate the "God String" response builder.
char header_buf[256];
int header_len = std::snprintf(
header_buf, sizeof(header_buf),
"HTTP/1.1 %.*s\r\n"
"Content-Type: %.*s\r\n"
"Content-Length: %zu\r\n"
"Connection: close\r\n\r\n",
// ... variables
);
this->_client_fd_send(std::string_view(header_buf, static_cast<size_t>(header_len)), 0);
if (!resp_body.empty()) this->_client_fd_send(resp_body, 0);Instead of using += to concatenate the headers and the body into one massive std::string (which forced the server to double its memory footprint just to serve a file), we write the headers into a lightweight stack buffer using snprintf and send the headers and body sequentially.
- Send responses in chunks, not byte-by-byte. The
_client_fd_sendmethod now uses awhileloop that sends as much of the buffer as the socket will accept in a single system call, instead of artificially locking it to 1 byte per call.
void Request::_client_fd_send(std::string_view message, int flags) {
ssize_t sent = 0;
size_t total_sent = 0;
auto message_len = message.length();
flags |= MSG_NOSIGNAL;
while (total_sent < message_len) {
sent = ::send(_client_fd, message.data() + total_sent, message_len - total_sent, flags);
if (sent <= 0) [[unlikely]] return;
total_sent += static_cast<size_t>(sent);
}
}- Disable Nagle's algorithm for lower HTTP latency.
set_opt_result = ::setsockopt(this->_socket_fd, IPPROTO_TCP, TCP_NODELAY, &opt, sizeof(opt));
if (set_opt_result == -1) throw std::system_error(errno, std::generic_category(), "setting TCP_NODELAY failed");Forces the server to send data immediately instead of artificially delaying small packets to batch them together.
- Acceptation hot path optimization inside the Server::acept function:
if (!this->_log_ip) [[likely]] return ::accept(this->_socket_fd, nullptr, nullptr);Passing nullptr when IP logging is disabled saves CPU cycles by preventing an unnecessary kernel memory copy.
- Remove unnecessary initializations.
sockaddr_in client_addr; // from sockaddr_in client_addr {};
// ...
char ip_str[INET_ADDRSTRLEN]; // from char ip_str[INET_ADDRSTRLEN] = {0};
// ...The functions that assign values into them are going to rewrite them anyways.
[cpp-web-server] (master) > wrk -t2 -c400 -d10s http://localhost:8888/
Running 10s test @ http://localhost:8888/
2 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 15.31ms 2.61ms 52.98ms 95.07%
Req/Sec 3.85k 288.60 4.25k 84.50%
76528 requests in 10.04s, 7.74MB read
Socket errors: connect 151, read 0, write 0, timeout 0
Requests/sec: 7622.41
Transfer/sec: 789.04KB
[cpp-web-server] (master) > wrk -t2 -c400 -d10s -s scripts/wrk-get.lua http://localhost:8888/
Running 10s test @ http://localhost:8888/
2 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 16.55ms 2.33ms 34.63ms 82.80%
Req/Sec 2.96k 187.33 3.46k 81.50%
58932 requests in 10.04s, 5.96MB read
Socket errors: connect 151, read 0, write 0, timeout 0
Requests/sec: 5871.20
Transfer/sec: 607.76KB
[cpp-web-server] (master) > wrk -t2 -c400 -d10s -s scripts/wrk-post.lua http://localhost:8888/
Running 10s test @ http://localhost:8888/
2 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 142.90ms 43.44ms 210.26ms 78.86%
Req/Sec 284.78 161.37 690.00 64.49%
5195 requests in 10.05s, 16.25KB read
Socket errors: connect 151, read 5033, write 1298, timeout 0
Non-2xx or 3xx responses: 5099
Requests/sec: 517.00
Transfer/sec: 1.62KBThe next set of optimizations focused on memory layout, data structures, and further syscall reduction. It can be found at commit 0df647a50f601d8bb49bea62152b827ac0a756bd.
- Aligning the Struct Layout
enum HttpMethod : uint8_t {
HTTP_GET,
HTTP_HEAD,
HTTP_POST,
HTTP_PUT,
HTTP_DELETE,
HTTP_CONNECT,
HTTP_OPTIONS,
HTTP_TRACE,
HTTP_PATCH,
HTTP_UNKNOWN = 255,
};By enforcing explicit sizes on enums (enum HttpMethod : uint8_t) and adding HTTP_UNKNOWN = 255, the parser gets a cheap default state for detecting unsupported HTTP methods.
class Request {
// constants
private:
inline static constexpr uint32_t HEADERS_USUAL_SIZE = 4096; // 99% of headers will be this length
inline static constexpr uint32_t HEADERS_MAX_SIZE = 65536; // 64KB
inline static constexpr uint32_t USUAL_NUMBER_OF_HEADERS = 25;
inline static constexpr uint32_t BODY_MAX_SIZE = 10485760; // 10MB
// aligned members
private:
std::string _request_raw;
std::string_view _headers_raw;
int _client_fd;
public:
HttpMethod method;
std::vector<HeaderType> headers;
std::string_view path;
std::string_view protocol;
std::string_view body;
// ...
};By reordering the class members, we eliminate wasted padding. Placing the 4-byte _client_fd right next to the 1-byte method allows the compiler to pack them tightly into a single 8-byte boundary right before the 8-byte aligned headers vector begins. This shrinks the overall object size, reducing memory pressure and improving cache locality.
- Data-Oriented Design (Vector vs. Hash Map)
// Replaced this:
std::unordered_map<std::string_view, std::string_view> headers;
// With this:
using HeaderNameType = std::string_view;
using HeaderValueType = std::string_view;
using HeaderType = std::pair<HeaderNameType, HeaderValueType>;
std::vector<HeaderType> headers;
// And in the constructor:
Request::Request(int client_fd) : _client_fd(client_fd), method(HTTP_UNKNOWN) {
this->_request_raw.reserve(HEADERS_USUAL_SIZE);
this->headers.reserve(USUAL_NUMBER_OF_HEADERS);
}Swapping std::unordered_map for a std::vector of pairs is a performance win. For small collections (like 25 HTTP headers), the overhead of hashing a string, dealing with bucket collisions, and jumping around fragmented memory in a linked list is far slower than just doing a linear scan over a contiguous block of memory in a std::vector. Reserving the space in the constructor also eliminates allocations during parsing.
- HTTP Method Switch Trick
std::string_view method_str = req_line.substr(0, first_space);
if (method_str.empty()) [[unlikely]] return RequestParseError_MalformedRequest;
switch (method_str[0]) {
case 'G': if (method_str == "GET") this->method = HTTP_GET; break;
case 'P':
if (method_str == "POST") this->method = HTTP_POST;
else if (method_str == "PUT") this->method = HTTP_PUT;
else if (method_str == "PATCH") this->method = HTTP_PATCH;
break;
case 'H': if (method_str == "HEAD") this->method = HTTP_HEAD; break;
case 'D': if (method_str == "DELETE") this->method = HTTP_DELETE; break;
case 'C': if (method_str == "CONNECT") this->method = HTTP_CONNECT; break;
case 'O': if (method_str == "OPTIONS") this->method = HTTP_OPTIONS; break;
case 'T': if (method_str == "TRACE") this->method = HTTP_TRACE; break;
}
if (this->method == HTTP_UNKNOWN) [[unlikely]] return RequestParseError_MalformedRequest;Replacing the massive if-else if string-comparison chain with a switch on the first character (method_str[0]) compiles into an optimized jump table. Since HTTP methods have conveniently unique starting letters, we instantly skip almost all the string comparisons.
- Gather I/O (
writev)
iovec iov[2];
iov[0].iov_base = header_buf;
iov[0].iov_len = static_cast<size_t>(header_len);
int iovcnt = 1;
if (!resp_body.empty()) {
iov[1].iov_base = const_cast<char*>(resp_body.data());
iov[1].iov_len = resp_body.size();
iovcnt = 2;
}
int iov_index = 0;
while (iov_index < iovcnt) {
ssize_t written = ::writev(this->_client_fd, &iov[iov_index], iovcnt - iov_index);
if (written <= 0) [[unlikely]] return;
size_t bytes_to_advance = static_cast<size_t>(written);
while (iov_index < iovcnt && bytes_to_advance > 0) {
if (bytes_to_advance >= iov[iov_index].iov_len) {
bytes_to_advance -= iov[iov_index].iov_len;
iov_index++;
} else {
iov[iov_index].iov_base = static_cast<char*>(iov[iov_index].iov_base) + bytes_to_advance;
iov[iov_index].iov_len -= bytes_to_advance;
bytes_to_advance = 0;
}
}
}Replacing multiple send() calls with a single writev() using iovec avoids copying the header buffer and the body buffer into one giant string, and it drops system call overhead in half by sending both blocks of memory in a single kernel transition.
[cpp-web-server] (master) > wrk -t2 -c400 -d10s http://localhost:8888/
Running 10s test @ http://localhost:8888/
2 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 12.54ms 3.86ms 62.90ms 93.58%
Req/Sec 4.30k 560.50 4.93k 86.87%
85574 requests in 10.06s, 8.65MB read
Socket errors: connect 151, read 0, write 0, timeout 0
Requests/sec: 8504.73
Transfer/sec: 0.86MB
[cpp-web-server] (master) > wrk -t2 -c400 -d10s -s scripts/wrk-get.lua http://localhost:8888/
Running 10s test @ http://localhost:8888/
2 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 15.80ms 2.00ms 35.43ms 96.65%
Req/Sec 3.36k 201.54 3.76k 79.50%
66978 requests in 10.03s, 6.77MB read
Socket errors: connect 151, read 0, write 0, timeout 0
Requests/sec: 6675.49
Transfer/sec: 691.02KB
[cpp-web-server] (master) > wrk -t2 -c400 -d10s -s scripts/wrk-post.lua http://localhost:8888/
Running 10s test @ http://localhost:8888/
2 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 141.95ms 41.95ms 200.42ms 79.84%
Req/Sec 263.54 155.84 666.00 67.24%
3998 requests in 10.09s, 10.98KB read
Socket errors: connect 151, read 3890, write 1334, timeout 0
Non-2xx or 3xx responses: 3916
Requests/sec: 396.19
Transfer/sec: 1.09KBThe next optimization was to stop running the whole server on a single thread and let the kernel distribute incoming connections between multiple listener sockets. It can be found at commit 4f8e4dc2c5264e49f7e2b1cbbdd63b862db8c2ce.
- Link pthreads
set(CMAKE_THREAD_PREFER_PTHREAD TRUE)
set(THREADS_PREFER_PTHREAD_FLAG TRUE)
find_package(Threads REQUIRED)
# ...
target_link_libraries(server ${CMAKE_THREAD_LIBS_INIT})Since we're now using std::thread, we need to link the executable with the system threading library.
- Allow multi-threaded kernel load balancing
set_opt_result = ::setsockopt(this->_socket_fd, SOL_SOCKET, SO_REUSEPORT, &opt, sizeof(opt));
if (set_opt_result == -1) throw std::system_error(errno, std::generic_category(), "setting SO_REUSEPORT options failed");SO_REUSEPORT allows multiple server sockets to bind to the same port. This lets each worker thread have its own listening socket, and the kernel can distribute incoming connections between them.
- Spawn one listener per hardware thread
unsigned int num_threads = std::thread::hardware_concurrency();
if (num_threads == 0) num_threads = 8;
std::print("Starting server on {} hardware threads using SO_REUSEPORT...\n", num_threads);
std::vector<std::thread> workers;
workers.reserve(num_threads);
for (unsigned int i = 0; i < num_threads; i++) workers.emplace_back(listener);
for (auto& t : workers) t.join();Instead of running one server loop on the main thread, we now create one worker per hardware thread. Each worker runs its own listener() function, which creates its own Server instance and accepts connections independently.
- Ignore
SIGPIPE
std::signal(SIGPIPE, SIG_IGN);When clients disconnect early, writing to the socket can trigger SIGPIPE. Since this is a normal thing under load testing, we ignore it and let the write path fail normally instead of killing the whole process.
- Read headers directly into the request string
size_t current_size = this->_request_raw.size();
ssize_t actual_bytes_read = 0;
this->_request_raw.resize_and_overwrite(current_size + HEADERS_USUAL_SIZE, [&](char* buf, size_t) {
actual_bytes_read = ::read(this->_client_fd, buf + current_size, HEADERS_USUAL_SIZE);
if (actual_bytes_read <= 0) return current_size;
return current_size + static_cast<size_t>(actual_bytes_read);
});
if (actual_bytes_read <= 0) [[unlikely]] return RequestParseError_SocketError;
headers_end = this->_request_raw.find("\r\n\r\n", (search_start >= 3) ? search_start - 3 : 0);
if (headers_end != std::string::npos) [[likely]] break;
search_start = this->_request_raw.size();The old version read into a stack buffer and then appended that buffer into _request_raw. This version uses resize_and_overwrite and reads directly into the final string storage.
[cpp-web-server] (master) > wrk -t2 -c400 -d10s http://localhost:8888/
Running 10s test @ http://localhost:8888/
2 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 11.54ms 3.28ms 41.19ms 82.84%
Req/Sec 4.41k 336.64 5.05k 85.50%
87774 requests in 10.04s, 8.87MB read
Socket errors: connect 151, read 0, write 0, timeout 0
Requests/sec: 8744.03
Transfer/sec: 0.88MB
[cpp-web-server] (master) > wrk -t2 -c400 -d10s -s scripts/wrk-get.lua http://localhost:8888/
Running 10s test @ http://localhost:8888/
2 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 16.30ms 2.73ms 52.87ms 92.90%
Req/Sec 3.19k 230.22 3.50k 86.50%
63543 requests in 10.05s, 6.42MB read
Socket errors: connect 151, read 0, write 0, timeout 0
Requests/sec: 6323.34
Transfer/sec: 654.56KB
[cpp-web-server] (master) > wrk -t2 -c400 -d10s -s scripts/wrk-post.lua http://localhost:8888/
Running 10s test @ http://localhost:8888/
2 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 874.63ms 122.41ms 1.06s 93.20%
Req/Sec 71.42 19.56 141.00 75.00%
1411 requests in 10.05s, 146.06KB read
Socket errors: connect 151, read 112, write 0, timeout 0
Requests/sec: 140.38
Transfer/sec: 14.53KB