Apparatus

Non-blocking server in C/Linux

In my previous post, I created an echo server application that

  • Opens a socket that listens to incoming TCP connections
  • When accepting a connection from a client, reads and echoes back a string terminated with a newline
  • Cleans up and exits when it receives a signal

The purpose of that example was to demonstrate how to handle both incoming connections and signals simultaneously in a C/Linux application. The caveat, as discussed in the post, was the connection handling subroutine was not integrated with the main loop. Because only the main loop handled signals, the application would not behave correctly if a signal was received while serving a client.

I thought it would be a fun exercise to design an echo server that overcomes the problems, using the low-level Linux APIs. The server presented in this post expands the previous design by:

  • Allowing multiple concurrent connections at once
  • Preventing the server from getting stuck by using non-blocking APIs and polling each client socket from the main loop

I’ll also discuss some C tricks (data encapsulation, using a switch statement to implement coroutines) that have proper abstractions in high-level languages typically used to implement non-blocking server applications.

Although the function is of the server remains almost trivial (echo back whatever the client sends at it), the non-blocking flow brings quite a few complications. It’s not impossible that I have made errors, so feedback is always welcome! I also opted out of some (micro-)optimization opportunities in favor of clarity.

Let’s get started!

The main loop

The full source code is available in GitHub. There was plenty of sausage making, so I’m not even going to pretend that I wrote this in clean stages. Instead, here is the implementation of a non-blocking, concurrent, signal handling echo server in C/Linux:

int main()
{
    sigset_t sigset;
    struct signalfd_siginfo siginfo;
    int server_fd, socket_fd, signal_fd, i, total_connections, connection_completed, flags;
    short revents, events_out;
    struct pollfd pollfds[MAX_CONNECTIONS + 2];
    struct context* connection;
    struct context* connections[MAX_CONNECTIONS];

    /* Setting up signalfd to read signals is explained in:
     * https://www.jmoisio.eu/en/blog/2020/04/20/handling-signals-correctly-in-a-linux-application/ */

    sigemptyset(&sigset);
    sigaddset(&sigset, SIGTERM);
    sigprocmask(SIG_SETMASK, &sigset, NULL);

    server_fd = create_server();
    pollfds[SERVER_FD].fd = server_fd;
    pollfds[SERVER_FD].events = POLLIN;

    signal_fd = signalfd(-1, &sigset, 0);
    pollfds[SIGNAL_FD].fd = signal_fd;
    pollfds[SIGNAL_FD].events = POLLIN;

    /* Setting up the connection objects. Initially there are no connections so
     * add dummy entries to the polling list. */

    for (i = FIRST_CONNECTION; i < POLLFDS; ++i) {
        pollfds[i].fd = -1;
        pollfds[i].events = 0;
    }
    memset(connections, 0, sizeof(connections));
    total_connections = 0;

    while (1) {
        for (i = 0; i < POLLFDS; ++i) {
            pollfds[i].revents = 0;
        }
        if (poll(pollfds, POLLFDS, -1) < 0) {
            handle_error("poll");
        }

        /* Handle an incoming connection. */
        if (pollfds[0].revents & POLLERR) {
            handle_error("server failure");
        } else if (pollfds[0].revents & POLLIN) {
            /* Accept an incoming connection. Immediately set nonblocking mode. */
            if ((socket_fd = accept(server_fd, NULL, NULL)) < 0) {
                handle_error("accept");
            }
            flags = fcntl(socket_fd, F_GETFD, 0);
            if (fcntl(socket_fd, F_SETFD, flags | O_NONBLOCK)) {
                handle_error("fcntl");
            }
            /* Create context for the connection. Setup pollfds for the socket
             * just created. */
            for (i = 0; i < MAX_CONNECTIONS; ++i) {
                if (!connections[i]) {
                    connection = create_connection(socket_fd, &events_out);
                    if (!connection) {
                        handle_error("create_connection");
                    }
                    connections[i] = connection;
                    pollfds[FIRST_CONNECTION + i].fd = socket_fd;
                    pollfds[FIRST_CONNECTION + i].events = events_out;
                    ++total_connections;
                    assert(total_connections <= MAX_CONNECTIONS);
                    break;
                }
            }
            /* If we reached the maximum number of concurrent connections,
             * remove the server socket from the polling list. The negation is a
             * neat trick explained in the poll() man page:
             * https://man7.org/linux/man-pages/man2/poll.2.html */
            if (total_connections == MAX_CONNECTIONS) {
                assert(pollfds[SERVER_FD].fd > 0);
                pollfds[SERVER_FD].fd = -pollfds[SERVER_FD].fd;
            }
        }

        /* Check if a signal was received. If it was, read the signal info and
         * break away from the event loop. */
        if (pollfds[1].revents & POLLERR) {
            handle_error("signal_fd failure");
        } else if (pollfds[1].revents & POLLIN) {
            if (read(signal_fd, &siginfo, sizeof(siginfo)) != sizeof(siginfo)) {
                handle_error("read siginfo");
            }
            break;
        }

        /* Handle connections. */
        for (i = 0; i < MAX_CONNECTIONS; ++i) {
            revents = pollfds[i + FIRST_CONNECTION].revents;
            if (revents & POLLERR) {
                handle_error("socket failure");
            } else if (revents) {
                connection = connections[i];
                events_out = 0;
                connection_completed = 0;
                /* For each connection that has events, we call
                 * handle_connection() with the context object, and in the next
                 * round listen to the events returned by the handler -- a sort
                 * of coroutine! */
                if (handle_connection(connection, revents, &events_out, &connection_completed)) {
                    handle_error("handle_connection");
                }
                pollfds[i + FIRST_CONNECTION].events = events_out;
                /* If a connection was completed, free the context object. If
                 * the number of connections was capped, the server is now free
                 * to serve more clients, so add the server socket back to the
                 * polling list. */
                if (connection_completed) {
                    destroy_connection(connection);
                    connections[i] = NULL;
                    pollfds[i + FIRST_CONNECTION].fd = -1;
                    if (total_connections == MAX_CONNECTIONS) {
                        assert(pollfds[SERVER_FD].fd < 0);
                        pollfds[SERVER_FD].fd = -pollfds[SERVER_FD].fd;
                    }
                    --total_connections;
                    assert(total_connections >= 0);
                }
            }
        }
    }

    /* We're done! Just clean up and exit. */
    fprintf(stderr, "Exiting via %s\n", strsignal(siginfo.ssi_signo));
    close(signal_fd);
    close(server_fd);
    for (i = 0; i < MAX_CONNECTIONS; ++i) {
        if (connections[i]) {
            destroy_connection(connections[i]);
        }
    }
    return 0;
}

There is lots of code, but let’s go through it. Before the main loop, we initialize signal masks, create the server socket, and set up the initial pollers. The main loop handles polling, accepting connections, and receiving signals just like in my previous post. The most notable difference in these bits is that the sockets for client connections are made non-blocking with the fcntl() call. We’ll come back to later when we see how the handling of client connections is implemented.

Note that there is a fixed number of maximum connections. When the maximum is reached, the server stops polling the incoming connections until an existing connection is released.

The new major bit is handling client connections. Instead of handling every client connection instantly in a (blocking) subroutine, the sockets are integrated into the main loop. The echo “protocol” is almost trivial, but still consists of receiving a message (possibly in multiple parts if the message is big or the client is slow), and sending it back to the client (again, possibly in multiple parts). That’s why we need to keep the context of each connection while it lasts. Note how the handle_connection() function accepts two “out” arguments (events_out and connetion_completed) used to signal the events that need to be polled next, and whether the connection is completed and may be released. Although a simple example, this is a starting point for a powerful and feature-rich event loop framework.

Note that there are certainly improvement opportunities. The error handling is brutal: The server aborts at the first sign of trouble. A real server wouldn’t want to do that just because of one of the many client connections fails. Also, nothing prevents a malicious client from performing a DOS attack by opening multiple simultaneous connections and doing nothing: There are no time-outs for the server to close an idle connection. We assume all the clients are benevolent.

Object-oriented design in C

Is C an object-oriented language? Well, no, but it doesn’t prevent object-oriented design wherever it makes sense. Just like nothing prevents you from writing bad code in a language emphasizing object-oriented design.

The state of each connection is stored in a context object. context is an abstract data type. The create_connection() function creates the context object (constructor). After that, the main loop interacts with the connection using a fixed set of functions (methods). When the main loop finishes the connection, the destroy_connection() function destroys the object and releases the related resources (destructor).

Because the context is forward declared (instead of defined) in the header file included in the main compilation unit, the exact representation is unknown. The main loop only holds an opaque pointer to the connection context. It’s a prime example of encapsulation advocated in the object-oriented programming.

If you want to know a lot more about advanced object-oriented concepts (inheritance, object polymorphism) in C, I can recommend Object-Oriented Programming with ANSI-C by Schreiner.

Event loops and coroutines

What do the concrete implementation of the context object and its methods look like? Let’s concentrate on the create_connection(), handle_connection() and destroy_connection() methods that I discussed in the previous section

struct context* create_connection(int socket_fd, short* events_out)
{
    struct context* ctx = (struct context*)malloc(sizeof(struct context));
    if (ctx) {
        ctx->fd = socket_fd;
        ctx->state = READING;
        memset(ctx->buf, 0, sizeof(ctx->buf));
        ctx->bytes = 0;
        ctx->buf_end = NULL;
        *events_out = POLLIN;
    }
    return ctx;
}

int handle_connection(struct context* ctx, short revents, short* events_out, int* connection_completed)
{
    ssize_t result, max_bytes;

    /* POLLHUP means that the other end has closed the connection (hung up!). No
       need to continue. */
    if (revents & POLLHUP) {
        *connection_completed = 1;
        return 0;
    }

    /* The switch statement jumps to the right position based on the state
     * stored in the context. In a way the handle_connection() function together
     * with the context object holding the state of the connection form a
     * coroutine: When the coroutine needs to wait for an event (the socket
     * becomes readable/writable), it awaits by telling the caller which events
     * are interesting and returning. The caller will call handle_connection()
     * again when the event happens and thanks to the state variable,
     * handle_connection() will know where to pick the execution up! */

    assert(ctx);
    switch (ctx->state) {
    case READING:
        max_bytes = sizeof(ctx->buf) - ctx->bytes - 1;
        /* Because the socket is in nonblocking mode, we may not get everything
         * at once. If we don't receive the linefeed ending the message, just
         * try again later. */
        result = read(ctx->fd, ctx->buf + ctx->bytes, max_bytes);
        if (result < 0) {
            return -1;
        }
        ctx->bytes += result;
        ctx->buf_end = memchr(ctx->buf, '\n', ctx->bytes);
        if (ctx->buf_end) {
            ctx->state = WRITING;
            ctx->bytes = 0;
            *events_out = POLLOUT;
        } else {
            *events_out = POLLIN;
            *connection_completed = 0;
            break;
        }
        // fallthrough
    case WRITING:
        max_bytes = strlen(ctx->buf) - ctx->bytes;
        /* Similarly as with reading, writing may not write all the bytes at
         * once, and we may need to wait for the socket to become readable
         * again. */
        result = write(ctx->fd, ctx->buf + ctx->bytes, max_bytes);
        if (result < 0) {
            return -1;
        }
        ctx->bytes += result;
        if (result == max_bytes) {
            ctx->state = DONE;
        } else {
            *events_out = POLLOUT;
            *connection_completed = 0;
            break;
        }
        // fallthrough
    case DONE:
        *events_out = 0;
        *connection_completed = 1;
    };

    return 0;
}

void destroy_connection(struct context* ctx)
{
    assert(ctx);
    close(ctx->fd);
    free(ctx);
}

The create_connection() method allocates a context object from the heap and tells the caller what events it waits for (always starting with POLLIN in our application). The most notable fields are the file descriptor used to access the client socket, the buffer that stores the message of the client, and an enumeration describing the state of the connection. The destroy_connection() is the mirror image of the constructor, closing the socket and freeing the allocated memory.

handle_connection() function does most of the heavy lifting. It accepts a context object, does something based on the state of the connection, and signals the caller what it wants to do next (read socket, write socket, finish).

This is similar to the async/await pattern in some high-level languages and asynchronous IO frameworks. The idea in those languages is to use a dedicated keyword to distinguish between regular functions and coroutine functions. Regular functions live in the call stack and complete before the caller can proceed. Coroutine functions return coroutines that are suspended and resumed multiple times before completion. In Python you could write something like this:

async def handle_connection():
    message = await receive_message()
    await send_message(message)

The above function handle_connection(), thanks to the async/await syntax, doesn’t execute immediately but returns a coroutine object. A coroutine object itself has a somewhat cumbersome interface, but luckily Python also provides the asyncio framework that knows how to deal with them. Generally, coroutines are tied to an event loop that knows how to drive them concurrently.

The C programming language does not provide a language-level abstraction for coroutines, so I rolled out my own coroutine interface. There is an execution context, including the state field telling where the execution shall take off the next time the coroutine is resumed. When the call returns, the caller can figure out from the “out” arguments if the coroutine is finished, or if not, when it should be resumed. As per C programming practices, the return value of the function itself is reserved for an error code.

Blocking IO versus non-blocking IO

Note how the non-blocking writes and reads differ from their blocking counterparts. The non-blocking read() may return 0 bytes if there is nothing to read, but this isn’t fundamentally different from the blocking read() that also may return fewer bytes than requested simply because the message size may be less than expected. So the handle_connection() keeps on reading until a newline is encountered, before entering the writing state.

We need to be even more careful with non-blocking writes. The user of the blocking version of write() would expect that every byte requested will be accepted. But this isn’t the case with non-blocking write() that accepts anything between 0-N bytes if N is being requested. So the connection remains in the writing state until all the bytes have been written.

Alternative APIs for non-blocking IO

The main loop in this example was written around the poll() system call. If we had to be prepared to handle tens of thousands of concurrent connections, scanning through the entire list of events wouldn’t scale. That’s why there are more sophisticated variants of poll() that require a constant amount of work from the userspace application to detect the ready file descriptors. In Linux, the API is called epoll.

There are also separate asynchronous APIs that do not use the read() and write() system calls. The POSIX AIO API offloads the reads and the writes to the kernel. With the AIO API, an application can request to be notified of a completed call either via receiving a signal, invoking a callback in an unspecified thread, or simply ignoring the result altogether.