# Understanding The Memcached Source Code-Event Driven II

slab allocator (I, II, III) is the core module of the cache system, which largely determines how efficient the bottleneck resource, memory, can be utilized. The other 3 parts, namely,

LRU algorithm (I , II , III) for entry expiration; and an

event driven model (I , II - this article , III) based on libevent; and the

consistent harsh (not complete) for data distribution,

are built around it.

In classic multithreading, blocking I/O operations constrain the maximum number of requests a server can handle. Hence asynchronous event driven model is used to eliminate the throughput bottleneck. As such, the synchronous and potentially slow process is divided into logic segments that are free of I/O, and are executed asynchronously.

When it comes to asynchronization, extra space is required to store contexts. This is because the logic segments, that could be associated with different sessions, are executed in an interleaved way. For instance, in the case when asynchronization is implemented (emulated) using synchronous multithreading, the “extra space” is in the form of thread stack. Whilst contexts are maintained in user land in event driven.

conn is the representative of those contexts in Memcached.

# Core data structure - conn

## Properties in discussion

fd - the file descriptor a event is rooted. used by last post

state - the main focus of this post

rbuf - read buffer address. used by try_read_network

rcurr - address of unprocessed data. used by try_read_network

rsize - current size of the read buffer. used by try_read_network

rbytes - size of data to be processed (it is also used as an indicator for leftover data in various places). initialised by try_read_network, updated by process_get_command, used by try_read_command

last_cmd_time - updated when start processing a command. used by try_read_network

ilist - the item list that is associated with the context; icurr and ileft indicate the current entry and number of entries left. used by process_get_command, conn_release_items

iov - the actual storage for pointers of output data, which is used by msglist; iovsize, and iovused are its allocated size and used size respectively. initialised by process_command, used by add_iov, ensure_iov_space

Here the data structures (struct msghdr and struct iovec) is required by sendmsg. The relevant text about the API is pasted bellow.

The msg_iov and msg_iovlen fields of message specify zero or more buffers containing the data to be sent. msg_iov points to an array of iovec structures; msg_iovlen shall be set to the dimension of this array. In each iovec structure, the iov_base field specifies a storage area and the iov_len field gives its size in bytes. Some of these sizes can be zero. The data from each storage area indicated by msg_iov is sent in turn.

msglist - the list that stores the struct msghdr themselves; msgsize and msgused are its allocated size and used size respectively; msgbytes indicates totall size of the output data size; msgcurr points to the index that has been processed (written).

Yet nothing is bette than a chart to demonstrate the data structures and the layout in memory.

## State switch

An event triggers cascading changes of states which in turn invokes various procedures, before drive machine relinquishes control and waits for a new event arrival. In last post, we have seen this process on dispatch thread, in which

1) conn_listening is triggered by a new connection;

2) dispatch_conn_new is invoked, which transfer the new accepted fd, as well as succeeding events to one of the worker threads;

3) dispatch thread gives up CPU and waits for new “new connection” events.

In this post, we are going to see more complex state switches that effectively link together the procedures we discussed in LRU III,

The state of a given session is represented by conn.state of its associated context.

and this time we are going to adopt a similar approach as LRU III, i.e., sending telnet commands to a Memcached instance, to navigate the outermost layer of the Memcached application.

We will also switch ON the convenient verbose with -vvv to better observe the internal state transitions.

Firstly (as usual) we telnet to the Memcached instance, and add some items

Here 36 is the accepted fd. As mentioned, the following operations will be on this fd.

Next we send the exact same read command to the Memcached instance as in LRU III

As mentioned in last post, the initial state of worker threads is

## conn_new_cmd

so we get started from here.

1) nreqs (settings.reqs_per_event) is the maximum requests one event loop iteration should handle. Note that the threshold is needed because new connections will not be handled if one event loop iteration takes too long to complete. Note also that the connection being interrupted will be fired again and get the chance to enter the drive machine with a “read” event since the descriptor is set with EV_PERSIST in the last post.

2) Initializes the relevant properties in the context for a new command.

2a) If there are leftover data, then switch to conn_parse_cmd directly.

2b) If it is a fresh session, then switch to conn_waiting.

3) Yield the current iteration when the threshold is reached.

## conn_waiting

Simply reset the descriptor with the original flags (i.e., EV_READ, EV_PERSIST), update the state of the context to the next hop (conn_read), and relinquish the CPU.

1) Read from the file descriptor and save the data to the context.

2) Switch to the next state, conn_parse_cmd.

Here the while (1) is used to handle logic flow for buffer expanding instead of loop.

1) Move rcurr to the beginning of the read buffer.

2) If the data size exceeds the read buffer size, try expanding the buffer (for at most 4 times).

3) Calculate the available buffer space for reading from the socket, and update rbytes accordingly.

3a) Goto 2) if the buffer is full.

3b) Return READ_DATA_RECEIVED, which switches the state to conn_parse_cmd in the state machine pass through.

## conn_parse_cmd

1) Determine the protocol type, in this case is ascii_prot.

~) Verbose message we saw in the beginning.

2) Trim all the '\n' and '\r' in the end, store the position of the command last character to el, and store the command end to cont.

3) Update last_cmd_time.

4) Call process_command which locates the “get” command and call process_get_command. In process_command, a) tokenize_command is a string parsing method that stores command (i.e., “get”) in tokens[COMMAND_TOKEN] and key (i.e., test) in tokens[KEY_TOKEN]; b) initialization of msgcurr, msgused, iovused; c) initialization other fields in add_msghdr; and d) process_get_command is the next step.

5) Update rbytes with the length of the command that has been processed (cont - c->rcurr).

6) Move the rcurr to the unprocessed data located at end of the command portion.

Before the logic reaches process_get_command, an entry should be initialised in msglist for the current command.

1) Expand the msglist when required.

2) Point to the next empty entry in msglist with msg.

3) Initialise the entry pointed by msg. Here the critical operation is msg->msg_iov = &c->iov[c->iovused]; which links the msglist to the specific entry in iov. (Figure - msglist & iov)

4) Initialise msgbytes to 0.

5) Update msgused accordingly.

### process_get_command

We have seen this method in beginning of LRU III. This time, we will complete its pass through with the context of event driven.

1) Iterate through key token array. Here we got one key token ‘test’.

2) Call item_get for the item pointer.

3) Increase the ilist size if it is full, and . Here ilist stores the item being processed. In the end of the current command processing, this list is used to batch release the items reference counts.

4) add_iov prepares the output of this session.

5) Call item_update to manipulate the LRU lists.

6) Link the item currently being processed to ilist, and update the associated fields.

7) Move on to the next state conn_mwrite.

This method initialised an entry on iov list and add it to the last in-use item in msglist (Figure - msglist & iov).

1) Get the tail of the in use portion of msglist.

2) Expend iov if necessary.

3) Initialize the iov_base and iov_len fields within the iov entry. Note that the msg_iov has been linked to the position of specific entry in iov, hence operations on msg_iov change the content of iov as well.

4) Update msgbytes with the total item size.

5, 6) Update iovused and msg_iovlen accordingly.

7) Handle MTU with the assistance of do while loop.

## conn_mwrite

Before explaining the logic process of conn_mwrite state, we look at the essential within first, which is

### transmit

As the essential method of state conn_mwrite processing, transmit goes through the msglist (starting from 0, the initial value) and tries its best to send out all the pending data accumulated in the current session. This is done within itself or in subsequent passes through the event loop. Only when blocking operation is indicated by EAGAIN or EWOULDBLOCK, the state machine stops the current event loop iteration, and the same session will be resumed when the buffer space becomes available again.

1) If the msg_iovlen is 0, the writing of msgcurr slot has finished, hence move to the next slot.

2) Call sendmsg and move msg_iov, iov_base and iov_len according to the data length (res) that has been written successfully. This leads to case b) of the state processing.

3) As mentioned, EAGAIN or EWOULDBLOCK returned by sendmsg leads to case c) of state processing.

4) Errors other than the above two lead to case c) of state processing.

5) c->msgcurr >= c->msgused means write of all data of the session finished, which leads to b) of the state processing.

### Back to state processing

According to the result of transmit, the logic flows to the following 3 branches,

a) If the result is TRANSMIT_COMPLETE, 1) finalise the current command processing with conn_release_items; 2) switch the state to conn_new_cmd which 3) eventually falls to conn_waiting and, as discussed, finishes the current event loop.

b) If the result is TRANSMIT_INCOMPLETE and TRANSMIT_HARD_ERROR, the state machine keeps the same state, and the subsequent passes through the event loop continues consuming more data in msglist. Unlike read operation, TRANSMIT_INCOMPLETE does not lead to immediate event loop finish because write operation does not block until buffer is full.

c) TRANSMIT_SOFT_ERROR means the buffer is full, hence finish the current event loop iteration straight away.

1) all items ownership are batched released here at the end the (read) command processing.