Understanding The Memcached Source Code - Consistent Hashing

January 1, 2021 Memcached Source Code

slab allocator (I, II, III) is the core module of the cache system, which largely determines how efficient the bottleneck resource, memory, can be utilized. The other 3 parts, namely,

LRU algorithm (I , II , III) for entry expiration; and an

event driven model (I , II , III) based on libevent; and

consistent hashing - this article for data distribution,

are built around it.

We have covered most of the mechanisms of memcached itself. We now know that memcached is a pure single-node, in-memory cache system and each memcached instance doesn’t know about each other. Hence clients need to deal with load balance themselves.

As a load balance mechanism, Hashing has intrincit problem when applied to storage layer. As a result, variances such as consistent hashing are used instead. In addtion, Rendezvous is also used to tackle one drawback of consistent hashing.

This post focuses on consistent hashing.

The problem consistent hashing solves

The problem is complete rehash. In fact, we have already examined this phenomenon in hash table mechanism. To recap, whenever the capacity of the hash table changes, the hash values of all entries become invalid and need to be re-calculated. In the context of distributed cache system, for instance, in a cluster of five nodes, entry A’s associated node will be calculated as n = Hash(A) % 5. If we add one node in the cluster, the divisor becomes six - H(A) % 6. And this change should be applied to all entries - complete rehash.

Operationally, such rehash means a complete cool-down of the cache system. Moreover, the complete cool-down is required for every routine such as adding new node (to scale up) and deleting failed nodes, which occurs fairly frequently.

The problme is stated as is. However, it could be inaccurate in practice. This is because it is not necessary to use exact number of nodes as the divisor. Rather, we can use a greater number as the divisor and explicitly distribute the load using range. Maybe another standalone article.

How does consistent hashing solves the problem

In general, server-side design and development favours gradualism and defined boundary of impact. Consistent hashing represents both - only entries associated with nodes in change are required to be rehashed. How does it work?

Basically, consistent hashing reset the rule that associates the infinite hashing space of entries and nodes. We already know that in hash approach, such rule is mod, and we just examined how it causes problem. In consistent hashing, the rule, that is free of complete rehash, is proximity. It works better because in the incidence of capacity change, the relative positions of most hash values (calculated from nodes and entries) are kept the same. A picture could spare us some more words.

When adding a new node

When removing a node

Next we read some code.

Implementation

This time we read the libmemcached, a C/C++ based memcached client library. The version in use is 1.0.18.

Let’s navigate directly to the algorithm.

static uint32_t dispatch_host(const Memcached *ptr, uint32_t hash)
{
  switch (ptr->distribution)
  {
...
  case MEMCACHED_DISTRIBUTION_CONSISTENT:
...
    {
      uint32_t num= ptr->ketama.continuum_points_counter;
      WATCHPOINT_ASSERT(ptr->ketama.continuum);

      memcached_continuum_item_st *begin, *end, *left, *right, *middle;
      begin= left= ptr->ketama.continuum; // ---------------------------> 1)
      end= right= ptr->ketama.continuum + num; // ----------------------> 1)

      while (left < right) // ------------------------------------------> 2)
      {
        middle= left + (right - left) / 2;
        if (middle->value < hash)
          left= middle + 1;
        else
          right= middle;
      }
      if (right == end)
        right= begin;
      return right->index;
    }
...
  }
}

dispatch_host@hash.cc

1) Here we only need to know that ketama.continuum is an sorted array of memcached_continuum_item_st, the value of which is precalculated when the system init and whenever a node is added or deleted.

...
    new_ptr= libmemcached_xrealloc(ptr, ptr->ketama.continuum, (live_servers + MEMCACHED_CONTINUUM_ADDITION) * points_per_server, memcached_continuum_item_st);
...

hosts.cc:188

...
struct memcached_continuum_item_st
{
  uint32_t index;
  uint32_t value;
};
...

continuum.hh

...
  for (uint32_t host_index= 0; host_index < memcached_server_count(ptr); ++host_index)
  {
...
          uint32_t value= hashkit_digest(&ptr->hashkit, sort_host, (size_t)sort_host_length);
          ptr->ketama.continuum[continuum_index].index= host_index;
          ptr->ketama.continuum[continuum_index++].value= value;
...
}
...

hosts.cc:322

2) Binary search the hash associated server. Note that the hash is calculated based on key.

uint32_t memcached_generate_hash(const memcached_st *shell, const char *key, size_t key_length)
{
  const Memcached* ptr= memcached2Memcached(shell);
  if (ptr)
  {
    return dispatch_host(ptr, _generate_hash_wrapper(ptr, key, key_length));
  }

  return UINT32_MAX;
}

memcached_generate_hash@hash.cc

That's it. Did I make a serious mistake? or miss out on anything important? Or you simply like the read. Link me on -- I'd be chuffed to hear your feedback.