Redis源码 - Eviction策略

本文主要想聊聊防止内存溢出的数据淘汰相关机制

基于redis 4.x版本source code

淘汰工作机制

触发时机

当在redis.conf文件中设置'maxmemory'来限制服务器使用的最大内存时

freeMemoryIfNeeded()函数会在处理命令之前被调用

freeMemoryIfNeeded函数的目标是释放足够的内存，使Redis保持在配置的内存限制下

工作机制

freeMemoryIfNeeded函数计算应该释放多少字节以使Redis保持在限制之下

并进入循环，根据配置的策略选择最佳的键来驱逐

如果释放了所有需要的字节以返回到限制以下，函数返回C_OK，否则返回C_ERR，并且阻止执行会导致服务器使用更多内存的命令，如Set command

什么样的操作可以阻止freeMemoryIfNeeded执行呢？

Lua 脚本在超时运行 server.lua_timedout
Redis server 正在加载数据 server.loading
没有设置maxmemory server.maxmemory

淘汰策略

当 redis 超出了内存的使用限制 maxmemory，server在处理命令时会触发 redis 内部的数据淘汰机制。淘汰目标数据一共有两种：

数据库所有(key-value)数据。
数据库所有被设置了过期时间的(key-value)数据

针对这两个目标，有几个相应的策略：

随机淘汰
先淘汰到期或快到期数据
近似 LRU 算法
近似 LFU 算法

Redis config 配置：

noeviction: New values aren’t saved when memory limit is reached. When a database uses replication, this applies to the primary database
allkeys-lru: Keeps most recently used keys; removes least recently used (LRU) keys
allkeys-lfu: Keeps frequently used keys; removes least frequently used (LFU) keys
volatile-lru: Removes least recently used keys with the expire field set to true.
volatile-lfu: Removes least frequently used keys with the expire field set to true.
allkeys-random: Randomly removes keys to make space for the new data added.
volatile-random: Randomly removes keys with expire field set to true.
volatile-ttl: Removes keys with expire field set to true and the shortest remaining time-to-live (TTL) value.

如果选择volatile-lru, volatile-lfu, volatile-random, and volatile-ttl 策略，当没有key被eviction时，表现与noeviction一致

noeviction

noeviction 配置允许我们不使用淘汰策略淘汰数据

Redis允许读，但禁止大部分写命令，返回 oomerr 错误

只有少数写命令可以执行，例如删除命令 del，hdel，unlink 这些能降低内存使用的写命令

随机淘汰

volatile-random，allkeys-random 这两个随机淘汰机制相对比较简单，随机从库中挑选数据进行淘汰

 /* volatile-random and allkeys-random policy */
        else if (server.maxmemory_policy == MAXMEMORY_ALLKEYS_RANDOM ||
                 server.maxmemory_policy == MAXMEMORY_VOLATILE_RANDOM)
        {
            /* When evicting a random key, we try to evict a key for
             * each DB, so we use the static 'next_db' variable to
             * incrementally visit all DBs. */
            for (i = 0; i < server.dbnum; i++) {
                j = (++next_db) % server.dbnum;
                db = server.db+j;
                dict = (server.maxmemory_policy == MAXMEMORY_ALLKEYS_RANDOM) ?
                        db->dict : db->expires;
                if (dictSize(dict) != 0) {
                    de = dictGetRandomKey(dict);
                    bestkey = dictGetKey(de);
                    bestdbid = j;
                    break;
                }
            }
        }


/* Return a random entry from the hash table. Useful to
 * implement randomized algorithms */
dictEntry *dictGetRandomKey(dict *d)
{
    ......
    if (dictIsRehashing(d)) {
        do {
            /* We are sure there are no elements in indexes from 0
             * to rehashidx-1 */
            h = d->rehashidx + (random() % (d->ht[0].size +
                                            d->ht[1].size -
                                            d->rehashidx));
            he = (h >= d->ht[0].size) ? d->ht[1].table[h - d->ht[0].size] :
                                      d->ht[0].table[h];
        } while(he == NULL);
    } else {
        do {
            h = random() & d->ht[0].sizemask;
            he = d->ht[0].table[h];
        } while(he == NULL);
    }
    .....
    return he;
}

采样机制

在谈其他淘汰策略之前，先了解下redis的采样

redis 作为一个内存数据库，里面保存了大量数据，可以根据到期时间（ttl），lru 或 lfu 进行数据淘汰，严格来说，需要维护一些数据结构才能准确筛选出目标数据。比如，以LRU为例：

需要为所有的数据维护一个链表
每当有新数据插入或是现有数据被再次访问时，需要执行多次链表操作

而外增加了操作和内存空间

同时，在合理使用redis前提下， maxmemory 触发的概率并不是很高，尤其是数据比较少的情况，有可能永远不会触发

为了一个概率低的场景去维护一些数据结构，带来了复杂和浪费。所以 redis 通过采样的方法，近似的数据淘汰策略😄-值得学习

采样方法：遍历数据库，每个数据库随机采集maxmemory_samples个样本，放进一个样本池中（数组）。样本池中的样本 idle 值从低到高排序。数据淘汰策略将会每次淘汰 idle 最高的那个数据。因为样本池大小是有限制的（EVPOOL_SIZE），所以采集的样本要根据自己的 idle 值大小或池中是否有空位来确定是否能成功插入到样本池中。如果池中没有空位或被插入样本的idle 值都小于池子中的数据，那插入将会失败。所以池子中一直存储着idle最大，最大几率被淘汰的那些数据样本

关于 Redis LRU 算法可以通过更改每次驱逐检查的样本数量来调整算法的精度。这个参数由以下配置指令控制： maxmemory-samples 5

注意，Redis 近似LRU只是一个模型，用于预测给定键在将来被访问的可能性。此外，如果您的数据访问模式与幂律（power law）非常相似，那么大多数访问将在LRU近似算法能够很好处理的键集合中

在模拟中，我们发现使用幂律访问模式时，真正的LRU和Redis近似之间的差异微乎其微或不存在。

然而，您可以将样本大小提高到10，以牺牲一些额外的CPU使用量，以更接近真正的LRU，并检查这是否会影响到您的缓存未命中率

/**
为了提高LRU近似的质量，在freeMemoryIfNeeded()调用中采用一组候选sample用于驱逐的键。

驱逐pool中的条目按空闲时间排序，将较长的空闲时间放在右侧（升序）。

当使用LFU策略时，使用反向频率指示而不是空闲时间，这样我们仍然根据较大的值进行驱逐（较大的逆频率意味着驱逐访问频率最低的键）。

空条目的键指针设置为NULL
 */
#define EVPOOL_SIZE 16
#define EVPOOL_CACHED_SDS_SIZE 255
struct evictionPoolEntry {
    unsigned long long idle;    /* Object idle time (inverse frequency for LFU) */
    sds key;                    /* Key name. */
    sds cached;                 /* Cached SDS object for key name. */
    int dbid;                   /* Key DB number. */
};

👆是evictionPoolEntry数据结构

具体采样过程： dictGetSomeKeys 从sampledict中随机选取key至samples 最终插入至evictionPoolEntry

void evictionPoolPopulate(int dbid, dict *sampledict, dict *keydict, struct evictionPoolEntry *pool) {
    int j, k, count;
    dictEntry *samples[server.maxmemory_samples];

    count = dictGetSomeKeys(sampledict,samples,server.maxmemory_samples);
    for (j = 0; j < count; j++) {
        ...
        if (server.maxmemory_policy & MAXMEMORY_FLAG_LRU) {
            // lru 近似算法，淘汰长时间没有使用的数据。
            idle = estimateObjectIdleTime(o);
        } else if (server.maxmemory_policy & MAXMEMORY_FLAG_LFU) {
            // 淘汰使用频率比较小的数据。
            idle = 255-LFUDecrAndReturn(o);
        } else if (server.maxmemory_policy == MAXMEMORY_VOLATILE_TTL) {
            // 淘汰最快过期数据。
            idle = ULLONG_MAX - (long)dictGetVal(de);
        } else {
            serverPanic("Unknown eviction policy in evictionPoolPopulate()");
        }
        // 将采集的 key，填充到 pool 数组中
        // 在 pool 数组中，寻找合适到位置。pool[k].key == NULL 或者 idle < pool[k].idle
        k = 0;
        while (k < EVPOOL_SIZE &&
               pool[k].key &&
               pool[k].idle < idle) k++;

        if (k == 0 && pool[EVPOOL_SIZE-1].key != NULL) {
            // pool 已满，当前采样没能找到合适位置插入。
            continue;
        } else if (k < EVPOOL_SIZE && pool[k].key == NULL) {
            // 找到合适位置插入，不需要移动数组其它元素。
        } else {
            // 找到数组中间位置，需要移动数据。
            if (pool[EVPOOL_SIZE-1].key == NULL) {
                // 数组还有空间，数据从插入位置向右移动。
                sds cached = pool[EVPOOL_SIZE-1].cached;
                memmove(pool+k+1,pool+k,
                    sizeof(pool[0])*(EVPOOL_SIZE-k-1));
                pool[k].cached = cached;
            } else {
                // 数组右边已经没有空间，那么删除 idle 最小的元素。
                k--;
                sds cached = pool[0].cached;
                if (pool[0].key != pool[0].cached) sdsfree(pool[0].key);
                memmove(pool,pool+1,sizeof(pool[0])*k);
                pool[k].cached = cached;
            }
        }

        // 内存的分配和销毁开销大，pool 缓存空间比较小的 key，方便内存重复使用。
        int klen = sdslen(key);
        if (klen > EVPOOL_CACHED_SDS_SIZE) {
            pool[k].key = sdsdup(key);
        } else {
            memcpy(pool[k].cached,key,klen+1);
            sdssetlen(pool[k].cached,klen);
            pool[k].key = pool[k].cached;
        }
        pool[k].idle = idle;
        pool[k].dbid = dbid;
    }
}

淘汰TTL数据

redisDb 用 expires 字典保存了 key 对应的过期时间

typedef struct redisDb {
    dict *dict;                 /* The keyspace for this DB */
    dict *expires;              /* Timeout of keys with a timeout set */
    ...
} redisDb;

dictGetSomeKeys 方法从expires获取要淘汰的key进行淘汰

计算idle的方式： ULLONG_MAX - (long)dictGetVal(de);

dictGetVal(de) 时间越小，越快到期，idle 就越大，越容易从样本池中淘汰

LRU淘汰

volatile-ttl 淘汰最快到期的数据，存在缺陷：有可能把活跃的数据先淘汰了。

所以可以采用 allkeys-lru 和 volatile-lru 策略，根据当前时间与上一次访问的时间间隔，间隔越小说明越活跃。通过采样，用近似 lru 算法淘汰那些没有使用的数据

#define LRU_BITS 24
#define LRU_CLOCK_MAX ((1<<LRU_BITS)-1) /* Max value of obj->lru */
#define LRU_CLOCK_RESOLUTION 1000 /* LRU clock resolution in ms */

typedef struct redisObject {
    unsigned type:4;
    unsigned encoding:4;
    unsigned lru:LRU_BITS; /* LRU time (relative to global lru_clock) or
                            * LFU data (least significant 8 bits frequency
                            * and most significant 16 bits access time). */
    int refcount;
    void *ptr;
} robj;

redisObject 成员 lru 保存了一个 24 bit 的系统访问数据时间戳。保存 lru 时间精度是秒
创建对象时给予LRU 时间戳

robj *createObject(int type, void *ptr) {
    robj *o = zmalloc(sizeof(*o));
    ...
    //如果缓存替换策略是LFU，那么将lru变量设置为LFU的计数值
    if (server.maxmemory_policy & MAXMEMORY_FLAG_LFU) {
        o->lru = (LFUGetTimeInMinutes()<<8) | LFU_INIT_VAL;
    } else {
        o->lru = LRU_CLOCK();   //否则，调用LRU_CLOCK函数获取LRU时钟值
    }
    return o;
}

访问对应数据时，更新 lru 时间

/* Low level key lookup API, not actually called directly from commands
 * implementations that should instead rely on lookupKeyRead(),
 * lookupKeyWrite() and lookupKeyReadWithFlags(). */
robj *lookupKey(redisDb *db, robj *key, int flags) {
    dictEntry *de = dictFind(db->dict,key->ptr);
    if (de) {
        robj *val = dictGetVal(de);

        /* Update the access time for the ageing algorithm.
         * Don't do it if we have a saving child, as this will trigger
         * a copy on write madness. */
        if (server.rdb_child_pid == -1 &&
            server.aof_child_pid == -1 &&
            !(flags & LOOKUP_NOTOUCH))
        {
            if (server.maxmemory_policy & MAXMEMORY_FLAG_LFU) {
                unsigned long ldt = val->lru >> 8;
                unsigned long counter = LFULogIncr(val->lru & 255);
                val->lru = (ldt << 8) | counter;
            } else {
                val->lru = LRU_CLOCK();//更新时间
            }
        }
        return val;
    } else {
        return NULL;
    }
}

近似 lru 淘汰长时间没使用数据 dictGetSomeKeys 方法从下面的dict获取要淘汰的key进行淘汰，放置Pool

volatile-lru： redisDb->expires
allkeys-lru： redisDb->dict

if (server.maxmemory_policy & MAXMEMORY_FLAG_LRU) {
            // lru 近似算法，淘汰长时间没有使用的数据。
            idle = estimateObjectIdleTime(o);

/* Given an object returns the min number of milliseconds the object was never
 * requested, using an approximated LRU algorithm. */
unsigned long long estimateObjectIdleTime(robj *o) {
    unsigned long long lruclock = LRU_CLOCK();
    if (lruclock >= o->lru) {
        return (lruclock - o->lru) * LRU_CLOCK_RESOLUTION;
    } else {
        return (lruclock + (LRU_CLOCK_MAX - o->lru)) *
                    LRU_CLOCK_RESOLUTION;
    }
}
/* This function is used to obtain the current LRU clock.
 * If the current resolution is lower than the frequency we refresh the
 * LRU clock (as it should be in production servers) we return the
 * precomputed value, otherwise we need to resort to a system call. */
unsigned int LRU_CLOCK(void) {
    unsigned int lruclock;
    if (1000/server.hz <= LRU_CLOCK_RESOLUTION) {
        atomicGet(server.lruclock,lruclock);//1s内无需重新计算，精度1s，减少系统调用
    } else {
        lruclock = getLRUClock();
    }
    return lruclock;
}
unsigned int getLRUClock(void) {
    return (mstime()/LRU_CLOCK_RESOLUTION) & LRU_CLOCK_MAX;
}

idle 越大越先被淘汰

近似LRU缺点：

并非绝对精确，某些更长时间未使用的key 可能更晚时间淘汰
软件研发还是一个tradeoff过程，达到目标即可

近似LFU淘汰

👆的LRU算法还有一个缺点：

B 的访问频率明显要比 A 高，显然 B 要比 A 热度更高。然而 lru 算法会把 B 数据淘汰掉

    ~~~~~A~~~~~A~~~~~A~~~~A~~~~~A~~~~~A~~|  
    ~~B~~B~~B~~B~~B~~B~~B~~B~~B~~B~~B~~B~|  
    ~~~~~~~~~~C~~~~~~~~~C~~~~~~~~~C~~~~~~|  
    ~~~~~D~~~~~~~~~~D~~~~~~~~~D~~~~~~~~~D|

redis 又引入了一种新的算法，近似 lfu 算法，反映数值访问频率，也就是数据访问热度。它重复利用了 redisObject 结构 lru 成员

unsigned lru:LRU_BITS 24位

      16 bits      8 bits
      +----------------+--------+
      + Last decr time | LOG_C  |
      +----------------+--------+

前 16 bits 用来存储上一个访问衰减时间（ldt），后 8 bits 用来存储衰减计数频率(counter)

那衰减时间和计数到底有什么用呢？其实是在一个时间段内，访问频率越高，计数counter就越大（计数最大值为 255）。我们通过计数的大小判断数据的热度idle

赋值过程如下：

创建对象 o->lru 精确至分钟，返回16bits值

robj *createObject(int type, void *ptr) {
    ......
    if (server.maxmemory_policy & MAXMEMORY_FLAG_LFU) {
        o->lru = (LFUGetTimeInMinutes()<<8) | LFU_INIT_VAL;
    }
    ......
}
/* Return the current time in minutes, just taking the least significant
 * 16 bits. The returned time is suitable to be stored as LDT (last decrement
 * time) for the LFU implementation. */
unsigned long LFUGetTimeInMinutes(void) {
    return (server.unixtime/60) & 65535;
}

访问触发频率更新，更新 lfu 数据，时间和计数的结合val->lru =(ldt << 8) | counter

/* Low level key lookup API, not actually called directly from commands
 * implementations that should instead rely on lookupKeyRead(),
 * lookupKeyWrite() and lookupKeyReadWithFlags(). */
robj *lookupKey(redisDb *db, robj *key, int flags) {
    .....
            if (server.maxmemory_policy & MAXMEMORY_FLAG_LFU) {
                unsigned long ldt = val->lru >> 8;
                unsigned long counter = LFULogIncr(val->lru & 255);
                val->lru = (ldt << 8) | counter;
            } 
    .....
}

/* Logarithmically increment a counter. The greater is the current counter value
 * the less likely is that it gets really implemented. Saturate it at 255. */
uint8_t LFULogIncr(uint8_t counter) {
    if (counter == 255) return 255;//最大值255
    double r = (double)rand()/RAND_MAX;
    double baseval = counter - LFU_INIT_VAL;
    if (baseval < 0) baseval = 0;
    double p = 1.0/(baseval*server.lfu_log_factor+1);
    if (r < p) counter++;
    return counter;
}

LFULogIncr计数器统计访问频率

这里面没有使用常规原子计算，而是使用概率计算
当数据被访问次数越多，那么随机数落在某个数据段的概率就越大。计数增加的可能性就越高。 redis 作者添加了控制因子 lfu_log_factor，当因子越大，那计数增长速度就越缓慢
redis.conf 配置lfu_log_factor
- lfu-log-factor 10
- 衰减时间是显而易见的，它是一个计数器在被采样后发现比该值更老时应该衰减的分钟数。特殊值0表示：我们永远不会衰减计数器 -->
- 计数器lfu-log-factor改变了饱和频率计数器所需的命中次数，该计数器的范围仅在0-255之间
- 当factor因子越大，计数增长速度就越缓慢

近似 lfu 淘汰使用频率比较低的数据

void evictionPoolPopulate(int dbid, dict *sampledict, dict *keydict, struct evictionPoolEntry *pool) {
    ...
    if (server.maxmemory_policy & MAXMEMORY_FLAG_LRU) {
        // 淘汰使用频率比较小的数据。
        idle = 255-LFUDecrAndReturn(o);
    }
    ...
}
/* If the object decrement time is reached, decrement the LFU counter and
 * update the decrement time field. Return the object frequency counter.
 *
 * This function is used in order to scan the dataset for the best object
 * to fit: as we check for the candidate, we incrementally decrement the
 * counter of the scanned objects if needed. */
#define LFU_DECR_INTERVAL 1
unsigned long LFUDecrAndReturn(robj *o) {
    unsigned long ldt = o->lru >> 8;
    unsigned long counter = o->lru & 255;
    if (LFUTimeElapsed(ldt) >= server.lfu_decay_time && counter) {
        if (counter > LFU_INIT_VAL*2) {
            counter /= 2;
            if (counter < LFU_INIT_VAL*2) counter = LFU_INIT_VAL*2;
        } else {
            counter--;
        }
        o->lru = (LFUGetTimeInMinutes()<<8) | counter;
    }
    return counter;
}
/* Given an object last decrement time, compute the minimum number of minutes
 * that elapsed since the last decrement. Handle overflow (ldt greater than
 * the current 16 bits minutes time) considering the time as wrapping
 * exactly once. */
unsigned long LFUTimeElapsed(unsigned long ldt) {
    unsigned long now = LFUGetTimeInMinutes();
    if (now >= ldt) return now-ldt;
    return 65535-ldt+now;
}

LFUDecrAndReturn 衰减计数
- LFUTimeElapsed 值越大，counter 就越小。也就是说，两次访问的时间间隔越大，计数的递减就越厉害
- 这个递减速度会受到衰减时间因子（lfu_decay_time）影响。可以在redis.conf文件中调节，默认为 1

总结

采样近似淘汰策略，巧妙避免了维护额外的数据结构，达到差不多的效果，这个思路值得学习
采样算法，根据样本的 idle 值进行数据淘汰，所以当我们采用一种采样算法时，不要密集地设置大量相似的 idle 数据，否则效率也是很低的
- 比如设置ttl，可以固定time+random随机几秒
当redis 内存到达 maxmemory，触发了数据淘汰，但是多次操作后，内存始终无法成功降到阈值以下，那么大部分写命令将会被禁止执行
如果发生大量淘汰的情况，那么处理客户端请求就会发生延迟，影响性能
键值对的 LRU 时钟值，不是直接通过调用 getLRUClock 函数（syscall）。 Redis 这种对性能要求极高的数据库，用一个定时任务（serverCron 函数），以固定频率触发系统调用获取机器时钟，然后把机器时钟挂到 server 的全局变量下，这相当于维护了一个本地缓存，当需要获取时钟时，直接从全局变量获取即可，节省了大量的系统调用开销
Redis 计算实例内存时，不会把主从复制的缓冲区计算在内，因此Redis进程需要更多的内存，maxmemory不等于redis的实际内存使用情况。仔细合理设置maxmemory，留有空间。

参考

redis文档

morRandom notes on improving the Redis LRU algorithm

Redis source code