在學(xué)習(xí)jemalloc之前可以了解一下glibc malloc,jemalloc沒有'unlinking' 和 'frontlinking'的概念,jemalloc最早使用是在freeBSD系統(tǒng)中,隨后firefox瀏覽器也開始使用jemalloc作為內(nèi)存分配器,jemalloc是開源的(源碼)。glibc是LInux系統(tǒng)默認(rèn)的mallocer,二者雖然來自不同系統(tǒng),但并不是完全不同,還是有很多相似之處,兩個(gè)mallocer支持多線程。
jemalloc同時(shí)分配的items也會(huì)同時(shí)使用,jemalloc支持SMP系統(tǒng)和并發(fā)多線程,多線程的支持是依賴于多個(gè)‘a(chǎn)renas’,并且一個(gè)線程第一次調(diào)用內(nèi)存mallocer,與其相關(guān)聯(lián)的是一個(gè)特殊的arena。
線程分配arena只有三種可能的算法:
1、TLS啟用的情況下就是線程ID的哈希值
2、TLS不可用并定義MALLOC_BALANCE的情況下通過內(nèi)置線性同余隨機(jī)數(shù)生成器
3、使用傳統(tǒng)的循環(huán)算法
對于后兩種情況,線程的整個(gè)生命周期中線程和arena的關(guān)聯(lián)不會(huì)一直保持不變。
jemalloc將內(nèi)存分成多個(gè)chunk,所有的chunk的大小都是一樣的,所有的數(shù)據(jù)都存儲(chǔ)在chunks中。Chunks進(jìn)一步被分為多個(gè)run,run負(fù)責(zé)請求/分配相應(yīng)大小的內(nèi)存。一個(gè)run記錄free和使用的regions的大小。
1.arena。jemalloc的核心分配管理區(qū)域,對于多核系統(tǒng),會(huì)默認(rèn)分配4*cores的Arena,線程采取輪詢的方式來選擇相應(yīng)的arena來進(jìn)行內(nèi)存分配。
2.chunk。具體進(jìn)行內(nèi)存分配的區(qū)域,目前的默認(rèn)大小是4M。chunk以page(默認(rèn)為4K)為單位進(jìn)行管理,每個(gè)chunk的前幾個(gè)page(默認(rèn)是6個(gè))用于存儲(chǔ)后面所有page的狀態(tài),比如是否待分配還是已經(jīng)分配;而后面的所有page則用于進(jìn)行實(shí)際的分配。
3.bin。用來管理各個(gè)不同大小單元的分配,比如最小的Bin管理的是8字節(jié)的分配,每個(gè)Bin管理的大小都不一樣,依次遞增。jemalloc的bin和ptmalloc的bin的作用類似。
4.run。每個(gè)bin在實(shí)際上是通過對它對應(yīng)的正在運(yùn)行的Run進(jìn)行操作來進(jìn)行分配的,一個(gè)run實(shí)際上就是chunk里的一塊區(qū)域,大小是page的整數(shù)倍,具體由實(shí)際的bin來決定,比如8字節(jié)的bin對應(yīng)的run就只有1個(gè)page,可以從里面選取一個(gè)8字節(jié)的塊進(jìn)行分配。在run的最開頭會(huì)存儲(chǔ)著這個(gè)run的信息,比如還有多少個(gè)塊可供分配。
5.tcache。線程對應(yīng)的私有緩存空間,默認(rèn)是使用的。因此在分配內(nèi)存時(shí)首先從tcache中找,miss的情況下才會(huì)進(jìn)入一般的分配流程。
Jemalloc 把內(nèi)存分配分為了三個(gè)部分,
Small objects的size以8字節(jié),16字節(jié),32字節(jié)等分隔開的,小于頁大小。
Large objects的size以分頁為單位,等差間隔排列,小于chunk的大小。
Huge objects的大小是chunk大小的整數(shù)倍。
small objects和large objects由arena來管理, huge objects由線程間公用的紅黑樹管理。
默認(rèn)64位系統(tǒng)的劃分方式如下:
Small: [8], [16, 32, 48, …, 128], [192, 256, 320, …, 512], [768, 1024, 1280, …, 3840] (1-57344分為44檔)
Large: [4 KiB, 8 KiB, 12 KiB, …, 4072 KiB] (58345-4MB)
Huge: [4 MiB, 8 MiB, 12 MiB, …] (4MB的整數(shù)倍)
接下來介紹它們之間的關(guān)系。每個(gè)arena有一個(gè)bin數(shù)組,根據(jù)機(jī)器配置不同它的具體結(jié)構(gòu)也不同,由相應(yīng)的size_class.h中的宏定義決定,而每個(gè)bin會(huì)通過它對應(yīng)的正在運(yùn)行的run進(jìn)行操作來進(jìn)行分配的。每個(gè)tcahe有一個(gè)對應(yīng)的arena,它本身也有一個(gè)bin數(shù)組(稱為tbin),前面的部分與arena的bin數(shù)組是對應(yīng)的,但它長度更大一些,因?yàn)樗鼤?huì)緩存一些更大的塊;而且它也沒有對應(yīng)的run的概念,因?yàn)樗蛔鼍彺妫瑀nny它只有一個(gè)avail數(shù)組來存儲(chǔ)被緩存的空間的地址。像筆者機(jī)器上的tcahe在arena最大的3584字節(jié)的bin的基礎(chǔ)上,后面還有8個(gè)bin,分別對應(yīng)4K,8K,12K一直到32K。
這里想重點(diǎn)介紹一下chunk與run的關(guān)系。之前提到chunk默認(rèn)是4M,而run是在chunk中進(jìn)行實(shí)際分配的操作對象,每次有新的分配請求時(shí)一旦tcache無法滿足要求,就要通過run進(jìn)行操作,如果沒有對應(yīng)的run存在就要新建一個(gè),哪怕只分配一個(gè)塊,比如只申請一個(gè)8字節(jié)的塊,也會(huì)生成一個(gè)大小為一個(gè)page(默認(rèn)4K)的run;再申請一個(gè)16字節(jié)的塊,又會(huì)生成一個(gè)大小為4096字節(jié)的run。run的具體大小由它對應(yīng)的bin決定,但一定是page的整數(shù)倍。因此實(shí)際上每個(gè)chunk就被分成了一個(gè)個(gè)的run。
<p>arena:</p><pre><code>struct arena_s {
/* This arena's index within the arenas array. */
unsigned ind;
/*
* Number of threads currently assigned to this arena. This field is
* protected by arenas_lock.
*/
unsigned nthreads;
/*
* There are three classes of arena operations from a locking
* perspective:
* 1) Thread assignment (modifies nthreads) is protected by arenas_lock.
* 2) Bin-related operations are protected by bin locks.
* 3) Chunk- and run-related operations are protected by this mutex.
*/
malloc_mutex_t lock;
arena_stats_t stats;
/*
* List of tcaches for extant threads associated with this arena.
* Stats from these are merged incrementally, and at exit if
* opt_stats_print is enabled.
*/
ql_head(tcache_t) tcache_ql;
uint64_t prof_accumbytes;
/*
* PRNG state for cache index randomization of large allocation base
* pointers.
*/
uint64_t offset_state;
dss_prec_t dss_prec;
/*
* In order to avoid rapid chunk allocation/deallocation when an arena
* oscillates right on the cusp of needing a new chunk, cache the most
* recently freed chunk. The spare is left in the arena's chunk trees
* until it is deleted.
*
* There is one spare chunk per arena, rather than one spare total, in
* order to avoid interactions between multiple threads that could make
* a single spare inadequate.
*/
arena_chunk_t *spare;
/* Minimum ratio (log base 2) of nactive:ndirty. */
ssize_t lg_dirty_mult;
/* Number of pages in active runs and huge regions. */
size_t nactive;
/*
* Current count of pages within unused runs that are potentially
* dirty, and for which madvise(... MADV_DONTNEED) has not been called.
* By tracking this, we can institute a limit on how much dirty unused
* memory is mapped for each arena.
*/
size_t ndirty;
/*
* Size/address-ordered tree of this arena's available runs. The tree
* is used for first-best-fit run allocation.
*/
arena_avail_tree_t runs_avail;
/*
* Unused dirty memory this arena manages. Dirty memory is conceptually
* tracked as an arbitrarily interleaved LRU of dirty runs and cached
* chunks, but the list linkage is actually semi-duplicated in order to
* avoid extra arena_chunk_map_misc_t space overhead.
*
* LRU-----------------------------------------------------------MRU
*
* /-- arena ---\
* | |
* | |
* |------------| /- chunk -\
* ...->|chunks_cache|<--------------------------->| /----\ |<--...
* |------------| | |node| |
* | | | | | |
* | | /- run -\ /- run -\ | | | |
* | | | | | | | | | |
* | | | | | | | | | |
* |------------| |-------| |-------| | |----| |
* ...->|runs_dirty |<-->|rd |<-->|rd |<---->|rd |<----...
* |------------| |-------| |-------| | |----| |
* | | | | | | | | | |
* | | | | | | | \----/ |
* | | \-------/ \-------/ | |
* | | | |
* | | | |
* \------------/ \---------/
*/
arena_runs_dirty_link_t runs_dirty;
extent_node_t chunks_cache;
/* Extant huge allocations. */
ql_head(extent_node_t) huge;
/* Synchronizes all huge allocation/update/deallocation. */
malloc_mutex_t huge_mtx;
/*
* Trees of chunks that were previously allocated (trees differ only in
* node ordering). These are used when allocating chunks, in an attempt
* to re-use address space. Depending on function, different tree
* orderings are needed, which is why there are two trees with the same
* contents.
*/
extent_tree_t chunks_szad_cache;
extent_tree_t chunks_ad_cache;
extent_tree_t chunks_szad_mmap;
extent_tree_t chunks_ad_mmap;
extent_tree_t chunks_szad_dss;
extent_tree_t chunks_ad_dss;
malloc_mutex_t chunks_mtx;
/* Cache of nodes that were allocated via base_alloc(). */
ql_head(extent_node_t) node_cache;
malloc_mutex_t node_cache_mtx;
/*
* User-configurable chunk allocation/deallocation/purge functions.
*/
chunk_alloc_t *chunk_alloc;
chunk_dalloc_t *chunk_dalloc;
chunk_purge_t *chunk_purge;
/* bins is used to store trees of free regions. */
arena_bin_t bins[NBINS];
};</code></pre>
<p>arena內(nèi)bin</p><pre><code>struct arena_bin_s {
/*
* All operations on runcur, runs, and stats require that lock be
* locked. Run allocation/deallocation are protected by the arena lock,
* which may be acquired while holding one or more bin locks, but not
* vise versa.
*/
malloc_mutex_t lock;
/*
* Current run being used to service allocations of this bin's size
* class.
*/
arena_run_t *runcur;
/*
* Tree of non-full runs. This tree is used when looking for an
* existing run when runcur is no longer usable. We choose the
* non-full run that is lowest in memory; this policy tends to keep
* objects packed well, and it can also help reduce the number of
* almost-empty chunks.
*/
arena_run_tree_t runs;
/* Bin statistics. */
malloc_bin_stats_t stats;
};</code></pre>
<p>tcache</p><pre><code>struct tcache_s {
ql_elm(tcache_t) link; /* Used for aggregating stats. */
uint64_t prof_accumbytes;/* Cleared after arena_prof_accum(). */
unsigned ev_cnt; /* Event count since incremental GC. */
index_t next_gc_bin; /* Next bin to GC. */
tcache_bin_t tbins[1]; /* Dynamically sized. */
/*
* The pointer stacks associated with tbins follow as a contiguous
* array. During tcache initialization, the avail pointer in each
* element of tbins is initialized to point to the proper offset within
* this array.
*/
};</code></pre>
<p>tcache內(nèi)bin</p><pre><code>struct tcache_bin_s {
tcache_bin_stats_t tstats;
int low_water; /* Min # cached since last GC. */
unsigned lg_fill_div; /* Fill (ncached_max >> lg_fill_div). */
unsigned ncached; /* # of cached objects. */
/*
* To make use of adjacent cacheline prefetch, the items in the avail
* stack goes to higher address for newer allocations. avail points
* just above the available space, which means that
* avail[-ncached, ... -1] are available items and the lowest item will
* be allocated first.
*/
void **avail; /* Stack of available objects. */
};</code></pre>
<p>run</p><pre><code>struct arena_run_s {
/* Index of bin this run is associated with. */
index_t binind;
/* Number of free regions in run. */
unsigned nfree;
/* Per region allocated/deallocated bitmap. */
bitmap_t bitmap[BITMAP_GROUPS_MAX];
};</code></pre>
<p>chunk</p><pre><code>struct arena_chunk_s {
/*
* A pointer to the arena that owns the chunk is stored within the node.
* This field as a whole is used by chunks_rtree to support both
* ivsalloc() and core-based debugging.
*/
extent_node_t node;
/*
* Map of pages within chunk that keeps track of free/large/small. The
* first map_bias entries are omitted, since the chunk header does not
* need to be tracked in the map. This omission saves a header page
* for common chunk sizes (e.g. 4 MiB).
*/
arena_chunk_map_bits_t map_bits[1]; /* Dynamically sized. */
};</code></pre>
內(nèi)存分配
jemalloc 的內(nèi)存分配,可分成四類:
1、small:如果請求size不大于arena的最小的bin,那么就通過線程對應(yīng)的tcache來進(jìn)行分配。首先確定size的大小屬于哪一個(gè)tbin,比如2字節(jié)的size就屬于最小的8字節(jié)的tbin,然后查找tbin中有沒有緩存的空間,如果有就進(jìn)行分配,沒有則為這個(gè)tbin對應(yīng)的arena的bin分配一個(gè)run,然后把這個(gè)run里面的部分塊的地址依次賦給tcache的對應(yīng)的bin的avail數(shù)組,相當(dāng)于緩存了一部分的8字節(jié)的塊,最后從這個(gè)availl數(shù)組中選取一個(gè)地址進(jìn)行分配;
2、large: 如果請求size大于arena的最小的bin,同時(shí)不大于tcache能緩存的最大塊,也會(huì)通過線程對應(yīng)的tcache來進(jìn)行分配,但方式不同。首先看tcache對應(yīng)的tbin里有沒有緩存塊,如果有就分配,沒有就從chunk里直接找一塊相應(yīng)的page整數(shù)倍大小的空間進(jìn)行分配(當(dāng)這塊空間后續(xù)釋放時(shí),這會(huì)進(jìn)入相應(yīng)的tcache對應(yīng)的tbin里);
3、large: 如果請求size大于tcache能緩存的最大塊,同時(shí)不大于chunk大小(默認(rèn)是4M),具體分配和第2類請求相同,區(qū)別只是沒有使用tcache;
4、huge(size> chunk ):如果請求大于chunk大小,直接通過mmap進(jìn)行分配。
函數(shù)調(diào)用關(guān)系je_malloc(jemalloc.c)->imalloc_body(jemalloc.c)->imalloc(jemalloc_internal.h)->iallocztm(jemalloc_internal.h)->arena_malloc(arena.h)
<p>定義在arena.h文件中</p><pre><code>JEMALLOC_ALWAYS_INLINE void *
arena_malloc(tsd_t *tsd, arena_t *arena, size_t size, bool zero,
tcache_t *tcache)
{
assert(size != 0);
arena = arena_choose(tsd, arena);
if (unlikely(arena == NULL))
return (NULL);
if (likely(size <= SMALL_MAXCLASS)) {
if (likely(tcache != NULL)) {
return (tcache_alloc_small(tsd, arena, tcache, size,
zero));
} else
return (arena_malloc_small(arena, size, zero));
} else if (likely(size <= arena_maxclass)) {
/*
* Initialize tcache after checking size in order to avoid
* infinite recursion during tcache initialization.
*/
if (likely(tcache != NULL) && size <= tcache_maxclass) {
return (tcache_alloc_large(tsd, arena, tcache, size,
zero));
} else
return (arena_malloc_large(arena, size, zero));
} else
return (huge_malloc(tsd, arena, size, zero, tcache));
}</code></pre>
如果要分配的內(nèi)存大小屬于small object,且線程tcache不為空
JEMALLOC_ALWAYS_INLINE void *
tcache_alloc_easy(tcache_bin_t *tbin)
{
void *ret;
if (unlikely(tbin->ncached == 0)) {
tbin->low_water = -1;
return (NULL);
}
tbin->ncached--;
if (unlikely((int)tbin->ncached < tbin->low_water))
tbin->low_water = tbin->ncached;
ret = tbin->avail[tbin->ncached];
return (ret);
}
JEMALLOC_ALWAYS_INLINE void *
tcache_alloc_small(tsd_t *tsd, arena_t *arena, tcache_t *tcache, size_t size,
bool zero)
{
void *ret;
index_t binind;
size_t usize;
tcache_bin_t *tbin;
binind = size2index(size);
assert(binind < NBINS);
tbin = &tcache->tbins[binind];
usize = index2size(binind);
ret = tcache_alloc_easy(tbin);
if (unlikely(ret == NULL)) {
ret = tcache_alloc_small_hard(tsd, arena, tcache, tbin, binind);
if (ret == NULL)
return (NULL);
}
.........( chunk > size > tcache )
return (ret);
}
void *
tcache_alloc_small_hard(tsd_t *tsd, arena_t *arena, tcache_t *tcache,
tcache_bin_t *tbin, index_t binind)
{
void *ret;
arena_tcache_fill_small(arena, tbin, binind, config_prof ?
tcache->prof_accumbytes : 0);
if (config_prof)
tcache->prof_accumbytes = 0;
ret = tcache_alloc_easy(tbin);
return (ret);
}
void
arena_tcache_fill_small(arena_t *arena, tcache_bin_t *tbin, index_t binind,
uint64_t prof_accumbytes)
{
unsigned i, nfill;
arena_bin_t *bin;
arena_run_t *run;
void *ptr;
assert(tbin->ncached == 0);
if (config_prof && arena_prof_accum(arena, prof_accumbytes))
prof_idump();
bin = &arena->bins[binind];
malloc_mutex_lock(&bin->lock);
for (i = 0, nfill = (tcache_bin_info[binind].ncached_max >>
tbin->lg_fill_div); i < nfill; i++) {
if ((run = bin->runcur) != NULL && run->nfree > 0)
ptr = arena_run_reg_alloc(run, &arena_bin_info[binind]);
else
ptr = arena_bin_malloc_hard(arena, bin);
if (ptr == NULL) {
/*
* OOM. tbin->avail isn't yet filled down to its first
* element, so the successful allocations (if any) must
* be moved to the base of tbin->avail before bailing
* out.
*/
if (i > 0) {
memmove(tbin->avail, &tbin->avail[nfill - i],
i * sizeof(void *));
}
break;
}
if (config_fill && unlikely(opt_junk_alloc)) {
arena_alloc_junk_small(ptr, &arena_bin_info[binind],
true);
}
/* Insert such that low regions get used first. */
tbin->avail[nfill - 1 - i] = ptr;
}
if (config_stats) {
bin->stats.nmalloc += i;
bin->stats.nrequests += tbin->tstats.nrequests;
bin->stats.curregs += i;
bin->stats.nfills++;
tbin->tstats.nrequests = 0;
}
malloc_mutex_unlock(&bin->lock);
tbin->ncached = i;
}
通過函數(shù)tcache_alloc_easy分配tcache中的內(nèi)存,如果失敗則調(diào)用tcache_alloc_small_hard函數(shù)在arena對應(yīng)的run中分配。同時(shí)保存在tcache的avail數(shù)組中,相當(dāng)于緩存到tcache中。
/* Re-fill bin->runcur, then call arena_run_reg_alloc(). */
static void *
arena_bin_malloc_hard(arena_t *arena, arena_bin_t *bin)
{
void *ret;
index_t binind;
arena_bin_info_t *bin_info;
arena_run_t *run;
binind = arena_bin_index(arena, bin);
bin_info = &arena_bin_info[binind];
bin->runcur = NULL;
run = arena_bin_nonfull_run_get(arena, bin);
if (bin->runcur != NULL && bin->runcur->nfree > 0) {
/*
* Another thread updated runcur while this one ran without the
* bin lock in arena_bin_nonfull_run_get().
*/
assert(bin->runcur->nfree > 0);
ret = arena_run_reg_alloc(bin->runcur, bin_info);
if (run != NULL) {
arena_chunk_t *chunk;
/*
* arena_run_alloc_small() may have allocated run, or
* it may have pulled run from the bin's run tree.
* Therefore it is unsafe to make any assumptions about
* how run has previously been used, and
* arena_bin_lower_run() must be called, as if a region
* were just deallocated from the run.
*/
chunk = (arena_chunk_t *)CHUNK_ADDR2BASE(run);
if (run->nfree == bin_info->nregs)
arena_dalloc_bin_run(arena, chunk, run, bin);
else
arena_bin_lower_run(arena, chunk, run, bin);
}
return (ret);
}
if (run == NULL)
return (NULL);
bin->runcur = run;
assert(bin->runcur->nfree > 0);
return (arena_run_reg_alloc(bin->runcur, bin_info));
}
JEMALLOC_INLINE_C void *
arena_run_reg_alloc(arena_run_t *run, arena_bin_info_t *bin_info)
{
void *ret;
unsigned regind;
arena_chunk_map_misc_t *miscelm;
void *rpages;
assert(run->nfree > 0);
assert(!bitmap_full(run->bitmap, &bin_info->bitmap_info));
regind = bitmap_sfu(run->bitmap, &bin_info->bitmap_info);
miscelm = arena_run_to_miscelm(run);
rpages = arena_miscelm_to_rpages(miscelm);
ret = (void *)((uintptr_t)rpages + (uintptr_t)bin_info->reg0_offset +
(uintptr_t)(bin_info->reg_interval * regind));
run->nfree--;
return (ret);
}
arena_tcache_fill_small函數(shù)會(huì)判斷arena數(shù)組中的bin對應(yīng)的run是否為空,如果為空就調(diào)用arena_bin_malloc_hard函數(shù)新建run再調(diào)用arena_run_reg_alloc分配內(nèi)存。注意:run中采用bitmap記錄分配區(qū)域的狀態(tài), 相比采用空閑列表的方式, 采用bitmap具有以下優(yōu)點(diǎn):bitmap能夠快速計(jì)算出第一塊空閑區(qū)域,且能很好的保證已分配區(qū)域的緊湊型。分配數(shù)據(jù)與應(yīng)用數(shù)據(jù)是隔離的,能夠減少應(yīng)用數(shù)據(jù)對分配數(shù)據(jù)的干擾。對很小的分配區(qū)域的支持更好。
回到arena_malloc_small函數(shù),邏輯與arena_tcache_fill_small相似,就不再贅述。
下面看一下當(dāng)申請的內(nèi)存屬于large object的情況
當(dāng)申請內(nèi)存小于tcache最大的object且支持tcache,調(diào)用tcache_alloc_large
JEMALLOC_ALWAYS_INLINE void *
tcache_alloc_large(tsd_t *tsd, arena_t *arena, tcache_t *tcache, size_t size,
bool zero)
{
void *ret;
index_t binind;
size_t usize;
tcache_bin_t *tbin;
binind = size2index(size);
usize = index2size(binind);
assert(usize <= tcache_maxclass);
assert(binind < nhbins);
tbin = &tcache->tbins[binind];
ret = tcache_alloc_easy(tbin);
if (unlikely(ret == NULL)) {
/*
* Only allocate one large object at a time, because it's quite
* expensive to create one and not use it.
*/
ret = arena_malloc_large(arena, usize, zero);
if (ret == NULL)
return (NULL);
} else {
if (config_prof && usize == LARGE_MINCLASS) {
arena_chunk_t *chunk =
(arena_chunk_t *)CHUNK_ADDR2BASE(ret);
size_t pageind = (((uintptr_t)ret - (uintptr_t)chunk) >>
LG_PAGE);
arena_mapbits_large_binind_set(chunk, pageind,
BININD_INVALID);
}
if (likely(!zero)) {
if (config_fill) {
if (unlikely(opt_junk_alloc))
memset(ret, 0xa5, usize);
else if (unlikely(opt_zero))
memset(ret, 0, usize);
}
} else
memset(ret, 0, usize);
if (config_stats)
tbin->tstats.nrequests++;
if (config_prof)
tcache->prof_accumbytes += usize;
}
tcache_event(tsd, tcache);
return (ret);
}
可以看到程序首先還是會(huì)從tcache分配內(nèi)存,如果tcache有空閑內(nèi)存則分配給程序否則將調(diào)用arena_malloc_large函數(shù)直接從arena分配
void *
arena_malloc_large(arena_t *arena, size_t size, bool zero)
{
void *ret;
size_t usize;
uint64_t r;
uintptr_t random_offset;
arena_run_t *run;
arena_chunk_map_misc_t *miscelm;
UNUSED bool idump;
/* Large allocation. */
usize = s2u(size);
malloc_mutex_lock(&arena->lock);
if (config_cache_oblivious) {
/*
* Compute a uniformly distributed offset within the first page
* that is a multiple of the cacheline size, e.g. [0 .. 63) * 64
* for 4 KiB pages and 64-byte cachelines.
*/
prng64(r, LG_PAGE - LG_CACHELINE, arena->offset_state,
UINT64_C(6364136223846793009), UINT64_C(1442695040888963409));
random_offset = ((uintptr_t)r) << LG_CACHELINE;
} else
random_offset = 0;
run = arena_run_alloc_large(arena, usize + large_pad, zero);
if (run == NULL) {
malloc_mutex_unlock(&arena->lock);
return (NULL);
}
miscelm = arena_run_to_miscelm(run);
ret = (void *)((uintptr_t)arena_miscelm_to_rpages(miscelm) +
random_offset);
if (config_stats) {
index_t index = size2index(usize) - NBINS;
arena->stats.nmalloc_large++;
arena->stats.nrequests_large++;
arena->stats.allocated_large += usize;
arena->stats.lstats[index].nmalloc++;
arena->stats.lstats[index].nrequests++;
arena->stats.lstats[index].curruns++;
}
if (config_prof)
idump = arena_prof_accum_locked(arena, usize);
malloc_mutex_unlock(&arena->lock);
if (config_prof && idump)
prof_idump();
if (!zero) {
if (config_fill) {
if (unlikely(opt_junk_alloc))
memset(ret, 0xa5, usize);
else if (unlikely(opt_zero))
memset(ret, 0, usize);
}
}
return (ret);
}
static arena_run_t *
arena_run_first_best_fit(arena_t *arena, size_t size)
{
size_t search_size = run_quantize_first(size);
arena_chunk_map_misc_t *key = (arena_chunk_map_misc_t *)
(search_size | CHUNK_MAP_KEY);
arena_chunk_map_misc_t *miscelm =
arena_avail_tree_nsearch(&arena->runs_avail, key);
if (miscelm == NULL)
return (NULL);
return (&miscelm->run);
}
static arena_run_t *
arena_run_alloc_large_helper(arena_t *arena, size_t size, bool zero)
{
arena_run_t *run = arena_run_first_best_fit(arena, s2u(size));
if (run != NULL)
arena_run_split_large(arena, run, size, zero);
return (run);
}
static arena_run_t *
arena_run_alloc_large(arena_t *arena, size_t size, bool zero)
{
arena_chunk_t *chunk;
arena_run_t *run;
assert(size <= arena_maxrun);
assert(size == PAGE_CEILING(size));
/* Search the arena's chunks for the lowest best fit. */
run = arena_run_alloc_large_helper(arena, size, zero);
if (run != NULL)
return (run);
/*
* No usable runs. Create a new chunk from which to allocate the run.
*/
chunk = arena_chunk_alloc(arena);
if (chunk != NULL) {
run = &arena_miscelm_get(chunk, map_bias)->run;
arena_run_split_large(arena, run, size, zero);
return (run);
}
/*
* arena_chunk_alloc() failed, but another thread may have made
* sufficient memory available while this one dropped arena->lock in
* arena_chunk_alloc(), so search one more time.
*/
return (arena_run_alloc_large_helper(arena, size, zero));
}
arena_malloc_large中首先會(huì)計(jì)算一個(gè)隨機(jī)偏移,隨后調(diào)用arena_run_alloc_large分配新的run。在arena_run_alloc_large中,我們可以看到arena_run_first_best_fit中通過樹(Size/address-ordered tree)查找到可用的run,如果沒有可用的run就通過arena_chunk_alloc函數(shù)創(chuàng)建新的chunk,在新建的chunk中分配run。
在arena_chunk_alloc中,程序首先會(huì)判斷arena結(jié)構(gòu)體中的spare是否為空,這個(gè)spare是arena緩存的最近釋放的chunk,也就是說如果最近釋放的chunk存在則調(diào)用arena_chunk_init_spare否則調(diào)用arena_chunk_init_hard。arena_chunk_init_spare很簡單直接將spare重新分配;
static arena_chunk_t *
arena_chunk_alloc(arena_t *arena)
{
arena_chunk_t *chunk;
if (arena->spare != NULL)
chunk = arena_chunk_init_spare(arena);
else {
chunk = arena_chunk_init_hard(arena);
if (chunk == NULL)
return (NULL);
}
/* Insert the run into the runs_avail tree. */
arena_avail_insert(arena, chunk, map_bias, chunk_npages-map_bias);
return (chunk);
}
arena_chunk_init_hard函數(shù)會(huì)調(diào)用arena_chunk_alloc_internal函數(shù)
static arena_chunk_t *
arena_chunk_alloc_internal(arena_t *arena, bool *zero)
{
arena_chunk_t *chunk;
if (likely(arena->chunk_alloc == chunk_alloc_default)) {
chunk = chunk_alloc_cache(arena, NULL, chunksize, chunksize,
zero, true);
if (chunk != NULL && arena_chunk_register(arena, chunk,
*zero)) {
chunk_dalloc_cache(arena, chunk, chunksize);
return (NULL);
}
} else
chunk = NULL;
if (chunk == NULL)
chunk = arena_chunk_alloc_internal_hard(arena, zero);
if (config_stats && chunk != NULL) {
arena->stats.mapped += chunksize;
arena->stats.metadata_mapped += (map_bias << LG_PAGE);
}
return (chunk);
}
首先會(huì)調(diào)用chunk_alloc_cache來分配chunk,在chunk_alloc_cache實(shí)際是通過查找arena中extent_tree來分配chunk,這個(gè)extent_tree也是一個(gè)紅黑樹,這個(gè)紅黑樹保存著之前分配并使用過的chunk。隨后調(diào)用arena_chunk_register進(jìn)行初始化和注冊等工作。如果分配失敗程序則調(diào)用arena_chunk_alloc_internal_hard函數(shù)分配新的chunk
static arena_chunk_t *
arena_chunk_alloc_internal_hard(arena_t *arena, bool *zero)
{
arena_chunk_t *chunk;
chunk_alloc_t *chunk_alloc = arena->chunk_alloc;
chunk_dalloc_t *chunk_dalloc = arena->chunk_dalloc;
malloc_mutex_unlock(&arena->lock);
chunk = (arena_chunk_t *)chunk_alloc_wrapper(arena, chunk_alloc, NULL,
chunksize, chunksize, zero);
if (chunk != NULL && arena_chunk_register(arena, chunk, *zero)) {
chunk_dalloc_wrapper(arena, chunk_dalloc, (void *)chunk,
chunksize);
chunk = NULL;
}
malloc_mutex_lock(&arena->lock);
return (chunk);
}
void *
chunk_alloc_wrapper(arena_t *arena, chunk_alloc_t *chunk_alloc, void *new_addr,
size_t size, size_t alignment, bool *zero)
{
void *ret;
ret = chunk_alloc(new_addr, size, alignment, zero, arena->ind);
if (ret == NULL)
return (NULL);
if (config_valgrind && chunk_alloc != chunk_alloc_default)
JEMALLOC_VALGRIND_MAKE_MEM_UNDEFINED(ret, chunksize);
return (ret);
}
最終是通過chunk_alloc函數(shù)從系統(tǒng)內(nèi)存中分配新的chunk,同樣進(jìn)行初始化和注冊等操作。至此large的分配結(jié)束。
下面看一下huge的情況:
void *
huge_malloc(tsd_t *tsd, arena_t *arena, size_t size, bool zero,
tcache_t *tcache)
{
size_t usize;
usize = s2u(size);
if (usize == 0) {
/* size_t overflow. */
return (NULL);
}
return (huge_palloc(tsd, arena, usize, chunksize, zero, tcache));
}
void *
huge_palloc(tsd_t *tsd, arena_t *arena, size_t usize, size_t alignment,
bool zero, tcache_t *tcache)
{
void *ret;
extent_node_t *node;
bool is_zeroed;
/* Allocate one or more contiguous chunks for this request. */
/* Allocate an extent node with which to track the chunk. */
node = ipallocztm(tsd, CACHELINE_CEILING(sizeof(extent_node_t)),
CACHELINE, false, tcache, true, arena);
if (node == NULL)
return (NULL);
/*
* Copy zero into is_zeroed and pass the copy to chunk_alloc(), so that
* it is possible to make correct junk/zero fill decisions below.
*/
is_zeroed = zero;
/* ANDROID change */
#if !defined(__LP64__)
/* On 32 bit systems, using a per arena cache can exhaust
* virtual address space. Force all huge allocations to
* always take place in the first arena.
*/
arena = a0get();
#else
arena = arena_choose(tsd, arena);
#endif
/* End ANDROID change */
if (unlikely(arena == NULL) || (ret = arena_chunk_alloc_huge(arena,
usize, alignment, &is_zeroed)) == NULL) {
idalloctm(tsd, node, tcache, true);
return (NULL);
}
extent_node_init(node, arena, ret, usize, is_zeroed);
if (huge_node_set(ret, node)) {
arena_chunk_dalloc_huge(arena, ret, usize);
idalloctm(tsd, node, tcache, true);
return (NULL);
}
/* Insert node into huge. */
malloc_mutex_lock(&arena->huge_mtx);
ql_elm_new(node, ql_link);
ql_tail_insert(&arena->huge, node, ql_link);
malloc_mutex_unlock(&arena->huge_mtx);
if (zero || (config_fill && unlikely(opt_zero))) {
if (!is_zeroed)
memset(ret, 0, usize);
} else if (config_fill && unlikely(opt_junk_alloc))
memset(ret, 0xa5, usize);
return (ret);
}
void *
arena_chunk_alloc_huge(arena_t *arena, size_t usize, size_t alignment,
bool *zero)
{
void *ret;
chunk_alloc_t *chunk_alloc;
size_t csize = CHUNK_CEILING(usize);
malloc_mutex_lock(&arena->lock);
/* Optimistically update stats. */
if (config_stats) {
arena_huge_malloc_stats_update(arena, usize);
arena->stats.mapped += usize;
}
arena->nactive += (usize >> LG_PAGE);
chunk_alloc = arena->chunk_alloc;
if (likely(chunk_alloc == chunk_alloc_default)) {
ret = chunk_alloc_cache(arena, NULL, csize, alignment, zero,
true);
} else
ret = NULL;
malloc_mutex_unlock(&arena->lock);
if (ret == NULL) {
ret = arena_chunk_alloc_huge_hard(arena, chunk_alloc, usize,
alignment, zero, csize);
}
if (config_stats && ret != NULL)
stats_cactive_add(usize);
return (ret);
}
static void *
arena_chunk_alloc_huge_hard(arena_t *arena, chunk_alloc_t *chunk_alloc,
size_t usize, size_t alignment, bool *zero, size_t csize)
{
void *ret;
ret = chunk_alloc_wrapper(arena, chunk_alloc, NULL, csize, alignment,
zero);
if (ret == NULL) {
/* Revert optimistic stats updates. */
malloc_mutex_lock(&arena->lock);
if (config_stats) {
arena_huge_malloc_stats_update_undo(arena, usize);
arena->stats.mapped -= usize;
}
arena->nactive -= (usize >> LG_PAGE);
malloc_mutex_unlock(&arena->lock);
}
return (ret);
}
當(dāng)申請的內(nèi)存為huge的情況下,huge_palloc中首先分配一個(gè)extent_node的chunk,注釋中也注明這個(gè)chunk是用來“跟蹤”新分配的chunk,隨后就是申請一個(gè)獨(dú)立的arena,隨后調(diào)用arena_chunk_alloc_huge進(jìn)行實(shí)際的分配;
同樣是調(diào)用chunk_alloc_cache函數(shù)分配新的chunk,這個(gè)函數(shù)之前使用過。如果分配失敗則調(diào)用arena_chunk_alloc_huge_hard來分配,又看見了熟悉的函數(shù)。
簡而言之,就是:
小內(nèi)存(small class): 線程緩存bin -> 分配區(qū)bin(bin加鎖) -> 問系統(tǒng)要
中型內(nèi)存(large class):分配區(qū)bin(bin加鎖) -> 問系統(tǒng)要
大內(nèi)存(huge class): 直接mmap組織成N個(gè)chunk+全局huge紅黑樹維護(hù)(帶緩存)
內(nèi)存釋放
回收流程大體和分配流程類似,有tcache機(jī)制的會(huì)將回收的塊進(jìn)行緩存,沒有tcache機(jī)制的直接回收(不大于chunk的將對應(yīng)的page狀態(tài)進(jìn)行修改,回收對應(yīng)的run;大于chunk的直接munmap)。需要關(guān)注的是jemalloc何時(shí)會(huì)將內(nèi)存還給操作系統(tǒng),因?yàn)閜tmalloc中存在因?yàn)槭褂胻op_chunk機(jī)制(詳見華庭的文章)而使得內(nèi)存無法還給操作系統(tǒng)的問題。目前看來,除了大內(nèi)存直接munmap,jemalloc還有兩種機(jī)制可以釋放內(nèi)存:
- 當(dāng)釋放時(shí)發(fā)現(xiàn)某個(gè)chunk的所有內(nèi)存都已經(jīng)為臟(即分配后又回收)就把整個(gè)chunk釋放;
- 當(dāng)arena中的page分配情況滿足一個(gè)閾值時(shí)對dirty page進(jìn)行purge(通過調(diào)用madvise來進(jìn)行)。這個(gè)閾值的具體含義是該arena中的dirty page大小已經(jīng)達(dá)到一個(gè)chunk的大小且占到了active page的1/opt_lg_dirty_mult(默認(rèn)為1/32)。active page的意思是已經(jīng)正在使用中的run的page,而dirty page就是其中已經(jīng)分配后又回收的page。
內(nèi)存釋放時(shí)的函數(shù)調(diào)用關(guān)系:
je_malloc->ifree->iqalloc->idalloctm->arena_dalloc
<p></p><pre><code>JEMALLOC_ALWAYS_INLINE void
arena_dalloc(tsd_t *tsd, void *ptr, tcache_t *tcache)
{
arena_chunk_t *chunk;
size_t pageind, mapbits;
assert(ptr != NULL);
chunk = (arena_chunk_t *)CHUNK_ADDR2BASE(ptr);
if (likely(chunk != ptr)) {
pageind = ((uintptr_t)ptr - (uintptr_t)chunk) >> LG_PAGE;
if defined(ANDROID)
/* Verify the ptr is actually in the chunk. */
if (unlikely(pageind < map_bias || pageind >= chunk_npages)) {
__libc_fatal_no_abort("Invalid address %p passed to free: invalid page index", ptr);
return;
}
endif
mapbits = arena_mapbits_get(chunk, pageind);
assert(arena_mapbits_allocated_get(chunk, pageind) != 0);
if defined(ANDROID)
/* Verify the ptr has been allocated. */
if (unlikely((mapbits & CHUNK_MAP_ALLOCATED) == 0)) {
__libc_fatal("Invalid address %p passed to free: value not allocated", ptr);
}
endif
if (likely((mapbits & CHUNK_MAP_LARGE) == 0)) {
/* Small allocation. */
if (likely(tcache != NULL)) {
index_t binind = arena_ptr_small_binind_get(ptr,
mapbits);
tcache_dalloc_small(tsd, tcache, ptr, binind);
} else {
arena_dalloc_small(extent_node_arena_get(
&chunk->node), chunk, ptr, pageind);
}
} else {
size_t size = arena_mapbits_large_size_get(chunk,
pageind);
assert(config_cache_oblivious || ((uintptr_t)ptr &
PAGE_MASK) == 0);
if (likely(tcache != NULL) && size - large_pad <=
tcache_maxclass) {
tcache_dalloc_large(tsd, tcache, ptr, size -
large_pad);
} else {
arena_dalloc_large(extent_node_arena_get(
&chunk->node), chunk, ptr);
}
}
} else
huge_dalloc(tsd, ptr, tcache);
}</code></pre>
本文參考:
內(nèi)存分配器jemalloc學(xué)習(xí)筆記
jemalloc原理分析
ptmalloc,tcmalloc和jemalloc內(nèi)存分配策略研究
幾種malloc實(shí)現(xiàn)原理 ptmalloc(glibc) && tcmalloc(google) && jemalloc(facebook)