RDMA通過(guò)kernel-bypass和協(xié)議棧offload兩大核心技術(shù),實(shí)現(xiàn)了遠(yuǎn)高于傳統(tǒng)TCP/IP的網(wǎng)絡(luò)通信性能。盡管RDMA的性能要遠(yuǎn)好于TCP/IP,但目前RDMA的實(shí)際落地業(yè)務(wù)場(chǎng)景卻寥寥無(wú)幾,這其中制約RDMA技術(shù)大規(guī)模上線應(yīng)用的主要原因有兩點(diǎn):
- 主流互聯(lián)網(wǎng)公司普遍選擇RoCE(RDMA over Converged Ethernet)作為RDMA部署方案,而RoCE本質(zhì)上是RDMA over UDP,在網(wǎng)絡(luò)上無(wú)法保證不丟包。因此RoCE部署方案需要額外的擁塞控制機(jī)制來(lái)保證底層的無(wú)損網(wǎng)絡(luò),如PFC、ECN等,這給大規(guī)模的上線部署帶來(lái)挑戰(zhàn)。而且目前各大廠商對(duì)硬件擁塞控制的支持均還不完善,存在兼容性問(wèn)題。
- RDMA提供了完全不同于socket的編程接口,因此要想使用RDMA,需要對(duì)現(xiàn)有應(yīng)用進(jìn)行改造。而RDMA原生編程API(verbs/RDMA_CM)比較復(fù)雜,需要對(duì)RDMA技術(shù)有深入理解才能做好開(kāi)發(fā),學(xué)習(xí)成本較高。
為了降低應(yīng)用程序的改造成本,決定研發(fā)一個(gè)RDMA通信庫(kù),該通信庫(kù)直接基于ibvebrs和RDMA_CM,避免對(duì)其他第三方庫(kù)的調(diào)用。本文主要對(duì)rdma編程的事件通知機(jī)制進(jìn)行歸納總結(jié)。
傳統(tǒng)socket編程中通常采用IO復(fù)用技術(shù)(select、poll、epoll等)來(lái)實(shí)現(xiàn)事件通知機(jī)制,那么對(duì)于rdma是否可以同樣基于IO復(fù)用技術(shù)來(lái)實(shí)現(xiàn)事件通知機(jī)制?答案是完全可以。
1. RDMA_CM API(For Connection)
在rdma編程時(shí),可以直接通過(guò)RDMA_CM API來(lái)建立RDMA連接。
對(duì)rdma_create_id函數(shù)進(jìn)行分析,其主要?jiǎng)?chuàng)建了rdma_cm_id對(duì)象,并將其注冊(cè)到驅(qū)動(dòng)中。
int rdma_create_id(struct rdma_event_channel *channel,
struct rdma_cm_id **id, void *context,
enum rdma_port_space ps)
{
enum ibv_qp_type qp_type = (ps == RDMA_PS_IPOIB || ps == RDMA_PS_UDP) ?
IBV_QPT_UD : IBV_QPT_RC;
ret = ucma_init(); //查詢(xún)獲取所有IB設(shè)備,存放在cma_dev_array全局?jǐn)?shù)組中;檢測(cè)是否支持AF_IB協(xié)議
struct cma_id_private *id_priv =
ucma_alloc_id(channel, context, ps, qp_type); //創(chuàng)建并初始化id_priv對(duì)象:若未創(chuàng)建rdma_event_channel,那么調(diào)用rdma_create_event_channel創(chuàng)建一個(gè)。
CMA_INIT_CMD_RESP(&cmd, sizeof cmd, CREATE_ID, &resp, sizeof resp);
cmd.uid = (uintptr_t) id_priv;
cmd.ps = ps;
cmd.qp_type = qp_type;
ret = write(id_priv->id.channel->fd, &cmd, sizeof cmd); //將id_priv相關(guān)信息注冊(cè)到內(nèi)核驅(qū)動(dòng)中,不做過(guò)多分析
*id = &id_priv->id; //返回rdma_cm_id對(duì)象
}
rdma_cm_id數(shù)據(jù)結(jié)構(gòu)定義如下:
struct rdma_cm_id {
struct ibv_context *verbs; //ibv_open_device
struct rdma_event_channel *channel; //rdma_create_event_channel創(chuàng)建;For Setup connection
void *context; //user specified context
struct ibv_qp *qp; //rdma_create_qp,底層調(diào)用的是ibv_create_qp
struct rdma_route route;
enum rdma_port_space ps; //RDMA_PS_IPOIB or RDMA_PS_UDP or RDMA_PS_TCP
uint8_t port_num; //port數(shù)目
struct rdma_cm_event *event; //rdma_cm相關(guān)的事件events
struct ibv_comp_channel *send_cq_channel; //ibv_create_comp_channel創(chuàng)建;For data transfer
struct ibv_cq *send_cq; //發(fā)送CQ,通常和recv_cq是同一個(gè)CQ
struct ibv_comp_channel *recv_cq_channel; //ibv_create_comp_channel創(chuàng)建;For data transfer
struct ibv_cq *recv_cq; //接收CQ,通常和send_cq是同一個(gè)CQ
struct ibv_srq *srq;
struct ibv_pd *pd; //ibv_open_device
enum ibv_qp_type qp_type; //IBV_QPT_RC or IBV_QPT_UD
};
在創(chuàng)建rdma_cm_id時(shí),如果預(yù)先沒(méi)有創(chuàng)建rdma_event_channel,那么需要調(diào)用rdma_create_event_channel函數(shù)。
struct rdma_event_channel *rdma_create_event_channel(void)
{
struct rdma_event_channel *channel;
if (ucma_init()) //通過(guò)static局部變量,保證只做一次初始化
return NULL;
channel = malloc(sizeof *channel); //創(chuàng)建rdma_event_channel
if (!channel)
return NULL;
channel->fd = open("/dev/infiniband/rdma_cm", O_RDWR | O_CLOEXEC); //可以看出rdma_event_channel本質(zhì)上就是一個(gè)fd
if (channel->fd < 0) {
goto err;
}
return channel;
err:
free(channel);
return NULL;
}
rdma_event_channel的定義如下:
struct rdma_event_channel {
int fd;
}
1.1 RDMA_CM原生事件通知實(shí)現(xiàn)(in block way)
static int cma_handler(struct rdma_cm_id *cma_id, struct rdma_cm_event *event);
ret = rdma_get_cm_event(channel, &event); //阻塞操作,直到有rdma_cm event發(fā)生才返回
if (!ret) {
ret = cma_handler(event->id, event); //處理事件
rdma_ack_cm_event(event); //ack event
}
static int cma_handler(struct rdma_cm_id *cma_id, struct rdma_cm_event *event) {
int ret = 0;
switch (event->event)
{
case RDMA_CM_EVENT_ADDR_RESOLVED:
ret = addr_handler(cma_id->context);
break;
case RDMA_CM_EVENT_MULTICAST_JOIN:
ret = join_handler(cma_id->context, &event->param.ud);
break;
case RDMA_CM_EVENT_ADDR_ERROR:
case RDMA_CM_EVENT_ROUTE_ERROR:
case RDMA_CM_EVENT_MULTICAST_ERROR:
printf("mckey: event: %s, error: %d\n", rdma_event_str(event->event), event->status); connect_error();
ret = event->status;
break;
case RDMA_CM_EVENT_DEVICE_REMOVAL:
/* Cleanup will occur after test completes. */
break;
default:
break;
}
可以看出,RDMA_CM的fd所偵測(cè)的都是建立連接相關(guān)的event,其不涉及數(shù)據(jù)傳輸相關(guān)的event,所以rdma_cm event只用于通知建連相關(guān)事件
enum rdma_cm_event_type {$
RDMA_CM_EVENT_ADDR_RESOLVED,
RDMA_CM_EVENT_ADDR_ERROR,
RDMA_CM_EVENT_ROUTE_RESOLVED,
RDMA_CM_EVENT_ROUTE_ERROR,
RDMA_CM_EVENT_CONNECT_REQUEST,
RDMA_CM_EVENT_CONNECT_RESPONSE,
RDMA_CM_EVENT_CONNECT_ERROR,
RDMA_CM_EVENT_UNREACHABLE,
RDMA_CM_EVENT_REJECTED,
RDMA_CM_EVENT_ESTABLISHED,
RDMA_CM_EVENT_DISCONNECTED,
RDMA_CM_EVENT_DEVICE_REMOVAL,
RDMA_CM_EVENT_MULTICAST_JOIN,
RDMA_CM_EVENT_MULTICAST_ERROR,
RDMA_CM_EVENT_ADDR_CHANGE,
RDMA_CM_EVENT_TIMEWAIT_EXIT
};$
1.2 IO復(fù)用poll/epoll(in non-block way)
rdma_cm fd不同于傳統(tǒng)socket fd,其只會(huì)向上拋POLLIN事件,表示有rdma_cm event事件發(fā)生,具體event類(lèi)型需要通過(guò)rdma_get_cm_event來(lái)獲取。
/* change the blocking mode of the completion channel */
flags = fcntl(cm_id->channel->fd, F_GETFL);
rc = fcntl(cm_id->channel->fd, F_SETFL, flags | O_NONBLOCK); //設(shè)置rdma_cm fd為NONBLOCK
if (rc < 0) {
fprintf(stderr, "Failed to change file descriptor of Completion Event Channel\n");
return -1;
}
struct pollfd my_pollfd;
int ms_timeout = 10;
/*
* poll the channel until it has an event and sleep ms_timeout
* milliseconds between any iteration
*/
my_pollfd.fd = cm_id->channel->fd;
my_pollfd.events = POLLIN; //只需要監(jiān)聽(tīng)POLLIN事件,POLLIN事件意味著有rdma_cm event發(fā)生
my_pollfd.revents = 0;
do {
rc = poll(&my_pollfd, 1, ms_timeout); //非阻塞操作,有事件或者超時(shí)時(shí)返回
} while (rc == 0);
/* 注意:poll監(jiān)聽(tīng)到有事件發(fā)生,只意味著有rdma_cm event事件發(fā)生,但具體event仍然需要通過(guò)rdma_get_cm_event來(lái)獲取。*/
ret = rdma_get_cm_event(channel, &event);
if (!ret) {
ret = cma_handler(event->id, event); //處理收到的事件
rdma_ack_cm_event(event); //ack event
}
2. verbs API(For data transfer)
從上一節(jié)可以看出,RDMA_CM中的fd只涉及建連相關(guān)的事件,其無(wú)法獲取數(shù)據(jù)傳輸相關(guān)的事件。
對(duì)于RDMA傳輸,數(shù)據(jù)傳輸是由NIC硬件完成的,完全不需要CPU參與。網(wǎng)卡硬件完成數(shù)據(jù)傳輸后,會(huì)向CQ(completion queue中)提交一個(gè)cqe,用于描述數(shù)據(jù)傳輸完成情況。
struct ibv_cq *ibv_create_cq(struct ibv_context *context, int cqe,
void *cq_context, struct ibv_comp_channel *channel, int comp_vector)
# 作用:創(chuàng)建CQ,每個(gè)QP都有對(duì)應(yīng)的send cq和recv cq。
# 一個(gè)CQ可以被同一個(gè)QP的send queue和recv queue共享,也可以被多個(gè)不同的QP共享
# 注意:CQ僅僅只是一個(gè)queue,其本身沒(méi)有built-in的事件通知機(jī)制。如果想要增加事件通知機(jī)制,那么需要指定channel對(duì)象。
verbs API提供了創(chuàng)建ibv_comp_channel的編程接口:
struct ibv_comp_channel *ibv_create_comp_channel(struct ibv_context *context)
# 作用:創(chuàng)建completion channel,用于向user通知有新的completion queue event(cqe)已經(jīng)被寫(xiě)入CQ中。
struct ibv_comp_channel {
struct ibv_context *context;
int fd;
int refcnt;
};$
2.1 Verbs原生事件通知實(shí)現(xiàn)(in block way)
struct ibv_context *context;
struct ibv_cq *cq;
void *ev_ctx = NULL; /* can be initialized with other values for the CQ context */
/* Create a CQ, which is associated with a Completion Event Channel */
cq = ibv_create_cq(ctx, 1, ev_ctx, channel, 0);
if (!cq) {
fprintf(stderr, "Failed to create CQ\n");
return -1;
}
/* Request notification before any completion can be created (to prevent races) */
ret = ibv_req_notify_cq(cq, 0);
if (ret) {
fprintf(stderr, "Couldn't request CQ notification\n");
return -1;
}
/* The following code will be called each time you need to read a Work Completion */
struct ibv_cq *ev_cq;
void *ev_ctx;
int ret;
int ne;
/* Wait for the Completion event */
ret = ibv_get_cq_event(channel, &ev_cq, &ev_ctx); //阻塞函數(shù),直到有cqe發(fā)生才返回,ev_cq指向發(fā)生cqe的CQ
if (ret) {
fprintf(stderr, "Failed to get CQ event\n");
return -1;
}
/* Ack the event */
ibv_ack_cq_events(ev_cq, 1);
/* Request notification upon the next completion event */
ret = ibv_req_notify_cq(ev_cq, 0);
if (ret) {
fprintf(stderr, "Couldn't request CQ notification\n");
return -1;
}
/* Empty the CQ: poll all of the completions from the CQ (if any exist) */
do {
ne = ibv_poll_cq(cq, 1, &wc);
if (ne < 0) {
fprintf(stderr, "Failed to poll completions from the CQ: ret = %d\n",
ne);
return -1;
}
/* there may be an extra event with no completion in the CQ */
if (ne == 0)
continue;
if (wc.status != IBV_WC_SUCCESS) {
fprintf(stderr, "Completion with status 0x%x was found\n",
wc.status);
return -1;
}
} while (ne);
2.2 IO復(fù)用poll/epoll(in non-block way)
利用fcntl設(shè)置channel->fd的屬性為non-block,然后就可以用poll/epoll/select等來(lái)監(jiān)聽(tīng)channel->fd的POLLIN事件,POLLIN事件意味著有新的completion queue event被填入CQ中。user程序在被喚醒后,無(wú)需像傳統(tǒng)socket那樣進(jìn)行read/write操作(因?yàn)閐ata已經(jīng)直接DMA到用戶(hù)態(tài)緩存中),而是需要做poll_cq操作,對(duì)每一個(gè)cqe進(jìn)行解析處理。
struct ibv_context *context;
struct ibv_cq *cq;
void *ev_ctx = NULL; /* can be initialized with other values for the CQ context */
/* Create a CQ, which is associated with a Completion Event Channel */
cq = ibv_create_cq(ctx, 1, ev_ctx, channel, 0);
if (!cq) {
fprintf(stderr, "Failed to create CQ\n");
return -1;
}
/* Request notification before any completion can be created (to prevent races) */
ret = ibv_req_notify_cq(cq, 0);
if (ret) {
fprintf(stderr, "Couldn't request CQ notification\n");
return -1;
}
/* The following code will be called only once, after the Completion Event Channel
was created,to change the blocking mode of the completion channel */
int flags = fcntl(channel->fd, F_GETFL);
rc = fcntl(channel->fd, F_SETFL, flags | O_NONBLOCK);
if (rc < 0) {
fprintf(stderr, "Failed to change file descriptor of Completion Event Channel\n");
return -1;
}
/* The following code will be called each time you need to read a Work Completion */
struct pollfd my_pollfd;
struct ibv_cq *ev_cq;
void *ev_ctx;
int ne;
int ms_timeout = 10;
/*
* poll the channel until it has an event and sleep ms_timeout
* milliseconds between any iteration
*/
my_pollfd.fd = channel->fd;
my_pollfd.events = POLLIN; //只需要監(jiān)聽(tīng)POLLIN事件,POLLIN事件意味著有新的cqe發(fā)生
my_pollfd.revents = 0;
do {
rc = poll(&my_pollfd, 1, ms_timeout); //非阻塞函數(shù),有cqe事件或超時(shí)時(shí)退出
} while (rc == 0);
if (rc < 0) {
fprintf(stderr, "poll failed\n");
return -1;
}
ev_cq = cq;
/* Wait for the completion event */
ret = ibv_get_cq_event(channel, &ev_cq, &ev_ctx); //獲取completion queue event。對(duì)于epoll水平觸發(fā)模式,必須要執(zhí)行ibv_get_cq_event并將該cqe取出,否則會(huì)不斷重復(fù)喚醒epoll
if (ret) {
fprintf(stderr, "Failed to get cq_event\n");
return -1;
}
/* Ack the event */
ibv_ack_cq_events(ev_cq, 1); //ack cqe
/* Request notification upon the next completion event */
ret = ibv_req_notify_cq(ev_cq, 0);
if (ret) {
fprintf(stderr, "Couldn't request CQ notification\n");
return -1;
}
/* Empty the CQ: poll all of the completions from the CQ (if any exist) */
do {
ne = ibv_poll_cq(cq, 1, &wc);
if (ne < 0) {
fprintf(stderr, "Failed to poll completions from the CQ: ret = %d\n",
ne);
return -1;
}
/* there may be an extra event with no completion in the CQ */
if (ne == 0)
continue;
if (wc.status != IBV_WC_SUCCESS) {
fprintf(stderr, "Completion with status 0x%x was found\n",
wc.status);
return -1;
}
} while (ne);
3. rpoll實(shí)現(xiàn)(rsocket)
rsocket是附在rdma_cm庫(kù)中的一個(gè)子模塊,提供了完全類(lèi)似于socket接口的rdma調(diào)用。此處主要對(duì)rpoll的實(shí)現(xiàn)進(jìn)行分析。
rpoll同時(shí)支持對(duì)rdma fd和正常socket fd進(jìn)行監(jiān)聽(tīng),但對(duì)于rdma fd,其目前僅支持四種事件:POLLIN、POLLOUT、POLLHUP、POLLERR。
* Note that we may receive events on an rsocket that may not be reported
* to the user (e.g. connection events or credit updates). Process those
* events, then return to polling until we find ones of interest.
*/
int rpoll(struct pollfd *fds, nfds_t nfds, int timeout)
{
struct timeval s, e;
struct pollfd *rfds;
uint32_t poll_time = 0;
int ret;
do {
ret = rs_poll_check(fds, nfds); //主動(dòng)輪詢(xún)查看是否有event發(fā)生
if (ret || !timeout) //如果有event發(fā)生或者timeout為0,直接返回
return ret;
if (!poll_time)
gettimeofday(&s, NULL);
gettimeofday(&e, NULL);
poll_time = (e.tv_sec - s.tv_sec) * 1000000 +
(e.tv_usec - s.tv_usec) + 1;
} while (poll_time <= polling_time); //嘗試輪詢(xún)polling_time時(shí)間,該時(shí)間內(nèi)如果有event發(fā)生,那么直接返回,否則進(jìn)入后續(xù)邏輯
rfds = rs_fds_alloc(nfds); //創(chuàng)建新的pollfd數(shù)組rfds,用于添加到原生poll中。
if (!rfds)
return ERR(ENOMEM);
do {
ret = rs_poll_arm(rfds, fds, nfds); //對(duì)所有verbs fd進(jìn)行arm操作,并將待監(jiān)聽(tīng)事件全部改為POLLIN
if (ret)
break;
ret = poll(rfds, nfds, timeout); //調(diào)用OS原生poll
if (ret <= 0)
break;
ret = rs_poll_events(rfds, fds, nfds); //將cqe或rdma_cm event轉(zhuǎn)化為具體event
} while (!ret);
rpoll中調(diào)用rs_poll_check進(jìn)行輪詢(xún),查看是否有event發(fā)生。
static int rs_poll_check(struct pollfd *fds, nfds_t nfds)
{
struct rsocket *rs;
int i, cnt = 0;
for (i = 0; i < nfds; i++) {
rs = idm_lookup(&idm, fds[i].fd); //根據(jù)fd找到對(duì)應(yīng)的rsocket對(duì)象
if (rs)
fds[i].revents = rs_poll_rs(rs, fds[i].events, 1, rs_poll_all);
//查看rsocket fd是否有event發(fā)生,手動(dòng)向上拋事件
else
poll(&fds[i], 1, 0); //普通fd,非阻塞poll一次,查詢(xún)是否有event發(fā)生
if (fds[i].revents)
cnt++;
}
return cnt;
}
static int rs_poll_rs(struct rsocket *rs, int events,
int nonblock, int (*test)(struct rsocket *rs))
{
struct pollfd fds;
short revents;
int ret;
check_cq:
if ((rs->type == SOCK_STREAM) && ((rs->state & rs_connected) ||
(rs->state == rs_disconnected) || (rs->state & rs_error))) {
rs_process_cq(rs, nonblock, test); //調(diào)用ibv_poll_cq遍歷cqe
//對(duì)于send cqe,可以在處理函數(shù)中將發(fā)送緩存重新放回到內(nèi)存池中,
//對(duì)于recv cqe,可以在處理函數(shù)中更新可讀數(shù)據(jù)length和addr等
revents = 0;
if ((events & POLLIN) && rs_conn_have_rdata(rs)) //接收緩存有數(shù)據(jù),拋POLLIN
事件
revents |= POLLIN;
if ((events & POLLOUT) && rs_can_send(rs)) //發(fā)送緩存可寫(xiě),拋POLLOUT事件
revents |= POLLOUT;
if (!(rs->state & rs_connected)) {
if (rs->state == rs_disconnected)
revents |= POLLHUP; //斷開(kāi)連接,拋POLLHUP事件
else
revents |= POLLERR; //拋POLLERR事件
}
return revents;
} else if (rs->type == SOCK_DGRAM) { //UDP相關(guān)邏輯,不關(guān)注
ds_process_cqs(rs, nonblock, test);
revents = 0;
if ((events & POLLIN) && rs_have_rdata(rs))
revents |= POLLIN;
if ((events & POLLOUT) && ds_can_send(rs))
revents |= POLLOUT;
return revents;
}
if (rs->state == rs_listening) { //rmda_cm fd
fds.fd = rs->cm_id->channel->fd;
fds.events = events; //此處沒(méi)有將要監(jiān)聽(tīng)的事件設(shè)置為POLLIN,why?
fds.revents = 0;
poll(&fds, 1, 0); //直接poll一次,然后返回
return fds.revents;
}
if (rs->state & rs_opening) {
ret = rs_do_connect(rs);
if (ret && (errno == EINPROGRESS)) {
errno = 0;
} else {
goto check_cq;
}
}
if (rs->state == rs_connect_error) {
revents = 0;
if (events & POLLOUT)
revents |= POLLOUT;
if (events & POLLIN)
revents |= POLLIN;
revents |= POLLERR;
return revents;
}
return 0;
}
當(dāng)主動(dòng)輪詢(xún)polling_time時(shí)間后,如果仍然沒(méi)有event發(fā)生,且尚未超時(shí),那么就需要調(diào)用rs_poll_arm函數(shù),其主要作用有兩點(diǎn):1)對(duì)所有verbs fd進(jìn)行arm操作(ibv_notify_cq_event);2)將所有rdma相關(guān)事件全部修改為監(jiān)聽(tīng)POLLIN事件,然后丟給原生poll函數(shù)去監(jiān)聽(tīng)。
static int rs_poll_arm(struct pollfd *rfds, struct pollfd *fds, nfds_t nfds)
{
struct rsocket *rs;
int i;
for (i = 0; i < nfds; i++) {
rs = idm_lookup(&idm, fds[i].fd);
if (rs) { // rdma相關(guān)fd
fds[i].revents = rs_poll_rs(rs, fds[i].events, 0, rs_is_cq_armed);
if (fds[i].revents)
return 1;
if (rs->type == SOCK_STREAM) {
if (rs->state >= rs_connected)
rfds[i].fd = rs->cm_id->recv_cq_channel->fd; //verbs fd,用于通知data傳輸event
else
rfds[i].fd = rs->cm_id->channel->fd; //rdma_cm fd,用于通知connect event
} else {
rfds[i].fd = rs->epfd;
}
rfds[i].events = POLLIN; //所有監(jiān)聽(tīng)事件全部改為POLLIN
} else { //普通fd
rfds[i].fd = fds[i].fd;
rfds[i].events = fds[i].events;
}
rfds[i].revents = 0;
}
return 0;
}
原生poll在超時(shí)時(shí)間內(nèi)如果監(jiān)聽(tīng)到有事件發(fā)生,那么調(diào)用rs_poll_events函數(shù)。
static int rs_poll_events(struct pollfd *rfds, struct pollfd *fds, nfds_t nfds)
{
struct rsocket *rs;
int i, cnt = 0;
for (i = 0; i < nfds; i++) {
if (!rfds[i].revents) //沒(méi)有事件發(fā)生,跳過(guò)
continue;
rs = idm_lookup(&idm, fds[i].fd);
if (rs) {
fastlock_acquire(&rs->cq_wait_lock);
if (rs->type == SOCK_STREAM)
rs_get_cq_event(rs); //調(diào)用ibv_get_cq_event
else
ds_get_cq_event(rs);
fastlock_release(&rs->cq_wait_lock);
fds[i].revents = rs_poll_rs(rs, fds[i].events, 1, rs_poll_all); //手動(dòng)向上拋事件
} else {
fds[i].revents = rfds[i].revents; //普通fd,直接向上拋事件
}
if (fds[i].revents)
cnt++;
}
return cnt;
}
總結(jié)來(lái)看,對(duì)于rpoll實(shí)現(xiàn),主要分兩個(gè)步驟:
- 主動(dòng)遍歷輪詢(xún)polling_time時(shí)間,查看是否有event發(fā)生;
- 如果polling_time時(shí)間內(nèi)沒(méi)有event發(fā)生,那么將verbs/rdma_cm fd直接注冊(cè)到OS原生poll中,并將待監(jiān)聽(tīng)事件改為POLLIN,然后調(diào)用原生poll。如果poll監(jiān)聽(tīng)到verbs/rdma_cm fd的事件,這只意味著有cqe事件或rdma_cm事件發(fā)生,不能直接返回給用戶(hù),需要額外進(jìn)行邏輯判斷,以確定究竟是否要向上拋事件,以及拋什么事件。
4. 總結(jié)
對(duì)于rdma編程,目前主流實(shí)現(xiàn)是利用rdma_cm來(lái)建立連接,然后利用verbs來(lái)傳輸數(shù)據(jù)。
rdma_cm和ibverbs分別會(huì)創(chuàng)建一個(gè)fd,這兩個(gè)fd的分工不同。rdma_cm fd主要用于通知建連相關(guān)的事件,verbs fd則主要通知有新的cqe發(fā)生。當(dāng)直接對(duì)rdma_cm fd進(jìn)行poll/epoll監(jiān)聽(tīng)時(shí),此時(shí)只能監(jiān)聽(tīng)到POLLIN事件,這意味著有rdma_cm事件發(fā)生。當(dāng)直接對(duì)verbs fd進(jìn)行poll/epoll監(jiān)聽(tīng)時(shí),同樣只能監(jiān)聽(tīng)到POLLIN事件,這意味著有新的cqe。