核心思想
使得operator能够反复、确定地重放tcp的trace,从而使用各种工具,作出问题诊断。 Oh,,TCP能重放的话 就能量化不同bottleneck了!并且不需要改应用!
主要挑战
butterfly effect(蝴蝶效应)—a small timing variation causes a chain reaction between TCP and the network that drives the system to a completely different state in the replay!
收集数据包的trace,通过trace TCP的执行,从而找到原因,这是经典的针对某个connection的做法。但是针对大规模数据中心,很多连接时不可行。Deterministic Replay是一种问题诊断的手段,论文是想能不能做TCP的Deterministic Replay呢?
挑战的原因:TCP的行为是与网络系统其他部分紧密耦合的
- application
- network
- other TCP connections traversing through a common switch
- kernel
因此出现了蝴蝶效应。replay的时候packet的发送时间必定与运行时有差别,微小的差异在相互作用( clock drift, context switching, kernel scheduling, and cache state)下叠加的影响很大,switch的action不同了,又因为TCP的ACK-locking,TCP行为不同(比如窗口变化不同了),影响了flow rate,又影响了交换机队列—>循环。。。 解决办法: replay each TCP connection separately: Identifying the minimal set of signals that capture all the interactions between a TCP connection with the application and the network, and record these signals at hosts in a lightweight manner. 这样可以分别同时replay多个connection
Record
- capture the interactions of a TCP connection with the application: socket calls
- capture the interactions of a TCP connection with kernel: TCP handler function calls from kernel、Reading kernel variables、 Thread scheduling(记录thread的lock获取时间)。
- capture the interactions of a TCP connection with network: 不需要记录每个交换机的动作 只需要记录结果。 At the receiver, detect packet drops by checking if the IP_ID fields are continuous and ECN by checking the ECN bits
Replay
- Replay packet send times: 首先要对packet send time进行准确记录,全部记录的话存储的overhead大,因此采取采样的方法。 Rate-based sampling Gap-based sampling fails to sample packets when the packet rate changes. Therefore, instead of recording the burst length, DETER records the packet rate. When the inferred sending time based on the recorded packet rate is wrong (i.e., the difference with the actual time is above the threshold th), we sample a new packet.
- Replay switch queue: 记录每个队列的入队和出队非常复杂。所以Simply push all the packets into the network at the right time. 针对丢包特殊处理
Implementation and evaluation
实现非常扎实 很多细节~
通过TCP replay:Spark、datacenter workload、testbed 发现了一些问题 很有意思:
- Spark application CPU bottleneck: The tail latency of flows from HDFS are mostly caused by receiver limit, because their receive windows frequently reach zero. For flows shorter than 1MB, their tail latency are mostly caused by packet drops (RTO or fast retransmission (FR)). For flows longer than 10MB, their tail latency are mostly caused by receive window frequently reaching zero (Rwnd=0). The flows in the range [100KB,1MB] are of particular interests, because most of their tail latencies (18 out of 24) are caused by multiple delayed ACKs(40ms 的delay). TCP explicitly delays the ACK because the free space in the receive buffer is shrinking. This suggests that the root cause is the application not reading the data in the receive buffer in time.
- Datacenter workload root cause of shot flows tail fct: RTO is not the main root cause of tail latency. A widely discussed reason for tail latency is RTO. But actually RTO is rare in this experiment. The reason is that when there are multiple requests in the same connection, later requests can help recover the packet losses of previous requests, so TCP loss recovery is effective in this scenario. Fast retransmission (FR) is delayed for 10s of milliseconds! 因为连接在收到10个dupack(正常应该是3个)的后才开始快重传 而很多小流没有足够的ack!bug:当收到乱序包时 dupack的threshold会一直增加
- Testbed root cause of RTO: 没有足够的dupack触发快重传;以及过大的receive buffer size:大buffer会导致ack一直被设置上窗口更新的flag,携带着一个更大的window size,而TCP不把window update的ACK当做重复ACK。
Switch queue replay:验证(testbed 用FPGA + simulations 用ns3)
- 通过queue的信息诊断RTO的问题:发现有两个流一次性丢了一窗口的数据,引发RTO。交换机使用share buffer。 The threshold of any queue is proportional to the total free buffer size. If the switch buffer utilization suddenly increases, the threshold shrinks, which causes temporary blackhole at the almost-full queues.
存在的问题:
需要改内核 需要sender和receiver同时部署