repeat core word,not sentencewe tracked down the root cause 找到了you noticed that the receive/send packet drops (RX/TX drops) were highly concentrated on VLAN 100 高度集中which is the traffic channel we specifically use for RoCE v2 storage and computing. 流量通道We tracked the switch counters and found a huge surge in CNP(Congestion Notification Packets) . 交换机计数器 激增It turns out that this is a typical PFC (Priority Flow Control) deadlock. 典型 死锁The receiving end on another node in Rack 12 responded too slowly, causing buffer backlogs, which then had a cascading effect through the network architecture and ultimately stalled the distributed Megatron-LM training task on Node 42. [kæˈskeɪd] 连锁的。 [ˈʌltɪmətli] 最终(奥体模特里)。 [stɔl] 停止。[megətrɒn] 卖个床we cleared the queue by temporarily switching the PFC configuration of the congested interfaces, thus releasing the buffer deadlock。拥塞接口 [kənˈdʒestəd]Then, we implemented a permanent fix: we updated our Ansible playbook to fine - tune the ECN threshold on the leaf switches. fine-tune [tun] 微调We decreased the ECN marking threshold 降低了This enables the host’s transport layer to smoothly reduce its transmission rate without triggering a complete pause frames storm. 传输层Let’s find a time to discuss this matter next week. 对一下这个事The error counters on ibstat have also completely subsided. ibstatib司大 [səbˈsaɪd] 平息ensure that the new ECN tuning remains stable under the peak of tensor - parallelism communication。高峰 [ˈperəlelˌɪzəm]Thank you for your rapid assistance with the switch metrics just now! [ˈræpɪd]cropped up in Availability Zone C 冒出来 可用区The task initialization [ɪˌnɪʃəlaɪ’zeɪʃ(ə)n] went smoothly, but as soon as it entered the first gradient [ˈɡreɪdiənt] synchronization [ˌsɪŋkrənaɪ’zeɪʃ(ə)n] phase [feɪz], the entire [ɪnˈtaɪr] pipeline got stuck[stʌk] 。 初始化 梯度同步阶段 卡死we started seeing severe[sɪˈvɪr] packet loss and TCP transmission timeouts on the host side. 严重的丢包I had them conduct a standard network connectivity test. 实施When we did a path MTU discovery using ping with the ‘do not fragment’ bit set。 不分片 路径 MTU 探测someone replaced a faulty line card on spine - switch - 03。故障线路板the new interface defaulted back to an MTU of 1500 instead of our standard 9000 - byte jumbo frame setting. 巨型帧solve 找到原因和办法 resolve 故障恢复I logged into the Spine Switch and updated the MTUensure the configuration took effect correctly. [ɪˈfekt] 和 took 连读Once the MTU in the network architecture was uniformly set to 9000. 统一地I’m going to submit a post - mortem work order so that the automation[ˌɔtəˈmeɪʃ(ə)n] team can add a dynamic [daɪˈnæmɪk]verification rule to our CI/CD pipelineYou can clear the high - priority alerts on the dashboard now.I found that our path is going through a sub - optimal, high - latency third - party transit carrier, with the packet loss rate fluctuating between 5% and 8% 次优的、高延迟的第三方转接运营商 [ˈflʌktʃuˌeɪt] 波动during a routine routing table refresh [ruːˈtiːn] 常规的our low - latency dedicated MPLS circuit has failed over to the backup public network VPN tunnel. [ˈdedɪkeɪtɪd] 专用的 [ˈsɜːrkɪt] 电路 [ˈtʌnl]To solve this problem and resume smooth data transmission /rɪˈzuːm/ 恢复I implemented asymmetric [ˌeɪsɪˈmetrɪk] routing intervention [ˌɪntərˈvenʃn]forcing the traffic to switch back to the dedicated fiber - optic link. /ˈfaɪbər/ 专用光纤链路explicitly [ɪkˈsplɪsətli] increased the BGP Local Preference [ˈprefrəns] attribute [ˈætrɪbjuːt] of the main line 明确调高了主线路的 BGP 本地优先级Local Preference属性flush out the cached [kæʃt] incorrect paths. 清除缓存中的错误路径。The moment the routing convergence [kənˈvɜːrdʒəns] was completed. 路由收敛是指所做的路由修改后全部路由器都知道the synchronization [ˌsɪŋkrənaɪ’zeɪʃ(ə)n] delay has dropped to within 2 minutes。We just launched [lɔːntʃt] a large - scale batch - processing task to handle a massive dataset through the Granite - Embedding model 浪吃[ˈmæsɪv]triggered a large - scale scaling event.our centralized[ˈsentrəlaɪzd] DNS server has completely stalled while handling internal service discovery requests.the CPU utilization of the DNS pod has completely flattened at the 100% mark. 而且 DNS Pod 的 CPU 利用率已经完全在 100% 水平线上拉直了Downstream vLLM sidecar containers are failing their initial [ɪˈnɪʃl] health checks due to inability[ˌɪnəˈbɪləti] to resolve local database endpoints and are in the CrashLoopBackOff state.下游的 vLLM 辅助容器因为无法解析本地数据库端点导致初始健康检查失败正处于 CrashLoopBackOff崩溃重启环状态It seems that the sudden traffic from 500 new Container Network Interfaces (CNIs) attempting to register [ˈredʒɪstər] and perform发起执行 service discovery simultaneously [ˌsaɪməlˈteɪniəsli] 同时地 in a short period has caused an unexpected意外的 Distributed Denial [dɪˈnaɪəl]拒绝 - of - Service (DDoS) attack on our internal resolver architecture.看起来500个新的容器网络接口CNI在短时间内同时尝试注册和执行服务发现由此产生的突发流量似乎对我们的内部解析器架构发起了意外的分布式拒绝服务DDoS攻击。we’ve just implemented an emergency patch. 紧急补丁maximum[ˈmæksɪməm] Queries[ˈkwɪriz] Per Second 【马克思梦亏锐字】preventing abnormal[æbˈnɔːrml] worker nodes from overwhelming[ˌoʊvərˈwelmɪŋ] the domain [doʊˈmeɪn] name resolution [ˌrezəˈluːʃn] of the entire cluster 防止异常工作节点对整个集群的域名解析造成过载DaemonSet 弟们赛特This can directly intercept [ˌɪntərˈsept] DNS queries at the local host level and cache common Kubernetes service lookups 查找 locally。这可以在本地宿主机层面直接拦截 DNS 查询在本地缓存常见的 Kubernetes 服务查找The moment the DaemonSet completed the rolling update 滚动更新The vLLM pods have now successfully resolved their endpoints and transitioned to the Running state. vLLM Pod 现在已经成功解析了它们的端点并过渡到了 Running运行中状态The batch Embedding task has finally started to progress. 批量嵌入任务终于开始取得进展了。Over the next 30 minutes, I’ll continue to track the cache hit - ratio metric to ensure that the local daemons are properly bearing the load. 接下来的 30 分钟里我会继续追踪缓存命中率cache hit-ratio指标以确保本地守护进程正常承载了这些负荷。hit-ratio /ˈreɪʃioʊ/ 【锐谁欧】thanks for coming online and synchronizing [ˈsɪŋkrənaɪzɪŋ]the automated pipeline initiated [ɪˈnɪʃieɪtɪd] 发起 an All - Reduce communication step across 200 Nvidia H100 nodes for a massive[ˈmæsɪv] 大型 Llama - 3.1 fine - tuning task.Right after the workload climbed, our monitoring dashboards all turned amber with alerts。工作量刚一攀升全部亮起了琥珀色警报indicating severe BGP route flapping and significant packet loss on the core backbone [ˈbækˌboʊn] network. 这表明核心骨干网络上存在严重的边界网关协议BGP路由抖动和显著的数据包丢失问题。toggling [ˈtɑːɡlɪŋ] between “UP” and “DOWN” almost every few seconds 几乎每隔几秒就在“UP”和“DOWN”之间来回切换I started checking the telemetry data 遥测数据interface status [ˈsteɪtəs]a loose [luːs] optical 光学的 module[ˈmɑːdʒuːl] 【毛猪欧】 or a physical fiber link failure 光模块松动或者物理光纤链路故障but this flapping was occurring [əˈkɜːrɪŋ] simultaneously[ˌsaɪməlˈteɪniəsli] 同时 on multiple[ˈmʌltɪpl]【谋踢剖】 independent links.It turns out that during the All - Reduce step, the highly synchronized and explosive /ɪkˈsploʊsɪv/ RDMA traffic completely overwhelmed the internal queues of the switches 事实证明在 All-Reduce 步骤中高度同步且极具爆发性的 RDMA 流量完全挤爆了交换机的内部队列the massive traffic load caused BGP Keepalive packets to be unceremoniously /ˌʌnˌserəˈmoʊniəsli/ dropped in the buffer queue 海量的流量负载导致 BGP 的 Keepalive心跳数据包在缓冲区队列中被无情丢弃As neighboring switches didn’t receive the heartbeat greetings in a timely manner, they determined that the peer had crashed。由于相邻交换机未能及时接收到心跳问候它们判断对等交换机已崩溃so they removed the routing path and then immediately tried to rebuild it as soon as the traffic paused slightly - thus creating a catastrophic [ˌkætəˈstrɑːfɪk] routing loop. 因此一旦流量稍有停顿他们就会移除路由路径然后立即尝试重建从而形成灾难性的路由循环。To stabilize [ˈsteɪbəlaɪz] the network architecture 为了稳定网络架构we’ve just carried out manual intervention 我们刚刚实施了手动干预adjusted the Keepalive interval to 10 seconds 将 Keepalive 间隔调整为 10 秒This gives the control plane a more generous [ˈdʒenərəs] time window to survive the momentary [ˈmoʊmənteri] 暂时的 buffer starvation [stɑːrˈveɪʃn] 饥饿 during the heavy AI collective [kəˈlektɪv] 集体的 communication phase. 这给了控制平面更宽裕的时间窗口使其能在沉重的 AI 集合通信阶段中从瞬时的缓冲区饥饿中幸存下来。In addition[əˈdɪʃn], we adjusted the Control Plane Policy (CoPP) to strictly assign BGP protocol[ˈproʊtəkɑːl] traffic to the absolute priority queue, ensuring that routing greeting packets can completely bypass the standard data - plane buffers.The training task has successfully resumed its training epoch[ˈiːpɒk] without throwing any transmission errors.训练任务已经成功恢复了它的训练周期Epoch没有再抛出任何传输错误the next large - scale synchronization burst.[bɜː®st] 下一次大规模同步爆发一、 “不分片”与“路径 MTU 探测”在网络中MTUMaximum Transmission Unit最大传输单元就像是公路上限重的货车。普通网络Ethernet的 MTU 通常是1500 字节小货车。AI 数据中心的高性能网络通常开启Jumbo Frame巨型帧MTU 可以达到9000 字节大卡车用来快速传输庞大的 AI 权重和梯度数据。1. 什么是“不分片”Do Not Fragment / DF正常情况下如果一辆 9000 字节的“大卡车”开到一个只允许 1500 字节的“小路口”路口的路由器会把大卡车上的货物拆卸下来分装到好几辆 1500 字节的小货车上。这个过程叫分片Fragmentation。但是分片和重组极其消耗 CPU 资源。在对延迟要求极高的 AI 训练中我们绝对不允许分片。所以我们在发包时会设置一个“不分片DF”的标志位。意思是“我就要这么大个的包裹如果路上哪个路口装不下直接给我扔掉别拆分”2. 什么是“路径 MTU 探测”Path MTU Discovery / PMTUD由于从服务器 A 到服务器 B 可能会经过很多台交换机和路由器整个路径上的 MTU 取决于最窄的那个瓶颈木桶效应。路径 MTU 探测就是利用“不分片”的特性来测试整条路到底能通过多大的车。 实例说明假设你的服务器配置了 MTU 9000你想测试到存储服务器的路径是否全线畅通。你可以在终端输入ping-Mdo-s897210.0.0.5-M do意思是“不分片”Do not fragment。-s 8972是数据包大小。加上 ICMP 和 IP 包头总大小正好是9000 字节。结果分析情况 A正常如果命令成功收到回应说明整条路径上所有的交换机都支持 9000 字节的大包。情况 B故障如果返回报错frag needed and DF set需要分片但设置了不分片标志或者数据包直接石沉大海丢包这就说明中间某个交换机比如昨晚刚换的线卡被限定成了 1500 字节大包过不去被路口直接无情丢弃了。这就是上一段对话里 Qwen-2.5 训练卡死的原因。二、 The entire interface breakout group 具体指啥在 AI 数据中心里为了追求极致的带宽交换机芯片的单端口速率非常高比如一个端口就能跑 400G 或 800G。但有时候我们不需要一根 400G 的超粗管道而是需要把它拆分成 4 根 100G 的管道去连接 4 台不同的服务器。这种技术就叫Breakout端口拆分/一分多。1. 什么是 Interface Breakout Group接口拆分组当你把交换机上的某一个物理端口比如Port 1从 400G 拆分成 4 个 100G 的逻辑端口比如Port 1/1,Port 1/2,Port 1/3,Port 1/4时这 4 个衍生出来的子接口在物理上其实共享同一块硬件芯片槽位它们就组成了一个Breakout Group拆分组。2. 实例说明上一段对话中提到“有人换了主干交换机Spine Switch的线卡新接口默认恢复成了 1500 MTU。”因为这个物理端口通过一分四的线缆Breakout Cable连接了同一个机架Rack里的好几台服务器。如果我一个一个接口去改 MTU 9000不仅慢还容易漏掉。所以我直接进入该物理主端口的主干配置对the entire interface breakout group整个接口拆分组进行批量操作# 伪代码示例 interface Eth 1/1 # 进入主端口 breakout 4x100g # 确认它被拆分成了4个100G组 interface range Eth 1/1[1-4] # 选中这整个拆分组的所有4个子接口 mtu 9000 # 一键把整个组的 MTU 全改为 9000 shutdown no shutdown # 统一重启接口bounce让配置生效总结这句话的意思就是工程师没有一个一个去改线而是直接在交换机上把那根“一分四”的分流线所对应的所有子接口一刀切地全部纠正回了 9000 巨型帧配置彻底解决了 AI 训练因网络大包过不去而卡死的问题。