近期评论

    今日最热

    No top posts yet

    物理内存故障引发CPU死锁 Physical memory failure cause CPU lockup

    故障症状:

    大量读写内存的程序运行过程中,终端反复打印出“BUG: soft lockup – CPU#? stuck for ?s”警告信息。

    同时,ssh等远程终端反应变得非常迟缓,甚至拒绝相应。

    检查进程运行情况(使用top或者ps),发现有除我的程序之外,有数个系统进程同样出现大量占用CPU的情况,包括ksoftirqd、kthreadd、kondemand,flush-???,kdmflush等。

    有时,处于这种故障情况下的主机可以正常启动,在刚启动时也不会表现出任何症状。

    故障起因:

    这种症状在我的工作站上出现过多次,其中几次在重启时会不时出现对内存错误的警告信息。我没有对系统进行任何人为修改,包括任何模块和驱动程序,因此基本可以排除内核错误引发故障的可能。

    经过检查,我确信这些故障都源自物理内存错误。内存无法正常读写,造成部分指令在取指或者取值时无法正常执行,从而使CPU卡死,watchdog误认为发生了死锁。

    故障修复:

    我使用的方法相当简单粗暴,排除所有故障的内存条(有时因为得不到警告信息,所以必须使用排除法来确定实际故障的内存条,在双CPU和双通道内存的主机上,通常每次需要排除一对处于相对位置的内存条)。

    Symptoms:

    After running a memory consuming program for a relatively long time, I would get the warning “BUG: soft lockup – CPU#? stuck for ?s” repeatedly.

    Meantime, reaction of remote terminals such as ssh became extremely slow or even froze.

    After checking the status of processes, I discovered that a few system processes besides my program were taking a large amount of CPU time, including ksoftirqd, kthreadd, kondemand, flush-???, kdmflush.

    Sometimes, computers under this condition could boot normally, and upon booting no sympton would show immediately.

    Cause:

    These symptoms showed on my workstation many times, some of which were accompanied with warnings of memory failure on booting. I had not changed any of the modules or drivers of the system, so it is highly unlikely that these symptoms were caused by actural kernel errors.

    I believe these problems originated from physical memory failure. When a memory unit can not be read/written properly, instructions may be stuck when fetching command or data. This could cause CPU to be stuck too and make watchdog mistake it for a soft lockup.

    Solution:

    My solution is perfactly simple: just remove the broken memory card(s) (when there was no warning information, you might need to remove the cards one by one to determine which one is broken.)