近期评论

    今日最热

    No top posts yet

    物理内存故障引发CPU死锁 Physical memory failure cause CPU lockup

    故障症状:

    大量读写内存的程序运行过程中,终端反复打印出“BUG: soft lockup – CPU#? stuck for ?s”警告信息。

    同时,ssh等远程终端反应变得非常迟缓,甚至拒绝相应。

    检查进程运行情况(使用top或者ps),发现有除我的程序之外,有数个系统进程同样出现大量占用CPU的情况,包括ksoftirqd、kthreadd、kondemand,flush-???,kdmflush等。

    有时,处于这种故障情况下的主机可以正常启动,在刚启动时也不会表现出任何症状。

    故障起因:

    这种症状在我的工作站上出现过多次,其中几次在重启时会不时出现对内存错误的警告信息。我没有对系统进行任何人为修改,包括任何模块和驱动程序,因此基本可以排除内核错误引发故障的可能。

    经过检查,我确信这些故障都源自物理内存错误。内存无法正常读写,造成部分指令在取指或者取值时无法正常执行,从而使CPU卡死,watchdog误认为发生了死锁。

    故障修复:

    我使用的方法相当简单粗暴,排除所有故障的内存条(有时因为得不到警告信息,所以必须使用排除法来确定实际故障的内存条,在双CPU和双通道内存的主机上,通常每次需要排除一对处于相对位置的内存条)。

    Symptoms:

    After running a memory consuming program for a relatively long time, I would get the warning “BUG: soft lockup – CPU#? stuck for ?s” repeatedly.

    Meantime, reaction of remote terminals such as ssh became extremely slow or even froze.

    After checking the status of processes, I discovered that a few system processes besides my program were taking a large amount of CPU time, including ksoftirqd, kthreadd, kondemand, flush-???, kdmflush.

    Sometimes, computers under this condition could boot normally, and upon booting no sympton would show immediately.

    Cause:

    These symptoms showed on my workstation many times, some of which were accompanied with warnings of memory failure on booting. I had not changed any of the modules or drivers of the system, so it is highly unlikely that these symptoms were caused by actural kernel errors.

    I believe these problems originated from physical memory failure. When a memory unit can not be read/written properly, instructions may be stuck when fetching command or data. This could cause CPU to be stuck too and make watchdog mistake it for a soft lockup.

    Solution:

    My solution is perfactly simple: just remove the broken memory card(s) (when there was no warning information, you might need to remove the cards one by one to determine which one is broken.)

    (Visited 291 times, 1 visits today)

    Leave a Reply