Missing 2 cpu's
#1
Missing 2 cpu's
I sent a Maya scene to render on my Onyx2 and noticed I only had 6 cpu's active, 8 are installed. I went back to the console and it appears I might have some bad memory that's disabling the cpu's. I wasn't able to copy/paste from the console so I typed this out, hopefully I didn't miss something.


Code:
DIAG RESULTS:
/hw/module/1/slot/n1/node/mem: MEMBANK(S) 6 disabled
reason: Bank 6: some DIMMs failed mem test.
/hw/module/1/slot/n2/node/cpu/0: CPU A disabled
Reason: PROM copied to memory (bank 0) is bad.
/hw/module/1/slot/n2/node/cpu/1: CPU B disabled
Reason: PROM copied to memory (bank 0) is bad.
/hw/module/1/slot/n2/node/mem: MEMBANK(S) 1 disabled
Reason: Bank 1: Some DIMMS failed mem test.
   
The system shows 15GB active, It's been a while, if I remember correctly there's 16GB installed. 

Should I pull node 1 and try reseating the memory? Any other suggestions?

When I check the node led's it appears that node 2 is the node that's inop.

[img][Image: 20190207_235606_zps4m9x3dps.jpg][/img]

 Onyx2 Rack Origin 2000 Deskside  Octane2
vladio
O2

Posts: 15
Threads: 5
Joined: Jan 2019
Find Reply
02-08-2019, 05:11 AM
#2
RE: Missing 2 cpu's
Reseating is always an option
MrWeedster
O2

Posts: 16
Threads: 2
Joined: Jan 2019
Find Reply
02-08-2019, 05:13 AM
#3
RE: Missing 2 cpu's
(02-08-2019, 05:13 AM)MrWeedster Wrote:  Reseating is always an option

I'm def going to try tomorrow but I'm unsure which node to pull. From the log it says node1 but the led's on the node boards look like node 2 is inop.

I pulled node 2 yesterday and reseated all the memory, I assumed it was the problem child by the led's that are lit.

 Onyx2 Rack Origin 2000 Deskside  Octane2
vladio
O2

Posts: 15
Threads: 5
Joined: Jan 2019
Find Reply
02-08-2019, 05:21 AM
#4
RE: Missing 2 cpu's
(02-08-2019, 05:21 AM)vladio Wrote:  
(02-08-2019, 05:13 AM)MrWeedster Wrote:  Reseating is always an option

I'm def going to try tomorrow but I'm unsure which node to pull. From the log it says node1 but the led's on the node boards look like node 2 is inop.

I pulled node 2 yesterday and reseated all the memory, I assumed it was the problem child by the led's that are lit.

You can often revive disabled DIMMs by cleaning the contact edges of the DIMMs with a cotton swab with some alcohol, but you're not going to 'revive' an XIO compression connector by 'reseating' it, unless maybe by tightening the hex screws on a really loose connection. Also, I've never seen the interconnects between the CPUs and nodeboards go bad. I would leave the CPUs well alone, clean the memory (all of it, if you have the time, because this will repeat itself with another bank). You have to manually re-enable the CPUs and memory from the PROM prompt ('enableall' or 'enable all') and reset. With a little luck, the memory and CPUs will stay enabled.

By the time it's all supposed to work again, you can do a run of 'extra heavy' boot time diagnostics to make sure. The amount of diagnostics is selected on a DIP switch bank on the MSC, refer to https://wiki.preterhuman.net/MSC#DIP_Switch_Settings If you go for 'Manufacturing diagnostics' you need to hook up the console to the DB9 near the power breaker on the back, 'Heavy diagnostics' is more or less the same but talks over the regular console port.
jan-jaap
SGI Collector

Posts: 241
Threads: 10
Joined: Jun 2018
Website Find Reply
02-08-2019, 08:22 AM
#5
RE: Missing 2 cpu's
(02-08-2019, 08:22 AM)jan-jaap Wrote:  You can often revive disabled DIMMs by cleaning the contact edges of the DIMMs with a cotton swab with some alcohol, but you're not going to 'revive' an XIO compression connector by 'reseating' it, unless maybe by tightening the hex screws on a really loose connection. Also, I've never seen the interconnects between the CPUs and nodeboards go bad. I would leave the CPUs well alone, clean the memory (all of it, if you have the time, because this will repeat itself with another bank). You have to manually re-enable the CPUs and memory from the PROM prompt ('enableall' or 'enable all') and reset. With a little luck, the memory and CPUs will stay enabled.

By the time it's all supposed to work again, you can do a run of 'extra heavy' boot time diagnostics to make sure. The amount of diagnostics is selected on a DIP switch bank on the MSC, refer to https://wiki.preterhuman.net/MSC#DIP_Switch_Settings If you go for 'Manufacturing diagnostics' you need to hook up the console to the DB9 near the power breaker on the back, 'Heavy diagnostics' is more or less the same but talks over the regular console port.

Thanks for the suggestion. Are the nodes numbered 0-3 or 1-4? In the OS I've seen them as 0-3 but on the back they're physically labeled 1-4. I guess I'm asking which module to pull... by the led's it's the second from the right but that's labeled node 2. If I pull that node will it boot without it? I was thinking I could pull it, boot and see if I get the same error.

 Onyx2 Rack Origin 2000 Deskside  Octane2
vladio
O2

Posts: 15
Threads: 5
Joined: Jan 2019
Find Reply
02-08-2019, 01:30 PM


Forum Jump:


Users browsing this thread: 1 Guest(s)