Uncorrectable pci express error embedded device bus 0 device 0 function 0 error status 0x00100000

Sean Paul’s Blog

Under promising and overdelivering.

Uncorrectable PCI Express Error

We’ve recently re-purposed a Proliant ML350p Gen8 server as a basic file server; after installing the operating system (Windows Server 2016) and the Support Pack for Proliant (SPP) the performance degraded massively. This wasn’t the worst of it…

Once I connected to the server using the ILO Remote Console it caused a Blue Screen of Death instantly with NMI_HARDWARE_FAILURE as the reason.

Naturally this shouldn’t happen – even after getting back in to Windows this happened again after launching the ILO RC.

After checking the IML (Integrated Management Logs) within ILO the following was logged:

As this server had been working perfectly fine for the past number of years it seemed a little strange that it was recommending we replace the processor!

As this seemed to be a little more firmware related I updated the BIOS to the latest version (01/2018) and re-flashed the ILO firmware again just in case. Same happened again but revealed a different error within the IML

Uncorrectable PCI Express Error (Embedded device, Bus 0, Device 28, Function 7, Error status 0x00100000)

After a quick google search I came across an article on the HPE Community Forum which indicated that it could be the Matrox Graphics driver causing the issue. This would make sense as the performance was terrible / choppy at best.

Sure enough after removing this driver and rebooting the system performance was back to normal and the ILO RC wasn’t crashing the server anymore.

Источник

Frank’s Tech Support

Virtualization. Networking. Sandwiches.

Diagnosing a PSOD

Last week, one of my hosts purple-screened. This seems like a bad thing, but it’s really not, and it does happen sometimes. It’s good practice to determine the root cause in case it’s something likely to happen again.

Believe it or not, purple screens are really a good thing. Your system is trying to save you from much worse. What is generally happening when you get a “Purple Screen of Death” is that some piece of hardware or software is misbehaving to the point that you are going to start experiencing data corruption and so the entire system halts to protect it from itself.

In my case, my PSOD came from a non-maskable interrupt. This is a special interrupt that the system is not allowed to ignore. It’s a signal that something critical just happened and Bad Things will ensue unless immediate action is taken. In the short term, it means your system just crashed. In the long term, it just saved you from potentially major data corruption. Bad, and yet good. On my HP server, NMI errors are generated by the hpnmi driver (which you should have installed as part of your VMware HP driver package to all of your HP ESXi hosts). This driver will keep an eye on your HP hardware and generate an NMI in the event of a catastrophic failure.

To begin tracking down the cause of my crash, I first took a complete set of log files via the VMware vCenter console. My generated bundle was about 800MB. To do this, in your vCenter Infrastructure Client, you’d click on File, Export System Logs. Browse through your tree and select your host. Make sure you include vCenter Server and Client as well for a complete picture.

Читайте также:  Application error file corrupted

Now that I have a full set of logs, it’s time to start digging into them. If you’re not an experienced log reader, this process can be extremely daunting. You may find tools like the VMware Log Insight tool to be helpful. Unfortunately, this is not a free tool, but it does work very well.

Since I never got to see the actual PSOD screen before my fellow support engineer restarted the host, I wasn’t really sure what I was looking for. However, I had the e-mail alert from when the host went down so I had a target time and date. My PSOD was somewhere around 23:49 UTC. My first stop was /var/log/vmkernel.log, which should tell me everything the vmkernel was up to.

2014-05-30T23:37:30.314Z cpu8:9295)FS3Misc: 1465: Long VMFS rsv time on ‘SANB-ISO1’ (held for 264 msecs). # R: 1, # W: 1 bytesXfer: 9 sectors
VMB: 49: mbMagic: 2badb002, mbInfo 0x101158
VMB: 54: flags a6d

Ok, nothing there. Normal log entries right before the reboot lines. Let’s try vpxa.log. That’s the host agents.

2014-05-30T23:47:16.908Z [3F4D3B90 verbose ‘vpxavpxaInvtHost’ opID=WFU-fe82c6fb] [HostChanged] Found update for tracked MoRef vim.HostSystem:ha-host
2014-05-30T23:47:16.909Z [3F4D3B90 verbose ‘VpxaHalCnxHostagent’ opID=WFU-fe82c6fb] [WaitForUpdatesDone] Starting next WaitForUpdates() call to hostd
2014-05-30T23:47:16.909Z [3F4D3B90 verbose ‘VpxaHalCnxHostagent’ opID=WFU-fe82c6fb] [WaitForUpdatesDone] Completed callback
Section for VMware ESX, pid=9549, version=5.1.0, build=1743533, option=Release
2014-05-31T00:10:15.434Z [FFC066D0 info ‘Default’] Logging uses fast path: false

Still nothing! Normal operation right up until the reboot when it started logging again. Alright, let’s get nasty. We’ll go straight to the dump file at /var/core/vmkernel-zdump. In this case, the log export reconstructed the dump file for me automatically but it still left all of the FRAG parts. Lots and lots of binary data, but what’s this I see right around 23:47, two minutes before my e-mail:

2014-05-30T23:47:38.642Z cpu0:9115)WARNING: NMI: 952: LINT1 motherboard interrupt

That doesn’t look good. Scroll a bit to get passed the stack dumps and I find the actual PSOD:

2014-05-30T23:47:38.819Z cpu0:9115)@BlueScreen: Panic requested by one or more 3rd party NMI handlers

Alright, so I have a purple screen requested by a third party NMI. The only third party NMI installed is HPNMI, as mentioned at the beginning of this post. HPNMI has a useful feature in that it will write logs to the iLO (integrated lights out), which is isolated from the running hypervisor on the host. Let’s go see what iLO has to say! I fire up my web browser and open my iLO Integrated Management Log:

Uh oh. That’s a hardware failure.

Uncorrectable PCI Express Error (Embedded device, Bus 0, Device 9, Function 0, Error status 0x00200000)

Let’s see if we can figure out exactly what failed. I’ll SSH into my host so that I can enumerate the PCI Express bus and figure out what’s on Bus 0 Device 9, built into the motherboard. For this, I’ll use lspci.

# lspci -v
*output suppressed*

00:00:09.0 PCI bridge Bridge: Intel Corporation 7500/5520/5500/X58 I/O Hub PCI Express Root Port 9 [PCIe RP[00:00:09.0]]
Class 0604: 8086:3410

Bus 0 Device 9 is the Intel PCI Express Root Port controller. Dang. This means something on the motherboard itself went south. I was hoping for a bad NIC or something.

According to HP, the only way to be sure about this problem is to replace the entire motherboard. Unfortunately, this host is out of warranty and the HP Care Pack has run out. My only hope now is that this is a one-time error and not a symptom of a serious issue. I have since moved my more critical VMs off of this host and demoted it to “Spare host with non-critical VMs only”. Still usable, but not exactly trustworthy.

Читайте также:  Sql server error string data right truncation

I hope this write-up encourages you to dig into your own PSOD’s instead of just rebooting and hoping for the best. The log files can be extremely intimidating but even if you only understand 5% of the information they display, it may be enough to give you insight into your issue. Thanks for reading!

Источник

HP PROLIANT DL180 GEN9 AND PCI Express error

New Member

Hi,
My HP HP Smart Array P440 Controller on PCI Express Slot1 and i am getting below error

Uncorrectable PCI Express Error (Slot 1, Bus 0, Device 3, Function 0, Error status 0x00000020)
Unrecoverable System Error (NMI) has occurred. System Firmware will log additional details in a separate IML entry if possible
PCI Bus Error (Slot 0, Bus 0, Device 3, Function 0)

i want to know that is that about proxmox driver or other than this.

need any advice.

fireon

Famous Member

What says the IML log? You can read this easy with the HPtools.

apt-key adv —recv-keys —keyserver keyserver.ubuntu.com 527BC53A2689B887
apt-key adv —recv-keys —keyserver keyserver.ubuntu.com FADD8D64B1275EA3
apt-get update
apt-get dist-upgrade
apt-get install hp-health hpssacli hponcfg

Or you can read important things in ILO.

Best Regards
Fireon

Deutsch PVE Dokumentation: http://deepdoc.at
DEEPDOC.AT — enjoy your brain

New Member

thanks for your reply.

i wrote what ILO said.

Critical PCI Bus 12/29/2015 18:15 12/29/2015 18:15 1 PCI Bus Error (Slot 0, Bus 0, Device 3, Function 0)

above lines from my HP DL180 GEN9 ILO.

fireon

Famous Member

Sorry, your 3 images are not visible in your post. Is in the ILO generally an HW Error? When yes. You should contact HP Support for warranty. Can you post Screenshot from ILO status and IML?

Does the system is running normaly or is only this message in IML your problem? Or you are not able to install the system?

Best Regards
Fireon

Deutsch PVE Dokumentation: http://deepdoc.at
DEEPDOC.AT — enjoy your brain

New Member

No image inmy post!?

my system works perfectly, but some times 4-5 days period, proxmox and also al my Virtual Apliance down, theni chack ILO i see these error linesa my P440 Smart Array controller and its pci controller

PCI-E Slot 1 HP Smart Array P440 Controller 749797-001 PDNMF0ARH7U2BX B 3.52

error is :
Critical PCI Bus 12/29/2015 18:15 12/29/2015 18:15 1 Uncorrectable PCI Express Error (Slot 1, Bus 0, Device 3, Function 0, Error status 0x00000020)
Critical System Error 12/29/2015 18:15 12/29/2015 18:15 1 Unrecoverable System Error (NMI) has occurred. System Firmware will log additional details in a separate IML entry if possible
Critical PCI Bus 12/29/2015 18:15 12/29/2015 18:15 1 PCI Bus Error (Slot 0, Bus 0, Device 3, Function 0)

Источник

HpCISSs2 crashing DL370 G6 with Server 2012

So before anyone says it I know the DL370 G6 doesn’t officially support Server 2012 but the P410i are the same in the G7’s and that appears to be the issue.

Anyway, every now and then my both my DL370 G6’s BSOD / what ever Microsoft call it now and the a bug check shows:

Читайте также:  Permerror spf permanent error

Probably caused by : HpCISSs2.sys ( HpCISSs2+4c74 )

After doing some googling, I found lots of places saying check the BIOS version, controller firmware etc. which are all up-to-date.

I’ve looked and looked but can’t work out why its doing it, it hadn’t done it for about two weeks then last night one of them died and then again this morning.

No reason, backups wern’t running, no updates installed, no llamas eating cables in the server room, it just doesn’t make sense and its making me crazzzzzyyy and look bad 🙁

Anyone got any thoughts?

akp982 is an IT service provider.

19 Replies

akp982 is an IT service provider.

Full Bug Check

DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If kernel debugger is available get stack backtrace.
Arguments:
Arg1: fffffa88193a2110, memory referenced
Arg2: 000000000000000b, IRQL
Arg3: 0000000000000000, value 0 = read operation, 1 = write operation
Arg4: fffff88000e1ec74, address which referenced memory

Page 7d306f not present in the dump file. Type «.hh dbgerr004» for details
Page 7d3000 not present in the dump file. Type «.hh dbgerr004» for details
Page 7d3000 not present in the dump file. Type «.hh dbgerr004» for details
Page 7d3000 not present in the dump file. Type «.hh dbgerr004» for details
Page 7d3000 not present in the dump file. Type «.hh dbgerr004» for details
Page 7d3000 not present in the dump file. Type «.hh dbgerr004» for details
Page 7d3000 not present in the dump file. Type «.hh dbgerr004» for details
Page 7d3000 not present in the dump file. Type «.hh dbgerr004» for details
Page 7d3000 not present in the dump file. Type «.hh dbgerr004» for details

READ_ADDRESS: fffffa88193a2110 Nonpaged pool

FAULTING_IP:
HpCISSs2+4c74
fffff880`00e1ec74 4c8b3cc8 mov r15,qword ptr [rax+rcx*8]

TRAP_FRAME: fffff88003a564b0 — (.trap 0xfffff88003a564b0)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=fffffa80193a2118 rbx=0000000000000000 rcx=00000000ffffffff
rdx=0000000000000001 rsi=0000000000000000 rdi=0000000000000000
rip=fffff88000e1ec74 rsp=fffff88003a56640 rbp=0000000000000000
r8=00000000000003b7 r9=fffffa80193a4118 r10=0000000000000000
r11=00000000000002e3 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0 nv up ei ng nz ac po nc
HpCISSs2+0x4c74:
fffff880`00e1ec74 4c8b3cc8 mov r15,qword ptr [rax+rcx*8] ds:fffffa88`193a2110=.
Resetting default scope

EXCEPTION_RECORD: fffff8027ce230e0 — (.exr 0xfffff8027ce230e0)
ExceptionAddress: 0b74f68445a472c7
ExceptionCode: 73c53b48
ExceptionFlags: dc034903
NumberParameters: 608996680
Parameter[0]: 40244c8b480002b1
Parameter[1]: 00018c10e8cc3348
Parameter[2]: 5b8b4950245c8d4c
Parameter[3]: e38b49506b8b4948
Parameter[4]: 5c415d415e415f41
Parameter[5]: 90909090c3595e5f
Parameter[6]: 9090909090909090
Parameter[7]: 9090909090909090
Parameter[8]: 6c894808245c8948
Parameter[9]: 5718247489481024
Parameter[10]: 48ea8b4820ec8348
Parameter[11]: 0000d8be8b48f18b
Parameter[12]: 000000d09e8b4800
Parameter[13]: 4c7856ff504e8b48
Parameter[14]: 4800240c83f0d88b

LAST_CONTROL_TRANSFER: from fffff8027ceca769 to fffff8027cecb440

STACK_TEXT:
fffff880`03a56368 fffff802`7ceca769 : 00000000`0000000a fffffa88`193a2110 00000000`0000000b 00000000`00000000 : nt!KeBugCheckEx
fffff880`03a56370 fffff802`7cec8fe0 : 00000000`00000000 fffffa80`19395010 00000000`00000000 fffff880`03a564b0 : nt!KiBugCheckDispatch+0x69
fffff880`03a564b0 fffff880`00e1ec74 : 00000000`00000246 00000000`00f088db fffff880`03a56740 fffff780`00000320 : nt!KiPageFault+0x260
fffff880`03a56640 fffff880`00e2024c : fffffa80`00000000 00000000`00000000 fffffa80`00000000 fffff802`00000000 : HpCISSs2+0x4c74
fffff880`03a56770 fffff802`7cec3c66 : fffde635`1845368c 00000000`00000100 fffde632`4d41c43c fffff880`03a56860 : HpCISSs2+0x624c
fffff880`03a567e0 fffff802`7ce30be2 : fffff802`7ce230e0 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiInterruptDispatch+0x1d6
fffff880`03a56978 fffff802`7ce230e0 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : hal!HalpTscQueryCounter+0x2
fffff880`03a56980 fffff880`01f94ae8 : 00000000`00000001 00000000`00000001 00000000`00000000 00000000`00000000 : hal!HalpTimerStallExecutionProcessor+0x131
fffff880`03a56a10 fffff880`01f94d52 : fffffa80`19934af0 00000000`00000001 fffffa80`18cb5000 00000000`00000000 : hpqilo2+0x6ae8
fffff880`03a56a40 fffff880`01f95055 : fffff880`03a56af8 fffffa80`19934b73 00000000`00000000 00000000`00000000 : hpqilo2+0x6d52
fffff880`03a56a70 fffff880`01f965ba : 00000000`00000001 00000000`00000000 fffffa80`18cb5000 fffffa80`19836280 : hpqilo2+0x7055
fffff880`03a56ad0 fffff880`01f90c13 : fffffa80`19934b48 00000000`00000000 00000000`00000080 00000000`484c5200 : hpqilo2+0x85ba
fffff880`03a56c70 fffff802`7ce9e045 : f919479b`0410f00d 304a19e2`d4101e53 3b2bb69b`82818ad5 e87a9567`edc931a6 : hpqilo2+0x2c13
fffff880`03a56d50 fffff802`7cf52766 : fffff880`017e5180 fffffa80`19836280 fffff880`017f1540 fffffa80`189c4980 : nt!PspSystemThreadStartup+0x59
fffff880`03a56da0 00000000`00000000 : fffff880`03a57000 fffff880`03a51000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x16

FOLLOWUP_IP:
HpCISSs2+4c74
fffff880`00e1ec74 4c8b3cc8 mov r15,qword ptr [rax+rcx*8]

Источник

Smartadm.ru
Adblock
detector