28.09.2017
Munich. While routinely testing our Embedded RTOS, Symobi on several platforms, including two tests with AMD's Ryzen CPU, we came across some strange behavior.
We noticed that this behavior only occurred on the Ryzen platform, so we initially thought that the problem was caused by Symobi - that perhaps there was an incompatibility issue that needed to be resolved - so we went straight to work to get to the bottom of it.
We discovered that the source of the problem did not originate with Symobi, as we initially suspected. Instead we found that Symobi was running into two independent effects on the Ryzen platform:
A. Symobi would sporadically crash under indeterminable conditions. We were able to reproduce this issue, but not deterministically. It's important to note that neither the system nor the hardware was subjected to any extreme conditions.
B. While Symobi's device drivers appeared to freeze, it turns out that in some cases, they actually weren't receiving an interrupt signal after sending device commands and as a result, it remained in waiting mode.
Due to us being producers of our own operating system, in comparison to other software developers, we benefit from deeper insights into the systems and have relevant abilities, which allow us to analyze what's really going on in the hardware. What we have unearthed leads us to believe that AMD, in addition to all of the reported bugs in the past months, will still have to fix some more.
Here's what our analysis revealed:
A. Both Ryzen CPU's seemed to have sporadically surfacing problems with hardware task switching while SMT was turned on. As a matter of fact, they crashed right in the midst of executing the task switch. When SMT was turned off, everything went smoothly. AMD has already had some issues with their newly introduced SMT. While the issue occurred randomly, we were still able to reproduce it. No extreme working conditions occurred to the system during any of our tests.
B. The IRQ issue, as it turns out, originates at the chipset, rather than the CPU. While running AMD's chipsets A320 and B350 in PIC mode (versus APIC mode), the interrupt mode could not be set to “level”, which is needed to share IRQ's among several devices. In addition, while running in the “edge” mode, we discovered that the signal levels seemed to be inverted. This led to the loss of IRQ's and consequently, the driver was not receiving any reply to its device commands.
Of course we have contacted AMD to inform them of our findings. At this time, we are awaiting their response. We will keep you updated. Stay tuned!