I am tracking a regression in the Linux kernel that affects systems with AMD GPUs (RDNA3+) and results in a hard freeze requiring a forced shut-down to recover. The regression occurred in 6.10 – commit e356d321d024 – and changed how the MES (microengine scheduler in GPU firmware) handles fence/synchronization for WAIT_REG_MEM, and causes the MES to stop responding to messages from the driver.
If you are experiencing such issues on a similar system (Krackan Point hardware, RDNA3+, gfx11), please reply in this thread and share your experience and system specs.
My system experiencing this:
Lenovo IdeaPad Slim 5 16AKP10 (model 83HY)
CPU: AMD Ryzen AI 7 350 (Krackan Point, 4nm)
GPU: AMD Radeon 860M (gfx11 / Krackan Point)
BIOS: R0CN22WW (latest available)
OS: Linux Mint 22.3
Kernel: 6.17.x
In my case, I am seeing this fairly often due to another issue with this system which causes a cascading pile up of UCSI events. But this hard freeze can still manifest through regular – and particularly extended – usage.
UPDATE: I found that a patch for this issue was submitted to 6.19.10. I have installed the 6.19.14 kernel and am now monitoring.
NOTE: This took a little finagling. I’m running Linux Mint 22.3 that is based upon Ubuntu 24.04.03 LTS. So, it seemed reasonable to obtain the kernel from the Ubuntu PPA. But it turns out that Linux Mint does not use Ubuntu’s run-parts but uses Debian’s. (While this was unexpected, it isn’t surprising given Linux Mint’s leaning towards Debian’s stability with its LMDE offering.)
Using the Ubuntu run-parts failed the install because it uses an unrecognized arg= parameter. I temporarily replaced run-parts with a version that just forced success. The install of the kernel was completed and I then restored the original run-parts. Bypassing the original run-parts required manually building initramfs and fixing the symlinks for grub.
After sudo update-grub and a reboot, the device is now running the 6.19.14 kernel. Now monitoring to see if this issue is resolved and, hopefully, not find any new issues with this out-of-band kernel.
Hi !
I have the same CPU/GPU SoC on my Thinkpad x13 gen6.
I run the MX-Moksha version of MX-Linux (Debian Trixie derivative that runs the liquorix-6.18 kernel and Moksha, a fork of Enlightenment). I’ve just installed the liquorix-7.0.9-2 kernel, the first liquorix to include the patch you mentionned.
Unfortunately, I still have the freeze.
To make sure all symptoms are identical: when my laptop freezes, it’s the GUI environment that does. The mouse can still be moved around freely, and so does the system itself (I can go to the CLI with a CTRL-ALT-F1 and interact normally there).
Is it the same for you ?
Thanks !
Similar experience though not exactly the same. Disappointing the new kernel didn’t fix it for you.
The key error is this (from journalctl):
amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
amdgpu: failed to reg_write_reg_wait
workqueue: amdgpu_tlb_fence_work [amdgpu] hogged CPU for >13333us
The behavior I experience is: I’ll notice the mouse sort of stutter – kind of move, but not well. This is the immediate signal that the system is being overwhelmed. Occasionally, if I immediately stop interacting soon enough and let it sit for ~30 seconds, it will recover. But usually, it still fails. Pressing a key gives no response. After a few seconds, the mouse freezes and I know it’s dead.
This thread has kind of become the lead on this issue. Note that the latest post (as of now) reports success after moving to 7.0 kernel.
I have not tried this but will if my kernel update doesn’t fix it. I’ll know tomorrow after being in the office for a few hours.
I’ll wait for your test, but I’m getting to think we’re not facing the same issue (though I’ve had a problem where the stuttered like you describe, with 100% CPU workload for no apparent reason).
@Trenien a warm welcome to our forum and thanks for sharing your observations. Have you been able to arrive at a solution for this or have you had to choose a completely different environment?
I’ll be using it all day tomorrow and, given recent history, it should definitely manifest after a few hours of regular office work. I will report back at end of the work day what I find.
In the meantime, you should share what errors you are seeing. They should be pretty obvious at the bottom of the journal created in the session that freezes – visible via journalctl -b -1 after you restart following the forced shutdown. If you aren’t sure about the specific errors, you can share the entire log via journalctl -b -1 --no-pager | nc termbin.com 9999 and share the link.
I have marked my post on the 6.19.14 kernel as the solution. I went all day without a single error related to MES. I’ll report back later if there are any changes in status. But going all day with no sign of the issue is pretty good confirmation.
As a plus, it even eliminated a couple other small issues I’d been dealing with.
However, be aware this will likely move your system out of normal lanes for updates until this or a later kernel comes along. I’m willing to deal with that to make my system workable. And I’m willing to offer some guidance to others if someone has the same issue
I’ve had one freeze since updating the kernel, so it’s significantly better than before. Not perfect though.
I had a bug getting the journal. I’ll do it next time it freezes
Thanks
UPDATE: Kernel 6.19.14 is EOL as of 22 April 26 and the next stable is 7.0.x. My plan is to continue with 6.19.14 for a solid ten full days* of office use. If I reach that without any sign of the issue returning, I will then move to the 7.0.x kernel (next stable release). Given history (and that Linux Mint is revisiting their release strategy), I expect this will happen long before 7.0.x shows up in Linux Mint 23. But doing so will put me in line to receive 23 when it arrives.
*Ten days may be seem excessive but given how this issue appeared and then gradually increased in frequency, I just want to ensure I’ve allowed adequate time to feel confident it is fixed.
UPDATE: Ubuntu now has 7.0 in the HWE channel and it became available for Linux Mint. I have moved to it and will test it for a solid seven days of office work to confirm. But, based upon three solid days working with 6.19.14 and having zero errors, my confidence is at 90%.