In the Linux kernel, the following vulnerability has been resolved:
thermal: core: Address thermal zone removal races with resume
Since thermal_zone_pm_complete() and thermal_zone_device_resume()
re-initialize the poll_queue delayed work for the given thermal zone,
the cancel_delayed_work_sync() in thermal_zone_device_unregister()
may miss some already running work items and the thermal zone may
be freed prematurely [1].
There are two failing scenarios that both start with
running thermal_pm_notify_complete() right before invoking
thermal_zone_device_unregister() for one of the thermal zones.
In the first scenario, there is a work item already running for
the given thermal zone when thermal_pm_notify_complete() calls
thermal_zone_pm_complete() for that thermal zone and it continues to
run when thermal_zone_device_unregister() starts. Since the poll_queue
delayed work has been re-initialized by thermal_pm_notify_complete(), the
running work item will be missed by the cancel_delayed_work_sync() in
thermal_zone_device_unregister() and if it continues to run past the
freeing of the thermal zone object, a use-after-free will occur.
In the second scenario, thermal_zone_device_resume() queued up by
thermal_pm_notify_complete() runs right after the thermal_zone_exit()
called by thermal_zone_device_unregister() has returned. The poll_queue
delayed work is re-initialized by it before cancel_delayed_work_sync() is
called by thermal_zone_device_unregister(), so it may continue to run
after the freeing of the thermal zone object, which also leads to a
use-after-free.
Address the first failing scenario by ensuring that no thermal work
items will be running when thermal_pm_notify_complete() is called.
For this purpose, first move the cancel_delayed_work() call from
thermal_zone_pm_complete() to thermal_zone_pm_prepare() to prevent
new work from entering the workqueue going forward. Next, switch
over to using a dedicated workqueue for thermal events and update
the code in thermal_pm_notify() to flush that workqueue after
thermal_pm_notify_prepare() has returned which will take care of
all leftover thermal work already on the workqueue (that leftover
work would do nothing useful anyway because all of the thermal zones
have been flagged as suspended).
The second failing scenario is addressed by adding a tz->state check
to thermal_zone_device_resume() to prevent it from re-initializing
the poll_queue delayed work if the thermal zone is going away.
Note that the above changes will also facilitate relocating the suspend
and resume of thermal zones closer to the suspend and resume of devices,
respectively.
CVE-2026-31731 is a use-after-free vulnerability in the Linux kernel's thermal management subsystem affecting thermal zone removal during system resume operations. The vulnerability occurs when thermal work items continue executing after the thermal zone object has been freed, potentially leading to kernel crashes or privilege escalation. With a CVSS score of 7.8 and no public exploits currently available, this poses a significant risk to systems running affected kernel versions, particularly in data center and critical infrastructure environments common in Saudi Arabia.
IMMEDIATE ACTIONS:
1. Identify all systems running Linux kernel versions 7.0-rc1 through 7.0-rc6 using 'uname -r' command
2. Assess thermal zone management criticality in your infrastructure (particularly virtualized and data center environments)
3. Implement enhanced monitoring for kernel panic logs and thermal subsystem errors
PATCHING GUIDANCE:
1. Apply the latest stable kernel patch that includes the thermal zone removal race condition fix
2. For RHEL/CentOS systems: Update to patched kernel-* packages via 'yum update kernel'
3. For Ubuntu/Debian systems: Update to patched linux-image packages via 'apt update && apt upgrade'
4. For SUSE systems: Apply kernel security updates via 'zypper update kernel'
5. Test patches in non-production environments first, particularly for mission-critical systems
6. Schedule coordinated patching during maintenance windows to minimize service disruption
COMPENSATING CONTROLS (if immediate patching unavailable):
1. Disable thermal zone polling if not critical to operations: echo 0 > /sys/class/thermal/thermal_zone*/polling_interval
2. Implement system restart scheduling to prevent long-running thermal work accumulation
3. Monitor /var/log/kern.log and dmesg for use-after-free warnings and thermal subsystem errors
4. Increase kernel watchdog timeout to reduce false-positive panic triggers
DETECTION RULES:
1. Monitor for kernel messages: 'BUG: unable to handle page fault for address' combined with thermal subsystem references
2. Alert on repeated kernel panics with stack traces containing 'thermal_zone_device_resume' or 'thermal_zone_pm_complete'
3. Track system uptime anomalies indicating unexpected reboots
4. Monitor dmesg for 'use-after-free' warnings in thermal subsystem
الإجراءات الفورية:
1. تحديد جميع الأنظمة التي تعمل بإصدارات نواة لينكس من 7.0-rc1 إلى 7.0-rc6 باستخدام أمر 'uname -r'
2. تقييم أهمية إدارة منطقة الحرارة في البنية التحتية الخاصة بك (خاصة في البيئات الافتراضية ومراكز البيانات)
3. تنفيذ مراقبة محسّنة لسجلات توقف النواة وأخطاء نظام إدارة الحرارة
إرشادات التصحيح:
1. تطبيق أحدث تصحيح نواة مستقر يتضمن إصلاح حالة السباق في إزالة منطقة الحرارة
2. لأنظمة RHEL/CentOS: تحديث حزم kernel-* عبر 'yum update kernel'
3. لأنظمة Ubuntu/Debian: تحديث حزم linux-image عبر 'apt update && apt upgrade'
4. لأنظمة SUSE: تطبيق تحديثات أمان النواة عبر 'zypper update kernel'
5. اختبار التصحيحات في بيئات غير الإنتاج أولاً، خاصة للأنظمة الحرجة
6. جدولة التصحيح المنسق خلال نوافذ الصيانة لتقليل انقطاع الخدمة
الضوابط البديلة (إذا لم يكن التصحيح الفوري متاحاً):
1. تعطيل استقصاء منطقة الحرارة إذا لم تكن حرجة: echo 0 > /sys/class/thermal/thermal_zone*/polling_interval
2. تنفيذ جدولة إعادة تشغيل النظام لمنع تراكم عمل الحرارة طويل الأمد
3. مراقبة /var/log/kern.log و dmesg للتحذيرات من استخدام الذاكرة بعد التحرير وأخطاء نظام إدارة الحرارة
4. زيادة مهلة مراقب النواة لتقليل تنبيهات التوقف الإيجابية الكاذبة
قواعد الكشف:
1. مراقبة رسائل النواة: 'BUG: unable to handle page fault for address' مع مراجع نظام إدارة الحرارة
2. التنبيه على توقفات النواة المتكررة مع تتبع المكدس يحتوي على 'thermal_zone_device_resume' أو 'thermal_zone_pm_complete'
3. تتبع شذوذ وقت التشغيل يشير إلى إعادة تشغيل غير متوقعة
4. مراقبة dmesg للتحذيرات من 'use-after-free' في نظام إدارة الحرارة