CVE |
Vendors |
Products |
Updated |
CVSS v3.1 |
In the Linux kernel, the following vulnerability has been resolved:
drm/amdgpu: change vm->task_info handling
This patch changes the handling and lifecycle of vm->task_info object.
The major changes are:
- vm->task_info is a dynamically allocated ptr now, and its uasge is
reference counted.
- introducing two new helper funcs for task_info lifecycle management
- amdgpu_vm_get_task_info: reference counts up task_info before
returning this info
- amdgpu_vm_put_task_info: reference counts down task_info
- last put to task_info() frees task_info from the vm.
This patch also does logistical changes required for existing usage
of vm->task_info.
V2: Do not block all the prints when task_info not found (Felix)
V3: Fixed review comments from Felix
- Fix wrong indentation
- No debug message for -ENOMEM
- Add NULL check for task_info
- Do not duplicate the debug messages (ti vs no ti)
- Get first reference of task_info in vm_init(), put last
in vm_fini()
V4: Fixed review comments from Felix
- fix double reference increment in create_task_info
- change amdgpu_vm_get_task_info_pasid
- additional changes in amdgpu_gem.c while porting |
In the Linux kernel, the following vulnerability has been resolved:
drm/amdkfd: range check cp bad op exception interrupts
Due to a CP interrupt bug, bad packet garbage exception codes are raised.
Do a range check so that the debugger and runtime do not receive garbage
codes.
Update the user api to guard exception code type checking as well. |
In the Linux kernel, the following vulnerability has been resolved:
amd/amdkfd: sync all devices to wait all processes being evicted
If there are more than one device doing reset in parallel, the first
device will call kfd_suspend_all_processes() to evict all processes
on all devices, this call takes time to finish. other device will
start reset and recover without waiting. if the process has not been
evicted before doing recover, it will be restored, then caused page
fault. |
In the Linux kernel, the following vulnerability has been resolved:
drm/amdgpu: Skip do PCI error slot reset during RAS recovery
Why:
The PCI error slot reset maybe triggered after inject ue to UMC multi times, this
caused system hang.
[ 557.371857] amdgpu 0000:af:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 557.373718] [drm] PCIE GART of 512M enabled.
[ 557.373722] [drm] PTB located at 0x0000031FED700000
[ 557.373788] [drm] VRAM is lost due to GPU reset!
[ 557.373789] [drm] PSP is resuming...
[ 557.547012] mlx5_core 0000:55:00.0: mlx5_pci_err_detected Device state = 1 pci_status: 0. Exit, result = 3, need reset
[ 557.547067] [drm] PCI error: detected callback, state(1)!!
[ 557.547069] [drm] No support for XGMI hive yet...
[ 557.548125] mlx5_core 0000:55:00.0: mlx5_pci_slot_reset Device state = 1 pci_status: 0. Enter
[ 557.607763] mlx5_core 0000:55:00.0: wait vital counter value 0x16b5b after 1 iterations
[ 557.607777] mlx5_core 0000:55:00.0: mlx5_pci_slot_reset Device state = 1 pci_status: 1. Exit, err = 0, result = 5, recovered
[ 557.610492] [drm] PCI error: slot reset callback!!
...
[ 560.689382] amdgpu 0000:3f:00.0: amdgpu: GPU reset(2) succeeded!
[ 560.689546] amdgpu 0000:5a:00.0: amdgpu: GPU reset(2) succeeded!
[ 560.689562] general protection fault, probably for non-canonical address 0x5f080b54534f611f: 0000 [#1] SMP NOPTI
[ 560.701008] CPU: 16 PID: 2361 Comm: kworker/u448:9 Tainted: G OE 5.15.0-91-generic #101-Ubuntu
[ 560.712057] Hardware name: Microsoft C278A/C278A, BIOS C2789.5.BS.1C11.AG.1 11/08/2023
[ 560.720959] Workqueue: amdgpu-reset-hive amdgpu_ras_do_recovery [amdgpu]
[ 560.728887] RIP: 0010:amdgpu_device_gpu_recover.cold+0xbf1/0xcf5 [amdgpu]
[ 560.736891] Code: ff 41 89 c6 e9 1b ff ff ff 44 0f b6 45 b0 e9 4f ff ff ff be 01 00 00 00 4c 89 e7 e8 76 c9 8b ff 44 0f b6 45 b0 e9 3c fd ff ff <48> 83 ba 18 02 00 00 00 0f 84 6a f8 ff ff 48 8d 7a 78 be 01 00 00
[ 560.757967] RSP: 0018:ffa0000032e53d80 EFLAGS: 00010202
[ 560.763848] RAX: ffa00000001dfd10 RBX: ffa0000000197090 RCX: ffa0000032e53db0
[ 560.771856] RDX: 5f080b54534f5f07 RSI: 0000000000000000 RDI: ff11000128100010
[ 560.779867] RBP: ffa0000032e53df0 R08: 0000000000000000 R09: ffffffffffe77f08
[ 560.787879] R10: 0000000000ffff0a R11: 0000000000000001 R12: 0000000000000000
[ 560.795889] R13: ffa0000032e53e00 R14: 0000000000000000 R15: 0000000000000000
[ 560.803889] FS: 0000000000000000(0000) GS:ff11007e7e800000(0000) knlGS:0000000000000000
[ 560.812973] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 560.819422] CR2: 000055a04c118e68 CR3: 0000000007410005 CR4: 0000000000771ee0
[ 560.827433] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 560.835433] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 560.843444] PKRU: 55555554
[ 560.846480] Call Trace:
[ 560.849225] <TASK>
[ 560.851580] ? show_trace_log_lvl+0x1d6/0x2ea
[ 560.856488] ? show_trace_log_lvl+0x1d6/0x2ea
[ 560.861379] ? amdgpu_ras_do_recovery+0x1b2/0x210 [amdgpu]
[ 560.867778] ? show_regs.part.0+0x23/0x29
[ 560.872293] ? __die_body.cold+0x8/0xd
[ 560.876502] ? die_addr+0x3e/0x60
[ 560.880238] ? exc_general_protection+0x1c5/0x410
[ 560.885532] ? asm_exc_general_protection+0x27/0x30
[ 560.891025] ? amdgpu_device_gpu_recover.cold+0xbf1/0xcf5 [amdgpu]
[ 560.898323] amdgpu_ras_do_recovery+0x1b2/0x210 [amdgpu]
[ 560.904520] process_one_work+0x228/0x3d0
How:
In RAS recovery, mode-1 reset is issued from RAS fatal error handling and expected
all the nodes in a hive to be reset. no need to issue another mode-1 during this procedure. |
In the Linux kernel, the following vulnerability has been resolved:
drm/amdgpu: Reset IH OVERFLOW_CLEAR bit
Allows us to detect subsequent IH ring buffer overflows as well. |
In the Linux kernel, the following vulnerability has been resolved:
amdkfd: use calloc instead of kzalloc to avoid integer overflow
This uses calloc instead of doing the multiplication which might
overflow. |
In the Linux kernel, the following vulnerability has been resolved:
drm/amdgpu: Fix variable 'mca_funcs' dereferenced before NULL check in 'amdgpu_mca_smu_get_mca_entry()'
Fixes the below:
drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c:377 amdgpu_mca_smu_get_mca_entry() warn: variable dereferenced before check 'mca_funcs' (see line 368)
357 int amdgpu_mca_smu_get_mca_entry(struct amdgpu_device *adev,
enum amdgpu_mca_error_type type,
358 int idx, struct mca_bank_entry *entry)
359 {
360 const struct amdgpu_mca_smu_funcs *mca_funcs =
adev->mca.mca_funcs;
361 int count;
362
363 switch (type) {
364 case AMDGPU_MCA_ERROR_TYPE_UE:
365 count = mca_funcs->max_ue_count;
mca_funcs is dereferenced here.
366 break;
367 case AMDGPU_MCA_ERROR_TYPE_CE:
368 count = mca_funcs->max_ce_count;
mca_funcs is dereferenced here.
369 break;
370 default:
371 return -EINVAL;
372 }
373
374 if (idx >= count)
375 return -EINVAL;
376
377 if (mca_funcs && mca_funcs->mca_get_mca_entry)
^^^^^^^^^
Checked too late! |
In the Linux kernel, the following vulnerability has been resolved:
drm/amdgpu: fix use-after-free bug
The bug can be triggered by sending a single amdgpu_gem_userptr_ioctl
to the AMDGPU DRM driver on any ASICs with an invalid address and size.
The bug was reported by Joonkyo Jung <joonkyoj@yonsei.ac.kr>.
For example the following code:
static void Syzkaller1(int fd)
{
struct drm_amdgpu_gem_userptr arg;
int ret;
arg.addr = 0xffffffffffff0000;
arg.size = 0x80000000; /*2 Gb*/
arg.flags = 0x7;
ret = drmIoctl(fd, 0xc1186451/*amdgpu_gem_userptr_ioctl*/, &arg);
}
Due to the address and size are not valid there is a failure in
amdgpu_hmm_register->mmu_interval_notifier_insert->__mmu_interval_notifier_insert->
check_shl_overflow, but we even the amdgpu_hmm_register failure we still call
amdgpu_hmm_unregister into amdgpu_gem_object_free which causes access to a bad address.
The following stack is below when the issue is reproduced when Kazan is enabled:
[ +0.000014] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020
[ +0.000009] RIP: 0010:mmu_interval_notifier_remove+0x327/0x340
[ +0.000017] Code: ff ff 49 89 44 24 08 48 b8 00 01 00 00 00 00 ad de 4c 89 f7 49 89 47 40 48 83 c0 22 49 89 47 48 e8 ce d1 2d 01 e9 32 ff ff ff <0f> 0b e9 16 ff ff ff 4c 89 ef e8 fa 14 b3 ff e9 36 ff ff ff e8 80
[ +0.000014] RSP: 0018:ffffc90002657988 EFLAGS: 00010246
[ +0.000013] RAX: 0000000000000000 RBX: 1ffff920004caf35 RCX: ffffffff8160565b
[ +0.000011] RDX: dffffc0000000000 RSI: 0000000000000004 RDI: ffff8881a9f78260
[ +0.000010] RBP: ffffc90002657a70 R08: 0000000000000001 R09: fffff520004caf25
[ +0.000010] R10: 0000000000000003 R11: ffffffff8161d1d6 R12: ffff88810e988c00
[ +0.000010] R13: ffff888126fb5a00 R14: ffff88810e988c0c R15: ffff8881a9f78260
[ +0.000011] FS: 00007ff9ec848540(0000) GS:ffff8883cc880000(0000) knlGS:0000000000000000
[ +0.000012] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0.000010] CR2: 000055b3f7e14328 CR3: 00000001b5770000 CR4: 0000000000350ef0
[ +0.000010] Call Trace:
[ +0.000006] <TASK>
[ +0.000007] ? show_regs+0x6a/0x80
[ +0.000018] ? __warn+0xa5/0x1b0
[ +0.000019] ? mmu_interval_notifier_remove+0x327/0x340
[ +0.000018] ? report_bug+0x24a/0x290
[ +0.000022] ? handle_bug+0x46/0x90
[ +0.000015] ? exc_invalid_op+0x19/0x50
[ +0.000016] ? asm_exc_invalid_op+0x1b/0x20
[ +0.000017] ? kasan_save_stack+0x26/0x50
[ +0.000017] ? mmu_interval_notifier_remove+0x23b/0x340
[ +0.000019] ? mmu_interval_notifier_remove+0x327/0x340
[ +0.000019] ? mmu_interval_notifier_remove+0x23b/0x340
[ +0.000020] ? __pfx_mmu_interval_notifier_remove+0x10/0x10
[ +0.000017] ? kasan_save_alloc_info+0x1e/0x30
[ +0.000018] ? srso_return_thunk+0x5/0x5f
[ +0.000014] ? __kasan_kmalloc+0xb1/0xc0
[ +0.000018] ? srso_return_thunk+0x5/0x5f
[ +0.000013] ? __kasan_check_read+0x11/0x20
[ +0.000020] amdgpu_hmm_unregister+0x34/0x50 [amdgpu]
[ +0.004695] amdgpu_gem_object_free+0x66/0xa0 [amdgpu]
[ +0.004534] ? __pfx_amdgpu_gem_object_free+0x10/0x10 [amdgpu]
[ +0.004291] ? do_syscall_64+0x5f/0xe0
[ +0.000023] ? srso_return_thunk+0x5/0x5f
[ +0.000017] drm_gem_object_free+0x3b/0x50 [drm]
[ +0.000489] amdgpu_gem_userptr_ioctl+0x306/0x500 [amdgpu]
[ +0.004295] ? __pfx_amdgpu_gem_userptr_ioctl+0x10/0x10 [amdgpu]
[ +0.004270] ? srso_return_thunk+0x5/0x5f
[ +0.000014] ? __this_cpu_preempt_check+0x13/0x20
[ +0.000015] ? srso_return_thunk+0x5/0x5f
[ +0.000013] ? sysvec_apic_timer_interrupt+0x57/0xc0
[ +0.000020] ? srso_return_thunk+0x5/0x5f
[ +0.000014] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
[ +0.000022] ? drm_ioctl_kernel+0x17b/0x1f0 [drm]
[ +0.000496] ? __pfx_amdgpu_gem_userptr_ioctl+0x10/0x10 [amdgpu]
[ +0.004272] ? drm_ioctl_kernel+0x190/0x1f0 [drm]
[ +0.000492] drm_ioctl_kernel+0x140/0x1f0 [drm]
[ +0.000497] ? __pfx_amdgpu_gem_userptr_ioctl+0x10/0x10 [amdgpu]
[ +0.004297] ? __pfx_drm_ioctl_kernel+0x10/0x10 [d
---truncated--- |
In the Linux kernel, the following vulnerability has been resolved:
drm/amd/display: Fix possible underflow for displays with large vblank
[Why]
Underflow observed when using a display with a large vblank region
and low refresh rate
[How]
Simplify calculation of vblank_nom
Increase value for VBlankNomDefaultUS to 800us |
In the Linux kernel, the following vulnerability has been resolved:
drm/amdgpu: install stub fence into potential unused fence pointers
When using cpu to update page tables, vm update fences are unused.
Install stub fence into these fence pointers instead of NULL
to avoid NULL dereference when calling dma_fence_wait() on them. |
In the Linux kernel, the following vulnerability has been resolved:
erofs: Fix detection of atomic context
Current check for atomic context is not sufficient as
z_erofs_decompressqueue_endio can be called under rcu lock
from blk_mq_flush_plug_list(). See the stacktrace [1]
In such case we should hand off the decompression work for async
processing rather than trying to do sync decompression in current
context. Patch fixes the detection by checking for
rcu_read_lock_any_held() and while at it use more appropriate
!in_task() check than in_atomic().
Background: Historically erofs would always schedule a kworker for
decompression which would incur the scheduling cost regardless of
the context. But z_erofs_decompressqueue_endio() may not always
be in atomic context and we could actually benefit from doing the
decompression in z_erofs_decompressqueue_endio() if we are in
thread context, for example when running with dm-verity.
This optimization was later added in patch [2] which has shown
improvement in performance benchmarks.
==============================================
[1] Problem stacktrace
[name:core&]BUG: sleeping function called from invalid context at kernel/locking/mutex.c:291
[name:core&]in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1615, name: CpuMonitorServi
[name:core&]preempt_count: 0, expected: 0
[name:core&]RCU nest depth: 1, expected: 0
CPU: 7 PID: 1615 Comm: CpuMonitorServi Tainted: G S W OE 6.1.25-android14-5-maybe-dirty-mainline #1
Hardware name: MT6897 (DT)
Call trace:
dump_backtrace+0x108/0x15c
show_stack+0x20/0x30
dump_stack_lvl+0x6c/0x8c
dump_stack+0x20/0x48
__might_resched+0x1fc/0x308
__might_sleep+0x50/0x88
mutex_lock+0x2c/0x110
z_erofs_decompress_queue+0x11c/0xc10
z_erofs_decompress_kickoff+0x110/0x1a4
z_erofs_decompressqueue_endio+0x154/0x180
bio_endio+0x1b0/0x1d8
__dm_io_complete+0x22c/0x280
clone_endio+0xe4/0x280
bio_endio+0x1b0/0x1d8
blk_update_request+0x138/0x3a4
blk_mq_plug_issue_direct+0xd4/0x19c
blk_mq_flush_plug_list+0x2b0/0x354
__blk_flush_plug+0x110/0x160
blk_finish_plug+0x30/0x4c
read_pages+0x2fc/0x370
page_cache_ra_unbounded+0xa4/0x23c
page_cache_ra_order+0x290/0x320
do_sync_mmap_readahead+0x108/0x2c0
filemap_fault+0x19c/0x52c
__do_fault+0xc4/0x114
handle_mm_fault+0x5b4/0x1168
do_page_fault+0x338/0x4b4
do_translation_fault+0x40/0x60
do_mem_abort+0x60/0xc8
el0_da+0x4c/0xe0
el0t_64_sync_handler+0xd4/0xfc
el0t_64_sync+0x1a0/0x1a4
[2] Link: https://lore.kernel.org/all/20210317035448.13921-1-huangjianan@oppo.com/ |
In the Linux kernel, the following vulnerability has been resolved:
fs/ntfs3: Add length check in indx_get_root
This adds a length check to guarantee the retrieved index root is legit.
[ 162.459513] BUG: KASAN: use-after-free in hdr_find_e.isra.0+0x10c/0x320
[ 162.460176] Read of size 2 at addr ffff8880037bca99 by task mount/243
[ 162.460851]
[ 162.461252] CPU: 0 PID: 243 Comm: mount Not tainted 6.0.0-rc7 #42
[ 162.461744] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 162.462609] Call Trace:
[ 162.462954] <TASK>
[ 162.463276] dump_stack_lvl+0x49/0x63
[ 162.463822] print_report.cold+0xf5/0x689
[ 162.464608] ? unwind_get_return_address+0x3a/0x60
[ 162.465766] ? hdr_find_e.isra.0+0x10c/0x320
[ 162.466975] kasan_report+0xa7/0x130
[ 162.467506] ? _raw_spin_lock_irq+0xc0/0xf0
[ 162.467998] ? hdr_find_e.isra.0+0x10c/0x320
[ 162.468536] __asan_load2+0x68/0x90
[ 162.468923] hdr_find_e.isra.0+0x10c/0x320
[ 162.469282] ? cmp_uints+0xe0/0xe0
[ 162.469557] ? cmp_sdh+0x90/0x90
[ 162.469864] ? ni_find_attr+0x214/0x300
[ 162.470217] ? ni_load_mi+0x80/0x80
[ 162.470479] ? entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 162.470931] ? ntfs_bread_run+0x190/0x190
[ 162.471307] ? indx_get_root+0xe4/0x190
[ 162.471556] ? indx_get_root+0x140/0x190
[ 162.471833] ? indx_init+0x1e0/0x1e0
[ 162.472069] ? fnd_clear+0x115/0x140
[ 162.472363] ? _raw_spin_lock_irqsave+0x100/0x100
[ 162.472731] indx_find+0x184/0x470
[ 162.473461] ? sysvec_apic_timer_interrupt+0x57/0xc0
[ 162.474429] ? indx_find_buffer+0x2d0/0x2d0
[ 162.474704] ? do_syscall_64+0x3b/0x90
[ 162.474962] dir_search_u+0x196/0x2f0
[ 162.475381] ? ntfs_nls_to_utf16+0x450/0x450
[ 162.475661] ? ntfs_security_init+0x3d6/0x440
[ 162.475906] ? is_sd_valid+0x180/0x180
[ 162.476191] ntfs_extend_init+0x13f/0x2c0
[ 162.476496] ? ntfs_fix_post_read+0x130/0x130
[ 162.476861] ? iput.part.0+0x286/0x320
[ 162.477325] ntfs_fill_super+0x11e0/0x1b50
[ 162.477709] ? put_ntfs+0x1d0/0x1d0
[ 162.477970] ? vsprintf+0x20/0x20
[ 162.478258] ? set_blocksize+0x95/0x150
[ 162.478538] get_tree_bdev+0x232/0x370
[ 162.478789] ? put_ntfs+0x1d0/0x1d0
[ 162.479038] ntfs_fs_get_tree+0x15/0x20
[ 162.479374] vfs_get_tree+0x4c/0x130
[ 162.479729] path_mount+0x654/0xfe0
[ 162.480124] ? putname+0x80/0xa0
[ 162.480484] ? finish_automount+0x2e0/0x2e0
[ 162.480894] ? putname+0x80/0xa0
[ 162.481467] ? kmem_cache_free+0x1c4/0x440
[ 162.482280] ? putname+0x80/0xa0
[ 162.482714] do_mount+0xd6/0xf0
[ 162.483264] ? path_mount+0xfe0/0xfe0
[ 162.484782] ? __kasan_check_write+0x14/0x20
[ 162.485593] __x64_sys_mount+0xca/0x110
[ 162.486024] do_syscall_64+0x3b/0x90
[ 162.486543] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 162.487141] RIP: 0033:0x7f9d374e948a
[ 162.488324] Code: 48 8b 0d 11 fa 2a 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 008
[ 162.489728] RSP: 002b:00007ffe30e73d18 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 162.490971] RAX: ffffffffffffffda RBX: 0000561cdb43a060 RCX: 00007f9d374e948a
[ 162.491669] RDX: 0000561cdb43a260 RSI: 0000561cdb43a2e0 RDI: 0000561cdb442af0
[ 162.492050] RBP: 0000000000000000 R08: 0000561cdb43a280 R09: 0000000000000020
[ 162.492459] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000561cdb442af0
[ 162.493183] R13: 0000561cdb43a260 R14: 0000000000000000 R15: 00000000ffffffff
[ 162.493644] </TASK>
[ 162.493908]
[ 162.494214] The buggy address belongs to the physical page:
[ 162.494761] page:000000003e38a3d5 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x37bc
[ 162.496064] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
[ 162.497278] raw: 000fffffc0000000 ffffea00000df1c8 ffffea00000df008 0000000000000000
[ 162.498928] raw: 0000000000000000 0000000000240000 00000000ffffffff 0000000000000000
[ 162.500542] page dumped becau
---truncated--- |
In the Linux kernel, the following vulnerability has been resolved:
wifi: ath12k: Avoid NULL pointer access during management transmit cleanup
Currently 'ar' reference is not added in skb_cb.
Though this is generally not used during transmit completion
callbacks, on interface removal the remaining idr cleanup callback
uses the ar pointer from skb_cb from management txmgmt_idr. Hence fill them
during transmit call for proper usage to avoid NULL pointer dereference.
Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.0.1-00029-QCAHKSWPL_SILICONZ-1 |
In the Linux kernel, the following vulnerability has been resolved:
mm: fix zswap writeback race condition
The zswap writeback mechanism can cause a race condition resulting in
memory corruption, where a swapped out page gets swapped in with data that
was written to a different page.
The race unfolds like this:
1. a page with data A and swap offset X is stored in zswap
2. page A is removed off the LRU by zpool driver for writeback in
zswap-shrink work, data for A is mapped by zpool driver
3. user space program faults and invalidates page entry A, offset X is
considered free
4. kswapd stores page B at offset X in zswap (zswap could also be
full, if so, page B would then be IOed to X, then skip step 5.)
5. entry A is replaced by B in tree->rbroot, this doesn't affect the
local reference held by zswap-shrink work
6. zswap-shrink work writes back A at X, and frees zswap entry A
7. swapin of slot X brings A in memory instead of B
The fix:
Once the swap page cache has been allocated (case ZSWAP_SWAPCACHE_NEW),
zswap-shrink work just checks that the local zswap_entry reference is
still the same as the one in the tree. If it's not the same it means that
it's either been invalidated or replaced, in both cases the writeback is
aborted because the local entry contains stale data.
Reproducer:
I originally found this by running `stress` overnight to validate my work
on the zswap writeback mechanism, it manifested after hours on my test
machine. The key to make it happen is having zswap writebacks, so
whatever setup pumps /sys/kernel/debug/zswap/written_back_pages should do
the trick.
In order to reproduce this faster on a vm, I setup a system with ~100M of
available memory and a 500M swap file, then running `stress --vm 1
--vm-bytes 300000000 --vm-stride 4000` makes it happen in matter of tens
of minutes. One can speed things up even more by swinging
/sys/module/zswap/parameters/max_pool_percent up and down between, say, 20
and 1; this makes it reproduce in tens of seconds. It's crucial to set
`--vm-stride` to something other than 4096 otherwise `stress` won't
realize that memory has been corrupted because all pages would have the
same data. |
In the Linux kernel, the following vulnerability has been resolved:
drm/amdgpu: fix calltrace warning in amddrm_buddy_fini
The following call trace is observed when removing the amdgpu driver, which
is caused by that BOs allocated for psp are not freed until removing.
[61811.450562] RIP: 0010:amddrm_buddy_fini.cold+0x29/0x47 [amddrm_buddy]
[61811.450577] Call Trace:
[61811.450577] <TASK>
[61811.450579] amdgpu_vram_mgr_fini+0x135/0x1c0 [amdgpu]
[61811.450728] amdgpu_ttm_fini+0x207/0x290 [amdgpu]
[61811.450870] amdgpu_bo_fini+0x27/0xa0 [amdgpu]
[61811.451012] gmc_v9_0_sw_fini+0x4a/0x60 [amdgpu]
[61811.451166] amdgpu_device_fini_sw+0x117/0x520 [amdgpu]
[61811.451306] amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[61811.451447] devm_drm_dev_init_release+0x4d/0x80 [drm]
[61811.451466] devm_action_release+0x15/0x20
[61811.451469] release_nodes+0x40/0xb0
[61811.451471] devres_release_all+0x9b/0xd0
[61811.451473] __device_release_driver+0x1bb/0x2a0
[61811.451476] driver_detach+0xf3/0x140
[61811.451479] bus_remove_driver+0x6c/0xf0
[61811.451481] driver_unregister+0x31/0x60
[61811.451483] pci_unregister_driver+0x40/0x90
[61811.451486] amdgpu_exit+0x15/0x447 [amdgpu]
For smu v13_0_2, if the GPU supports xgmi, refer to
commit f5c7e7797060 ("drm/amdgpu: Adjust removal control flow for smu v13_0_2"),
it will run gpu recover in AMDGPU_RESET_FOR_DEVICE_REMOVE mode when removing,
which makes all devices in hive list have hw reset but no resume except the
basic ip blocks, then other ip blocks will not call .hw_fini according to
ip_block.status.hw.
Since psp_free_shared_bufs just includes some software operations, so move
it to psp_sw_fini. |
In the Linux kernel, the following vulnerability has been resolved:
drm/amdkfd: Fix an illegal memory access
In the kfd_wait_on_events() function, the kfd_event_waiter structure is
allocated by alloc_event_waiters(), but the event field of the waiter
structure is not initialized; When copy_from_user() fails in the
kfd_wait_on_events() function, it will enter exception handling to
release the previously allocated memory of the waiter structure;
Due to the event field of the waiters structure being accessed
in the free_waiters() function, this results in illegal memory access
and system crash, here is the crash log:
localhost kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x185/0x1e0
localhost kernel: RSP: 0018:ffffaa53c362bd60 EFLAGS: 00010082
localhost kernel: RAX: ff3d3d6bff4007cb RBX: 0000000000000282 RCX: 00000000002c0000
localhost kernel: RDX: ffff9e855eeacb80 RSI: 000000000000279c RDI: ffffe7088f6a21d0
localhost kernel: RBP: ffffe7088f6a21d0 R08: 00000000002c0000 R09: ffffaa53c362be64
localhost kernel: R10: ffffaa53c362bbd8 R11: 0000000000000001 R12: 0000000000000002
localhost kernel: R13: ffff9e7ead15d600 R14: 0000000000000000 R15: ffff9e7ead15d698
localhost kernel: FS: 0000152a3d111700(0000) GS:ffff9e855ee80000(0000) knlGS:0000000000000000
localhost kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
localhost kernel: CR2: 0000152938000010 CR3: 000000044d7a4000 CR4: 00000000003506e0
localhost kernel: Call Trace:
localhost kernel: _raw_spin_lock_irqsave+0x30/0x40
localhost kernel: remove_wait_queue+0x12/0x50
localhost kernel: kfd_wait_on_events+0x1b6/0x490 [hydcu]
localhost kernel: ? ftrace_graph_caller+0xa0/0xa0
localhost kernel: kfd_ioctl+0x38c/0x4a0 [hydcu]
localhost kernel: ? kfd_ioctl_set_trap_handler+0x70/0x70 [hydcu]
localhost kernel: ? kfd_ioctl_create_queue+0x5a0/0x5a0 [hydcu]
localhost kernel: ? ftrace_graph_caller+0xa0/0xa0
localhost kernel: __x64_sys_ioctl+0x8e/0xd0
localhost kernel: ? syscall_trace_enter.isra.18+0x143/0x1b0
localhost kernel: do_syscall_64+0x33/0x80
localhost kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
localhost kernel: RIP: 0033:0x152a4dff68d7
Allocate the structure with kcalloc, and remove redundant 0-initialization
and a redundant loop condition check. |
In the Linux kernel, the following vulnerability has been resolved:
drm/amdgpu: fix ttm_bo calltrace warning in psp_hw_fini
The call trace occurs when the amdgpu is removed after
the mode1 reset. During mode1 reset, from suspend to resume,
there is no need to reinitialize the ta firmware buffer
which caused the bo pin_count increase redundantly.
[ 489.885525] Call Trace:
[ 489.885525] <TASK>
[ 489.885526] amdttm_bo_put+0x34/0x50 [amdttm]
[ 489.885529] amdgpu_bo_free_kernel+0xe8/0x130 [amdgpu]
[ 489.885620] psp_free_shared_bufs+0xb7/0x150 [amdgpu]
[ 489.885720] psp_hw_fini+0xce/0x170 [amdgpu]
[ 489.885815] amdgpu_device_fini_hw+0x2ff/0x413 [amdgpu]
[ 489.885960] ? blocking_notifier_chain_unregister+0x56/0xb0
[ 489.885962] amdgpu_driver_unload_kms+0x51/0x60 [amdgpu]
[ 489.886049] amdgpu_pci_remove+0x5a/0x140 [amdgpu]
[ 489.886132] ? __pm_runtime_resume+0x60/0x90
[ 489.886134] pci_device_remove+0x3e/0xb0
[ 489.886135] __device_release_driver+0x1ab/0x2a0
[ 489.886137] driver_detach+0xf3/0x140
[ 489.886138] bus_remove_driver+0x6c/0xf0
[ 489.886140] driver_unregister+0x31/0x60
[ 489.886141] pci_unregister_driver+0x40/0x90
[ 489.886142] amdgpu_exit+0x15/0x451 [amdgpu] |
In the Linux kernel, the following vulnerability has been resolved:
drm/amdgpu: Fix a null pointer access when the smc_rreg pointer is NULL
In certain types of chips, such as VEGA20, reading the amdgpu_regs_smc file could result in an abnormal null pointer access when the smc_rreg pointer is NULL. Below are the steps to reproduce this issue and the corresponding exception log:
1. Navigate to the directory: /sys/kernel/debug/dri/0
2. Execute command: cat amdgpu_regs_smc
3. Exception Log::
[4005007.702554] BUG: kernel NULL pointer dereference, address: 0000000000000000
[4005007.702562] #PF: supervisor instruction fetch in kernel mode
[4005007.702567] #PF: error_code(0x0010) - not-present page
[4005007.702570] PGD 0 P4D 0
[4005007.702576] Oops: 0010 [#1] SMP NOPTI
[4005007.702581] CPU: 4 PID: 62563 Comm: cat Tainted: G OE 5.15.0-43-generic #46-Ubunt u
[4005007.702590] RIP: 0010:0x0
[4005007.702598] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[4005007.702600] RSP: 0018:ffffa82b46d27da0 EFLAGS: 00010206
[4005007.702605] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffa82b46d27e68
[4005007.702609] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9940656e0000
[4005007.702612] RBP: ffffa82b46d27dd8 R08: 0000000000000000 R09: ffff994060c07980
[4005007.702615] R10: 0000000000020000 R11: 0000000000000000 R12: 00007f5e06753000
[4005007.702618] R13: ffff9940656e0000 R14: ffffa82b46d27e68 R15: 00007f5e06753000
[4005007.702622] FS: 00007f5e0755b740(0000) GS:ffff99479d300000(0000) knlGS:0000000000000000
[4005007.702626] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[4005007.702629] CR2: ffffffffffffffd6 CR3: 00000003253fc000 CR4: 00000000003506e0
[4005007.702633] Call Trace:
[4005007.702636] <TASK>
[4005007.702640] amdgpu_debugfs_regs_smc_read+0xb0/0x120 [amdgpu]
[4005007.703002] full_proxy_read+0x5c/0x80
[4005007.703011] vfs_read+0x9f/0x1a0
[4005007.703019] ksys_read+0x67/0xe0
[4005007.703023] __x64_sys_read+0x19/0x20
[4005007.703028] do_syscall_64+0x5c/0xc0
[4005007.703034] ? do_user_addr_fault+0x1e3/0x670
[4005007.703040] ? exit_to_user_mode_prepare+0x37/0xb0
[4005007.703047] ? irqentry_exit_to_user_mode+0x9/0x20
[4005007.703052] ? irqentry_exit+0x19/0x30
[4005007.703057] ? exc_page_fault+0x89/0x160
[4005007.703062] ? asm_exc_page_fault+0x8/0x30
[4005007.703068] entry_SYSCALL_64_after_hwframe+0x44/0xae
[4005007.703075] RIP: 0033:0x7f5e07672992
[4005007.703079] Code: c0 e9 b2 fe ff ff 50 48 8d 3d fa b2 0c 00 e8 c5 1d 02 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 e c 28 48 89 54 24
[4005007.703083] RSP: 002b:00007ffe03097898 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[4005007.703088] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f5e07672992
[4005007.703091] RDX: 0000000000020000 RSI: 00007f5e06753000 RDI: 0000000000000003
[4005007.703094] RBP: 00007f5e06753000 R08: 00007f5e06752010 R09: 00007f5e06752010
[4005007.703096] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000022000
[4005007.703099] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
[4005007.703105] </TASK>
[4005007.703107] Modules linked in: nf_tables libcrc32c nfnetlink algif_hash af_alg binfmt_misc nls_ iso8859_1 ipmi_ssif ast intel_rapl_msr intel_rapl_common drm_vram_helper drm_ttm_helper amd64_edac t tm edac_mce_amd kvm_amd ccp mac_hid k10temp kvm acpi_ipmi ipmi_si rapl sch_fq_codel ipmi_devintf ipm i_msghandler msr parport_pc ppdev lp parport mtd pstore_blk efi_pstore ramoops pstore_zone reed_solo mon ip_tables x_tables autofs4 ib_uverbs ib_core amdgpu(OE) amddrm_ttm_helper(OE) amdttm(OE) iommu_v 2 amd_sched(OE) amdkcl(OE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm igb ahci xhci_pci libahci i2c_piix4 i2c_algo_bit xhci_pci_renesas dca
[4005007.703184] CR2: 0000000000000000
[4005007.703188] ---[ en
---truncated--- |
In the Linux kernel, the following vulnerability has been resolved:
drm/amdgpu/vkms: fix a possible null pointer dereference
In amdgpu_vkms_conn_get_modes(), the return value of drm_cvt_mode()
is assigned to mode, which will lead to a NULL pointer dereference
on failure of drm_cvt_mode(). Add a check to avoid null pointer
dereference. |
In the Linux kernel, the following vulnerability has been resolved:
drm/amdgpu: Fix potential null pointer derefernce
The amdgpu_ras_get_context may return NULL if device
not support ras feature, so add check before using. |