CVE |
Vendors |
Products |
Updated |
CVSS v3.1 |
In the Linux kernel, the following vulnerability has been resolved:
cpufreq: Init completion before kobject_init_and_add()
In cpufreq_policy_alloc(), it will call uninitialed completion in
cpufreq_sysfs_release() when kobject_init_and_add() fails. And
that will cause a crash such as the following page fault in complete:
BUG: unable to handle page fault for address: fffffffffffffff8
[..]
RIP: 0010:complete+0x98/0x1f0
[..]
Call Trace:
kobject_put+0x1be/0x4c0
cpufreq_online.cold+0xee/0x1fd
cpufreq_add_dev+0x183/0x1e0
subsys_interface_register+0x3f5/0x4e0
cpufreq_register_driver+0x3b7/0x670
acpi_cpufreq_init+0x56c/0x1000 [acpi_cpufreq]
do_one_initcall+0x13d/0x780
do_init_module+0x1c3/0x630
load_module+0x6e67/0x73b0
__do_sys_finit_module+0x181/0x240
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd |
In the Linux kernel, the following vulnerability has been resolved:
memory: pl353-smc: Fix refcount leak bug in pl353_smc_probe()
The break of for_each_available_child_of_node() needs a
corresponding of_node_put() when the reference 'child' is not
used anymore. Here we do not need to call of_node_put() in
fail path as '!match' means no break.
While the of_platform_device_create() will created a new
reference by 'child' but it has considered the refcounting. |
In the Linux kernel, the following vulnerability has been resolved:
ice: fix Rx page leak on multi-buffer frames
The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each
buffer in the current frame. This function was introduced as part of
handling multi-buffer XDP support in the ice driver.
It works by iterating over the buffers from first_desc up to 1 plus the
total number of fragments in the frame, cached from before the XDP program
was executed.
If the hardware posts a descriptor with a size of 0, the logic used in
ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added
as fragments in ice_add_xdp_frag. Since the buffer isn't counted as a
fragment, we do not iterate over it in ice_put_rx_mbuf(), and thus we don't
call ice_put_rx_buf().
Because we don't call ice_put_rx_buf(), we don't attempt to re-use the
page or free it. This leaves a stale page in the ring, as we don't
increment next_to_alloc.
The ice_reuse_rx_page() assumes that the next_to_alloc has been incremented
properly, and that it always points to a buffer with a NULL page. Since
this function doesn't check, it will happily recycle a page over the top
of the next_to_alloc buffer, losing track of the old page.
Note that this leak only occurs for multi-buffer frames. The
ice_put_rx_mbuf() function always handles at least one buffer, so a
single-buffer frame will always get handled correctly. It is not clear
precisely why the hardware hands us descriptors with a size of 0 sometimes,
but it happens somewhat regularly with "jumbo frames" used by 9K MTU.
To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on
all buffers between first_desc and next_to_clean. Borrow the logic of a
similar function in i40e used for this same purpose. Use the same logic
also in ice_get_pgcnts().
Instead of iterating over just the number of fragments, use a loop which
iterates until the current index reaches to the next_to_clean element just
past the current frame. Unlike i40e, the ice_put_rx_mbuf() function does
call ice_put_rx_buf() on the last buffer of the frame indicating the end of
packet.
For non-linear (multi-buffer) frames, we need to take care when adjusting
the pagecnt_bias. An XDP program might release fragments from the tail of
the frame, in which case that fragment page is already released. Only
update the pagecnt_bias for the first descriptor and fragments still
remaining post-XDP program. Take care to only access the shared info for
fragmented buffers, as this avoids a significant cache miss.
The xdp_xmit value only needs to be updated if an XDP program is run, and
only once per packet. Drop the xdp_xmit pointer argument from
ice_put_rx_mbuf(). Instead, set xdp_xmit in the ice_clean_rx_irq() function
directly. This avoids needing to pass the argument and avoids an extra
bit-wise OR for each buffer in the frame.
Move the increment of the ntc local variable to ensure its updated *before*
all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic
requires the index of the element just after the current frame.
Now that we use an index pointer in the ring to identify the packet, we no
longer need to track or cache the number of fragments in the rx_ring. |
In the Linux kernel, the following vulnerability has been resolved:
wifi: wilc1000: avoid buffer overflow in WID string configuration
Fix the following copy overflow warning identified by Smatch checker.
drivers/net/wireless/microchip/wilc1000/wlan_cfg.c:184 wilc_wlan_parse_response_frame()
error: '__memcpy()' 'cfg->s[i]->str' copy overflow (512 vs 65537)
This patch introduces size check before accessing the memory buffer.
The checks are base on the WID type of received data from the firmware.
For WID string configuration, the size limit is determined by individual
element size in 'struct wilc_cfg_str_vals' that is maintained in 'len' field
of 'struct wilc_cfg_str'. |
In the Linux kernel, the following vulnerability has been resolved:
crypto: ccp - Always pass in an error pointer to __sev_platform_shutdown_locked()
When
9770b428b1a2 ("crypto: ccp - Move dev_info/err messages for SEV/SNP init and shutdown")
moved the error messages dumping so that they don't need to be issued by
the callers, it missed the case where __sev_firmware_shutdown() calls
__sev_platform_shutdown_locked() with a NULL argument which leads to
a NULL ptr deref on the shutdown path, during suspend to disk:
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 983 Comm: hib.sh Not tainted 6.17.0-rc4+ #1 PREEMPT(voluntary)
Hardware name: Supermicro Super Server/H12SSL-i, BIOS 2.5 09/08/2022
RIP: 0010:__sev_platform_shutdown_locked.cold+0x0/0x21 [ccp]
That rIP is:
00000000000006fd <__sev_platform_shutdown_locked.cold>:
6fd: 8b 13 mov (%rbx),%edx
6ff: 48 8b 7d 00 mov 0x0(%rbp),%rdi
703: 89 c1 mov %eax,%ecx
Code: 74 05 31 ff 41 89 3f 49 8b 3e 89 ea 48 c7 c6 a0 8e 54 a0 41 bf 92 ff ff ff e8 e5 2e 09 e1 c6 05 2a d4 38 00 01 e9 26 af ff ff <8b> 13 48 8b 7d 00 89 c1 48 c7 c6 18 90 54 a0 89 44 24 04 e8 c1 2e
RSP: 0018:ffffc90005467d00 EFLAGS: 00010282
RAX: 00000000ffffff92 RBX: 0000000000000000 RCX: 0000000000000000
^^^^^^^^^^^^^^^^
and %rbx is nice and clean.
Call Trace:
<TASK>
__sev_firmware_shutdown.isra.0
sev_dev_destroy
psp_dev_destroy
sp_destroy
pci_device_shutdown
device_shutdown
kernel_power_off
hibernate.cold
state_store
kernfs_fop_write_iter
vfs_write
ksys_write
do_syscall_64
entry_SYSCALL_64_after_hwframe
Pass in a pointer to the function-local error var in the caller.
With that addressed, suspending the ccp shows the error properly at
least:
ccp 0000:47:00.1: sev command 0x2 timed out, disabling PSP
ccp 0000:47:00.1: SEV: failed to SHUTDOWN error 0x0, rc -110
SEV-SNP: Leaking PFN range 0x146800-0x146a00
SEV-SNP: PFN 0x146800 unassigned, dumping non-zero entries in 2M PFN region: [0x146800 - 0x146a00]
...
ccp 0000:47:00.1: SEV-SNP firmware shutdown failed, rc -16, error 0x0
ACPI: PM: Preparing to enter system sleep state S5
kvm: exiting hardware virtualization
reboot: Power down
Btw, this driver is crying to be cleaned up to pass in a proper I/O
struct which can be used to store information between the different
functions, otherwise stuff like that will happen in the future again. |
In the Linux kernel, the following vulnerability has been resolved:
ext4: add EXT4_IGET_BAD flag to prevent unexpected bad inode
There are many places that will get unhappy (and crash) when ext4_iget()
returns a bad inode. However, if iget the boot loader inode, allows a bad
inode to be returned, because the inode may not be initialized. This
mechanism can be used to bypass some checks and cause panic. To solve this
problem, we add a special iget flag EXT4_IGET_BAD. Only with this flag
we'd be returning bad inode from ext4_iget(), otherwise we always return
the error code if the inode is bad inode.(suggested by Jan Kara) |
In the Linux kernel, the following vulnerability has been resolved:
drm: bridge: anx7625: Fix NULL pointer dereference with early IRQ
If the interrupt occurs before resource initialization is complete, the
interrupt handler/worker may access uninitialized data such as the I2C
tcpc_client device, potentially leading to NULL pointer dereference. |
In the Linux kernel, the following vulnerability has been resolved:
zram: fix slot write race condition
Parallel concurrent writes to the same zram index result in leaked
zsmalloc handles. Schematically we can have something like this:
CPU0 CPU1
zram_slot_lock()
zs_free(handle)
zram_slot_lock()
zram_slot_lock()
zs_free(handle)
zram_slot_lock()
compress compress
handle = zs_malloc() handle = zs_malloc()
zram_slot_lock
zram_set_handle(handle)
zram_slot_lock
zram_slot_lock
zram_set_handle(handle)
zram_slot_lock
Either CPU0 or CPU1 zsmalloc handle will leak because zs_free() is done
too early. In fact, we need to reset zram entry right before we set its
new handle, all under the same slot lock scope. |
In the Linux kernel, the following vulnerability has been resolved:
smb: client: let recv_done verify data_offset, data_length and remaining_data_length
This is inspired by the related server fixes. |
In the Linux kernel, the following vulnerability has been resolved:
ksmbd: smbdirect: validate data_offset and data_length field of smb_direct_data_transfer
If data_offset and data_length of smb_direct_data_transfer struct are
invalid, out of bounds issue could happen.
This patch validate data_offset and data_length field in recv_done. |
In the Linux kernel, the following vulnerability has been resolved:
RDMA/core: Make sure "ib_port" is valid when access sysfs node
The "ib_port" structure must be set before adding the sysfs kobject,
and reset after removing it, otherwise it may crash when accessing
the sysfs node:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000050
Mem abort info:
ESR = 0x96000006
Exception class = DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000006
CM = 0, WnR = 0
user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000e85f5ba5
[0000000000000050] pgd=0000000848fd9003, pud=000000085b387003, pmd=0000000000000000
Internal error: Oops: 96000006 [#2] PREEMPT SMP
Modules linked in: ib_umad(O) mlx5_ib(O) nfnetlink_cttimeout(E) nfnetlink(E) act_gact(E) cls_flower(E) sch_ingress(E) openvswitch(E) nsh(E) nf_nat_ipv6(E) nf_nat_ipv4(E) nf_conncount(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) mst_pciconf(O) ipmi_devintf(E) ipmi_msghandler(E) ipmb_dev_int(OE) mlx5_core(O) mlxfw(O) mlxdevm(O) auxiliary(O) ib_uverbs(O) ib_core(O) mlx_compat(O) psample(E) sbsa_gwdt(E) uio_pdrv_genirq(E) uio(E) mlxbf_pmc(OE) mlxbf_gige(OE) mlxbf_tmfifo(OE) gpio_mlxbf2(OE) pwr_mlxbf(OE) mlx_trio(OE) i2c_mlxbf(OE) mlx_bootctl(OE) bluefield_edac(OE) knem(O) ip_tables(E) ipv6(E) crc_ccitt(E) [last unloaded: mst_pci]
Process grep (pid: 3372, stack limit = 0x0000000022055c92)
CPU: 5 PID: 3372 Comm: grep Tainted: G D OE 4.19.161-mlnx.47.gadcd9e3 #1
Hardware name: https://www.mellanox.com BlueField SoC/BlueField SoC, BIOS BlueField:3.9.2-15-ga2403ab Sep 8 2022
pstate: 40000005 (nZcv daif -PAN -UAO)
pc : hw_stat_port_show+0x4c/0x80 [ib_core]
lr : port_attr_show+0x40/0x58 [ib_core]
sp : ffff000029f43b50
x29: ffff000029f43b50 x28: 0000000019375000
x27: ffff8007b821a540 x26: ffff000029f43e30
x25: 0000000000008000 x24: ffff000000eaa958
x23: 0000000000001000 x22: ffff8007a4ce3000
x21: ffff8007baff8000 x20: ffff8007b9066ac0
x19: ffff8007bae97578 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000
x15: 0000000000000000 x14: 0000000000000000
x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000000 x10: 0000000000000000
x9 : 0000000000000000 x8 : ffff8007a4ce4000
x7 : 0000000000000000 x6 : 000000000000003f
x5 : ffff000000e6a280 x4 : ffff8007a4ce3000
x3 : 0000000000000000 x2 : aaaaaaaaaaaaaaab
x1 : ffff8007b9066a10 x0 : ffff8007baff8000
Call trace:
hw_stat_port_show+0x4c/0x80 [ib_core]
port_attr_show+0x40/0x58 [ib_core]
sysfs_kf_seq_show+0x8c/0x150
kernfs_seq_show+0x44/0x50
seq_read+0x1b4/0x45c
kernfs_fop_read+0x148/0x1d8
__vfs_read+0x58/0x180
vfs_read+0x94/0x154
ksys_read+0x68/0xd8
__arm64_sys_read+0x28/0x34
el0_svc_common+0x88/0x18c
el0_svc_handler+0x78/0x94
el0_svc+0x8/0xe8
Code: f2955562 aa1603e4 aa1503e0 f9405683 (f9402861) |
In the Linux kernel, the following vulnerability has been resolved:
bpf: Propagate error from htab_lock_bucket() to userspace
In __htab_map_lookup_and_delete_batch() if htab_lock_bucket() returns
-EBUSY, it will go to next bucket. Going to next bucket may not only
skip the elements in current bucket silently, but also incur
out-of-bound memory access or expose kernel memory to userspace if
current bucket_cnt is greater than bucket_size or zero.
Fixing it by stopping batch operation and returning -EBUSY when
htab_lock_bucket() fails, and the application can retry or skip the busy
batch as needed. |
In the Linux kernel, the following vulnerability has been resolved:
coresight: cti: Fix hang in cti_disable_hw()
cti_enable_hw() and cti_disable_hw() are called from an atomic context
so shouldn't use runtime PM because it can result in a sleep when
communicating with firmware.
Since commit 3c6656337852 ("Revert "firmware: arm_scmi: Add clock
management to the SCMI power domain""), this causes a hang on Juno when
running the Perf Coresight tests or running this command:
perf record -e cs_etm//u -- ls
This was also missed until the revert commit because pm_runtime_put()
was called with the wrong device until commit 692c9a499b28 ("coresight:
cti: Correct the parameter for pm_runtime_put")
With lock and scheduler debugging enabled the following is output:
coresight cti_sys0: cti_enable_hw -- dev:cti_sys0 parent: 20020000.cti
BUG: sleeping function called from invalid context at drivers/base/power/runtime.c:1151
in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 330, name: perf-exec
preempt_count: 2, expected: 0
RCU nest depth: 0, expected: 0
INFO: lockdep is turned off.
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffff80000822b394>] copy_process+0xa0c/0x1948
softirqs last enabled at (0): [<ffff80000822b394>] copy_process+0xa0c/0x1948
softirqs last disabled at (0): [<0000000000000000>] 0x0
CPU: 3 PID: 330 Comm: perf-exec Not tainted 6.0.0-00053-g042116d99298 #7
Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Sep 13 2022
Call trace:
dump_backtrace+0x134/0x140
show_stack+0x20/0x58
dump_stack_lvl+0x8c/0xb8
dump_stack+0x18/0x34
__might_resched+0x180/0x228
__might_sleep+0x50/0x88
__pm_runtime_resume+0xac/0xb0
cti_enable+0x44/0x120
coresight_control_assoc_ectdev+0xc0/0x150
coresight_enable_path+0xb4/0x288
etm_event_start+0x138/0x170
etm_event_add+0x48/0x70
event_sched_in.isra.122+0xb4/0x280
merge_sched_in+0x1fc/0x3d0
visit_groups_merge.constprop.137+0x16c/0x4b0
ctx_sched_in+0x114/0x1f0
perf_event_sched_in+0x60/0x90
ctx_resched+0x68/0xb0
perf_event_exec+0x138/0x508
begin_new_exec+0x52c/0xd40
load_elf_binary+0x6b8/0x17d0
bprm_execve+0x360/0x7f8
do_execveat_common.isra.47+0x218/0x238
__arm64_sys_execve+0x48/0x60
invoke_syscall+0x4c/0x110
el0_svc_common.constprop.4+0xfc/0x120
do_el0_svc+0x34/0xc0
el0_svc+0x40/0x98
el0t_64_sync_handler+0x98/0xc0
el0t_64_sync+0x170/0x174
Fix the issue by removing the runtime PM calls completely. They are not
needed here because it must have already been done when building the
path for a trace.
[ Fix build warnings ] |
In the Linux kernel, the following vulnerability has been resolved:
tls: make sure to abort the stream if headers are bogus
Normally we wait for the socket to buffer up the whole record
before we service it. If the socket has a tiny buffer, however,
we read out the data sooner, to prevent connection stalls.
Make sure that we abort the connection when we find out late
that the record is actually invalid. Retrying the parsing is
fine in itself but since we copy some more data each time
before we parse we can overflow the allocated skb space.
Constructing a scenario in which we're under pressure without
enough data in the socket to parse the length upfront is quite
hard. syzbot figured out a way to do this by serving us the header
in small OOB sends, and then filling in the recvbuf with a large
normal send.
Make sure that tls_rx_msg_size() aborts strp, if we reach
an invalid record there's really no way to recover. |
In the Linux kernel, the following vulnerability has been resolved:
macintosh: fix possible memory leak in macio_add_one_device()
Afer commit 1fa5ae857bb1 ("driver core: get rid of struct device's
bus_id string array"), the name of device is allocated dynamically. It
needs to be freed when of_device_register() fails. Call put_device() to
give up the reference that's taken in device_initialize(), so that it
can be freed in kobject_cleanup() when the refcount hits 0.
macio device is freed in macio_release_dev(), so the kfree() can be
removed. |
In the Linux kernel, the following vulnerability has been resolved:
ALSA: usb-audio: Fix potential memory leaks
When the driver hits -ENOMEM at allocating a URB or a buffer, it
aborts and goes to the error path that releases the all previously
allocated resources. However, when -ENOMEM hits at the middle of the
sync EP URB allocation loop, the partially allocated URBs might be
left without released, because ep->nurbs is still zero at that point.
Fix it by setting ep->nurbs at first, so that the error handler loops
over the full URB list. |
In the Linux kernel, the following vulnerability has been resolved:
dm-stripe: fix a possible integer overflow
There's a possible integer overflow in stripe_io_hints if we have too
large chunk size. Test if the overflow happened, and if it did, don't set
limits->io_min and limits->io_opt; |
In the Linux kernel, the following vulnerability has been resolved:
smb: client: let smbd_destroy() call disable_work_sync(&info->post_send_credits_work)
In smbd_destroy() we may destroy the memory so we better
wait until post_send_credits_work is no longer pending
and will never be started again.
I actually just hit the case using rxe:
WARNING: CPU: 0 PID: 138 at drivers/infiniband/sw/rxe/rxe_verbs.c:1032 rxe_post_recv+0x1ee/0x480 [rdma_rxe]
...
[ 5305.686979] [ T138] smbd_post_recv+0x445/0xc10 [cifs]
[ 5305.687135] [ T138] ? srso_alias_return_thunk+0x5/0xfbef5
[ 5305.687149] [ T138] ? __kasan_check_write+0x14/0x30
[ 5305.687185] [ T138] ? __pfx_smbd_post_recv+0x10/0x10 [cifs]
[ 5305.687329] [ T138] ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[ 5305.687356] [ T138] ? srso_alias_return_thunk+0x5/0xfbef5
[ 5305.687368] [ T138] ? srso_alias_return_thunk+0x5/0xfbef5
[ 5305.687378] [ T138] ? _raw_spin_unlock_irqrestore+0x11/0x60
[ 5305.687389] [ T138] ? srso_alias_return_thunk+0x5/0xfbef5
[ 5305.687399] [ T138] ? get_receive_buffer+0x168/0x210 [cifs]
[ 5305.687555] [ T138] smbd_post_send_credits+0x382/0x4b0 [cifs]
[ 5305.687701] [ T138] ? __pfx_smbd_post_send_credits+0x10/0x10 [cifs]
[ 5305.687855] [ T138] ? __pfx___schedule+0x10/0x10
[ 5305.687865] [ T138] ? __pfx__raw_spin_lock_irq+0x10/0x10
[ 5305.687875] [ T138] ? queue_delayed_work_on+0x8e/0xa0
[ 5305.687889] [ T138] process_one_work+0x629/0xf80
[ 5305.687908] [ T138] ? srso_alias_return_thunk+0x5/0xfbef5
[ 5305.687917] [ T138] ? __kasan_check_write+0x14/0x30
[ 5305.687933] [ T138] worker_thread+0x87f/0x1570
...
It means rxe_post_recv was called after rdma_destroy_qp().
This happened because put_receive_buffer() was triggered
by ib_drain_qp() and called:
queue_work(info->workqueue, &info->post_send_credits_work); |
In the Linux kernel, the following vulnerability has been resolved:
net/mlx5e: Harden uplink netdev access against device unbind
The function mlx5_uplink_netdev_get() gets the uplink netdevice
pointer from mdev->mlx5e_res.uplink_netdev. However, the netdevice can
be removed and its pointer cleared when unbound from the mlx5_core.eth
driver. This results in a NULL pointer, causing a kernel panic.
BUG: unable to handle page fault for address: 0000000000001300
at RIP: 0010:mlx5e_vport_rep_load+0x22a/0x270 [mlx5_core]
Call Trace:
<TASK>
mlx5_esw_offloads_rep_load+0x68/0xe0 [mlx5_core]
esw_offloads_enable+0x593/0x910 [mlx5_core]
mlx5_eswitch_enable_locked+0x341/0x420 [mlx5_core]
mlx5_devlink_eswitch_mode_set+0x17e/0x3a0 [mlx5_core]
devlink_nl_eswitch_set_doit+0x60/0xd0
genl_family_rcv_msg_doit+0xe0/0x130
genl_rcv_msg+0x183/0x290
netlink_rcv_skb+0x4b/0xf0
genl_rcv+0x24/0x40
netlink_unicast+0x255/0x380
netlink_sendmsg+0x1f3/0x420
__sock_sendmsg+0x38/0x60
__sys_sendto+0x119/0x180
do_syscall_64+0x53/0x1d0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Ensure the pointer is valid before use by checking it for NULL. If it
is valid, immediately call netdev_hold() to take a reference, and
preventing the netdevice from being freed while it is in use. |
In the Linux kernel, the following vulnerability has been resolved:
xen/gntdev: Accommodate VMA splitting
Prior to this commit, the gntdev driver code did not handle the
following scenario correctly with paravirtualized (PV) Xen domains:
* User process sets up a gntdev mapping composed of two grant mappings
(i.e., two pages shared by another Xen domain).
* User process munmap()s one of the pages.
* User process munmap()s the remaining page.
* User process exits.
In the scenario above, the user process would cause the kernel to log
the following messages in dmesg for the first munmap(), and the second
munmap() call would result in similar log messages:
BUG: Bad page map in process doublemap.test pte:... pmd:...
page:0000000057c97bff refcount:1 mapcount:-1 \
mapping:0000000000000000 index:0x0 pfn:...
...
page dumped because: bad pte
...
file:gntdev fault:0x0 mmap:gntdev_mmap [xen_gntdev] readpage:0x0
...
Call Trace:
<TASK>
dump_stack_lvl+0x46/0x5e
print_bad_pte.cold+0x66/0xb6
unmap_page_range+0x7e5/0xdc0
unmap_vmas+0x78/0xf0
unmap_region+0xa8/0x110
__do_munmap+0x1ea/0x4e0
__vm_munmap+0x75/0x120
__x64_sys_munmap+0x28/0x40
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x61/0xcb
...
For each munmap() call, the Xen hypervisor (if built with CONFIG_DEBUG)
would print out the following and trigger a general protection fault in
the affected Xen PV domain:
(XEN) d0v... Attempt to implicitly unmap d0's grant PTE ...
(XEN) d0v... Attempt to implicitly unmap d0's grant PTE ...
As of this writing, gntdev_grant_map structure's vma field (referred to
as map->vma below) is mainly used for checking the start and end
addresses of mappings. However, with split VMAs, these may change, and
there could be more than one VMA associated with a gntdev mapping.
Hence, remove the use of map->vma and rely on map->pages_vm_start for
the original start address and on (map->count << PAGE_SHIFT) for the
original mapping size. Let the invalidate() and find_special_page()
hooks use these.
Also, given that there can be multiple VMAs associated with a gntdev
mapping, move the "mmu_interval_notifier_remove(&map->notifier)" call to
the end of gntdev_put_map, so that the MMU notifier is only removed
after the closing of the last remaining VMA.
Finally, use an atomic to prevent inadvertent gntdev mapping re-use,
instead of using the map->live_grants atomic counter and/or the map->vma
pointer (the latter of which is now removed). This prevents the
userspace from mmap()'ing (with MAP_FIXED) a gntdev mapping over the
same address range as a previously set up gntdev mapping. This scenario
can be summarized with the following call-trace, which was valid prior
to this commit:
mmap
gntdev_mmap
mmap (repeat mmap with MAP_FIXED over the same address range)
gntdev_invalidate
unmap_grant_pages (sets 'being_removed' entries to true)
gnttab_unmap_refs_async
unmap_single_vma
gntdev_mmap (maps the shared pages again)
munmap
gntdev_invalidate
unmap_grant_pages
(no-op because 'being_removed' entries are true)
unmap_single_vma (For PV domains, Xen reports that a granted page
is being unmapped and triggers a general protection fault in the
affected domain, if Xen was built with CONFIG_DEBUG)
The fix for this last scenario could be worth its own commit, but we
opted for a single commit, because removing the gntdev_grant_map
structure's vma field requires guarding the entry to gntdev_mmap(), and
the live_grants atomic counter is not sufficient on its own to prevent
the mmap() over a pre-existing mapping. |