Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5.4.210 velinux #3

Closed

Conversation

liuxuepinggit
Copy link
Collaborator

No description provided.

iuliana-prodan and others added 30 commits November 22, 2022 22:43
Added support for batch requests, per crypto engine.
A new callback is added, do_batch_requests, which executes a
batch of requests. This has the crypto_engine structure as argument
(for cases when more than one crypto-engine is used).
The crypto_engine_alloc_init_and_set function, initializes
crypto-engine, but also, sets the do_batch_requests callback.
On crypto_pump_requests, if do_batch_requests callback is
implemented in a driver, this will be executed. The link between
the requests will be done in driver, if possible.
do_batch_requests is available only if the hardware has support
for multiple request.

Signed-off-by: Iuliana Prodan <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>
(cherry picked from commit 8d90822)
Signed-off-by: lei he <[email protected]>
Fix wrong test data at testmgr.h, it seems to be caused
by ignoring the last '\0' when calling sizeof.

Signed-off-by: Lei He <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>
(cherry picked from commit 39ef085)
Signed-off-by: lei he <[email protected]>
According to the BER encoding rules, integer value should be encoded
as two's complement, and if the highest bit of a positive integer
is 1, should add a leading zero-octet.

The kernel's built-in RSA algorithm cannot recognize negative numbers
when parsing keys, so it can pass this test case.

Export the key to file and run the following command to verify the
fix result:

  openssl asn1parse -inform DER -in /path/to/key/file

Signed-off-by: Lei He <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>
(cherry picked from commit a988701)
Signed-off-by: lei he <[email protected]>
Base on the lastest virtio crypto spec, define VIRTIO_CRYPTO_NOSPC.

Reviewed-by: Gonglei <[email protected]>
Signed-off-by: zhenwei pi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Michael S. Tsirkin <[email protected]>
(cherry picked from commit 13d640a)
Signed-off-by: lei he <[email protected]>
Since crypto is a modern-only device,
tag config space fields as having little endian-ness.

Signed-off-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
(cherry picked from commit 24bcf35)
Signed-off-by: lei he <[email protected]>
Introduce asymmetric service definition, asymmetric operations and
several well known algorithms.

Co-developed-by: lei he <[email protected]>
Signed-off-by: lei he <[email protected]>
Signed-off-by: zhenwei pi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Gonglei <[email protected]>
(cherry picked from commit 24e1959)
Signed-off-by: lei he <[email protected]>
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().

Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.

Cc: Gonglei <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>
(cherry picked from commit d01a9f7)
Signed-off-by: lei he <[email protected]>
Support rsa & pkcs1pad(rsa,sha1) with priority 150.

Test with QEMU built-in backend, it works fine.
1, The self-test framework of crypto layer works fine in guest kernel
2, Test with Linux guest(with asym support), the following script
test(note that pkey_XXX is supported only in a newer version of keyutils):
  - both public key & private key
  - create/close session
  - encrypt/decrypt/sign/verify basic driver operation
  - also test with kernel crypto layer(pkey add/query)

All the cases work fine.

rm -rf *.der *.pem *.pfx
modprobe pkcs8_key_parser # if CONFIG_PKCS8_PRIVATE_KEY_PARSER=m
rm -rf /tmp/data
dd if=/dev/random of=/tmp/data count=1 bs=226

openssl req -nodes -x509 -newkey rsa:2048 -keyout key.pem -out cert.pem -subj "/C=CN/ST=BJ/L=HD/O=qemu/OU=dev/CN=qemu/[email protected]"
openssl pkcs8 -in key.pem -topk8 -nocrypt -outform DER -out key.der
openssl x509 -in cert.pem -inform PEM -outform DER -out cert.der

PRIV_KEY_ID=`cat key.der | keyctl padd asymmetric test_priv_key @s`
echo "priv key id = "$PRIV_KEY_ID
PUB_KEY_ID=`cat cert.der | keyctl padd asymmetric test_pub_key @s`
echo "pub key id = "$PUB_KEY_ID

keyctl pkey_query $PRIV_KEY_ID 0
keyctl pkey_query $PUB_KEY_ID 0

echo "Enc with priv key..."
keyctl pkey_encrypt $PRIV_KEY_ID 0 /tmp/data enc=pkcs1 >/tmp/enc.priv
echo "Dec with pub key..."
keyctl pkey_decrypt $PRIV_KEY_ID 0 /tmp/enc.priv enc=pkcs1 >/tmp/dec
cmp /tmp/data /tmp/dec

echo "Sign with priv key..."
keyctl pkey_sign $PRIV_KEY_ID 0 /tmp/data enc=pkcs1 hash=sha1 > /tmp/sig
echo "Verify with pub key..."
keyctl pkey_verify $PRIV_KEY_ID 0 /tmp/data /tmp/sig enc=pkcs1 hash=sha1

echo "Enc with pub key..."
keyctl pkey_encrypt $PUB_KEY_ID 0 /tmp/data enc=pkcs1 >/tmp/enc.pub
echo "Dec with priv key..."
keyctl pkey_decrypt $PRIV_KEY_ID 0 /tmp/enc.pub enc=pkcs1 >/tmp/dec
cmp /tmp/data /tmp/dec

echo "Verify with pub key..."
keyctl pkey_verify $PUB_KEY_ID 0 /tmp/data /tmp/sig enc=pkcs1 hash=sha1

Conflicts & Solve:
- drivers/crypto/virtio/Kconfig
    Except that 'select CRYPTO_SKCIPHER' is removed, all changes from HEAD
    and cherry-picked commit are both kept.
- drivers/crypto/virito/virtio_crypto_core.c
    There is an upstream commit that changed macro 'virito_cread' to
    'virito_cread_le'. Since this commit introduced a lot of
    changes (and conflicts), here just modified the cherry-picked
    commit, let it keep use 'virito_cread'.
- drivers/crypto/virtio/virtio_crypto_akcipher_algs.c
    Replace kfree_sensitive with kzfree, see:
    https://patchwork.kernel.org/project/target-devel/patch/[email protected]/

[1 compiling warning during development]
Reported-by: kernel test robot <[email protected]>

Co-developed-by: lei he <[email protected]>
Signed-off-by: lei he <[email protected]>
Signed-off-by: zhenwei pi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Gonglei <[email protected]>
Signed-off-by: Nathan Chancellor <[email protected]> #Kconfig tweaks
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Michael S. Tsirkin <[email protected]>
(cherry picked from commit 59ca6c9)
Signed-off-by: lei he <[email protected]>
Suggested by Gonglei, rename virtio_crypto_algs.c to
virtio_crypto_skcipher_algs.c. Also minor changes for function name.
Thus the function of source files get clear: skcipher services in
virtio_crypto_skcipher_algs.c and akcipher services in
virtio_crypto_akcipher_algs.c.

Signed-off-by: zhenwei pi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Gonglei <[email protected]>
(cherry picked from commit ea993de)
Signed-off-by: lei he <[email protected]>
Use temporary variable to make code easy to read and maintain.
	/* Pad cipher's parameters */
        vcrypto->ctrl.u.sym_create_session.op_type =
                cpu_to_le32(VIRTIO_CRYPTO_SYM_OP_CIPHER);
        vcrypto->ctrl.u.sym_create_session.u.cipher.para.algo =
                vcrypto->ctrl.header.algo;
        vcrypto->ctrl.u.sym_create_session.u.cipher.para.keylen =
                cpu_to_le32(keylen);
        vcrypto->ctrl.u.sym_create_session.u.cipher.para.op =
                cpu_to_le32(op);
-->
	sym_create_session = &ctrl->u.sym_create_session;
	sym_create_session->op_type = cpu_to_le32(VIRTIO_CRYPTO_SYM_OP_CIPHER);
	sym_create_session->u.cipher.para.algo = ctrl->header.algo;
	sym_create_session->u.cipher.para.keylen = cpu_to_le32(keylen);
	sym_create_session->u.cipher.para.op = cpu_to_le32(op);

The new style shows more obviously:
- the variable we want to operate.
- an assignment statement in a single line.

Conflicts & Solve:
- drivers/crypto/virito/virtio_crypto_skcipher_algs.c
    Use 'kzfree' instead of 'kfree_sensitive' because currently
    kfree_sensitive is not imported, see:
    'https://patchwork.kernel.org/project/target-devel/patch/[email protected]/'

Cc: Michael S. Tsirkin <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Gonglei <[email protected]>
Reviewed-by: Gonglei <[email protected]>
Signed-off-by: zhenwei pi <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
(cherry picked from commit 6fd763d)
Signed-off-by: lei he <[email protected]>
Originally, all of the control requests share a single buffer(
ctrl & input & ctrl_status fields in struct virtio_crypto), this
allows queue depth 1 only, the performance of control queue gets
limited by this design.

In this patch, each request allocates request buffer dynamically, and
free buffer after request, so the scope protected by ctrl_lock also
get optimized here.
It's possible to optimize control queue depth in the next step.

A necessary comment is already in code, still describe it again:
/*
 * Note: there are padding fields in request, clear them to zero before
 * sending to host to avoid to divulge any information.
 * Ex, virtio_crypto_ctrl_request::ctrl::u::destroy_session::padding[48]
 */
So use kzalloc to allocate buffer of struct virtio_crypto_ctrl_request.

Conflicts & Solve:
- drivers/crypto/virtio/virtio_crypto_skcipher_algs.c
    1. Use changes from cherry-picked commit
    2. replace the kfree_sensitive with kzfree

Potentially dereferencing uninitialized variables:
Reported-by: kernel test robot <[email protected]>
Reported-by: Dan Carpenter <[email protected]>

Cc: Michael S. Tsirkin <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Gonglei <[email protected]>
Reviewed-by: Gonglei <[email protected]>
Signed-off-by: zhenwei pi <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
(cherry picked from commit 0756ad1)
Signed-off-by: lei he <[email protected]>
Originally, after submitting request into virtio crypto control
queue, the guest side polls the result from the virt queue. This
works like following:
    CPU0   CPU1               ...             CPUx  CPUy
     |      |                                  |     |
     \      \                                  /     /
      \--------spin_lock(&vcrypto->ctrl_lock)-------/
                           |
                 virtqueue add & kick
                           |
                  busy poll virtqueue
                           |
              spin_unlock(&vcrypto->ctrl_lock)
                          ...

There are two problems:
1, The queue depth is always 1, the performance of a virtio crypto
   device gets limited. Multi user processes share a single control
   queue, and hit spin lock race from control queue. Test on Intel
   Platinum 8260, a single worker gets ~35K/s create/close session
   operations, and 8 workers get ~40K/s operations with 800% CPU
   utilization.
2, The control request is supposed to get handled immediately, but
   in the current implementation of QEMU(v6.2), the vCPU thread kicks
   another thread to do this work, the latency also gets unstable.
   Tracking latency of virtio_crypto_alg_akcipher_close_session in 5s:
        usecs               : count     distribution
         0 -> 1          : 0        |                        |
         2 -> 3          : 7        |                        |
         4 -> 7          : 72       |                        |
         8 -> 15         : 186485   |************************|
        16 -> 31         : 687      |                        |
        32 -> 63         : 5        |                        |
        64 -> 127        : 3        |                        |
       128 -> 255        : 1        |                        |
       256 -> 511        : 0        |                        |
       512 -> 1023       : 0        |                        |
      1024 -> 2047       : 0        |                        |
      2048 -> 4095       : 0        |                        |
      4096 -> 8191       : 0        |                        |
      8192 -> 16383      : 2        |                        |
This means that a CPU may hold vcrypto->ctrl_lock as long as 8192~16383us.

To improve the performance of control queue, a request on control queue
waits completion instead of busy polling to reduce lock racing, and gets
completed by control queue callback.
    CPU0   CPU1               ...             CPUx  CPUy
     |      |                                  |     |
     \      \                                  /     /
      \--------spin_lock(&vcrypto->ctrl_lock)-------/
                           |
                 virtqueue add & kick
                           |
      ---------spin_unlock(&vcrypto->ctrl_lock)------
     /      /                                  \     \
     |      |                                  |     |
    wait   wait                               wait  wait

Test this patch, the guest side get ~200K/s operations with 300% CPU
utilization.

Cc: Michael S. Tsirkin <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Gonglei <[email protected]>
Reviewed-by: Gonglei <[email protected]>
Signed-off-by: zhenwei pi <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
(cherry picked from commit 977231e)
Signed-off-by: lei he <[email protected]>
For some akcipher operations(eg, decryption of pkcs1pad(rsa)),
the length of returned result maybe less than akcipher_req->dst_len,
we need to recalculate the actual dst_len through the virt-queue
protocol.

Cc: Michael S. Tsirkin <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Gonglei <[email protected]>
Reviewed-by: Gonglei <[email protected]>
Signed-off-by: lei he <[email protected]>
Signed-off-by: zhenwei pi <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
(cherry picked from commit a36bd0a)
Signed-off-by: lei he <[email protected]>
Enable retry for virtio-crypto-dev, so that crypto-engine
can process cipher-requests parallelly.

Cc: Michael S. Tsirkin <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Gonglei <[email protected]>
Reviewed-by: Gonglei <[email protected]>
Signed-off-by: lei he <[email protected]>
Signed-off-by: zhenwei pi <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
(cherry picked from commit 4e0d352)
Signed-off-by: lei he <[email protected]>
Now, in crypto-engine, if hardware queue is full (-ENOSPC),
requeue request regardless of MAY_BACKLOG flag.
If hardware throws any other error code (like -EIO, -EINVAL,
-ENOMEM, etc.) only MAY_BACKLOG requests are enqueued back into
crypto-engine's queue, since the others can be dropped.
The latter case can be fatal error, so those cannot be recovered from.
For example, in CAAM driver, -EIO is returned in case the job descriptor
is broken, so there is no possibility to fix the job descriptor.
Therefore, these errors might be fatal error, so we shouldn’t
requeue the request. This will just be pass back and forth between
crypto-engine and hardware.

Fixes: 6a89f49 ("crypto: engine - support for parallel requests based on retry mechanism")
Signed-off-by: Iuliana Prodan <[email protected]>
Reported-by: Horia Geantă <[email protected]>
Reviewed-by: Horia Geantă <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>
(cherry picked from commit d1c72f6)
Signed-off-by: lei he <[email protected]>
According to PKCS#1 standard, the 'otherPrimeInfos' field contains
the information for the additional primes r_3, ..., r_u, in order.
It shall be omitted if the version is 0 and shall contain at least
one instance of OtherPrimeInfo if the version is 1, see:
https://www.rfc-editor.org/rfc/rfc3447#page-44
Replace the version number '1' with 0, otherwise, some drivers may
not pass the run-time tests.

Signed-off-by: lei he <[email protected]>
(cherry picked from commit 0bb8f12)
Signed-off-by: lei he <[email protected]>
…stat of cgroup v2

commit 673520f upstream.

There are already statistics of {pgscan,pgsteal}_kswapd and
{pgscan,pgsteal}_direct of memcg event here, but now only the sum of the
two is displayed in memory.stat of cgroup v2.

In order to obtain more accurate information during monitoring and
debugging, and to align with the display in /proc/vmstat, it better to
display {pgscan,pgsteal}_kswapd and {pgscan,pgsteal}_direct separately.

Also, for forward compatibility, we still display pgscan and pgsteal items
so that it won't break existing applications.

[[email protected]: add comment for memcg_vm_event_stat (suggested by Michal)]
  Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix the doc, thanks to Johannes]
  Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Qi Zheng <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Muchun Song <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
commit bcb1704 upstream.

Two new statistics are introduced to show the internal of burst feature
and explain why burst helps or not.

nr_bursts:  number of periods bandwidth burst occurs
burst_time: cumulative wall-time (in nanoseconds) that any cpus has
	    used above quota in respective periods

Co-developed-by: Shanpei Chen <[email protected]>
Signed-off-by: Shanpei Chen <[email protected]>
Co-developed-by: Tianchen Ding <[email protected]>
Signed-off-by: Tianchen Ding <[email protected]>
Signed-off-by: Huaixin Chang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Daniel Jordan <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Hao Jia <[email protected]>
commit 7ebc7dc upstream.

Since commit d6a3b24 these three lines are merged into one by the
RST processor, making it hard to read. Use bullet points to separate
the entries, like it's done in other similar places.

Signed-off-by: Kir Kolyshkin <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
Signed-off-by: Hao Jia <[email protected]>
commit d73df88 upstream.

Basic description of usage and effect for CFS Bandwidth Control Burst.

Co-developed-by: Shanpei Chen <[email protected]>
Signed-off-by: Shanpei Chen <[email protected]>
Co-developed-by: Tianchen Ding <[email protected]>
Signed-off-by: Tianchen Ding <[email protected]>
Signed-off-by: Huaixin Chang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Daniel Jordan <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Hao Jia <[email protected]>
commit a6d210d upstream.

Using bitops makes bit masks and shifts easier to read.

Tested-by: Brad Campbell <[email protected]>
Tested-by: Bernhard Gebetsberger <[email protected]>
Tested-by: Holger Kiehl <[email protected]>
Tested-by: Michael Larabel <[email protected]>
Tested-by: Jonathan McDowell <[email protected]>
Tested-by: Ken Moffat <[email protected]>
Tested-by: Darren Salt <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Xie Haocheng <[email protected]>
 commit d547552 upstream.

Convert driver to use devm_hwmon_device_register_with_info to simplify
the code and to reduce its size.

Old size (x86_64):
   text	   data	    bss	    dec	    hex	filename
   8247	   4488	     64	  12799	   31ff	drivers/hwmon/k10temp.o
New size:
   text	   data	    bss	    dec	    hex	filename
   6778	   2792	     64	   9634	   25a2	drivers/hwmon/k10temp.o

Tested-by: Brad Campbell <[email protected]>
Tested-by: Bernhard Gebetsberger <[email protected]>
Tested-by: Holger Kiehl <[email protected]>
Tested-by: Michael Larabel <[email protected]>
Tested-by: Jonathan McDowell <[email protected]>
Tested-by: Ken Moffat <[email protected]>
Tested-by: Darren Salt <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Xie Haocheng <[email protected]>
commit c757938 upstream.

Zen2 reports reporting temperatures per CPU die (called Core Complex Dies,
or CCD, by AMD). Add support for it to the k10temp driver.

Tested-by: Brad Campbell <[email protected]>
Tested-by: Bernhard Gebetsberger <[email protected]>
Tested-by: Holger Kiehl <[email protected]>
Tested-by: Michael Larabel <[email protected]>
Tested-by: Jonathan McDowell <[email protected]>
Tested-by: Ken Moffat <[email protected]>
Tested-by: Darren Salt <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Xie Haocheng <[email protected]>
commit b00647c upstream.

Ryzen CPUs report core and SoC voltages and currents. Add support
for it to the k10temp driver.

For the time being, only report voltages and currents for Ryzen
CPUs. Threadripper and EPYC appear to use a different mechanism.

Tested-by: Brad Campbell <[email protected]>
Tested-by: Bernhard Gebetsberger <[email protected]>
Tested-by: Holger Kiehl <[email protected]>
Tested-by: Michael Larabel <[email protected]>
Tested-by: Jonathan McDowell <[email protected]>
Tested-by: Ken Moffat <[email protected]>
Tested-by: Darren Salt <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Xie Haocheng <[email protected]>
commit 70831c8 upstream.

The maximum Tdie or Tctl is not published for Ryzen CPUs. What is
known, however, is that the traditional value of 70 degrees C is no
longer correct. On top of that, the limit applies to Tctl, not to Tdie.
Displaying it in either context is meaningless, confusing, and wrong.
Stop doing it.

Tested-by: Brad Campbell <[email protected]>
Tested-by: Holger Kiehl <[email protected]>
Tested-by: Michael Larabel <[email protected]>
Tested-by: Jonathan McDowell <[email protected]>
Tested-by: Ken Moffat <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Xie Haocheng <[email protected]>
commit 9c4a38f upstream.

Show thermal and SVI registers for Family 17h CPUs.

Tested-by: Sebastian Reichel <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Xie Haocheng <[email protected]>
commit fd8bdb2 upstream.

In HWiNFO, we see support for Tccd1, Tccd3, Tccd5, and Tccd7 temperature
sensors on Zen2 based Threadripper CPUs. Checking register maps on
Threadripper 3970X confirms SMN register addresses and values for those
sensors.

Register values observed in an idle system:

0x059950: 00000000 00000abc 00000000 00000ad8
0x059960: 00000000 00000ade 00000000 00000ae4

Under load:

0x059950: 00000000 00000c02 00000000 00000c14
0x059960: 00000000 00000c30 00000000 00000c22

More analysis shows that EPYC CPUs support up to 8 CCD temperature
sensors. EPYC 7601 supports three CCD temperature sensors. Unlike
Zen2 CPUs, the register space in Zen1 CPUs supports a maximum of four
sensors, so only search for a maximum of four sensors on Zen1 CPUs.

On top of that, in thm_10_0_sh_mask.h in the Linux kernel, we find
definitions for THM_DIE{1-3}_TEMP__VALID_MASK, set to 0x00000800, as well
as matching SMN addresses. This lets us conclude that bit 11 of the
respective registers is a valid bit. With this assumption, the temperature
offset is now 49 degrees C. This conveniently matches the documented
temperature offset for Tdie, again suggesting that above registers indeed
report temperatures sensor values. Assume that bit 11 is indeed a valid
bit, and add support for the additional sensors.

With this patch applied, output from 3970X (idle) looks as follows:

k10temp-pci-00c3
Adapter: PCI adapter
Tdie:         +55.9°C
Tctl:         +55.9°C
Tccd1:        +39.8°C
Tccd3:        +43.8°C
Tccd5:        +43.8°C
Tccd7:        +44.8°C

Tested-by: Michael Larabel <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Xie Haocheng <[email protected]>
commit b02c685 upstream.

Traditionally, the temperature displayed by k10temp was Tctl.
On Family 17h CPUs, Tdie was displayed instead. To reduce confusion,
Tctl was added later as second temperature. This resulted in Tdie
being reported as temp1_input, and Tctl as temp2_input. This is
different to non-Ryzen CPUs, where Tctl is displayed as temp1_input.
Swap temp1_input and temp2_input on Family 17h CPUs, such that Tctl
is now reported as temp1_input and Tdie is reported as temp2_input,
to align with other CPUs, streamline the code, and make it less
confusing. Coincidentally, this also aligns the code with its
documentation, which states that Tdie is reported as temp2_input.

Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Xie Haocheng <[email protected]>
commit 6046524 upstream.

Use a bit map to describe if temperature channels are supported,
and use it for all temperature channels. Use a separate flag,
independent of Tdie support, to indicate if the system is running
on a Ryzen CPU.

Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Xie Haocheng <[email protected]>
commit 0e786f3 upstream.

Fix the following sparse warning:

drivers/hwmon/k10temp.c:189:12: warning: symbol 'k10temp_temp_label' was
not declared. Should it be static?
drivers/hwmon/k10temp.c:202:12: warning: symbol 'k10temp_in_label' was
not declared. Should it be static?
drivers/hwmon/k10temp.c:207:12: warning: symbol 'k10temp_curr_label' was
not declared. Should it be static?

Reported-by: Hulk Robot <[email protected]>
Signed-off-by: Jason Yan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Xie Haocheng <[email protected]>
Dawei Li and others added 23 commits October 8, 2023 20:29
commit ce4b815 upstream.

s_inodes is superblock-specific resource, which should be
protected by sb's specific lock s_inode_list_lock.

Link: https://lore.kernel.org/r/TYCP286MB23238380DE3B74874E8D78ABCA299@TYCP286MB2323.JPNP286.PROD.OUTLOOK.COM
Fixes: 7d41963 ("erofs: Support sharing cookies in the same domain")
Reviewed-by: Yue Hu <[email protected]>
Reviewed-by: Jia Zhu <[email protected]>
Reviewed-by: Jingbo Xu <[email protected]>
Signed-off-by: Dawei Li <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
Signed-off-by: Jia Zhu <[email protected]>
commit 39bfcb8 upstream.

When erofs instance is remounted with fsid or domain_id mount option
specified, the original fsid and domain_id string pointer in sbi->opt
is directly overridden with the fsid and domain_id string in the new
fs_context, without freeing the original fsid and domain_id string.
What's worse, when the new fsid and domain_id string is transferred to
sbi, they are not reset to NULL in fs_context, and thus they are freed
when remount finishes, while sbi is still referring to these strings.

Reconfiguration for fsid and domain_id seems unusual. Thus clarify this
restriction explicitly and dump a warning when users are attempting to
do this.

Besides, to fix the use-after-free issue, move fsid and domain_id from
erofs_mount_opts to outside.

Fixes: c6be2bd ("erofs: register fscache volume")
Fixes: 8b7adf1 ("erofs: introduce fscache-based domain")
Signed-off-by: Jingbo Xu <[email protected]>
Reviewed-by: Jia Zhu <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Gao Xiang <[email protected]>
Signed-off-by: Jia Zhu <[email protected]>
commit 37020bb upstream.

The xarray iteration only holds the RCU read lock and thus may encounter
XA_RETRY_ENTRY if there's process modifying the xarray concurrently.
This will cause oops when referring to the invalid entry.

Fix this by adding the missing xas_retry(), which will make the
iteration wind back to the root node if XA_RETRY_ENTRY is encountered.

Fixes: d435d53 ("erofs: change to use asynchronous io for fscache readpage/readahead")
Suggested-by: David Howells <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Reviewed-by: Jia Zhu <[email protected]>
Signed-off-by: Jingbo Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Gao Xiang <[email protected]>
Signed-off-by: Jia Zhu <[email protected]>
commit 27f2a2d upstream.

When shared domain is enabled, doing mount twice with the same fsid and
domain_id will trigger sysfs warning as shown below:

 sysfs: cannot create duplicate filename '/fs/erofs/d0,meta.bin'
 CPU: 15 PID: 1051 Comm: mount Not tainted 6.1.0-rc6+ openvelinux#1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
 Call Trace:
  <TASK>
  dump_stack_lvl+0x38/0x49
  dump_stack+0x10/0x12
  sysfs_warn_dup.cold+0x17/0x27
  sysfs_create_dir_ns+0xb8/0xd0
  kobject_add_internal+0xb1/0x240
  kobject_init_and_add+0x71/0xa0
  erofs_register_sysfs+0x89/0x110
  erofs_fc_fill_super+0x98c/0xaf0
  vfs_get_super+0x7d/0x100
  get_tree_nodev+0x16/0x20
  erofs_fc_get_tree+0x20/0x30
  vfs_get_tree+0x24/0xb0
  path_mount+0x2fa/0xa90
  do_mount+0x7c/0xa0
  __x64_sys_mount+0x8b/0xe0
  do_syscall_64+0x30/0x60
  entry_SYSCALL_64_after_hwframe+0x46/0xb0

The reason is erofs_fscache_register_cookie() doesn't guarantee the primary
data blob (aka fsid) is unique in the shared domain and
erofs_register_sysfs() invoked by the second mount will fail due to the
duplicated fsid in the shared domain and report warning.

It would be better to check the uniqueness of fsid before doing
erofs_register_sysfs(), so adding a new flags parameter for
erofs_fscache_register_cookie() and doing the uniqueness check if
EROFS_REG_COOKIE_NEED_NOEXIST is enabled.

After the patch, the error in dmesg for the duplicated mount would be:

 erofs: ...: erofs_domain_register_cookie: XX already exists in domain YY

Reviewed-by: Jia Zhu <[email protected]>
Reviewed-by: Jingbo Xu <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Hou Tao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Fixes: 7d41963 ("erofs: Support sharing cookies in the same domain")
Signed-off-by: Gao Xiang <[email protected]>
Signed-off-by: Jia Zhu <[email protected]>
commit 0130e4e upstream.

erofs over fscache doesn't support the compressed layout yet. It will
cause NULL crash if there are compressed inodes contained when working
in fscache mode.

So far in the erofs based container image distribution scenarios
(RAFS v6), the compressed RAFS v6 images are downloaded and then
decompressed on demand as an uncompressed erofs image. Then the erofs
image is mounted in fscache mode for containers to use. IOWs, currently
compressed data is decompressed on the userspace side instead and
uncompressed erofs images will be finally cached.

The fscache support for the compressed layout is still under
development and it will be used for runtime decompression feature.
Anyway, to avoid the potential crash, let's leave the compressed inodes
unsupported in fscache mode until we support it later.

Fixes: 1442b02 ("erofs: implement fscache-based data read for non-inline layout")
Signed-off-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Gao Xiang <[email protected]>
Signed-off-by: Jia Zhu <[email protected]>
commit e6d9f9b upstream.

For unmapped range, the returned map.m_llen is zero, and thus the
calculated count is unexpected zero.

Prior to the refactoring introduced by commit 1ae9470 ("erofs:
clean up .read_folio() and .readahead() in fscache mode"), only the
readahead routine suffers from this. With the refactoring of making
.read_folio() and .readahead() calling one common routine, both
read_folio and readahead have this issue now.

Fix this by calculating count separately in unmapped condition.

Fixes: c665b39 ("erofs: implement fscache-based data readahead")
Fixes: 1ae9470 ("erofs: clean up .read_folio() and .readahead() in fscache mode")
Signed-off-by: Jingbo Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Gao Xiang <[email protected]>
Signed-off-by: Jia Zhu <[email protected]>
…_lock

commit 598ad0b upstream.

Taking sb_writers whilst holding mmap_lock isn't allowed and will result in
a lockdep warning like that below.  The problem comes from cachefiles
needing to take the sb_writers lock in order to do a write to the cache,
but being asked to do this by netfslib called from readpage, readahead or
write_begin[1].

Fix this by always offloading the write to the cache off to a worker
thread.  The main thread doesn't need to wait for it, so deadlock can be
avoided.

This can be tested by running the quick xfstests on something like afs or
ceph with lockdep enabled.

WARNING: possible circular locking dependency detected
5.15.0-rc1-build2+ #292 Not tainted
------------------------------------------------------
holetest/65517 is trying to acquire lock:
ffff88810c81d730 (mapping.invalidate_lock#3){.+.+}-{3:3}, at: filemap_fault+0x276/0x7a5

but task is already holding lock:
ffff8881595b53e8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x28d/0x59c

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> openvelinux#2 (&mm->mmap_lock#2){++++}-{3:3}:
       validate_chain+0x3c4/0x4a8
       __lock_acquire+0x89d/0x949
       lock_acquire+0x2dc/0x34b
       __might_fault+0x87/0xb1
       strncpy_from_user+0x25/0x18c
       removexattr+0x7c/0xe5
       __do_sys_fremovexattr+0x73/0x96
       do_syscall_64+0x67/0x7a
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> openvelinux#1 (sb_writers#10){.+.+}-{0:0}:
       validate_chain+0x3c4/0x4a8
       __lock_acquire+0x89d/0x949
       lock_acquire+0x2dc/0x34b
       cachefiles_write+0x2b3/0x4bb
       netfs_rreq_do_write_to_cache+0x3b5/0x432
       netfs_readpage+0x2de/0x39d
       filemap_read_page+0x51/0x94
       filemap_get_pages+0x26f/0x413
       filemap_read+0x182/0x427
       new_sync_read+0xf0/0x161
       vfs_read+0x118/0x16e
       ksys_read+0xb8/0x12e
       do_syscall_64+0x67/0x7a
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #0 (mapping.invalidate_lock#3){.+.+}-{3:3}:
       check_noncircular+0xe4/0x129
       check_prev_add+0x16b/0x3a4
       validate_chain+0x3c4/0x4a8
       __lock_acquire+0x89d/0x949
       lock_acquire+0x2dc/0x34b
       down_read+0x40/0x4a
       filemap_fault+0x276/0x7a5
       __do_fault+0x96/0xbf
       do_fault+0x262/0x35a
       __handle_mm_fault+0x171/0x1b5
       handle_mm_fault+0x12a/0x233
       do_user_addr_fault+0x3d2/0x59c
       exc_page_fault+0x85/0xa5
       asm_exc_page_fault+0x1e/0x30

other info that might help us debug this:

Chain exists of:
  mapping.invalidate_lock#3 --> sb_writers#10 --> &mm->mmap_lock#2

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&mm->mmap_lock#2);
                               lock(sb_writers#10);
                               lock(&mm->mmap_lock#2);
  lock(mapping.invalidate_lock#3);

 *** DEADLOCK ***

1 lock held by holetest/65517:
 #0: ffff8881595b53e8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x28d/0x59c

stack backtrace:
CPU: 0 PID: 65517 Comm: holetest Not tainted 5.15.0-rc1-build2+ #292
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
Call Trace:
 dump_stack_lvl+0x45/0x59
 check_noncircular+0xe4/0x129
 ? print_circular_bug+0x207/0x207
 ? validate_chain+0x461/0x4a8
 ? add_chain_block+0x88/0xd9
 ? hlist_add_head_rcu+0x49/0x53
 check_prev_add+0x16b/0x3a4
 validate_chain+0x3c4/0x4a8
 ? check_prev_add+0x3a4/0x3a4
 ? mark_lock+0xa5/0x1c6
 __lock_acquire+0x89d/0x949
 lock_acquire+0x2dc/0x34b
 ? filemap_fault+0x276/0x7a5
 ? rcu_read_unlock+0x59/0x59
 ? add_to_page_cache_lru+0x13c/0x13c
 ? lock_is_held_type+0x7b/0xd3
 down_read+0x40/0x4a
 ? filemap_fault+0x276/0x7a5
 filemap_fault+0x276/0x7a5
 ? pagecache_get_page+0x2dd/0x2dd
 ? __lock_acquire+0x8bc/0x949
 ? pte_offset_kernel.isra.0+0x6d/0xc3
 __do_fault+0x96/0xbf
 ? do_fault+0x124/0x35a
 do_fault+0x262/0x35a
 ? handle_pte_fault+0x1c1/0x20d
 __handle_mm_fault+0x171/0x1b5
 ? handle_pte_fault+0x20d/0x20d
 ? __lock_release+0x151/0x254
 ? mark_held_locks+0x1f/0x78
 ? rcu_read_unlock+0x3a/0x59
 handle_mm_fault+0x12a/0x233
 do_user_addr_fault+0x3d2/0x59c
 ? pgtable_bad+0x70/0x70
 ? rcu_read_lock_bh_held+0xab/0xab
 exc_page_fault+0x85/0xa5
 ? asm_exc_page_fault+0x8/0x30
 asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x40192f
Code: ff 48 89 c3 48 8b 05 50 28 00 00 48 85 ed 7e 23 31 d2 4b 8d 0c 2f eb 0a 0f 1f 00 48 8b 05 39 28 00 00 48 0f af c2 48 83 c2 01 <48> 89 1c 01 48 39 d5 7f e8 8b 0d f2 27 00 00 31 c0 85 c9 74 0e 8b
RSP: 002b:00007f9931867eb0 EFLAGS: 00010202
RAX: 0000000000000000 RBX: 00007f9931868700 RCX: 00007f993206ac00
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00007ffc13e06ee0
RBP: 0000000000000100 R08: 0000000000000000 R09: 00007f9931868700
R10: 00007f99318689d0 R11: 0000000000000202 R12: 00007ffc13e06ee0
R13: 0000000000000c00 R14: 00007ffc13e06e00 R15: 00007f993206a000

Fixes: 726218f ("netfs: Define an interface to talk to a cache")
Signed-off-by: David Howells <[email protected]>
Tested-by: Jeff Layton <[email protected]>
cc: Jan Kara <[email protected]>
cc: [email protected]
cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]/ [1]
Link: https://lore.kernel.org/r/163887597541.1596626.2668163316598972956.stgit@warthog.procyon.org.uk/ # v1
Signed-off-by: Jia Zhu <[email protected]>
commit 1122f40 upstream.

For now, enqueuing and dequeuing on-demand requests all start from
idx 0, this makes request distribution unfair. In the weighty
concurrent I/O scenario, the request stored in higher idx will starve.

Searching requests cyclically in cachefiles_ondemand_daemon_read,
makes distribution fairer.

Fixes: c838305 ("cachefiles: notify the user daemon when looking up cookie")
Reported-by: Yongqing Li <[email protected]>
Signed-off-by: Xin Yin <[email protected]>
Signed-off-by: David Howells <[email protected]>
Reviewed-by: Jeffle Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]/ # v1
Link: https://lore.kernel.org/r/[email protected]/ # v2
Signed-off-by: Jia Zhu <[email protected]>
commit c93ccd6 upstream.

The cache_size field of copen is specified by the user daemon.
If cache_size < 0, then the OPEN request is expected to fail,
while copen itself shall succeed. However, returning 0 is indeed
unexpected when cache_size is an invalid error code.

Fix this by returning error when cache_size is an invalid error code.

Changes
=======
v4: update the code suggested by Dan
v3: update the commit log suggested by Jingbo.

Fixes: c838305 ("cachefiles: notify the user daemon when looking up cookie")
Signed-off-by: Sun Ke <[email protected]>
Suggested-by: Jeffle Xu <[email protected]>
Suggested-by: Dan Carpenter <[email protected]>
Signed-off-by: David Howells <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Reviewed-by: Jingbo Xu <[email protected]>
Reviewed-by: Dan Carpenter <[email protected]>
Link: https://lore.kernel.org/r/[email protected]/ # v2
Link: https://lore.kernel.org/r/[email protected]/ # v3
Link: https://lore.kernel.org/r/[email protected]/ # v4
Signed-off-by: Jia Zhu <[email protected]>
... in prep for the following failover feature for the on-demand mode of
Cachefiles.

Signed-off-by: Jia Zhu <[email protected]>
The cachefiles framework use dio for local cache files filling
and reading, which may affects the performance for on-demand IO
path.

Change to use buffer IO for cache files filling, and first try
to find data in the pagecache during cache files reading. After
the pagecache for cache files is recycled, we will not suffer from
double caching issue.

Signed-off-by: Xin Yin <[email protected]>
Prior to fscache refactoring, the volume key in fscache_volume is a
string with trailing NUL; while after refactoring, the volume key in
fscache_cookie is actually a string without trailing NUL.

Thus the current volume key setup for cachefiles_open may cause oops
since it attempts to access volume key from the bad address.  This can
be reproduced by specifying "fsid" with 10 characters, e.g.
"-o fsid=abigdomain".

Fix this by determining if volume key is stored in volume->key or
volume->inline_key by checking volume->key_len, rather than
volume_key_size (which is actually volume->key_len plus 1).

Reported-by: Jia Zhu <[email protected]>
Fixes: c838305 ("cachefiles: notify the user daemon when looking up cookie")
Signed-off-by: Jingbo Xu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Acked-by: Joseph Qi <[email protected]>
Link: https://gitee.com/anolis/cloud-kernel/pulls/881
Signed-off-by: Jia Zhu <[email protected]>
Previously, @ondemand_id field was used not only to identify ondemand
state of the object, but also to represent the index of the xarray.
This commit introduces @State field to decouple the role of @ondemand_id
and adds helpers to access it.

Signed-off-by: Jia Zhu <[email protected]>
Reviewed-by: Xin Yin <[email protected]>
Reviewed-by: Jingbo Xu <[email protected]>
…ject

We'll introduce a @work_struct field for @object in subsequent patches,
it will enlarge the size of @object.
As the result of that, this commit extracts ondemand info field from
@object.

Signed-off-by: Jia Zhu <[email protected]>
Reviewed-by: Jingbo Xu <[email protected]>
…bject is closed

When an anonymous fd is closed by user daemon, if there is a new read
request for this file comes up, the anonymous fd should be re-opened
to handle that read request rather than fail it directly.

1. Introduce reopening state for objects that are closed but have
   inflight/subsequent read requests.
2. No longer flush READ requests but only CLOSE requests when anonymous
   fd is closed.
3. Enqueue the reopen work to workqueue, thus user daemon could get rid
   of daemon_read context and handle that request smoothly. Otherwise,
   the user daemon will send a reopen request and wait for itself to
   process the request.

Signed-off-by: Jia Zhu <[email protected]>
Reviewed-by: Xin Yin <[email protected]>
Reviewed-by: Jingbo Xu <[email protected]>
…in ondemand mode

Don't trigger EPOLLIN when there are only reopening read requests in
xarray.

Suggested-by: Xin Yin <[email protected]>
Signed-off-by: Jia Zhu <[email protected]>
Reviewed-by: Jingbo Xu <[email protected]>
…nd read requests

Previously, in ondemand read scenario, if the anonymous fd was closed by
user daemon, inflight and subsequent read requests would return EIO.
As long as the device connection is not released, user daemon can hold
and restore inflight requests by setting the request flag to
CACHEFILES_REQ_NEW.

Suggested-by: Gao Xiang <[email protected]>
Signed-off-by: Jia Zhu <[email protected]>
Signed-off-by: Xin Yin <[email protected]>
Reviewed-by: Jingbo Xu <[email protected]>
CONFIG_CACHEFILES_ONDEMAND=y
CONFIG_EROFS_FS=m
CONFIG_EROFS_FS_ONDEMAND=y

Signed-off-by: Jia Zhu <[email protected]>
Func kthread_create returns ERR_PTR(-ENOMEM) or ERR_PTR(-EINTR) when it
failed. Those values are not null.

Fixes: a7f33d7e ("nbd: Don't use workqueue to create recv threads")
Signed-off-by: Sheng Zhao <[email protected]>
There are some peripheral deivces of Intel and Qualcomm are working on
sanqingshan 2nd board, such as I2C bus/uart/usb etc. We need some
kernel modules to support them.

Signed-off-by: Duan Yayong <[email protected]>
ssif_info->client is NULL, so there is a NULL pointer dereference. Fix it.

Fixes: 4fff7b116321 ("bytedance: ipmi: ssif: add timeout for I2C client")
Signed-off-by: Muchun Song <[email protected]>
enable selinux and apparmor configs,
set kernel boot param selinux=0 and apparmor=0 by default.

Signed-off-by: huteng.ht <[email protected]>
PvsNarasimha pushed a commit to PvsNarasimha/kernel that referenced this pull request Dec 27, 2024
commit 99d4850 upstream

Found by leak sanitizer:
```
==1632594==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 21 byte(s) in 1 object(s) allocated from:
    #0 0x7f2953a7077b in __interceptor_strdup ../../../../src/libsanitizer/asan/asan_interceptors.cpp:439
    openvelinux#1 0x556701d6fbbf in perf_env__read_cpuid util/env.c:369
    openvelinux#2 0x556701d70589 in perf_env__cpuid util/env.c:465
    openvelinux#3 0x55670204bba2 in x86__is_amd_cpu arch/x86/util/env.c:14
    #4 0x5567020487a2 in arch__post_evsel_config arch/x86/util/evsel.c:83
    #5 0x556701d8f78b in evsel__config util/evsel.c:1366
    openvelinux#6 0x556701ef5872 in evlist__config util/record.c:108
    openvelinux#7 0x556701cd6bcd in test__PERF_RECORD tests/perf-record.c:112
    openvelinux#8 0x556701cacd07 in run_test tests/builtin-test.c:236
    openvelinux#9 0x556701cacfac in test_and_print tests/builtin-test.c:265
    openvelinux#10 0x556701cadddb in __cmd_test tests/builtin-test.c:402
    openvelinux#11 0x556701caf2aa in cmd_test tests/builtin-test.c:559
    openvelinux#12 0x556701d3b557 in run_builtin tools/perf/perf.c:323
    openvelinux#13 0x556701d3bac8 in handle_internal_command tools/perf/perf.c:377
    openvelinux#14 0x556701d3be90 in run_argv tools/perf/perf.c:421
    openvelinux#15 0x556701d3c3f8 in main tools/perf/perf.c:537
    openvelinux#16 0x7f2952a46189 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

SUMMARY: AddressSanitizer: 21 byte(s) leaked in 1 allocation(s).
```

Fixes: f7b58cb ("perf mem/c2c: Add load store event mappings for AMD")
Signed-off-by: Ian Rogers <[email protected]>
Acked-by: Ravi Bangoria <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ravi Bangoria <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
guojinhui-liam pushed a commit that referenced this pull request Jan 6, 2025
…mory

commit a7800aa80ea4d5356b8474c2302812e9d4926fa6 upstream

Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.

A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem.  With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings.   E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection.  Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.

Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping.  Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.

Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.

A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).

More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd.  While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption.  And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.

Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.

Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping.  And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.

Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory.  That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.

Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem.  I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.

Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay.  Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.

Link: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/[email protected]
Cc: Fuad Tabba <[email protected]>
Cc: Vishal Annapurve <[email protected]>
Cc: Ackerley Tng <[email protected]>
Cc: Jarkko Sakkinen <[email protected]>
Cc: Maciej Szmigiero <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Michael Roth <[email protected]>
Cc: Wang <[email protected]>
Cc: Liam Merwick <[email protected]>
Cc: Isaku Yamahata <[email protected]>
Co-developed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Co-developed-by: Yu Zhang <[email protected]>
Signed-off-by: Yu Zhang <[email protected]>
Co-developed-by: Chao Peng <[email protected]>
Signed-off-by: Chao Peng <[email protected]>
Co-developed-by: Ackerley Tng <[email protected]>
Signed-off-by: Ackerley Tng <[email protected]>
Co-developed-by: Isaku Yamahata <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Co-developed-by: Paolo Bonzini <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Co-developed-by: Michael Roth <[email protected]>
Signed-off-by: Michael Roth <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Reviewed-by: Fuad Tabba <[email protected]>
Tested-by: Fuad Tabba <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Rui Qi <[email protected]>

BP CAUTION:
1. Only changes to the kernel were merged!
2. Since we do not maintain the KVM module in the kernel tree, all fix patches
involving KVM will not be merged. Here, only a mark is made to allow the CI search fix job to pass

commit 2098acaf24455698c149b27f0347eb4ddc6d2058 upstream.
commit c31745d2c508796a0996c88bf2e55f552d513f65 upstream.
commit e563592224e02f87048edee3ce3f0da16cceee88 upstream.
PvsNarasimha pushed a commit to PvsNarasimha/kernel that referenced this pull request Jan 7, 2025
commit 99d4850 upstream

Found by leak sanitizer:
```
==1632594==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 21 byte(s) in 1 object(s) allocated from:
    #0 0x7f2953a7077b in __interceptor_strdup ../../../../src/libsanitizer/asan/asan_interceptors.cpp:439
    openvelinux#1 0x556701d6fbbf in perf_env__read_cpuid util/env.c:369
    openvelinux#2 0x556701d70589 in perf_env__cpuid util/env.c:465
    openvelinux#3 0x55670204bba2 in x86__is_amd_cpu arch/x86/util/env.c:14
    #4 0x5567020487a2 in arch__post_evsel_config arch/x86/util/evsel.c:83
    #5 0x556701d8f78b in evsel__config util/evsel.c:1366
    openvelinux#6 0x556701ef5872 in evlist__config util/record.c:108
    openvelinux#7 0x556701cd6bcd in test__PERF_RECORD tests/perf-record.c:112
    openvelinux#8 0x556701cacd07 in run_test tests/builtin-test.c:236
    openvelinux#9 0x556701cacfac in test_and_print tests/builtin-test.c:265
    openvelinux#10 0x556701cadddb in __cmd_test tests/builtin-test.c:402
    openvelinux#11 0x556701caf2aa in cmd_test tests/builtin-test.c:559
    openvelinux#12 0x556701d3b557 in run_builtin tools/perf/perf.c:323
    openvelinux#13 0x556701d3bac8 in handle_internal_command tools/perf/perf.c:377
    openvelinux#14 0x556701d3be90 in run_argv tools/perf/perf.c:421
    openvelinux#15 0x556701d3c3f8 in main tools/perf/perf.c:537
    openvelinux#16 0x7f2952a46189 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

SUMMARY: AddressSanitizer: 21 byte(s) leaked in 1 allocation(s).
```

Fixes: f7b58cb ("perf mem/c2c: Add load store event mappings for AMD")
Signed-off-by: Ian Rogers <[email protected]>
Acked-by: Ravi Bangoria <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ravi Bangoria <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
PvsNarasimha pushed a commit to PvsNarasimha/kernel that referenced this pull request Jan 7, 2025
commit 99d4850 upstream

Found by leak sanitizer:
```
==1632594==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 21 byte(s) in 1 object(s) allocated from:
    #0 0x7f2953a7077b in __interceptor_strdup ../../../../src/libsanitizer/asan/asan_interceptors.cpp:439
    openvelinux#1 0x556701d6fbbf in perf_env__read_cpuid util/env.c:369
    openvelinux#2 0x556701d70589 in perf_env__cpuid util/env.c:465
    openvelinux#3 0x55670204bba2 in x86__is_amd_cpu arch/x86/util/env.c:14
    #4 0x5567020487a2 in arch__post_evsel_config arch/x86/util/evsel.c:83
    #5 0x556701d8f78b in evsel__config util/evsel.c:1366
    openvelinux#6 0x556701ef5872 in evlist__config util/record.c:108
    openvelinux#7 0x556701cd6bcd in test__PERF_RECORD tests/perf-record.c:112
    openvelinux#8 0x556701cacd07 in run_test tests/builtin-test.c:236
    openvelinux#9 0x556701cacfac in test_and_print tests/builtin-test.c:265
    openvelinux#10 0x556701cadddb in __cmd_test tests/builtin-test.c:402
    openvelinux#11 0x556701caf2aa in cmd_test tests/builtin-test.c:559
    openvelinux#12 0x556701d3b557 in run_builtin tools/perf/perf.c:323
    openvelinux#13 0x556701d3bac8 in handle_internal_command tools/perf/perf.c:377
    openvelinux#14 0x556701d3be90 in run_argv tools/perf/perf.c:421
    openvelinux#15 0x556701d3c3f8 in main tools/perf/perf.c:537
    openvelinux#16 0x7f2952a46189 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

SUMMARY: AddressSanitizer: 21 byte(s) leaked in 1 allocation(s).
```

Fixes: f7b58cb ("perf mem/c2c: Add load store event mappings for AMD")
Signed-off-by: Ian Rogers <[email protected]>
Acked-by: Ravi Bangoria <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ravi Bangoria <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
PvsNarasimha pushed a commit to PvsNarasimha/kernel that referenced this pull request Jan 7, 2025
commit 99d4850 upstream

Found by leak sanitizer:
```
==1632594==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 21 byte(s) in 1 object(s) allocated from:
    #0 0x7f2953a7077b in __interceptor_strdup ../../../../src/libsanitizer/asan/asan_interceptors.cpp:439
    openvelinux#1 0x556701d6fbbf in perf_env__read_cpuid util/env.c:369
    openvelinux#2 0x556701d70589 in perf_env__cpuid util/env.c:465
    openvelinux#3 0x55670204bba2 in x86__is_amd_cpu arch/x86/util/env.c:14
    #4 0x5567020487a2 in arch__post_evsel_config arch/x86/util/evsel.c:83
    #5 0x556701d8f78b in evsel__config util/evsel.c:1366
    openvelinux#6 0x556701ef5872 in evlist__config util/record.c:108
    openvelinux#7 0x556701cd6bcd in test__PERF_RECORD tests/perf-record.c:112
    openvelinux#8 0x556701cacd07 in run_test tests/builtin-test.c:236
    openvelinux#9 0x556701cacfac in test_and_print tests/builtin-test.c:265
    openvelinux#10 0x556701cadddb in __cmd_test tests/builtin-test.c:402
    openvelinux#11 0x556701caf2aa in cmd_test tests/builtin-test.c:559
    openvelinux#12 0x556701d3b557 in run_builtin tools/perf/perf.c:323
    openvelinux#13 0x556701d3bac8 in handle_internal_command tools/perf/perf.c:377
    openvelinux#14 0x556701d3be90 in run_argv tools/perf/perf.c:421
    openvelinux#15 0x556701d3c3f8 in main tools/perf/perf.c:537
    openvelinux#16 0x7f2952a46189 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

SUMMARY: AddressSanitizer: 21 byte(s) leaked in 1 allocation(s).
```

Fixes: f7b58cb ("perf mem/c2c: Add load store event mappings for AMD")
Signed-off-by: Ian Rogers <[email protected]>
Acked-by: Ravi Bangoria <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ravi Bangoria <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
PvsNarasimha pushed a commit to PvsNarasimha/kernel that referenced this pull request Jan 30, 2025
commit 99d4850 upstream

Found by leak sanitizer:
```
==1632594==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 21 byte(s) in 1 object(s) allocated from:
    #0 0x7f2953a7077b in __interceptor_strdup ../../../../src/libsanitizer/asan/asan_interceptors.cpp:439
    openvelinux#1 0x556701d6fbbf in perf_env__read_cpuid util/env.c:369
    openvelinux#2 0x556701d70589 in perf_env__cpuid util/env.c:465
    openvelinux#3 0x55670204bba2 in x86__is_amd_cpu arch/x86/util/env.c:14
    #4 0x5567020487a2 in arch__post_evsel_config arch/x86/util/evsel.c:83
    #5 0x556701d8f78b in evsel__config util/evsel.c:1366
    openvelinux#6 0x556701ef5872 in evlist__config util/record.c:108
    openvelinux#7 0x556701cd6bcd in test__PERF_RECORD tests/perf-record.c:112
    openvelinux#8 0x556701cacd07 in run_test tests/builtin-test.c:236
    openvelinux#9 0x556701cacfac in test_and_print tests/builtin-test.c:265
    openvelinux#10 0x556701cadddb in __cmd_test tests/builtin-test.c:402
    openvelinux#11 0x556701caf2aa in cmd_test tests/builtin-test.c:559
    openvelinux#12 0x556701d3b557 in run_builtin tools/perf/perf.c:323
    openvelinux#13 0x556701d3bac8 in handle_internal_command tools/perf/perf.c:377
    openvelinux#14 0x556701d3be90 in run_argv tools/perf/perf.c:421
    openvelinux#15 0x556701d3c3f8 in main tools/perf/perf.c:537
    openvelinux#16 0x7f2952a46189 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

SUMMARY: AddressSanitizer: 21 byte(s) leaked in 1 allocation(s).
```

Fixes: f7b58cb ("perf mem/c2c: Add load store event mappings for AMD")
Signed-off-by: Ian Rogers <[email protected]>
Acked-by: Ravi Bangoria <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ravi Bangoria <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
Signed-off-by: PvsNarasimha <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.