Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fdtable: fix longstanding layering violations #75348

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

cfriedt
Copy link
Member

@cfriedt cfriedt commented Jul 3, 2024

Although fdtable sat a layer below posix, as part of the base os, it was reaching up a layer for POSIX types, creating another dependency cycle.

Instead of forcing POSIX types to be used, simply add the native Zephyr types, k_off_t and k_ssize_t. Remove <sys/types.h> and <fcntl.h> inclusion below the POSIX layer, where appropriate.

Generally, it would be best to use only ISO or native zephyr types below POSIX.

(Would have fixed #75205 with a very small workaround)
Fixes #75513
Fixes #10436

Update:
To future-proof Zephyr in terms of file, disk, and device compatibility, a Kconfig option was added that allows the user to explicitly use a 64-bit k_off_t, even if the underlying hardware is only 32-bit.

The question is whether we want future generations to pay off our debt + even more interest, or whether we are willing to pay the bill ourselves. This has already been ignored for 2 LTS releases.

Directly before an LTS release probably isn't a super appropriate time for a fix like this to go in, but then again, maybe it's the most appropriate time for it to go in 🤷‍♂️

@cfriedt
Copy link
Member Author

cfriedt commented Jul 3, 2024

twister -p native_sim -p qemu_riscv64 -p qemu_x86 -p qemu_x86_64 -p qemu_cortex_a53 -p mps2/an385 -T tests/posix
...
INFO    - Total complete:  270/ 270  100%  skipped:   84, failed:    0, error:    0
INFO    - 45 test scenarios (270 test instances) selected, 84 configurations skipped (58 by static filter, 26 at runtime).
INFO    - 186 of 270 test configurations passed (100.00%), 0 failed, 0 errored, 84 skipped with 0 warnings in 403.48 seconds
INFO    - In total 7029 test cases were executed, 3759 skipped on 6 out of total 21 platforms (28.57%)
INFO    - 181 test configurations executed on platforms, 5 test configurations were only built.

A lot of build-only targets being skipped here..

twister --build-only -T tests/net/ -T tests/lib/c_lib
...
INFO    - Total complete: 7042/7042  100%  skipped: 5895, failed:    0, error:    0
INFO    - 250 test scenarios (7042 test instances) selected, 5895 configurations skipped (5868 by static filter, 27 at runtime).
INFO    - 1147 of 7042 test configurations passed (100.00%), 0 failed, 0 errored, 5895 skipped with 0 warnings in 2589.39 seconds
INFO    - 0 test configurations executed on platforms, 1147 test configurations were only built.
twister -i -p  mimxrt1170_evk/mimxrt1176/cm7 -p mm_feather -p nrf9131ek/nrf9131/ns -s tests/posix/common/portability.posix.common.newlib -s tests/posix/common/portability.posix.common.tls.newlib -s tests/posix/fs/portability.posix.fs.newlib -s tests/posix/fs/portability.posix.fs.tls.newlib -s tests/posix/headers/portability.posix.headers.newlib.with_posix_api -s tests/lib/c_lib/thrd/libraries.libc.c11_threads.newlib -s tests/lib/c_lib/thrd/libraries.libc.c11_threads.newlib_nano
...
INFO    - Total complete:   21/  21  100%  skipped:    2, failed:    0, error:    0
INFO    - 7 test scenarios (21 test instances) selected, 2 configurations skipped (0 by static filter, 2 at runtime).
INFO    - 19 of 21 test configurations passed (100.00%), 0 failed, 0 errored, 2 skipped with 0 warnings in 127.21 seconds
INFO    - In total 696 test cases were executed, 210 skipped on 3 out of total 7 platforms (42.86%)
INFO    - 0 test configurations executed on platforms, 19 test configurations were only built.

@cfriedt cfriedt force-pushed the declare-off-t-and-ssize-t-in-fdtable-h branch from 468003b to 398403c Compare July 3, 2024 02:01
@cfriedt cfriedt requested a review from ycsin July 3, 2024 02:08
@cfriedt cfriedt marked this pull request as ready for review July 3, 2024 02:54
@zephyrbot zephyrbot added area: POSIX POSIX API Library area: Base OS Base OS Library (lib/os) labels Jul 3, 2024
@cfriedt cfriedt added the Hotfix Fix for issues blocking development, i.e. CI issues, tests failing in CI, etc. label Jul 3, 2024
Copy link
Member

@aescolar aescolar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I understand what and why you are doing this I don't think this is the solution.
Defining a type in the "wrong" header to avoid including another header from that one will end up with an even more messy situation, as now issues will depend on the order of header inclusion by users of those headers.
It is not safe to redefine these types (as you do not know what the C library is defining those to), and attempting to guard those definitions with the C library internal guard names is going to not work with other libraries or when those guards names change.

include/zephyr/sys/fdtable.h Outdated Show resolved Hide resolved
@aescolar aescolar requested a review from keith-packard July 3, 2024 07:20
@cfriedt
Copy link
Member Author

cfriedt commented Jul 3, 2024

@aescolar - this was the "path of least resistance" change that is non-destructive. The better change would be to simply not use inappropriate types in the base OS (i.e. simply change return values from ssize_t to long). However, that would have greater repercussions as several function signatures in and out of tree would need to be corrected.

@keith-packard
Copy link
Collaborator

keith-packard commented Jul 3, 2024

While painful, I think the only 'conforming' solution is to avoid POSIX types entirely and either use existing C types or create your own Zephyr-specific ones. For ssize_t, you could use ptrdiff_t and then introduce casts in the POSIX layer between those -- I've never heard of a target for which those don't use the same underlying representation.

For off_t, I don't think there's any C replacement you could use with the right semantics. fpos_t would be great as an abstraction but that's not required to be an integer type. I think you need to create a zephyr-specific type here. I would really encourage you to use a 64-bit type and avoid all of the disasters that come from 32-bit file size adventures.

C libraries for Zephyr could then adopt this new OS type for their own off_t and fpos_t definitions so that we would be assured of 64-bit file support through the whole stack. C libraries without such changes would need range checks across the OS abstraction layer to catch problems at runtime (sigh).

Reading through the picolibc code, I'm afraid it defines off_t as long. I'll go fix that so that it will always use 64-bits. That way it will match the Zephyr OS type you should create.

@cfriedt
Copy link
Member Author

cfriedt commented Jul 5, 2024

Thanks for your help with this @keith-packard 🪁

@cfriedt cfriedt force-pushed the declare-off-t-and-ssize-t-in-fdtable-h branch from 398403c to 55290f0 Compare July 6, 2024 13:13
@cfriedt cfriedt force-pushed the declare-off-t-and-ssize-t-in-fdtable-h branch from 55290f0 to f633095 Compare July 6, 2024 13:26
@cfriedt cfriedt changed the title fdtable: avoid pulling in posix header & types fdtable: remove inappropriate use of POSIX off_t and ssize_t Jul 6, 2024
@cfriedt cfriedt removed the Hotfix Fix for issues blocking development, i.e. CI issues, tests failing in CI, etc. label Jul 6, 2024
Chris Friedt added 2 commits July 12, 2024 06:35
Update sof to use k_ssize_t instead of the POSIX ssize_t, thus
avoiding layering violations in core parts of Zephyr.

Signed-off-by: Chris Friedt <[email protected]>
A shortcut was taken many years ago by "cherry-picking" the
POSIX off_t and ssize_t types for use throughout Zephyr.

Additionally, the POSIX header fcntl.h, as well as constants
defined in that header, were being used inappropriately
throughout Zephyr.

Doing so created a dependency cycle: below POSIX's position
in the stack, code depended on POSIX; above POSIX's position
in the stack, code depends on the lower parts.

Such dependency cycles usually result in fragility and
instability of the software stack.

Use the newly defined k_off_t and k_ssize_t types throughout
Zephyr's stack, where off_t and ssize_t were previously used
inappropriately.

Additionally, use ZVFS-prefixed constants instead of their
POSIX counterparts.

Additionally, where appropriate, ensure the POSIX fcntl.h
header is prefixed with <zephyr/posix/fcntl.h>.

We effectively create a mutual dependency to resolve the
cyclic dependency, as described in GitHub issue zephyrproject-rtos#51211.

For clarification, it is inappropriate to use POSIX types or
functions within the kernel, core OS, OS services, or really
anywhere that is not equal to or above POSIX's position in the
software stack.

In other words, if a library uses POSIX, Zephyr's POSIX
implementation cannot depend on that library.

Similarly, if a system service uses POSIX, Zephyr's POSIX
implementation cannot depend on that system service.

Signed-off-by: Chris Friedt <[email protected]>
@cfriedt cfriedt force-pushed the declare-off-t-and-ssize-t-in-fdtable-h branch from 502a493 to 641a504 Compare July 12, 2024 11:15
@cfriedt
Copy link
Member Author

cfriedt commented Jul 12, 2024

Copy link
Collaborator

@sylvioalves sylvioalves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From ESP32 side and memory discussion, LGTM.

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooph, that's a lot of change. Might as well +1 just so we can get this in. In the longer term, maybe we should look at moving usage from ssize_t (AIUI a POSIX type) to ptrdiff_t (ISO C). Not sure whether off_t has an equivalent in the standard library, though? Probably not, so I guess a k_off_t is unavoidable.

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops forgot vote

ssize_t len;
k_ssize_t len;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really correct? This driver is for native_sim, and some 10 lines later the return value from the host libc's read() is assigned to it.

Copy link
Member Author

@cfriedt cfriedt Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definitions should be identical (as far as the underlying type goes). Otherwise, there would be warnings / errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't doubt that. I was mainly wondering where a native_sim driver like this conceptually sits in the software stack. It's calling the host OS APIs (like read()) so in that sense it could be seen to be above POSIX (in which case ssize_t would be more appropriate), however the driver also accesses Zephyr kernel APIs. Anyway, this isn't necessarily that critical (as you say, the types should be the same), which is why I didn't want to block this PR by requesting changes, but I do think it would be good to understand where native_sim driver code sits in the software stack.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it's kind of a recursion. I would still say those drivers sit below (Zephyr's) POSIX API.

The namespace issue is kind of an unfortunate side-effect (not without solutions, ofc)

@andyross
Copy link
Contributor

Not sure whether off_t has an equivalent in the standard library, though?

Duh, I forgot stdio. So interestingly fread/fwrite() use size_t as the buffer offset type, so effectively in all real circumstances that demands off_t be the same. Interestingly fseek() never got the memo and uses a bare long as the offset type. But there, POSIX got huffy and defined fseeko() which takes off_t, so we don't get to assume anything.

@cfriedt
Copy link
Member Author

cfriedt commented Jul 12, 2024

Technically size_t would be the same already for 64 bit (without the sign)

@keith-packard
Copy link
Collaborator

Duh, I forgot stdio. So interestingly fread/fwrite() use size_t as the buffer offset type, so effectively in all real circumstances that demands off_t be the same.

Not at all -- size_t refers to the in-memory storage, which is limited to the address space. off_t covers file sizes, not memory, which C doesn't really talk about much, aside from fseek as you say. However, there's also fpos_t as used by fgetpos and fsetpos. C has weasel words for that allowing it to be a struct type, which allows file sizes beyond the scope of any integer type, but sadly doesn't map neatly to off_t.

On 32-bit Linux systems, size_t is 32-bits, but (most of the time), off_t is 64-bits.

@cfriedt
Copy link
Member Author

cfriedt commented Jul 12, 2024

Aren't standards wonderful?? 😂🙃

@nashif
Copy link
Member

nashif commented Jul 16, 2024

The PR is reaching the limit of what GH can handle, opening it in the browser just takes forever :(

from the commit message of the largest commit:

Additionally, the POSIX header fcntl.h, as well as constants
defined in that header, were being used inappropriately
throughout Zephyr.

Can this be done in its own commit?

Additionally, use ZVFS-prefixed constants instead of their POSIX counterparts.

ditto

Additionally, where appropriate, ensure the POSIX fcntl.h header is prefixed with <zephyr/posix/fcntl.h>.

ditto

@@ -866,6 +866,23 @@ config XIP
supply a linker command file when building your image. Enabling this
option increases both the code and data footprint of the image.

config OFFSET_64BIT
bool "Use 64-bit offsets"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be great if this went into its own section in Kconfig (menu), i.e. do not put alongside other features and make it obvious is it a type, maybe others will be added to the same menu later.

OFFSET_64BIT can be confusing, maybe make it obvious we are talking about a type here: TYPE_OFFSET_64BIT.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. The closest thing I found to this so far is TIMEOUT_64BIT. Should that be moved as well? It also has to do with a type (k_timeout_t).

@de-nordic
Copy link
Collaborator

de-nordic commented Jul 16, 2024

Not sure whether off_t has an equivalent in the standard library, though?

Duh, I forgot stdio. So interestingly fread/fwrite() use size_t as the buffer offset type, so effectively in all real circumstances that demands off_t be the same. Interestingly fseek() never got the memo and uses a bare long as the offset type. But there, POSIX got huffy and defined fseeko() which takes off_t, so we don't get to assume anything.

fread/fwrite operate on memory objects rather than buffers. The POSIX write operates on buffers, but still knows it can only write as much as size_t.
I think that when the fread/fwrite was designed you would still mostly write on tape, so you would rather do a lot of sequential operations, on same types of objects, and then fseek to some other location, and again do some sequential operations.
Tape could be much longer then your address space so using long for fseek made sense, as this operates on external address space which may not fit in size_t.
POSIX was designed later than C, to stop Sys V vs 4BSD API division in systems getting too annoying.
It is funny that if you have true POSIX system, this means that lseek would be native function using off_t, for offset, and the fseek implemented on top would have to convert from long to off_t on call.

Copy link

This pull request has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this pull request will automatically be closed in 14 days. Note, that you can always re-open a closed pull request at any time.

@github-actions github-actions bot added the Stale label Sep 18, 2024
@mmahadevan108
Copy link
Collaborator

@cfriedt, can you help rebase.

@github-actions github-actions bot removed the Stale label Sep 26, 2024
@mmahadevan108 mmahadevan108 modified the milestones: v4.0.0, v4.1.0 Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Base OS Base OS Library (lib/os) area: POSIX POSIX API Library DNM This PR should not be merged (Do Not Merge) manifest manifest-sof treewide 🧹
Projects
Status: Todo