-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: FS/GS-based TLS access abstraction #257
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Joe Richey <[email protected]>
Thanks a lot for proposing this! The design proposed here looks good to me overall. The assembly comparison is really cool, it's great to see that the use of inline assembly still leads to optimized code. As I understand it, there is also native LLVM support for thread local data, which Rust exposes through the unstable One thing that I'm curious about is how the performance of your proposal compares to storing just a pointer example::one_page:
xor eax, eax
mov rax, qword ptr gs:[rax]
mov rax, qword ptr [rax + 2784]
ret Without a pointer, we have to load the full page array first before accessing the field, which the compiler cannot optimize away: https://godbolt.org/z/hMrWdsE1Y
I think for values > 8 byte it makes more sense to go through a pointer for the mentioned performance reasons. Or maybe it is possible to adjust the macro to also support loading a specific array element or nested struct field directly? We should also think about values that contain padding bytes. For example, does the compiler require them to have a specific value (e.g. 0)?
Given that this is probably a very costly operation (the optimizer won't remove any of the
The We might be able to make the
I think I prefer the term TLS over "per cpu" since the fs/gs value is typically set per thread (and updated accordingly by the context switching code).
I'm fine with creating a separate |
Amazing!! It would be nice to be able to do this without needing to use fancy macros. Causz that could lead to lots of inlined code. Right? Personally I think something like this would be efficient: static LOCAL: CpuLocal<Something> = CpuLocal::new(CpuLocalBase::FSBase, || Something::new_uninit());
fn test() {
LOCAL.get_mut().something = 10;
// ^^^^^^^^^^^
// Returns &mut Something
*LOCAL.get_mut() = NEW_THING;
}
impl<T> CpuLocal<T> {
pub fn new(base: CpuLocalBase, lazy fn() -> T) -> CpuLocal<T>;
pub fn get_mut(&self) -> &mut T;
pub fn get(&self) -> &T;
pub fn get_offset(&self) -> VirtAddr;
} |
I agree, the macros should only be needed if you need maximal performance. The basic idea would be to have the TLS structure contain a pointer to itself (as @phil-opp mentioned above, stored at some fixed offset). This what Linux does and what Fushia does on x86_64 (other architectures don't need such a pointer). The basic idea would then be to have the following API (say in a #[repr(C)]
pub struct Value<T> {
ptr: *const T,
val: T,
}
impl<T> Value<T> {
const fn new(val: T) -> Self;
}
pub struct Wrapper<S, T>(PhantomData<(S, *mut Value<T>)>);
impl<S: Segment64, T> Wrapper<S, T> {
pub const fn new() -> Self {
Self(PhantomData)
}
/// Needs to be called on each thread to initialize the segment base
pub unsafe fn init(&self, new: *mut Value<T>) -> *mut Value<T> {
let old = S::read_base().as_mut_ptr();
S::write_base(VirtAddr::from_ptr(new));
old
}
/// Only safe to call if `init()` has been called on this thread
pub unsafe fn with<R>(&'static self, f: impl FnOnce(&mut T) -> R) -> R {
let base = S::read_u64_at_offset_zero() as *const Value<T>;
let r: &mut T = &mut (*base).val;
f(r)
}
} @Andy-Python-Programmer would that sort of API work? I think the EDIT: I don't think there's a way to have some sort of lazy initialization function in the constructor. We wouldn't know when the value had been initialized, so we wouldn't know when to call the constructor. We can't use an |
Ah I see. The current read method takes a field. What if you want to compare two CPU local Arcs? ($wrapper:path, $field:tt) => {{ if Arc::ptr_eq(&new_process, &previous_process) {
return false;
} This is quite common as we want the current process to be local to the CPU |
I guess you would need to copy them into the appropriate context. We could also add an |
All good! |
I think I'm going to initially submit a PR that adds the non-macro changes, and then do the macro stuff in a followup PR. The code generated is fairly good for both regardless, so there's only a marginal benefit to the macro approch. For example, consider this setup for GS-based thread-local: #[repr(C)]
struct ThreadLocal {
a: u64,
b: u32,
}
const LOCAL: Wrapper<GS, ThreadLocal> = Wrapper::new(); Using the pub fn slow_get_a() -> u64 {
unsafe { LOCAL.with(|t| t.a) }
}
pub fn slow_set_b(v: u32) {
unsafe { LOCAL.with(|t| t.b = v) }
} gives code that uses two example::slow_get_a:
mov rax, qword ptr gs:[0]
mov rax, qword ptr [rax]
ret
example::slow_set_b:
mov rax, qword ptr gs:[0]
mov dword ptr [rax + 8], edi
ret Using the macros: pub fn fast_get_a() -> u64 {
unsafe { tls_read!(LOCAL.a) }
}
pub fn fast_set_b(v: u32) {
unsafe { tls_write!(LOCAL.b, v) }
} gives code equal to the C equivalent: example::fast_get_a:
mov rax, qword ptr gs:[8]
ret
example::fast_set_b:
mov dword ptr gs:[16], edi
ret |
The macro actually produces more optimised assembly. I see |
Sounds good to me! |
This is a fairly naive implementation, where we allocate a per-CPU region from CPU_SPACE for each CPU (which is currently just the first CPU, as we don't have multi-CPU support yet), then we store a pointer to the beginning of that space in the GS base. To load the per-CPU data, we simply read the address back out of the GS base and dereference it to access the data. This isn't as efficient as the approach being discussed in rust-osdev/x86_64#257, but it works well enough for now.
This is a fairly naive implementation, where we allocate a per-CPU region from CPU_SPACE for each CPU (which is currently just the first CPU, as we don't have multi-CPU support yet), then we store a pointer to the beginning of that space in the GS base. To load the per-CPU data, we simply read the address back out of the GS base and dereference it to access the data. This isn't as efficient as the approach being discussed in rust-osdev/x86_64#257, but it works well enough for now. Signed-off-by: SlyMarbo <[email protected]>
This is a fairly naive implementation, where we allocate a per-CPU region from CPU_SPACE for each CPU (which is currently just the first CPU, as we don't have multi-CPU support yet), then we store a pointer to the beginning of that space in the GS base. To load the per-CPU data, we simply read the address back out of the GS base and dereference it to access the data. This isn't as efficient as the approach being discussed in rust-osdev/x86_64#257, but it works well enough for now. Signed-off-by: SlyMarbo <[email protected]>
This is a fairly naive implementation, where we allocate a per-CPU region from CPU_SPACE for each CPU (which is currently just the first CPU, as we don't have multi-CPU support yet), then we store a pointer to the beginning of that space in the GS base. To load the per-CPU data, we simply read the address back out of the GS base and dereference it to access the data. This isn't as efficient as the approach being discussed in rust-osdev/x86_64#257, but it works well enough for now. Signed-off-by: SlyMarbo <[email protected]>
I was trying to use FS/GS based Thread Local Storage (TLS) and I noticed we don't have an easy way to view the FS or GS register as a structure (a common patter in OS-dev).
This patch would allow for the following code:
I'm not sure exactly what we want from the API, but I wanted to get feedback on the design sooner rather than later. Issues to discuss:
PerCpu
struct have to implement some sort of trait?tls_read!(PER_CPU)
)?Tls
vsWrapper
,tls_read!
vsper_cpu_read!
)instructions::tls
?instructions::segmentation
?)This implementation seems to compile to fairly efficient code. See this Rust implementation vs a C implementation (using
__seg_gs
). The C implementation is still slightly better, which isn't surprising given that__seg_gs
has LLVM support (on the other hand, I was able to get__seg_gs
to crash clang, so it's quite unstable).TODO: Docs, tests, finish design
Signed-off-by: Joe Richey [email protected]