-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow integer types to be any range #3806
Comments
I'm worried this might result in a code explosion as different types are passed to a generic. Idea: input to a generic could be "marked" for which properties are used. |
@andrewrk i cannot understand how did you get required size of Since struct is a product type, MyFancyType has a total of |
@Rocknest my math is wrong, thanks for the correction 👍 |
wouldn't it be pub const Int = struct {
is_signed: bool,
bits: comptime_int,
from: comptime_int,
to: comptime_int,
}; because of the |
Yes except only
|
should this be marked a breaking proposal as well? |
You still need the |
You're thinking of a different, already planned issue, which is zig keeping track of additional information beyond the type, for expressions and test "runtime value hints" {
var a: u64 = 99999;
// `a` is runtime known because it is `var`.
const b: u64 = a % 256;
const c: u8 = b; // allowed without @intCast because `b` has a runtime value hint
} |
i don't think so. take for example the WebSocket protocol, the opcodes are always 4 bits, but 0xB to 0xF are reserved for future use, so effectively they are not allowed. |
That sounds like a usecase for a non-exhaustive enum: enum(u4) {
Continuation = 0,
Text = 1,
Binary = 2,
Close = 8,
Ping = 9,
Pong = 10,
_,
} |
This is very nice and fancy, but I would like to note that it might have a negative effect on performance. This is definitely more efficient memory wise, so for storing it (on disk) this is perfect. However for passing it around in code in this format it would mean that every time there is a calculation bit fiddling comes into play to extract the value and put it back. Just something to think about. |
This is not correct, except with Also in many cases if you were to use |
Yes, sorry I was thinking about align(0) cases. For the other cases this proposal only adds range checks, right?
I'm a bit sceptical about this as I've had experience of the opposite. It probably depends a lot on the circumstances. Which is why I don't think a compiler should be sneaky and perform bit fiddling behind your back. However if it just applies to align(0), it is something which you ask for so it's fine. Somehow I feel though that this is more something to do with optionals and other constructs which have more information than arbitrary numbers and enums, right? It's more like in this example:
Where each case has a special number or range of numbers, such as:
And I don't think it goes for your other example where you have four, what to me seems independant, enums:
Aren't there 4 * 2 * 6 * 9 = 432 possible combinations of those four enums? Which means you still need 9 bits (or two bytes)? In general I do think it would be great if the optionals can work more generically. Not having to write all the -1 checks would be cool. Especially since sometimes people work with indices instead of pointers and the special value trick is quite common. |
Note that integer ranges have a long history in some other languages like Pascal and Ada. There they are heavily used and can be used as ranges for arrays. This leads to some nice code where you can use ranged for loops without any possibility of going out of bounds. It has been decades since I used Pascal or Ada in anger but even back then I seem to remember that most run time checks could be dropped as the compiler could prove that the use was safe. A very naive compiler would do the checks all the time, but existing compilers show that the possible overhead is quite low. As long as the programmer is aware (such as choosing different arithmetic with overflow wrapping) that this could have a cost, I think this is fine. The driving case here: more compact representation of enums and optionals is a very strong one IMHO. Having just optionals use the same space as non-optional types would be a huge argument to get programmers to use them. The cache argument is a strong one too. A cache miss can cost 25-50 clock cycles (if you hit L2, worse if you fall all the way to RAM) and you can run a lot of bit twiddling code in that time. |
I like the core of this proposal, but there are a few things I'm worried about. First, I feel like the optimizations are too implicit. You have to hope that you left the correct number of values open. If you do it wrong, there is no indication that you have messed up, you just have bigger types than you expected. As code changes over time, it's easy for additional values to be added to enums that are sensitive to this, which may break these optimizations without warning. I'm worried that, because it's so implicit, general zig advice may become "put explicit bounds on every integer type you use, so that the compiler can optimize it". This could create a lot of additional cognitive load when reading and writing zig code. Does It also seems to me like working with these types would in many cases require Finally, while applying range information to type resolution would solve #747, it also introduces a problem when assigning directly. There is a precedent for an opaque optional optimization in Zig: Here are some off-the-cuff suggestions. There's definitely room for improvement for these, but I think they do a good job of showing the direction I'm looking. Range-checked Explicit IntegersThese are the same as the original proposal, but the integer bit-width is specified by the programmer. For example, Integer Unions:An integer union (or alternatively, "partition" or "ranged enum") would be a union of non-overlapping ranged ints. It is guaranteed to be the size of its backing integer type.
This is very similar to the union example in the original proposal, but with this if the range is specified incorrectly or a new value is added, there would be a compile error instead of a drastic change in representation. If the integer type is not specified, it would be inferred from the ranges as the smallest type that can represent the range. But the ability to specify the bit width and receive an error if it can't be represented is valuable. Ranged unions could also support the '_' option from extensible unions. Explicit Sentinel Optionals:If an optional type could be specially constructed by specifying a sentinel, this would solve sentinel checks in a way that guarantees the optimization. Something like None of these counter-proposals are perfect, but I like that they take more direct control over the optimization. E.g. instead of thinking "I'd like to optimize this optional to use a sentinel value, I should put ranges this int type everywhere it's used", I can instead think "I'd like to optimize this optional to use a sentinel value, I should tell zig to optimize this optional to use a sentinel value.". |
Sorry, small nitpick. It's not actually bits, it's unused values. For example in the pointer case I agree with you that it would be cleaner if it's more explicit. |
Sorry, yeah, my wording was unclear. I'm trying to convey that the simpler guarantee of a bit for optionals of non-power-of-two integers is easier to guarantee and think about than an inferred sentinel, and that something like |
If we make it so an optional ranged type interprets any out-of-range value as // Zero-cost wrapper for an ABI that uses negative values as error codes
const Result = fn (Int: Type) Type {
return union {
// Nullable limits to avoid generating unnecessary bounds checks
success: @Ranged(Int, 0, null),
failure: @Ranged(Int, null, -1),
};
}; Define canonical Possible optimisation: allow nesting of // Can be optimised to u32 with tag
// -- don't have to set global rules
const OptionalU32 = ?@Ranged(u32, null, maxInt(u32));
// Don't need another range -- canonical null of
// OptionalU32 is inner null, anything else is outer null
const OptionalOptionalU32 = ?OptionalU32;
// ;)
const Bool = ?@Ranged(u1, 1, null);
const True: Bool = 1;
const False: Bool = null;
const And = fn (T: Type, a: ?T, b: ?T) ?T {
return if (a) { b } else { null };
};
const Or = fn (T: Type, a: ?T, b: ?T) ?T {
return a orelse b;
};
const Not = fn (a: Bool) Bool {
return if (a) { null } else { 1 };
}; Pro
|
You could use |
We already have a ranged integer type. It is called |
Thanks for looking into it. I agree with everything you said, and to add to that, one thing I'm concerned about is how easy or difficult it will be to choose integer types for function parameters. I think we just have to fuck around and find out. |
I have implemented such a family of types in C++: https://github.com/davidstone/bounded-integer/blob/main/bounded-readme.md. I have not experienced any code size or run-time performance issues (and sometimes I get improvements), and I have a moderate-sized code base that uses From the documentation: bounded::integer<1, 100> const x = f();
bounded::integer<-3, 7> const y = g();
auto z = x + y;
static_assert(std::same_as<decltype(z), bounded::integer<-2, 107>>, "Type of z incorrect."); After using this, I feel crippled when I'm forced to use the types of integers that come with most programming languages. The main drawback is increased compile times, but a big portion of that is due to the cost of C++ template instantiation. I'm not sure how that differs in Zig and whether you'd be implementing it in the compiler. |
@davidstone Thank you for this, it's a very useful data point. I have no reason to believe an implementation of this would lead to any significant changes in compile times - the only thing I can think of is that when dealing with particularly large types, you may be performing a fair bit more bigint arithmetic, but that would only matters for types with bounds larger than probably |
shoutouts to ziglang/zig#3806
When the slice-by-length start position is runtime-known, it is likely protected by a runtime-known condition and therefore a compile error is less appropriate than a runtime panic check. This is demonstrated in the json code that was updated and then reverted in this commit. When #3806 is implemented, this decision can be reassessed. Revert "std: work around compiler unable to evaluate condition at compile time" Revert "frontend: comptime array slice-by-length OOB detection" This reverts commit 7741aca. This reverts commit 2583b38.
When the slice-by-length start position is runtime-known, it is likely protected by a runtime-known condition and therefore a compile error is less appropriate than a runtime panic check. This is demonstrated in the json code that was updated and then reverted in this commit. When ziglang#3806 is implemented, this decision can be reassessed. Revert "std: work around compiler unable to evaluate condition at compile time" Revert "frontend: comptime array slice-by-length OOB detection" This reverts commit 7741aca. This reverts commit 2583b38.
This reverts commit cb5a6be. I deeply apologize for the churn. This change is problematic given that we do not have ranged integers (yet? see #3806). In the meantime, this type needs to be `usize`, matching the length and index types for all std lib data structures. Users who want to save memory should not use heap-allocated BoundedArray values, since it is inherently memory-inefficient. Use a different memory layout instead. If #3806 is accepted and implemented, the length value can become an integer with the appropriate range, without the footgun. If that proposal is not accepted, len type will remain a usize.
It seems worth observing that this offers a possible solution to the felt need in #17039/#14704. Andrew made a comment to that effect in the first of those issues, I wanted to add some notes about it here. If and when integers are defined as a range, rather than a width, it could be worthwhile to make this legal: for (u8) |byte| {
// ...
} If I saw this I would have no question about what values If that's too weird, it could take a builtin: for (@range(u8)) |byte| {
// ...
} But I think that in a world where integers are ranges, it would be eloquent for them to coerce to an iteration of that range in a This brings me to a point which is tangential to the main thrust of this comment, yet necessary for what follows: ranged integers should be defined inclusively, as closed intervals. I feel that this accords with how we reckon with integers already: I think of a I do understand why using the exclusive range felt more natural, of course, but we really want it to be inclusive: most of this comment serves as the argument for that, although this point is not why I came to this issue today. That would make This is how Ada does it, and with reason. Ada also has a 'modular' type, which also specifies wraparound arithmetic for the type. Since this uses modulus, it mentions the first unsigned integer which isn't in the range. Zig uses explicit modular operators, so this is probably not a useful concept for us. Another data point is the C standard, Section 7.20.1. It doesn't use interval notation at all, but it defines macros for limits in terms of minimum and maximum (inclusive) values. Inclusive is the better convention in its own right, and also best compatible with the I broadly agree with not having two range syntaxes for a There's another place this coercion could apply: switch statements. fn byteSwitch(b: u8) void {
const a_range: type = @Int(0, 0x7f)
const b_lead: type = @int(0xc2, 0xdf)
// ...
switch (b) {
a_range => ... ,
b_range => ... ,
}
} These custom types coerce to For the same reason that it would be confusing to have an exclusive range in a switch statement, it would be confusing to have an exclusive range in a ranged integer definition. Part of why I wrote this out was to be able to link to it later, but also on the premise that it's good to have discussions of the positive (or negative) consequences of a feature in the issue tracking it. I don't see a lot of point in writing the possible coercion features up as a separate issue until or unless this one is tagged 'accepted', although I'm willing to do so. |
I also wanted to respond separately to @SpexGuy's comment, since it has coherent implementation concerns which haven't yet been completely addressed. That was four years ago, and Spex might have a completely different take now, I want to acknowledge that before continuing. Zig has also changed quite a bit in that period of time. For me, the benefit of this proposal is lifting ranges up into the type system, and the niche optimizations are a nice bonus. I think we can come up with rules which are general, easy to understand, and cover all cases of optional integer types, whether by themselves or in a packed struct. At the limit, someone concerned if they left a niche in their enum type can use a test: Another solution is this: const ThreeEnum = enum(@Int(1,3)) {
fee,
fie,
foe,
}; This seems like something which would have to be legal if integer ranges are just integers. This guarantees (according to what follows) an optional sentinel at zero. It seems to me that the solution to the limited-range problem of assigning comptime ints is something we'd have to do already: you need to provide a type for a Niche Optimization of Ranged Integer TypesI propose a rule like the following: A ranged integer type has an automatic bitwidth, which is the minimum necessary to hold every value in the range. If there are any negative values in the range, this is a signed width, otherwise, unsigned. If there are representable values in that width which are not part of the range, then the first value higher than the maximum, if it exists, is the optional sentinel, otherwise, the first value lower than the minimum. If there are no representable values in the bitwidth (these are the integers we have already) then the original bitwidth is sign-extended by one bit, and the first value of the new width which is higher than the maximum of the old width is the optional. I think this covers all the bases. Part of what Spex was addressing is combinatoric difficulties with the This rules out some of the more exotic 'optimizations', which would be space-optimizing only, and at the expense of everything else. For example, [-3, 240] could be represented in eight bits, this would require it to have a bitwidth of Even in the event of following in the footsteps of Pascal and Ada by having arrays with widths and indices defined by a ranged integer, the offsetting of an index into that array should be at the point of use, not in the storage value of a ranged integer used for indexing that array. This is the tip of a huge iceberg, which I will now veer away from. Alternative version of the rule which is actually better: for an unsigned range, if there are available values higher than the maximum, the highest value which fits in the bitwidth is chosen, otherwise a bit is added and the sentinel is the maximum value of the new width. For a signed range, if there are available values lower than the minimum, the lowest possible value of the width is chosen, otherwise the highest possible value, otherwise it is sign-extended by one and the lowest value of the new bitwidth is the optional sentinel. My little bait-and-switch here is because that formula is significantly harder to understand, and looks worse at a first glance. But it's better, because it always chooses all 1 bits when it can, and when it can't, it chooses 0 followed by all 1s. This is the opposite of There might be a case for adding "if zero is not in the range, then zero is the sentinel" but I wouldn't make it. Someone else might, but I'm sensitive to the fact that the second version of the rule is more complex, and therefore harder to understand. I'll acknowledge that zero is a good value to be working with when it comes time to emit machine code, however. I could also be wrong about the more complex rule being better for the compiler, but it would be easier to debug, and that's not nothing. This neglects the actual size which the integer will take, and that's intentional. A As a note, Zig's documentation doesn't specify how non-natural-width integers are represented on the machine, and perhaps that's intentional, but, maybe it should. Back to the comment I'm reply to: I don't think we want either of "bring your own bitwidth" or "bring your own sentinel". That's creating opportunities for incompatibility where there otherwise would not be. It seems to me that even in the packed case, a width wider than the minimum dictated by the rules here can be emulated by adding a pad field after the ranged integer. It also brings the weirdness that a The major reason to even define what the sentinel will be is debugging, it isn't a value that user code has native access to, although presumably this would reveal it: const opt_ranged: ?@Int(1, 14) = null;
const as_cast: u8 = @bitCast(opt_ranged); And I think the fact that the above would be possible, and debugging, are good enough reasons to have the sentinel value be a specified part of the language. This may be irrelevant, given that I'd argue that even for packed structs, trying to minimize space in a range like [256, 511] should be resisted. If it's imperative that values of that nature fit in eight bits and not nine, it's reasonable to do that with arithmetic. Things would just get weird otherwise. The partitioned integer is a cool idea though. I haven't a clue how feasible it is, but it's probably its own issue rather than this one. |
You'll have to forgive that all of my example code here uses C++ syntax, I'm not familiar enough with Zig to be confident giving the Zig equivalent. My C++ library defines the range bounds inclusively ( I agree that we should make the representation of the integer type match its value. In other words, My integer type also works with a compressed optional implementation -- it's split into the general optional type and an associated traits class to manage the optimization for integer types specifically. The general idea is similar to what was outlined here: use the unused values as sentinel values. My implementation uses the least representable value as the first tombstone value and counts up from there, skipping over the valid value range. So if I much prefer the design of my library where the user says "I want an integer in this range, you figure out how to store it ( |
It's probably not clear by casual inspection and I don't know how optional support works in Zig, but in my C++ library, the type |
This reverts commit cb5a6be. I deeply apologize for the churn. This change is problematic given that we do not have ranged integers (yet? see ziglang#3806). In the meantime, this type needs to be `usize`, matching the length and index types for all std lib data structures. Users who want to save memory should not use heap-allocated BoundedArray values, since it is inherently memory-inefficient. Use a different memory layout instead. If ziglang#3806 is accepted and implemented, the length value can become an integer with the appropriate range, without the footgun. If that proposal is not accepted, len type will remain a usize.
Just chiming in that there's also a nice interaction between this and the "named integer types" shown in https://ziglang.org/devlog/2024/#2024-11-04 for defining named types with massive but specific ranges of valid values:
|
Zig already has ranged integer types, however the range is required to be a signed or unsigned power of 2. This proposal is for generalizing them further, to allow any arbitrary range.
Let's consider some reasons to do this:
One common practice for C developers is to use
-1
orMAX_UINT32
(and related) constants as an in-bound indicator of metadata. For example, the stage1 compiler uses asize_t
field to indicate the ABI size of a type, but the valueSIZE_MAX
is used to indicate that the size is not yet computed.In Zig we want people to use Optionals for this, but there's a catch: the in-bound special value uses less memory for the type. In Zig on 64-bit targets,
@sizeOf(usize) == 8
and@sizeOf(?usize) == 16
. That's a huge cost to pay, for something that could take up 0 bits of information if you are willing to give up a single value inside the range of ausize
.With ranged integers, this could be made type-safe:
Now, not only do we have the Optionals feature of zig protecting against accidentally using a very large integer when it is supposed to indicate
null
, but we also have the compile error helping out with range checks. One can choose to deal with the larger ranged value by handling the possibility, and returning an error, or with@intCast
, which inserts a handy safety check.How about if there are 2 special values rather than 1?
Here, size of N would be 4 bytes.
Let's consider another example, with enums.
Enums allow defining a set of possible values for a type:
There are 3 possible values of this type, so Zig chooses to use
u2
for the tag type. It will require 1 byte to represent it, wasting 6 bits. If you wrap it in an optional, that will be 16 bits to represent something that, according to information theory, requires only 2 bits. And Zig's hands are tied; because currently each field requires ABI alignment, each byte is necessary.If #3802 is accepted and implemented, and the
is_null
bit of optionals becomesalign(0)
, then?E
can remain 1 byte, and?E
in a struct withalign(0)
will take up 3 bits.However, consider if the enum was allowed to choose a ranged integer type. It would choose
@Int(0, 3)
. Wrapped in an optional, it actually could choose to use the integer value 3 as theis_null
bit. Then?E
in a struct will take up 2 bits.Again, assuming #3802 is implemented, Zig would even be able to "flatten" several enums into the same integer:
If you add up all the bits of the tag type, it comes out to 10, meaning that the size of MyFancyType would have to be 2 bytes. However, with ranged integers as tag types, zig would be able to flatten out all the enum tag values into one byte. In fact there are only 21 total tag types here, leaving room for 235 more total tags before MyFancyType would have to gain another byte of size.
This proposal would solve #747. Peer type resolution of comptime ints would produce a ranged integer:
With optional pointers, Zig has an optimization to use the zero address as the null value. The
allowzero
property can be used to indicate that the address 0 is valid. This is effectively treating the address as a ranged integer type! This optimization for optional pointers could now be described in userland types:One possible extension to this proposal would be to allow pointer types to override the address integer type. Rather than
allowzero
which is single purpose, they could do something like this:This would also introduce type-safety to using more than just 0x0 as a special pointer address, which is perfectly acceptable on most hosted operating systems, and also typically set up in freestanding environments as well. Typically, the entire first page of memory is unmapped, and often the virtual address space is limited to 48 bits making
@Int(os.page_size, 1 << 48)
a good default address type for pointers on many targets! Combining this with the fact that pointers also have alignment bits to play with, this would give Zig's type system the ability to pack a lot of data into pointers which are annotated withalign(0)
.What about two's complement wrapping math operations? Two's complement only works on powers-of-two integer types. Wrapping math operations would not be allowed on non-power-of-two integer types. Compile error.
The text was updated successfully, but these errors were encountered: