You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Something interesting occurs if we use obj_size() to systematically explore the size of an integer vector. The code below computes and plots the memory usage of integer vectors ranging in length from 0 to 50 elements. You might expect that the size of an empty vector would be zero and that memory usage would grow proportionately with length. Neither of those things are true! \index{vectors!size of}
Those 40 bytes are used to store four components possessed by every object in R:
Object metadata (4 bytes). These metadata store the base type (e.g. integer)
and information used for debugging and memory management.
Two pointers: one to the next object in memory and one to the previous
object (2 * 8 bytes). This doubly-linked list makes it easy for internal
R code to loop through every object in memory.
A pointer to the attributes (8 bytes).
All vectors have three additional components: \indexc{SEXP}
The length of the vector (4 bytes). By using only 4 bytes, you might expect
that R could only support vectors up to $2 ^ {4 \times 8 - 1}$ ($2 ^ {31}$, about
two billion) elements. But in R 3.0.0 and later, you can actually have
vectors up to $2 ^ {52}$ elements. [Read R-internals][long-vectors] to see how
support for long vectors was added without having to change the size of this
field. \index{long vectors} \index{atomic vectors!long}
The "true" length of the vector (4 bytes). This is basically never used,
except when the object is the hash table used for an environment. In that
case, the true length represents the allocated space, and the length
represents the space currently used.
The data (variable number of bytes). An empty vector has 0 bytes of data. Numeric vectors occupy 8 bytes for
every element, integer vectors 4, and complex vectors 16.
If you're keeping count you'll notice that this only adds up to 36 bytes. The remaining 4 bytes are used for padding so that each component starts on an 8 byte (= 64-bit) boundary. Most cpu architectures require pointers to be aligned in this way, and even if they don't require it, accessing non-aligned pointers tends to be rather slow. (If you're interested, you can read more about it in C structure packing.)
This explains the intercept on the graph. But why does the memory size grow irregularly? To understand why, you need to know a little bit about how R requests memory from the operating system. Requesting memory (with malloc()) is a relatively expensive operation. Having to request memory every time a small vector is created would slow R down considerably. Instead, R asks for a big block of memory and then manages that block itself. This block is called the small vector pool and is used for vectors less than 128 bytes long. For efficiency and simplicity, it only allocates vectors that are 8, 16, 32, 48, 64, or 128 bytes long. If we adjust our previous plot to remove the 40 bytes of overhead, we can see that those values correspond to the jumps in memory use.
plot(0:50, sizes - 40, xlab = "Length",
ylab = "Bytes excluding overhead", type = "n")
abline(h = 0, col = "grey80")
abline(h = c(8, 16, 32, 48, 64, 128), col = "grey80")
abline(a = 0, b = 4, col = "grey90", lwd = 4)
lines(sizes - 40, type = "s")
Beyond 128 bytes, it no longer makes sense for R to manage vectors. After all, allocating big chunks of memory is something that operating systems are very good at. Beyond 128 bytes, R will ask for memory in multiples of 8 bytes. This ensures good alignment.
Exercises
Repeat the analysis above for numeric, logical, and complex vectors.
If a data frame has one million rows, and three variables (two numeric, and
one integer), how much space will it take up? Work it out from theory,
then verify your work by creating a data frame and measuring its size.
Compare the sizes of the elements in the following two lists. Each
contains basically the same data, but one contains vectors of small
strings while the other contains a single long string.
Formerly in adv-r
Something interesting occurs if we use
obj_size()
to systematically explore the size of an integer vector. The code below computes and plots the memory usage of integer vectors ranging in length from 0 to 50 elements. You might expect that the size of an empty vector would be zero and that memory usage would grow proportionately with length. Neither of those things are true! \index{vectors!size of}This isn't just an artefact of integer vectors. Every length 0 vector occupies 40 bytes of memory:
Those 40 bytes are used to store four components possessed by every object in R:
Object metadata (4 bytes). These metadata store the base type (e.g. integer)
and information used for debugging and memory management.
Two pointers: one to the next object in memory and one to the previous
object (2 * 8 bytes). This doubly-linked list makes it easy for internal
R code to loop through every object in memory.
A pointer to the attributes (8 bytes).
All vectors have three additional components: \indexc{SEXP}
The length of the vector (4 bytes). By using only 4 bytes, you might expect$2 ^ {4 \times 8 - 1}$ ($2 ^ {31}$ , about$2 ^ {52}$ elements. [Read R-internals][long-vectors] to see how
that R could only support vectors up to
two billion) elements. But in R 3.0.0 and later, you can actually have
vectors up to
support for long vectors was added without having to change the size of this
field. \index{long vectors} \index{atomic vectors!long}
The "true" length of the vector (4 bytes). This is basically never used,
except when the object is the hash table used for an environment. In that
case, the true length represents the allocated space, and the length
represents the space currently used.
The data (variable number of bytes). An empty vector has 0 bytes of data. Numeric vectors occupy 8 bytes for
every element, integer vectors 4, and complex vectors 16.
If you're keeping count you'll notice that this only adds up to 36 bytes. The remaining 4 bytes are used for padding so that each component starts on an 8 byte (= 64-bit) boundary. Most cpu architectures require pointers to be aligned in this way, and even if they don't require it, accessing non-aligned pointers tends to be rather slow. (If you're interested, you can read more about it in C structure packing.)
This explains the intercept on the graph. But why does the memory size grow irregularly? To understand why, you need to know a little bit about how R requests memory from the operating system. Requesting memory (with
malloc()
) is a relatively expensive operation. Having to request memory every time a small vector is created would slow R down considerably. Instead, R asks for a big block of memory and then manages that block itself. This block is called the small vector pool and is used for vectors less than 128 bytes long. For efficiency and simplicity, it only allocates vectors that are 8, 16, 32, 48, 64, or 128 bytes long. If we adjust our previous plot to remove the 40 bytes of overhead, we can see that those values correspond to the jumps in memory use.Beyond 128 bytes, it no longer makes sense for R to manage vectors. After all, allocating big chunks of memory is something that operating systems are very good at. Beyond 128 bytes, R will ask for memory in multiples of 8 bytes. This ensures good alignment.
Exercises
Repeat the analysis above for numeric, logical, and complex vectors.
If a data frame has one million rows, and three variables (two numeric, and
one integer), how much space will it take up? Work it out from theory,
then verify your work by creating a data frame and measuring its size.
Compare the sizes of the elements in the following two lists. Each
contains basically the same data, but one contains vectors of small
strings while the other contains a single long string.
Which takes up more memory: a factor (
x
) or the equivalent charactervector (
as.character(x)
)? Why?Explain the difference in size between
1:5
andlist(1:5)
.The text was updated successfully, but these errors were encountered: