-
Notifications
You must be signed in to change notification settings - Fork 393
Containers: When to Use Which
The C++ standard library includes a number of container (i.e., data structure) templates: array
, vector
, list
, stack
, queue
, deque
, map
, set
, unordered_map
, and unordered_set
. The use of good data structures is the key to good performance. So ... when to use each one of these? First, a quick lesson about asymptotic behavior and fixed overheads.
You've probably heard that "std::map
is optimized for search". That is the first part of a larger sentence. The complete version is "std::map
is optimized for search relative to other containers when the number of elements in the container is large, and the larger the number of elements the greater its relative advantage over the other containers." This is the concept of asymptotic cost, which in computer science is written as "O(f(N))" and called "big-O" notation (N
is the number of elements in the data structure).
Searching for an element in a std::vector
has asymptotic behavior of O(N)
, the cost is proportional to N
, the number of elements. Searching for an element in a std::map
has asymptotic behavior of O(lg(N))
, the cost is proportional to lg(N)
(i.e., log base 2 of N). In practice, the growth functions are more complicated than just N
or lg(N)
. There is usually some constant factor applied. For instance, in a std::vector
you typically only have to search half the elements before finding the one you need and so the function is really N/2
, not N
. For similar reasons, the average cost of searching a std::map
is lg(N/2)
or lg(N)-1
, not just lg(N)
. There are also constant factors involved with traversing the data structure that may be different for different data structures, so the costs are really Cv*(N/2)
and Cm*(lg(N)-1)
. But when N
gets very large, the only thing that matters is N
so that is the only part that is written in the notation. Suppose Cv
(the cost of traversing a std::vector
is 1) and Cm
(the cost of traversing a node in std::map
is 3). When N
is 1,000,000, the cost of finding an element in a std::vector
is 500,000, the cost of finding the same element in std::map
is 60. The difference increases as N
gets even larger. When N
is 1,000,000,000, the cost of finding something in a std::vector
is 500,000,000 and the cost of finding it in a std::map
is 90!
Here's the thing about asymptotic behavior and EnergyPlus. In EnergyPlus we never deal with containers of 1,000,000 elements. What is the largest number of elements we deal with? 10,000? 1,000? The costs for searching a std::vector
and a std::map
each with 1,000 elements are 500 and 30, respectively. When you go to 100 elements, the costs become 50 and 20. When you get to 10 elements they are 5 and 9--if you have 10 elements, it's faster to search a std::vector
than a std::map
, not to mention to create a std::vector
than a std::map
! When thinking about which container or data structure to use, it's important not only to think about asymptotic behavior but where that asymptotic behavior takes over (usually somewhere between 10 and 100 elements for most data structures) and whether EnergyPlus is likely to ever hit that crossover point. If you are only expecting 10-20 elements, use the simpler data structure with the lower constant overhead factor. Save the "fancy" data structures that are optimized for certain operations for containers with many elements.
The best container--it least as far as EnergyPlus use cases are concerned but more generally too I suspect--is [std::array
]. (https://en.cppreference.com/w/cpp/container/array). std::array
is an array whose size is fixed at compile-time. This does limit the use of std::array
, but when array size is statically known, it's the best option. std::array
is a low-overhead, pointer-safe veneer on top of a C-style array. The memory footprint of a std::array
is exactly that of a C-style array. Even the number of elements is not saved, it's part of the std::array
template type instance as opposed to individual object instances--yes, std::array
of the same element type but different element counts are considered different types!! There is no secondary heap allocation, construction, or destruction associated with a std::array
. In all other containers, the actual data element storage is allocated dynamically and lives on "the heap" regardless of where the container object itself lives, stack, heap, or static global memory. In a std::array
the data-element storage is the container object. If a std::array
is declared to be on the stack, the elements it contains are also on the stack. Zero-overhead memory management is one advantage of std::array
.
But there's more! Because there is no dynamic memory management, a std::array
can be made constexpr
assuming the elements it contains can also be made constexpr
. constexpr
is one of the great new features of C++ (learn more about it here). constexpr
data is written into a special section in the executable and is loaded into memory for use with no actual code needing to run. Look at the FluidProperties module to see how useful this can be.
We have already hinted at one of the traditional limitations of std::array
, the fact that std::array
of different sizes are considered different types even if they contain the same type of element. This makes using std::array
's as function parameters difficult. Thankfully, C++20 provides a fix (and in the meantime, we are using that fix via the C++ guideline support library). That fix is the std::span
(or for now gsl::span
) template. gsl::span
is a lightweight, storage-less reference template for std::array
, std::vector
, and C-style arrays and pointers. It has the same relationship to std::array
as std::string_view
has to std::string
. gsl::span
contains a pointer to storage and a size element. When a parameter is delcared gsl::span
and an std::array
is passed as an argument, the compiler initializes gsl::span
with a pointer to the std::array
and the number of elements (which is part of the type so the compiler knows it). Within the callee, the number of elements can be retrieved using the .size()
method. Look here for an example of how gsl::span
is used.
If you actually look at the example, you saw that gsl::span
is declared as a value parameter. Aren't objects supposed to be passed by const & rather than by value? In general, yes. But there are exceptions to every rule, and the exceptions to this one are std::string_view
and gsl::span
, both of which have only two members (it's actually faster overall to pass an object with two elements by value than to pass a reference to the object and then access both members in the callee) and because both are essentially references and passing references by reference is considered gauche in StackOverflow circles.
TL;DR std::span
/gsl::span
has removed whatever barriers were remaining to using std::array
. Use this container wherever possible!
There are a significant number of arrays whose sizes are known at compile time, but an even larger number of arrays whose sizes are determined by the contents of the IDF file. For these, you can use std::vector
, EPVector
(a thin veneer on top of std::vector
that provides Fortran-like 1-based indexing functionality), or ObjexxFCL::Array1D
(a from-scratch library that provides all the functionality of Fortran arrays). The goal for EnergyPlus should be to prune Fortran-specific functionality and gradually migrate from ObjexxFCL::Array1D
to EPVector
and finally to std::vector
, this last transition also requires moving to 0-based array indexing. You should give preference to std::vector
, EPVector
, and ObjexxFCL::Array1D
in that order, but for the time being all are fine.
There is more to say about multi-dimensional arrays and arrays of objects vs. arrays of scalars, but those will require separate pages.
As alluded to at the top of this page, these containers do have some real world uses, although not nearly as many as one would think. std::map
and std::unordered_map
are good for large dynamic collections that have to be searched relatively frequently. std::map
is implemented as "red-black" (i.e., self-rebalancing) binary tree. std::unordered_map
is implemented as a hash table (in my experience, hash tables are by far the second most useful real-world data structure after arrays). There is a bit of constant additive overhead associated with inserting into and searching a hash table that is associated with computing the hash function of the element, but overall insertion and search are faster because with a good hash function a hash table is naturally self-balancing. If traversing the elements in some order (e.g., alphabetic) is important, use std::map
otherwise use std::unordered_map
.
A good use of std::unordered_map
is as a temporary data structure used to link permanent data structures together during initialization. Many EnergyPlus objects reference Zone
by name. Once the Zone
array is created and sorted, a std::unordered_map<std::string_view, int> ZoneNameUCtoNum
can be created to help other objects quickly translate Zone
names to indices.
I honestly cannot think of too many EnergyPlus use cases for these data structures. If you find yourself using one of them, ask yourself why.
If you've been programming in C++ for a while, you will have been bitten by this counter-intuitive issue. Don't worry, we all have. Multiple times. What is "this issue" exactly? "This" is the fact that C++ has two polymorphism mechanisms, class
and template
, and although they can be used together as constructs, their individual polymorphism mechanisms are not compatible and do not compose in the way you would intuitively expect. Specifically:
struct BaseGlobalStruct {
};
struct WindowData : BaseGlobalStruct {
};
std::vector<BaseGlobalStruct> baseGlobalStructVec;
std::vector<WindowData> windowDataVec;
Forget for a second that BaseGlobalStruct
is an abstract class (i.e., it has a pure virtual function that cannot be called) and cannot be instantiated. If it could be, would windowDataVec
be a subtype of baseGlobalStructVec
? Would you be able to write a function that specifies a std::vector<BaseGlobalStruct> &
as an argument and pass a std::vector<WindowData>
as a parameter to that function? No and no. No template
instantiation is a subtype of any other template
instantiation even if the template is identical and the template
parameters have a subtype relationship. TANSTATS--there ain't no such thing as template
subtyping.
If you want to exploit element polymorphism within a container, the container has to be defined as holding pointers to the base class, even if it only contains pointers to a subclass.
std::vector<BaseGlobalStruct *> baseGlobalStructPVec;
std::vector<BaseGlobalStruct *> windowDataPVec; // contains pointers to WindowData objects but defined in terms of the parent BaseGlobalStruct
Note, this is a characteristic of all templates. Array1D
and EPVector
will have the same issue.