100-introduction.tex

\chapter{Introduction}
\label{sec-introduction}

Production of integrated circuits has advanced at a consistent and rapid pace for over half a century, doubling the transistors density every one to two years in accordance with Moore's law. Only in recent years has the cadence slowed as we approach fundamental physical limits.

While Moore's law is most commonly associated with improvements in performance, it also allowed us to reduce the size of computers: doing the same thing in ever smaller packages. This trend has not been as smooth as the continuous improvements in performance. While any improvement in performance is a direct advantage, a small reduction in size usually is not. However, it enables revolutions at certain thresholds: moving from room-sized computers to home computers and PCs in every home,  scaling them down further to portable/laptop computers, and eventually handhelds and smart phones. Once established, each of these then benefitted from Moore's law to improve their capabilities, but the truly disruptive moments are when miniaturisation allowed whole new applications areas to emerge.

The term 'ubiquitous computing' was coined by Mark Weiser, predicting in 1991 that computing would move from a dedicated device on the desktop, to devices all around us, from alarm clocks to coffee makers \cite{Weiser:1991wz}. Around the turn of the century, we were able to scale down useful, working devices to the size of a few millimetres \cite{Warneke:2001ui}. This led to the start of research into Wireless Sensor Networks (WSN): many small and inexpensive sensor nodes, often called 'motes', working together to perform continuous automated sensing tasks.

Many promising WSN applications were proposed, ranging from military applications \cite{Arora:2004}, to precision agriculture \cite{Langendoen:2006un}, habitat monitoring \cite{Mainwaring:2002wb}, and environmental monitoring \cite{WernerAllen:2006ta, Chang:2010ek}. While the applications vary greatly, the hardware platforms used to build WSN applications are all quite similar, and usually very resource-constrained.

Sensor nodes are typically small and battery powered, and many applications require a lifetime measured in weeks or months rather than hours, so maintaining a very low power consumption is critical. To achieve this, the CPUs used for sensor nodes are kept very simple. While they lack most of the advanced features found in modern desktop CPUs, they typically do have several sleep modes, allowing them to reduce power consumption by over 99.99\% \cite{Atmel:ATmega128Datasheet}. Extremely long battery life is possible by keeping the CPU in sleep mode for most of the time, only occasionally waking up to perform its sensing task. Since RAM requires power to maintain its state even in sleep mode, it is usually limited to only a few KB of RAM, a full six orders of magnitude less than most modern computers.

In 2001 Pister predicted that by 2010, the size of these devices would be reduced to a cubic centimetre, and cost to less than a dollar \cite{Pister:2001vr}. While the first prediction has come true \cite{Wang:2014cq}, the latter so far has not. Future improvements in IC technology may allow more powerful devices at the same level of cost and power consumption, but for many applications an increase in battery lifetime or a reduction in cost may be more valuable, and may enable new applications not possible at the current level of technology.

Thus, much of the research into WSN is about the trade-offs involved in achieving useful functionality in as small a space as possible, gradually exploring the design space between capabilities, accuracy and performance on one side, and their cost in terms of memory and power consumption on the other. New protocols were developed at every layer in an application, optimising them for the specific constraints of sensor nodes. This includes lightweight MAC protocols for radio communication, trading latency for energy by turning off the radio as often as possible \cite{Ye:2002uv, vanDam:2018tr}, lightweight operating systems and virtual machines, trading functionality for reduced size and complexity \cite{Levis:2004ws, Gu:2006ww, Han:2005th, Levis:2002ku, Brouwers:2009cj}, lightweight routing and data aggregation \cite{Intanogonwiwat:2018wz, Braginsky:2002wg}, lightweight data compression and reprogramming techniques, trading CPU cycles for a reduction in transmitted bits \cite{Marcelloni:2009ja, Reijers:2003ww}, lightweight localisation, trading accuracy for reduced complexity \cite{Niculescu:2001bl, Savarese:2002tx, Savvides:2002uf}, etc.

What these have in common is that they all revisit classic computer science problems, and adjust them to fit one a sensor node, making trade-offs to optimise for power consumption, either directly by reducing the time the processor or radio is active, or indirectly by reducing code size and memory consumption enough for them to run on the extremely low power, but very resource-constrained CPUs.

\section{Internet-of-Things}
\label{sec-introduction-iot}
Recently, research into the Internet-of-Things (IoT) focuses on connecting many everyday objects and building smart applications with them. In this vision, similar to Weiser's ubiquitous computing, any object could be connected to the internet, and cooperate to achieve useful goals. For example a house with a smart air-conditioning system may use sensors in each room, weather forecast information downloaded from the internet, past data on how the house responds to weather changes, and the user's current location, and combine all this information to conserve energy while making sure the house is at a comfortable temperature when the user gets home.

While IoT and WSN overlap and the two terms are sometimes used interchangeably, an important difference is that in WSN research, applications typically consist of a large number of homogeneous and resource-constrained nodes, where IoT devices come in a wide range, with vastly different performance characteristics, cost, and power requirements.

On one end of this spectrum are devices like the Intel Edison and Raspberry Pi. These are the result of another decade of miniaturisation since the beginning of WSN research, and are basically a complete PC in a very small form factor. They are powerful enough to run a normal operating system like Linux, but relatively expensive and power hungry. On the other end are traditional WSN CPUs like the Atmel ATmega or Texas Instruments MSP430: much less powerful, but also much cheaper and low power enough to potentially last for months or years on a single battery. Since both classes of devices have such different characteristics, solutions that are appropriate for one usually do not work for the other.

A second important difference between WSN and IoT applications is that in WSN applications, the network is usually dedicated to a specific task and the hardware is an integral part of the design of the application. In the broadest IoT vision, the smart devices in the user's environment cooperate to implement new applications, but these devices may come from many different vendors and may not be specifically designed for the application the user wants to run. Coming back to the example from Weiser's paper \cite{Weiser:1991wz}, it is unlikely a user would be willing to buy a matching pair of a coffee maker and an alarm clock, just so that they will work together to have his coffee ready in the morning. The challenge for IoT is to allow different smart coffee makers and smart alarm clocks to be programmed in such a way to enable this application.

Thus, many IoT applications are inherently heterogeneous, and as Gu points out \cite{Gu:2006ww}, even when powerful devices like the Raspberry Pi are used, it is not unusual for low power devices to be included to form a hybrid network and take advantage of their extremely long battery lifetime. One of the main challenges then becomes how to programme these networks of IoT devices.


\section{Virtual machines}
The use of virtual machines has been common in desktop computing for a long time, with Java and .Net being the most well-known examples. There are several advantages to using VMs, the most obvious one being platform independence. Java enables a vast number of different models of Android phones to run the same applications. In a heterogeneous environment as IoT applications are expected to be, a VM can significantly ease the deployment of these applications if the same programme can be run on any node, regardless of its hardware platform. A second advantage is that a VM can offer a safe execution environment, preventing buggy or malicious code from disabling the device.

% A final often mentioned advantage is ease of programming. Many VMs allow the developer to write programmes at a higher level of abstraction than the bare-metal C programming that is still common for sensor nodes. However, it could be argued that this could also be achieved using more high level languages that compile to native code, so the two advantages of VMs that we will focus on in this dissertation are safety and platform independent reprogramming.

Since the early days of WSN research, many VMs, some based on Java and .Net, have been developed to run on resource-constrained sensor nodes. They manage to pack an impressive set of features on such a limited platform, but sacrifice performance, and usually do not provide a safe execution environment.

\subsection{Performance degradation}
\label{sec-introduction-performance}
The VMs for which we have found concrete performance data are shown in Table \ref{tbl-slowdown-for-sensornode-vms}. The best case, the 4x slowdown seen in one of SensorScheme's benchmarks, is a tiny benchmark that does a single call to a random number generator, so this only tells us a function call takes about 3 times longer than generating the random number. Apart from this single data point, all interpreting VM are between one and two orders of magnitude slower than native code.

\begin{table}
\caption{Slowdown for interpreting sensor node VMs}
\label{tbl-slowdown-for-sensornode-vms}
    \begin{tabular}{lllr} % NO SIMULATION DATA
    \toprule
    VM              & Source                           & Platform                   & Performance vs. native C \\
    \midrule
    \midrule
    Darjeeling      & Delft University                 & ATmega128                  & 30x-113x slower \cite{Brouwers:2009cj} \\
                    & ~~ of Technology                 &                            & \\
    TakaTuka        & University of Freiburg           & Mica2 (AVR) and            & 230x slower \cite{Ellul:2012thesis} \\
                    &                                  & JCreate (MSP)              & \\
    TinyVM          & Yonsei University                & ATmega128                  & 14x-72x slower \cite{Hong:2012wj} \\
    DVM             & UCLA                             & ATmega128L                 & 108x slower \cite{Balani:2006} \\
                    &                                  &                            & 555x slower \cite{Kumar:2007ge} \\
    SensorScheme    & University of Twente             & MSP430                     & 4x-105x slower \cite{Evers:2010ur} \\
    % Ellul's work \cite{Ellul:2012thesis}    & University of Southampton       & MSP430                     & 2.2x-9x slower           \\
    \bottomrule
    \end{tabular}
\end{table}

In many scenarios this may not be acceptable for two reasons: for many tasks such as periodic sensing there is a hard limit on the amount of time that can be spent on each measurement, and an application may not be able to tolerate a slowdown of this magnitude. For applications that sample close to the maximum rate a node could process, \emph{any} reduction in performance directly translates to a reduction in the sampling rate.

Perhaps more importantly, one of the main reasons for using such tiny devices is their extremely low power consumption. In many applications the CPU is expected to be in sleep mode most of the time, so little energy is spent on the CPU compared to communication or sensors. However, if the slowdown incurred by a VM means the CPU has to stay in active mode 10 to 100 times longer, this means 10 to 100 times more energy is spent on the CPU and it may suddenly become the dominant factor and reduce battery lifetime.

To illustrate this we will look at three concrete examples below.

\subsubsection{Mercury}
Few WSN or IoT applications report a detailed breakdown of their power consumption. One that does is a platform for motion analysis called Mercury \cite{Lorincz:2009kt}. The data reported in their paper is copied in Table \ref{tbl-mercury-energy}. The greatest energy consumer is the sampling of a gyroscope, at 53,163 μJ. Only 1,664 μJ is spent in the CPU on application code for an activity recognition filter and feature extraction. When multiplied by 10 or 100 however, the CPU becomes a significant, or even by far the largest energy consumer.

\begin{table}
\caption[Energy consumption breakdown for the Mercury motion analysis application]{Energy consumption breakdown for the Mercury motion analysis application, source: \cite{Lorincz:2009kt}}
\label{tbl-mercury-energy}
    \begin{tabular}{lr} % NO SIMULATION DATA
    \toprule
    Component                          & Energy (μJ) \\
    \midrule
    \midrule
    Sampling accel                     &  2,805 \\
    CPU (activity filter)              &    946 \\
    Radio listen (LPL, 4\% duty cycle) &  2,680 \\
    Time sync protocol (FTSP)          &    125 \\
    Sampling gyro                      & 53,163 \\
    Log raw samples to flash           &  2,590 \\
    Read raw samples from flash        &  3,413 \\
    Transmit raw samples               & 19,958 \\
    \midrule
    Compute features                   &    718 \\
    Log features to flash              &     34 \\
    Read features to flash             &     44 \\
    Transmit features                  &    249 \\
    \midrule
    512-point FFT                      & 12,920 \\
    \bottomrule
    \end{tabular}
\end{table}


Table \ref{tbl-mercury-energy} also shows that transmitting raw data is a major energy consumer. To reduce this, Mercury has the option of first extracting features from the raw sensor data, and transmitting these instead, achieving a 1:60 compression. Mercury has five feature detection algorithms built in: maximum peak-to-peak amplitude; mean; RMS; peak velocity; and RMS of the jerk time series, but the authors note that the exact feature extractor may be customised by an application.

This is the kind of code we may want to update at a later time, where using a VM could be useful to provide safety and platform independence. However, at more than a 35x slowdown in the feature extraction algorithm, which is in the range seen for most VMs, this would be pointless, because more energy would then be spent in the CPU than would be saved on transmission.

Finally, a more complex operation such as a 512 point FFT costs 12,920 mJ. For tasks like this, even a slowdown by a much smaller factor will have a significant impact on the total energy consumption.

\subsubsection{Lossless compression}
As a second example we consider lossless data compression. Since the radio is often one of the major power consumers on mobile devices and one of the main tasks of sensor nodes is collecting and transmitting data, compressing this data before it is sent is an attractive option to conserve energy spent on transmission. However, energy must also be spent on CPU cycles during compression. Barr and Asanovi\'c have analysed this trade-off for five compression algorithms on the Skiff research platform, with hardware similar to the Compaq iPAQ. The break-even point was found to be at about 1000 instructions for each bit saved, beyond which the energy spent on compression would start to outweigh the energy saved on transmission. In their experiments this was the case for a number of combinations of compression algorithms and input data where compression led to an increase in total energy consumption, compared to sending uncompressed data \cite{Barr:2006vg}.

Most of the traditional algorithms, such as the ones considered by they Barr and Asanovi\'c, are too complex to run on a sensor node, so specialised compression algorithms have been developed for sensor nodes. One such algorithm is LEC \cite{Marcelloni:2009ja}, a simple lossless compression algorithm that can be implemented in very little code and only needs to maintain a few bytes of state. We will use a rough calculation to show why performance is also a concern for the simpler compression algorithms developed for sensor nodes.

Using the power consumption data from the datasheets for the Atmel ATmega128 \cite{Atmel:ATmega128Datasheet} CPU, we estimate the energy per CPU cycle in active mode. Running at 8 MHz and 2.7V, the ATmega consumes 7.5 mA.

\begin{equation}
    7.5mA * 2.7V / 8MHz = 2.53nJ / cycle  
\end{equation}

We can do a similar calculation to determine the energy per bit for the Chipcon CC2420 IEEE 802.15.4 transceiver \cite{Chipcon:CC2420Datasheet}. This radio can transmit at 250 kbps, but the small frame size of the 802.15.4 protocol introduces a relatively large overhead, and Latre\'e et al. calculate a maximum throughput of 163 kbps \cite{Latre:2006wr}. The CC2420 can also operate at 2.7 V, and consumes between 8.5 and 17.4 mA depending on transmission power. Using the higher value, so that compression will be more worthwhile, this yields

\begin{equation}
  17.4 mA * 2.7 V / 163 kbps = 288.2 nJ / bit
\end{equation}

As a result, we can spend $288.2/2.53 \approx 114$ cycles per bit to reduce the size of the transmitted data, and still conserve energy using compression.

We implemented the LEC compression algorithm and used it to compress a dataset of 256 16-bit ECG measurements \cite{physionet-ecg-data}, or 4096 bits of data. The results are shown in Table \ref{tbl-lec-energy}. LEC compression reduced the dataset to 1840 bits, saving 2256 bits, or 651 μJ on transmitting the data, at the expense of 246 μJ extra energy spent in the CPU.

\begin{table}
\caption{LEC compression energy savings}
\label{tbl-lec-energy}
    \begin{tabular}{lr} % UPDATED 20180327
    \toprule
    Energy consumption \\
    ~~~ ATmega128 per cycle            & 2.53 nJ/cycle  \\
    ~~~ CC2420 per transmitted bit     & 288.2 nJ/bit  \\
    \\
    LEC compression \\
    ~~~ bits saved                     & 2256 bits \\
    ~~~ cycles spent                   & 97052 cycles\\
    ~~~ cycles per bit                 & 43 cycles/bit \\
    \\
    Energy saved                       & 650 μJ \\
    Energy expended                    & 246 μJ \\
    Ratio saved/expended               & 2.6x \\
    \bottomrule
    \end{tabular}
\end{table}

This shows that for this combination of hardware and sensor data, LEC compression is effective in reducing total energy consumption. However the energy saved is only 2.6x more than the extra energy expended on the CPU. This means that if running the compression algorithm in a VM slows it down by a factor of more than 2.6x, this would tip the balance in favour of simply sending the raw data.

Where exactly this break-even point lies depends on many factors. The CPUs and radio used in this calculation are very common in typical sensor nodes. The widely used Telos platform \cite{Polastre:2005ut} uses the CC2420 radio, and the ATmega CPU is found in many Arduinos, but many parameters will affect the results. Power consumption is roughly linear in relation to clock frequency, so at the same voltage the cost per cycle will be similar, but at lower frequencies the CPU can operate at a lower voltage which lowers the cost per cycle. The cost per bit depends on many factors including the link quality. A bad link will increase the cost per bit due to retransmissions, but if the node transmits at a lower power, for example to reduce the number of neighbours and collisions, the cost per bit will be lower than calculated.

This calculation is another example of a situation where the slowdown caused by current sensor node VMs means compression is not worthwhile. For Mercury's feature extraction, the break-even point is around 35x slowdown, while for this case of LEC compression it is at only 2.6x. The exact numbers depend on the application, but it is clear that a slowdown of one to two orders of magnitude will affect battery lifetime in many applications.

\subsubsection{Amulet}
A final example to motivate both the benefit of using a VM and the need for good performance is the smart watch platform Amulet \cite{Hester:2016je}. Amulet aims to provide a week-long or month-long battery life time. To do so it uses a typical low-power resource-constrained CPU, the MSP430.

Amulet supports multiple concurrent applications. Currently these are written in Amulet C, and are compiled together into a single firmware image that is loaded onto a watch. However, the authors envision a future with multiple vendors of Amulet devices, who will no doubt use different hardware platforms, and many developers submitting applications to a sort of app store. In such a scenario the platform independence a VM can provide is a valuable property.

One of the main design goals of Amulet is long battery life, and although the authors do not provide a breakdown of energy consumption per component like Mercury, they do note that “Energy consumption is significantly impacted by the fraction of time the application microcontroller (MSP430) is active”. This suggests a large performance overhead would significantly reduce battery lifetime.

\begin{table}
\caption[Sleep and active time for Amulet applications]{Sleep and active time for Amulet applications, source: \cite{Hester:2016je}}
\label{tbl-amulet-applications-sleep}
    \begin{tabular}{lrrr} % NO SIMULATION DATA
        \toprule
        Application        & \%Sleep & \%OS & \%App \\
        \midrule
        \midrule
        Clock              & 98.1    & 0.9  & 1.0 \\
        EMA                & 98.2    & 1.0  & 0.8 \\
        Heart rate         & 91.1    & 0.9  & 8.0 \\
        Pedometer          & 93.8    & 2.2  & 4.0 \\
        Pedometer+HR       & 87.5    & 1.9  & 10.6 \\
        Pedometer+HR+Clock & 85.4    & 2.8  & 11.8 \\
        \bottomrule
    \end{tabular}
\end{table}

The paper provides an overview of the percentage of time spent in sleep mode, in the OS, or in application code for a few different applications. This is reproduced in Table \ref{tbl-amulet-applications-sleep}. The percentage of time the CPU is executing application code varies between 0.8\% and 11.8\%. Combined with the one to two orders of magnitude slowdown seen in Table \ref{tbl-slowdown-for-sensornode-vms}, there are many cases where the slowdown of a VM would mean the CPU has to be active for more than 100\% of the time, indicating it does not have enough time to complete all its tasks in time.

For the most expensive configuration that spends 11.8\% executing application code, this happens at a slowdown of roughly 8.5x. The lighter applications can tolerate more overhead, but the highest slowdowns seen in Table \ref{tbl-slowdown-for-sensornode-vms} are a problem for even the lightest application.

\subsubsection{Ahead-of-Time compilation}
Thus, a better performing VM is needed, preferably one that performs as close to native performance as possible. Translating bytecode to native code is a common technique to improve performance in desktop VMs. Translation can occur at three moments: offline, ahead-of-time (AOT), or just-in-time (JIT).

JIT compilers translate only the necessary parts of bytecode at run time, just before they are executed. They are common on desktops and on more powerful mobile devices, but are impractical on sensor node platforms, some of which can only execute code from flash memory. This means a JIT compiler would have to write to flash memory at run time, which is expensive and would cause unacceptable delays. There are nodes that can execute code from RAM, but the small amount of RAM present on sensor nodes means a JIT compiler would either have to allocate a large part of this scarce resource to the compiled code cache, or frequently recompile the same methods if they get flushed when the cache overflows \cite{Ellul:2012thesis}.

Translating to native code offline, before it is sent to the node, has the advantage that more resources are available for the compilation process. We do not have a Java compiler that compiles to our sensor node's native code to test the resulting performance, but we would expect it would come close to compiled C code in many cases. However, doing so, even if only for small, performance critical sections of code, sacrifices the two of the key advantages of using a VM: The host now needs knowledge of the target platform, and needs to prepare a different binary for each type of CPU used in the network, and for the node it will be more difficult to provide a safe execution environment when it receives binary code.

Therefore, this dissertation will focus on the middle option: translating the bytecode to native code on the node itself, at load time. We will build on previous work by Joshua Ellul \cite{Ellul:2012thesis} on AOT translation on sensor nodes. This approach reduces performance overhead to a slowdown of 3x to 20x, significantly faster than the interpreting VMs, but not fast enough for LEC compression to be worthwhile in our example. Unfortunately, it also results in an increase in the size of the stored programmes of up to 5.5x, which limits the size of the programmes that can be loaded on a node.

\subsection{Safety}
\label{sec-introduction-safety}
Low-cost low-power sensor node CPUs have a very simple architecture. They typically do not have a memory management unit (MMU) or privileged execution modes to isolate processes. Instead, the entire address range is accessible from any part of the code running on the device.

At the same time, sensor node code can be quite complex. While programming in a high-level language can reduce the risk of programming errors, the limited resources on a sensor device often still force us to use more low-level approaches to fit as much functionality and data on a device as possible. For example by storing data in simple byte arrays instead of using more expensive objects, or a few cases where we have had to explicitly set a variable to \mycode{null} to allow an object to be garbage collected earlier than it otherwise would have been. In such an environment, mistakes are easily made, and with full access to the entire address space can have catastrophic consequences. A second threat comes from malicious code. As IoT applications become more widespread, so do the attacks against them, and the unprotected execution environment of sensor node CPUs makes them an attractive target.

To guard against both buggy code and malicious attacks, the application should be executed in a sandboxed manner and isolated from the VM itself. Specifically, we want to guarantee that untrusted code cannot:
\begin{enumerate}
    \item Write to memory outside the range assigned by the VM.
    \item Perform actions it does not have permission for.
    \item Retain control of the CPU indefinitely.
\end{enumerate}

Note that these guarantees do not assure the correctness of the application itself: buggy code may still corrupt its own state. More fine-grained checks can be useful to protect the state of the application and can speed up the development process by detecting bugs earlier. Safe TinyOS \cite{Cooprider:2007ub} adds run-time checks to detect illegal writes, and can do so efficiently by analysing the source code before it is compiled. However, this depends on the host and does not protect against malicious code being sent to the device.

Our approach depends only on the correctness of the VM, and guarantees it can always regain control of the node and terminate any misbehaving application before it executes an illegal write or performs an action it is not permitted to.

\subsubsection{Amulet}
The Amulet \cite{Hester:2016je} smart watch platform is also a good motivating example for the need for a safe execution environment. Since it aims to run multiple concurrent applications possibly developed by different developers, it is important to isolate these applications from each other and from the OS.

Amulet does this by using a restricted dialect of C, Amulet C, which has several limitations that make it easier to guarantee safety. For example, there is no access to arbitrary memory locations through pointers, no goto statements, no dynamic memory allocation, and no recursion. The Amulet compiler then adds runtime checks where static safety checks are not sufficient.

The authors do not provide any data on the overhead caused by the restrictions in Amulet C and the added run time checks, but it is significant enough to motivate them to investigate the use of memory protection units found on some recent CPUs to reduce this overhead \cite{Hardin:2017cq}. However, these MPUs are only available on a limited number of CPUs.

To use a VM in a project such as Amulet, it must be able to provide the same level of isolation, and do so at an acceptable cost.


\section{Scope}
\label{sec-introduction-scope}
Internet-of-Things devices come in a wide range with varying capabilities. The larger IoT platforms are powerful enough to run standard operating systems, tools and languages, and VMs are well established as a good way to programme devices powerful enough to run advanced JIT compilers. This dissertation focuses specifically on small sensor nodes for which no such standards exist. These platforms have the following characteristics:
\begin{itemize}
    \item Separate data and programme memory: Memory is split into RAM for data, and flash memory for code. While some device can execute code from both RAM and flash \cite{TexasInstrumentsIncorporated:MSP430F1611Datasheet}, others cannot \cite{Atmel:ATmega128Datasheet}.
    \item Very limited memory: Since volatile memory consumes energy even when the CPU is in sleep mode, it is typically restricted to 10 KB of RAM or less. More non-volatile flash memory is available to hold programmes, but at 16 to 256 KB this is still very limited.
    \item Low complexity CPUs: While usually rich in IO to drive actuators and read from sensors, the rest of the CPUs used in these devices is very simple in design to save cost and power consumption. Instructions usually take a fixed number of cycles, since memory is on chip access times are constant and fast, and there are no complicating factors like deep pipelines, caches, or branch predictors.
    \item Very limited energy budget: Typical usage scenarios demand a long battery lifetime, since frequent replacement or recharging would be too impractical or costly. All aspects of WSN software design are therefore focused towards minimizing energy consumption.
\end{itemize}

The focus on sensor nodes raises the question of how platform independent CapeVM is, since IoT applications may mix both classes, and the ability to use a single binary to programme multiple classes of devices is one of the main advantages of using a VM. While the optimisations proposed in this dissertation are specifically developed to work on a resource-constrained sensor device, the bytecode is very close to standard JVM bytecode, so a more powerful device would have no problem running it just as efficiently as it runs normal Java code.

\begin{figure}
\centering
\includegraphics[width=0.8\linewidth]{compilation-process-highlevel.eps}
\caption{High-level overview of the compilation process}
\label{fig-compilation-process-highlevel}
\end{figure}

\paragraph{AOT compiler}
Figure \ref{fig-compilation-process-highlevel} shows the high level process from source code to native code running on the node. The host PC will compile source code to bytecode, which is uploaded to the target node. Instead of interpreting this bytecode, our VM will translate it at load time and store the resulting native code in flash, which is then executed.

We will see that in order to achieve good performance, an optimising source-to-bytecode compiler is necessary to generate high quality bytecode. However, the focus in this dissertation is on the AOT compiler running on the sensor node, and on what performance it can deliver, given good quality bytecode.

Although we will make some changes to the way the bytecode is produced, building a full optimising source-to-bytecode compiler is outside the scope of this dissertation. To determine what level of performance is possible, some manual optimisations are applied to the source code to produce better quality bytecode, most of which an optimising compiler could be expected to do automatically.

\paragraph{Source language}
Note that in Figure \ref{fig-compilation-process-highlevel} we use 'source code' and 'bytecode' instead of Java or JVM bytecode, since the use of Java was only motivated by the availability of existing work to build upon, most notably the Darjeeling VM \cite{Brouwers:2009cj}, not because we believe Java to be a particularly good choice. The techniques described in this dissertation are not specific to Java, but can be applied to any language that compiles to a stack based bytecode.

We will see that in fact Java, in its current form, is not well suited for sensor node applications, and we will end this dissertation with a number of suggestions on how to remedy some of the issues we encountered.


\newpage
\section{Research questions and contributions}
\label{sec-introduction-research-questions}
There are clear advantages to using virtual machines, especially in heterogeneous and dynamic IoT scenarios, but these come at a cost. On a sensor node, this cost is especially significant since they are already very resource-constrained and cannot do many of the optimisations used in VMs on larger devices.

The main research question of this dissertation is whether virtual machines are a suitable means to programme resource-constrained sensor nodes from three perspectives:
\begin{enumerate}
    \item[a.] Performance: How close can an Ahead-of-Time compiling sensor node VM come to native C performance, and what are the trade-offs?
    \item[b.] Safety: Can a VM be an efficient way to provide a safe, sandboxed execution environment on a sensor node?
    \item[c.] Language: Is Java a suitable language for a sensor node VM, and how may it be improved?
\end{enumerate}

\noindent
This dissertation makes the following contributions:
% TODO: read these again
\begin{enumerate}
    \item We identify the major sources of overhead when using the Ahead-of-Time compilation technique described by Ellul and Martinez \cite{Ellul:2010iw, Ellul:2012thesis}.
    \item Using the results of this analysis, we propose a set of eleven optimisations to address each source of overhead. These include improvements to Ellul's AOT approach, modifications to the VM's bytecode, and a lightweight alternative to standard Java method invocation.    
    \item We show that in addition to these improvements to the AOT compiler, better optimisation in the Java to bytecode compiler is critical to achieving good performance.
    \item We describe a number of checks that are sufficient for the VM to provide a safe, sandboxed execution environment, and show most checks can be done at load time, reducing the overhead of run-time checks.
    \item We evaluate our optimisations using a set of benchmarks with varying characteristics, including the commonly used \mybench{CoreMark} benchmark \cite{coremark} and a number of real-world sensor node applications. We show these optimisations:
    \begin{itemize}
    	\item Reduce the size of the generated code by 40\%, allowing larger programmes to be loaded, and quickly compensating for the increase in VM size due to these optimisations,
    	\item Allow constant data to be placed in flash memory, enabling applications with constant tables that otherwise could not be implemented,
    	\item Eliminate 91\% of the performance overhead caused by the VM's stack-based architecture, and 82\% of performance overhead overall.
    \end{itemize}
    \item Using the same benchmarks we evaluate the cost of our safety checks, and show this cost to be comparable to, or lower than the two existing native code approaches to provide a safe execution environment on a sensor node, while providing platform independence at the same time.
    \item Finally, we identify a number of aspects of the Java language and virtual machine that ultimately make it a bad match for sensor nodes, and propose ways to address these issues in future sensor node VMs.
\end{enumerate}

\section{Dissertation outline}
Chapter 2 introduces necessary background knowledge on wireless sensor networks, Java and the Java virtual machine, and AOT compilation.

Chapter 3 discusses the state of the art in improving performance for sensor node VMs and providing a safe execution environment.

Chapter 4 describes the global design of CapeVM.

Chapter 5 first analyses the sources of overhead for the current state of the art in sensor node AOT compilers, and then presents a set of optimisations to address each of these sources. Where there were multiple options to implement an optimisation, we describe alternatives and motivate our choice.

Chapter 6 presents a set of checks that allow CapeVM to provide the sandbox safety guarantees described in Section \ref{sec-introduction-safety}, and shows how the more structured design of the VM's bytecode, compared to native code, allows for most of these checks to be done at load time.

Chapter 7 evaluates the effect of the optimisations presented in Section \ref{sec-optimisations} and the cost of the safety checks presented in Section \ref{sec-safety}.

Chapter 8 describes a number of issues we encountered while doing this work, which show standard Java is not the best choice for a sensor node VM, and suggests ways to improve these in future sensor node VMs.

Chapter 9 concludes this work.

\section{List of publications}
Part of the work presented in this dissertation is based on previously published reports or conference proceedings:

\begin{itemize}
    \item N. Reijers, K. J. Lin, Y. C. Wang, C. S. Shih, and J. Y. Hsu. Design of an Intelligent Middleware for Flexible Sensor Configuration in M2M Systems. \emph{SENSORNETS: Proceedings of the Second International Conference on Sensor Networks}, Feb. 2013.
    \item N. Reijers and C. S. Shih. Ahead-of-Time Compilation of Stack-Based JVM Bytecode on Resource-Constrained Devices. \emph{EWSN: Proceedings of the Fourteenth International Conference on Embedded Wireless Systems and Networks}, Feb. 2017.
    \item N. Reijers and C. S. Shih. Improved Ahead-of-Time Compilation of Stack-Based JVM Bytecode on Resource-Constrained Devices. \emph{arXiv:1712.05590}, Dec. 2017.
    \item N. Reijers, J. Ellul, and C. S. Shih. Making sensor node virtual machines work for real-world applications. \emph{IEEE Embedded Systems Letters}, in press.
\end{itemize}

\section{Naming}
In literature various names are used for our target devices. The word \emph{mote} was common in early research on wireless sensor networks, as was \emph{sensor node}, or simply \emph{device}, although the latter is also used for larger, more powerful devices. In this dissertation we will use the terms \emph{sensor node} or just \emph{node} interchangeably to refer solely to the type of severely resource-constrained devices used in both WSN and IoT applications.

%TODO: do we? actually right now neither term is used very often. Maybe drop this, but should be more consistent.
%The terms \emph{wireless sensor networks}, \emph{cyber-physical systems} and \emph{internet-of-things} are all commonly used to describe sensor node applications. We will use WSN since it is the longer established term, and more clearly focussed on resource-constrained devices, but note that this also covers a large portion of IoT applications.

When reprogramming these nodes, a new programme must be sent to them by a more powerful device. This role is referred to in literature by various names, including \emph{server}, \emph{gateway}, \emph{master}, \emph{controller}, or \emph{host}, depending on the exact design of the network and the way it is reconfigured. For the work in this dissertation these differences are not relevant, and we will use \emph{host} to refer to the source of the code that is uploaded to the sensor node, which is assumed to be a more powerful device with desktop-class processing capabilities.

Since our VM is an Ahead-of-Time compiler, the Java source code is transformed into native code in two steps. We will use \emph{compile time} to refer to the compilation of Java code to bytecode on the host, and \emph{translation time} to refer to the translation of this bytecode into native code on the device.

Finally, we follow Dalvik in naming our virtual machine after a coastal town. In this case the beautiful city of Cape Town, where parts of CapeVM were developed over the course of two trips.

\begin{figure}
\centering
\includegraphics[width=\linewidth,angle=270]{capetown.JPG}
\caption{Cape Town workplace}
\label{fig-capetown}
\end{figure}