520-optimisations-java-source.tex

\section{Manually optimising the Java source code}
\label{sec-optimisations-manual-java-source-optimisation}
As shown in Figure \ref{fig-translation-process}, our current implementation uses three steps to translate Java source code to CapeVM bytecode: the standard Java compiler, the ProGuard optimiser, and the modified Darjeeling infuser. None of these do any complex optimisations. 

In a future version, these three steps should be merged into an 'optimising infuser' which uses all the normal, well-known optimisation techniques to produce better quality bytecode, but at the moment we do not have the resources to build such an optimising infuser.

Since our goal is to determine what level of performance is possible on a sensor node, the Java source was manually optimised to get better quality JVM bytecode from \mycode{javac}. While these changes are not an automatic optimisation we developed, we find it important to mention them explicitly and analyse their impact, since many developers may expect these optimisations to happen automatically and without this step it would be impossible to reproduce our results.

% We have been careful to limit ourselves to 'fair' optimisations, by which we mean optimisations that an optimising infuser could reasonably be expected to do automatically, given some basic, conservative assumptions about the performance model. 

The most common optimisations that were done are:
\begin{itemize}
	\item Store the result of expressions calculated in a loop in a temporary variable, if it is known the result will be the same for each iteration.
	\item Since array and object field access is relatively expensive and not cached by the mark loop optimisation discussed in Section \ref{sec-optimisation-markloops}, prefer to minimise array and object access by using a temporary local variable, if the value may be used again soon.
	\item Manually inline \texttt{\#define}s and functions that were inlined by \mycode{avr-gcc} in the C version of the benchmark.
	\item Use 16-bit variables for array indexes where possible.
	\item Use bit shifts for multiplication and division by a power of two.
\end{itemize}

We will examine the effect of some other optimisations on the \mybench{CoreMark} benchmark in Section \ref{sec-evaluation-coremark}.

\paragraph{Temporary variables}
The first two optimisations generate extra store instructions, which means they may not always be beneficial if the value is never used again. However, a value often only needs to be reused only once for it to be faster to store it in a local variable than to calculate it twice or do two array accesses. If we use the mark loops optimisation discussed in Section \ref{sec-optimisation-markloops}, in many cases the variable may be stored in registers, making accessing them either very cheap, or completely free.

\paragraph{Manual inlining}
ProGuard automatically inlines methods that are only called from a single location, or small methods called from multiple locations. In these cases it simply replaces the \mycode{INVOKE} instruction with the callee's body, prepended with \mycode{STORE} instructions to pop the parameters off the stack and initialise the callee's local variables.

Manual inlining often results in better code because it may not be necessary to store the parameters in local variables if they are only used once. We therefore manually inline all small methods that were either a \texttt{\#define} in the C version of our benchmarks, or a function that was inlined by \mycode{avr-gcc}. Again, it is easy to imagine that an optimising infuser should be able to come to the same result automatically.


\paragraph{Platform independence}
Assuming an optimising infuser does raise the question how platform independent the resulting code is. If the infuser has more specific knowledge about the target platform, it can produce better code for that platform, but, while it should still run anywhere, this may not be as efficient on other platforms.

The optimisations described here are based on very conservative assumptions, and are not specific to the ATmega CPU. Therefore, they should work well for most programmes, independent of the hardware they are running on.

\paragraph{Example} Listing \ref{lst-manual-optimisation} shows an example of these manual optimisations, applied to the \mybench{bubble sort} benchmark. To have a fair comparison, we applied exactly the same optimisations to the C versions of our benchmarks, but here this had little or no effect on the performance.

\begin{listing}
\centering

\begin{minipage}[t]{0.47\textwidth}
\centering
  \begin{minted}{java}
// ORIGINAL
public static void bsort(int[] numbers)
{
  short NUMBERS=(short)numbers.length;
  for (short i=0; i<NUMBERS; i++)
  {
    for (short j=0; j<NUMBERS-i-1; j++)
    {
      if (numbers[j]>numbers[j+1])
      {
        int temp = numbers[j];
        numbers[j] = numbers[j+1];
        numbers[j+1] = temp;
      }
    }
  }
}
\end{minted}
\end{minipage}
\hfill
\begin{minipage}[t]{0.52\textwidth}
\centering
\begin{minted}[linenos=false]{java}
// MANUALLY OPTIMISED
public static void bsort(int[] numbers)
{
  short NUMBERS=(short)numbers.length;
  for (short i=0; i<NUMBERS; i++)
  {
    short x=(short)(NUMBERS-i-1);
    short j_plus_1 = 1;
    for (short j=0; j<x; j++)
    {
      int val_at_j = numbers[j];
      int val_at_j_plus_1 = numbers[j_plus_1];
      if (val_at_j>val_at_j_plus_1)
      {
        numbers[j] = val_at_j_plus_1;
        numbers[j_plus_1] = val_at_j;
      }
      j_plus_1++;
    }
  }
}
\end{minted}
\end{minipage}
\caption{Optimisation of the bubble sort benchmark}
\label{lst-manual-optimisation}
\end{listing}