andreagus
/
AOS-Project


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218
							\section{Analysis of the results}
For the analysis of the results I opted for using \textbf{Python} and in particular the \textbf{numpy} and \textbf{matplotlib} libraries to analyze and plot the results obtained before.\\
At first I decided to use gnuplot to plot the results, but since the charts are quite populated I found matplotlib being more practical for this purpose.\\
I divided this step in three scripts that also correspond to different conceptual phase, and for this reason I preferred to keep them separate.
We have three phases:
\begin{itemize}
  \item Preprocessing: elimination of some faulty power measurements.
  \item Average Computation: we compute the \textit{average} and the \textit{stderr} between all the runs of each benchmark.
  \item Plot: the real creation of the final charts.
\end{itemize}

Let's see in more detail each phase.

\subsection{Preprocessing}
Unfortunately there is a bug in the measurement utility that in some occasions create a stall of the the process taking the measurement, leading to blank power measurement value in the \textit{.csv} file were the utility stores the results or worse to a duplication of the previous value in the total.dat file.\\
To check this as first thing we need to look at the complete file and find duplicated values on consecutive lines. There might be a small chance that the values are correct, for example this sometimes happens with the \textit{backprop} and the \textit{bfs} benchmarks that have a really similar run time, and this actually means that also the power measurement will be very similar. We exclude this situation from our preprocessing by looking at the time values, and if their difference is less than 2.5 seconds we do not purge the power consumption value from the results. In all the other cases if we have a duplicate this is symptom of a problem with the measurement, so we purge the second identical power measurement record.

\subsection{Average Computation}
At this point we can proceed in computing the \textbf{average} and the \textbf{standard errror} of the measurements belonging to the same benchmark.\\
It is in fact suggested to repeat the run of the benchmarks an adequate number of times to, at the end, have a value that not affected by temporary high load of the system caused by reasons external to the benchmark, and also for getting an approximate idea of the accuracy of the results. In the following of this report the results have been obtained by averaging \textbf{10 runs} of each benchmark for each platform supported by the machine.\\
I took advantage of the \textit{average} and \textit{stderr} functions provided by \textit{numpy} for conducting this analysis.
The results are then stored in two files named \textit{average.csv} and \textit{stderr.csv} in which we have an entry per benchmark containing respectively the average and the standard error of time and power consumption.

\subsection{Plot}
The creation of the final charts is conducted with the help of the \textit{matplotlib}, starting from the data already produced by the \textit{analyze} script in the \lstinline{average.csv} and \lstinline{stderr.cvs} files.\\
Since the benchmarks have really different duration and energy consumption, having in a single plot the data of all the benchmarks does not make much sense, since comparing a bar with an height of 180 seconds with another of eight of 2 seconds can't give much information.\\
So as first thing I decided to split the benchmarks in \textbf{two categories}, one with the becnhmarks that have a duration on CPU less than \tetxbf{30 seconds}, and the other the remaining benchmarks.\\
I then organized the benchmarks in a plot chart having on the \textbf{x-axis} the various benchmarks, with \textbf{three bars} for each platform on which the has been run (CPU, GPU 4 cores, GPU 2 cores).\\
In this way we can easily spot the differences between the various platforms. Obviously to keep the charts readable there are two charts, one for the execution time and one with the power consumption.\\

\subsection{Charts}
In this section there are the charts representing the data produced, remember that all the benchmarks have been executed \textbf{10 times} per platform, per device, so these are actually the averages with also the representation on each bar of the standard error.

Following the charts relative to the benchmarks execution on the ODROIDXU3:

\begin{figure}[H]
  \centering
  \caption{Execution time ODROIDXU3, \textit{short} category}
  \includegraphics[width=1\textwidth]{ODROIDXU3/times-short}
\end{figure}

\begin{figure}[H]
  \centering
  \caption{Execution time ODROIDXU3, \textit{long} category}
  \includegraphics[width=1\textwidth]{ODROIDXU3/times-long}
  \label{fig:odroid3timelong}
\end{figure}

\begin{figure}[H]
  \centering
  \caption{Power consumption ODROIDXU3, \textit{short} category}
  \includegraphics[width=1\textwidth]{ODROIDXU3/power-short}
\end{figure}

\begin{figure}[H]
  \centering
  \caption{Power consumption ODROIDXU3, \textit{long} category}
  \includegraphics[width=1\textwidth]{ODROIDXU3/power-long}
  \label{fig:odroid3powerlong}
\end{figure}

Following the charts relative to the benchmarks execution on the ODROIDXU4:

\begin{figure}[H]
  \centering
  \caption{Execution time ODROIDXU4, \textit{short} category}
  \includegraphics[width=1\textwidth]{ODROIDXU4/times-short}
\end{figure}

\begin{figure}[H]
  \centering
  \caption{Execution time ODROIDXU4, \textit{long} category}
  \includegraphics[width=1\textwidth]{ODROIDXU4/times-long}
  \label{fig:odroid4timelong}
\end{figure}

\begin{figure}[H]
  \centering
  \caption{Power consumption ODROIDXU4, \textit{short} category}
  \includegraphics[width=1\textwidth]{ODROIDXU4/power-short}
\end{figure}

\begin{figure}[H]
  \centering
  \caption{Power consumption ODROIDXU4, \textit{long} category}
  \includegraphics[width=1\textwidth]{ODROIDXU4/power-long}
  \label{fig:odroid4powerlong}
\end{figure}


I also made some measurement of only the execution time of the benchmarks on my laptop, whose characteristics are reported in the Hardware section \ref{sec:hardware} (I couldn't take power consumption measurements and even if I could have taking the power consumption of a such complex machine wouldn't have made much sense).
Here are the charts relative to the benchmarks execution on my laptop (X86), only CPU, only execution time:

\begin{figure}[H]
  \centering
  \caption{Execution time on Thinkpad X1, \textit{short} category}
  \includegraphics[width=1\textwidth]{X86/times-short}
\end{figure}

\begin{figure}[H]
  \centering
  \caption{Execution time on Thinkpad X1, \textit{long} category}
  \includegraphics[width=1\textwidth]{X86/times-long}
  \label{fig:x86timelong}
\end{figure}

\subsection{Comments on the results}
Let's analyze first the results obtained on the ODROIDXU3 . We can clearly see that on average running a benchmark on the GPU (especially on the one with 4 cores) is really beneficial for the execution time \ref{fig:odroid3timelong}. This is particularly evident for the benchmarks which belong to the \textit{long} category, where we have more time to see the effects of the increased computational power.
On the \textit{short} category instead in some cases we have an inversion of the roles, but this can be explained with the additional overhead that a GPU computation brings. In fact we need to copy the buffers on which the OpenCL kernels will work between the central memory and the memory of the GPU, and on benchmarks that are really short this additional overhead compensate the reduced execution time.\\
\smallskip
\\
What is really surprising is the power consumption of the GPU \ref{fig:odroid3powerlong}(4 cores). We can clearly see that for basically all the benchmarks (except for the \textit{gaussian} and \textit{dwt2d} benchmarks, but we can explain this with the fact that for example gaussian is a benchmark which does not give opportunity to much parallelization, in fact it implements the gaussian elimination algorithm on a matrix, that is per se a sequential task, while dwt2d is a task such short that probably presents the problems of the copy of the buffers between the CPU memory and the GPU one) the power consumption is drastically reduced in the case of a GPU computation. Often, also the computation with the GPU composed of only 2 cores, even presenting an execution time higher than the one on CPU can achieve a better power efficiency. We can have an hint of this noticing that when running the benchmarks on the GPU the fan of the board spin significantly more rarely than when running the benchmarks on the CPU. This means that the SOC is dissipating less heat, a clear clue that the board is draining less power.\\
\smallskip
\\
Also the results for the ODROIDXU4 presents more or less the same trends, and this is expected since the hardware configuration of the two boards is basically identical.
What instead we can notice is that especially for the benchmarks belonging to the \textit{long} category we have a reduction in execution time and power consumption when moving from the ODROIDXU3 to the ODROIDXU4. At first this result may seem strange, since as we've said the SOC on the two boards should be identical, but I think that we can explain this noticing that while the XU3 was running the OS on a SD card, the XU4 was instead running the OS on a eMMC, which have a speed considerably superior to the one of even a good SD card. And this, together with the fact that often the benchmarks work on huge input files that we have to load from the storage, explains why the execution time reduced so much.\\
\smallskip
Let's have a detailed look at the results in \ref{fig:odroid4timelong}: we can see for streamcluster we have an execution time on GPU of 60 seconds, when instead on the CPU we have more than the double, and the same trend is verified for the power consumption (more than double). For leukocyte instead we have an execution time 10 times higher for the CPU with respect to the GPU, and this is probably due to the intrinsic parallel nature of the benchmarks.\\
The main exception to this trend is the gaussian benchmark. In this case the GPU computation is always disadvantaged, probably due to the structure of the benchmark that is not so much parallel.
\\
Another thing that is worth highlighting is that especially on \textit{long} benchmarks the standard error computed on the power consumption is quite high. This means that during the measurement external factors may have influenced the measurements, even if obviously during the tests we had no other task running on the board. Probably the scheduling mechanism of the Linux Kernel caused this high variance, so maybe in the future will be interesting to try to deepen in this aspect and try to isolate the possible causes.\\
\smallskip
\\
As a final note we can see that the execution times on the X86 platform are significantly slower \ref{fig:x86timelong}, but this is something that in a way we can forecast. An X86 machine (even if with only 4 cores) has a lot of additional support hardware in the CPU, and a \textbf{CISC } architecture and achieve performances considerably superior with respect to a \textbf{RISC} one.\\
But if we look at the results of some benchmarks (e.g cfd or leukocyte), if we assume to be able to optimally parallelize the execution on all the devices available on the ODROID (not considering contention on the central memory), we can imagine that the final execution time could be comparable or at least of the same order of the one of the X86 architecture, leave alone the power consumption, that will be probably one order of magnitude inferior.\\
Obviously we can't be sure of this, and more research and experiment should be conducted in this direction.\\
But also without speculation, we can notice that for example the leukocyte benchmark is always really fast on the ARM GPU, even faster that on the X86 machine. This is because the benchmark is intrinsically designed to be run on a GPU platform (it consists of image frames elaboration) and thus we can conclude that our little ODROID can beat a \textbf{beast} such as Intel i5 CPU, under certain assumptions.

\subsection{Detailed results}
In this section we provide the detailed results obtained with the ODROIDXU3.
\begin{table}[H]
  \caption{Results on CPU}
  \begin{center}
  \begin{tabular}{||c c c c c||}
  \hline
  Name & Ex time avg. & Power avg. & Ex time stderr & Power stderr\\ [0.1ex]
  \hline\hline
  bfs & 7.305 & 12.6 & 0.109 & 0.6\\
  gaussian & 5.033 & 9.7 & 0.371 & 1.2\\
  hybridsort & 6.367 & 8.6 & 1.037 & 1.3\\
  nn & 4.969 & 8.3 & 0.502 & 0.7\\
  dwt2d & 1.293 & 1.3 & 0.532 & 0.5\\
  lavaMD & 13.425 & 43.1 & 0.696 & 1.6\\
  streamcluster & 248.779 & 416.0 & 15.785 & 67.0\\
  cfd & 123.94 & 362.0 & 4.612 & 12.5\\
  kmeans & 71.942 & 193.5 & 4.715 & 16.5\\
  pathfinder & 32.635 & 103.1 & 0.84 & 1.8\\
  particlefilter & 35.226 & 113.8 & 1.691 & 3.5\\
  backprop & 8.307 & 14.3 & 0.9 & 1.6\\
  srad & 11.294 & 36.1 & 0.839 & 2.1\\
  leukocyte & 86.971 & 233.7 & 3.705 & 19.2\\
  nw & 2.135 & 3.2 & 0.557 & 0.9\\
  lud & 6.498 & 21.6 & 0.905 & 1.2\\
  hotspot & 42.938 & 142.0 & 1.152 & 3.1\\
  \hline
  \end{tabular}
  \end{center}
\end{table}

\begin{table}[H]
  \caption{Results on GPU (4 cores)}
  \begin{center}
  \begin{tabular}{||c c c c c||}
  \hline
  Name & Ex time avg. & Power avg. & Ex time stderr & Power stderr\\ [0.1ex]
  \hline\hline
  bfs & 7.237 & 13.5 & 0.321 & 2.0\\
  gaussian & 7.751 & 14.7 & 0.344 & 1.8\\
  hybridsort & 1.434 & 1.7 & 0.131 & 0.5\\
  nn & 4.509 & 7.8 & 0.055 & 0.9\\
  dwt2d & 2.315 & 3.2 & 0.3 & 0.9\\
  lavaMD & 2.781 & 4.4 & 0.029 & 0.7\\
  streamcluster & 195.689 & 282.8 & 16.543 & 38.0\\
  cfd & 90.693 & 174.1 & 1.69 & 58.0\\
  kmeans & 79.049 & 168.5 & 2.063 & 15.9\\
  pathfinder & 31.747 & 51.6 & 0.064 & 8.0\\
  particlefilter & 31.599 & 56.1 & 0.272 & 19.1\\
  backprop & 7.947 & 13.6 & 0.678 & 2.1\\
  srad & 5.038 & 8.3 & 0.169 & 0.9\\
  leukocyte & 11.284 & 16.6 & 0.863 & 4.0\\
  nw & 2.808 & 4.1 & 0.262 & 1.1\\
  lud & 4.099 & 7.8 & 0.058 & 0.9\\
  hotspot & 33.946 & 63.1 & 0.08 & 7.8\\
  \hline
  \end{tabular}
  \end{center}
\end{table}

\begin{table}[H]
  \caption{Results on GPU (2 cores)}
  \begin{center}
  \begin{tabular}{||c c c c c||}
  \hline
  Name & Ex time avg. & Power avg. & Ex time stderr & Power stderr\\ [0.1ex]
  \hline\hline
  bfs & 7.217 & 13.4 & 0.274 & 1.5\\
  gaussian & 10.181 & 17.1 & 0.357 & 2.1\\
  hybridsort & 1.482 & 1.9 & 0.38 & 0.3\\
  nn & 4.457 & 7.0 & 0.032 & 0.0\\
  dwt2d & 2.335 & 3.2 & 0.371 & 0.8\\
  lavaMD & 4.57 & 6.1 & 0.025 & 0.3\\
  streamcluster & 203.219 & 241.7 & 10.673 & 67.2\\
  cfd & 157.451 & 265.5 & 0.81 & 22.0\\
  kmeans & 109.956 & 188.2 & 1.189 & 9.9\\
  pathfinder & 62.599 & 76.7 & 0.049 & 13.0\\
  particlefilter & 60.217 & 83.2 & 0.396 & 15.5\\
  backprop & 8.051 & 13.6 & 0.462 & 1.9\\
  srad & 8.133 & 12.3 & 0.13 & 1.1\\
  leukocyte & 17.193 & 19.2 & 0.4 & 6.3\\
  nw & 2.823 & 3.8 & 0.075 & 0.4\\
  lud & 7.41 & 10.8 & 0.045 & 0.4\\
  hotspot & 66.436 & 100.3 & 0.051 & 11.0\\
  \hline
  \end{tabular}
  \end{center}
\end{table}

\pagebreak