Przeglądaj źródła

REPORT: Added results analysis and conclusion

Added section on the analysis of the results and conclusions.

Minor fixes all over the report.

Added charts of the results.

Added some tables with detailed results.
Andrea Gussoni 7 lat temu
rodzic
commit
06c52571ee

BIN
report/source/figures/vector/ODROIDXU3/power-long.pdf


BIN
report/source/figures/vector/ODROIDXU3/power-short.pdf


BIN
report/source/figures/vector/ODROIDXU3/times-long.pdf


BIN
report/source/figures/vector/ODROIDXU3/times-short.pdf


BIN
report/source/figures/vector/ODROIDXU4/power-long.pdf


BIN
report/source/figures/vector/ODROIDXU4/power-short.pdf


BIN
report/source/figures/vector/ODROIDXU4/times-long.pdf


BIN
report/source/figures/vector/ODROIDXU4/times-short.pdf


BIN
report/source/figures/vector/X86/times-long.pdf


BIN
report/source/figures/vector/X86/times-short.pdf


+ 42 - 0
report/source/project_bibliography.bib

@@ -45,3 +45,45 @@
     url       = "http://www.gnuplot.info/",
     keywords  = "Plot"
 }
+
+@online{odroidxu3website,
+    title     = "ODROIDXU3 specifications webpage",
+    url       = "http://www.hardkernel.com/main/products/prdt_info.php?g_code=g140448267127",
+    keywords  = "OpenCL, ODROID"
+}
+
+@online{power1website,
+    title     = "SmartPower1 product page",
+    url       = "http://www.hardkernel.com/main/products/prdt_info.php?g_code=G137361754360",
+    keywords  = "OpenCL, ODROID"
+}
+
+@online{power2website,
+    title     = "SmartPower2 product page",
+    url       = "http://www.hardkernel.com/main/products/prdt_info.php?g_code=G148048570542",
+    keywords  = "OpenCL, ODROID"
+}
+
+@online{linuxscheduling,
+    title     = "Linux Scheduling",
+    url       = "https://www.usenix.org/legacy/publications/library/proceedings/usenix01/freenix01/full_papers/alicherry/alicherry_html/node5.html",
+    keywords  = "Linux, ODROID"
+}
+
+@online{khronoswebsite,
+    title     = "Khronos website",
+    url       = "https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/",
+    keywords  = "Linux, ODROID, OpenCL"
+}
+
+@online{rodiniarepo,
+    title     = "Rodinia Custom Repository",
+    url       = "http://gogs.heisenberg.ovh/andreagus/rodinia-benchmark.git",
+    keywords  = "ODROID, OpenCL"
+}
+
+@online{powerrepo,
+    title     = "SmartPower Custom Repository",
+    url       = "https://bitbucket.org/zanella_michele/odroid_smartpower_bridge",
+    keywords  = "ODROID, OpenCL"
+}

+ 3 - 0
report/source/report.tex

@@ -5,6 +5,7 @@
 \usepackage[a4paper,bindingoffset=0.2in,left=.5in,right=.5in,top=1in,bottom=1in,footskip=.5in]{geometry}
 \usepackage[english]{babel}
 \usepackage{graphicx}
+\usepackage{float}
 \usepackage{lipsum}
 \usepackage[bitstream-charter]{mathdesign}
 \usepackage[T1]{fontenc}
@@ -110,6 +111,8 @@ sorting=ynt
 
 \pagebreak
 
+\listoffigures
+\listoftables
 \printbibliography
 
 \end{document}

+ 25 - 1
report/source/sections/conclusions.tex

@@ -1 +1,25 @@
-\section{Conclusions}
+\section{Conclusions and Future Works}
+\subsection{OpenCL and Heterogeneous Computing}
+We have seen that even a quite affordable and small ARM board has real potential in running computational intensive applications. With a fraction of the cost of an X86 platform we can achieve really surprising results, also remembering that the power consumption is really limited.\\
+This, together with applications written for \textbf{heterogeneous computing platforms}, can really open the road to embedded devices running heavy tasks with a fraction of the costs of a standard X86 platform. The development of the Linux Kernel for ARM devices has really advanced a lot in the last years, and now we have a lot of alternatives for running a Linux-based distribution on an ARM device. So having available our favorite platform is no more a problem.\\
+What I really think is important is to have a common programming language or framework that enables exploiting all the computational power provided by a board like an ODROID with the minimum effort possible in porting the applications.\\
+\smallskip
+\\
+I never worked with OpenCL before, but the approach that underlies the project is really promising, since once we have a well written OpenCL kernel we can basically run it on different types of devices without additional effort. If we imagine this applied on a dedicated board with a lot of GPU devices and a \textit{small} CPU that only serves for running the OS and dispatching the tasks, we can easily obtain power efficient devices able to run heavy tasks. In addition to this we could have also other types of accelerators (such FPGAs, co-processors, cryptographic accelerators) that could benefit from this type of architecture.
+\\
+There are also other examples of parallel and heterogeneous computing oriented platforms and programming paradigms (such as CUDA or OpenMP), but OpenCL really suits well the environment of embedded and low power platforms in my opinion.\\
+The main challenge will be to exploit as much as possible all the computational power provided by these kind of boards, since we have seen that using a GPU can reduce by an half or more the execution of graphical oriented applications, and not exploiting all the hardware available on our device is really a waste.\\
+Another main goal should be to understand when a task is more efficient on a GPU/CPU/Accelerator, and dispatch it accordingly to this criteria. For example in the case of our benchmarks we would like to have a policy that dispatch the streamcluster task on the GPU, while the gaussian one on the CPU, where it is more efficient.
+
+\subsection{Possible future extensions}
+A natural continuation for this project will be to better investigate how much the performances of the benchmarks are affected by the read/write speed on the main storage. Some benchmarks work on really huge input files, so it is probable that this fact affects in a huge way the final data that we get from the benchmarks.\\
+We could try to run the benchmarks on inputs of different sizes, or track how much of the time is used for reading the files and how much for doing the actual computation. We could also try to preallocate in the main memory the data for example using a \textbf{tmpfs} file system and try to re-execute the benchmarks and see the differences.
+\medskip
+\\
+Another possible extension could be to try to investigate how the benchmarks execution time and power consumption are affected by scheduling policy of the Linux kernel. We could try to change the scheduling class \cite{linuxscheduling} when executing the benchmarks and try to see how the results are affected by this changes.
+\smallskip
+\\
+Another interesting thing to do would be to try to compare the power consumption of an X86 machine with the one of the ODROID, but of course we would need an appropriate criteria to fairly compare the two measurements.
+\smallkip
+\\ Another interesting thing could be to try to execute the benchmarks on only the big or LITTLE core provided by the CPU. In this manner we can see how much the performances vary when using the low power cores or when using the high power ones, or even better understand if the LITTLE cores are in some way a bottleneck for the application.
+\pagebreak

+ 18 - 14
report/source/sections/introduction.tex

@@ -1,41 +1,45 @@
 \section{Introduction}
-The main purpouse of this document is to sum up the work done during the project for the \textit{Advanced Operating Systems} course.
+The main purpose of this document is to sum up the work done during the development of the project for the \textit{Advanced Operating Systems} course.
 
 \subsection{Problem statement}
 Quoting the project assignment of the project, the goal is to \textit{Compile a OpenCL runtime (pocl) on a ARM board. Run some benchmarks. Provide a comparison in terms of execution time, power/energy consumption.}\\
 Let's characterize more in detail the hardware and software used for the project.
 
 \subsection{Hardware}
+\label{sec:hardware}
 
 \subsubsection{ARM Board}
-The main ARM board used for the project is a \textbf{ODROID-XU3} produced by \textbf{Hardkernel co., Ltd.} \cite{hardkernelwebsite} provided by the \textbf{HEAP lab}.\\
-It mounts a CPU belongin to the \textbf{Arm big.LITTLE} series, in particular a Samsung Exynos5422 Cortex\texttrademark-A15 2Ghz and Cortex\texttrademark-A7 Octa core CPU.\\
-The GPU is a \textbf{Mali-T628 MP6} that is certified for OpenGL ES 3.1/2.0/1.1 and OpenCL 1.2 Full profile.\\
-This configuration of the board is equipped with \textbf{2Gbyte LPDDR3 RAM} PoP stacked.
-\href{http://www.hardkernel.com/main/products/prdt_info.php?g_code=g140448267127}{You can visit the product page for further details on the hardware.}\\
-I also used an \textbf{ODROID-XU4} of my own to adavance in the completion of the project during the summer. I opted for this model since the previous model wasn't available from the producer, and the SOC platform (CPU and GPU) is identical with respect to the \textbf{ODOID-XU3} except for small differences with ports and integrated peripherals, and I don't expect that this have influenced the results of the benchmarks, also because the final results proposed here have been \textbf{always} produced with the board present in \textbf{laboratory}.\\
-There is a small chance that problems may arise from the fact that I manly tested the auto-deployment scripts on my personal board during the summer when the University was closed, keep in mind that if there are some problems with the deploy scripts, since they may simply be differences on packages names or something like this.
+The main ARM board used for the project is an \textbf{ODROID-XU3} produced by \textbf{Hardkernel co., Ltd.} \cite{hardkernelwebsite} provided by the \textbf{HEAP lab}.\\
+The specifics are:
+\begin{itemize}
+  \item A CPU belonging to the \textbf{Arm big.LITTLE} series, in particular a Samsung Exynos5422 Cortex\texttrademark-A15 2Ghz and Cortex\texttrademark-A7 Octa core CPU.
+  \item A \textbf{Mali-T628 MP6} GPU that is certified for OpenGL ES 3.1/2.0/1.1 and OpenCL 1.2 Full profile.
+  \item This configuration of the board is equipped with \textbf{2Gbyte LPDDR3 RAM} PoP stacked.
+\end{itemize}
+You can visit the product page \cite{odroidxu3website} for further details on the hardware.\\
+I also used an \textbf{ODROID-XU4} of my own to advance in the completion of the project during the summer. I opted for this model since the previous model wasn't available from the producer, and the SOC platform (CPU and GPU) is identical with respect to the \textbf{ODROID-XU3}, except for small differences with ports and integrated peripherals, but I don't expect that this to influence the results of the benchmarks, also because the final results proposed here have been \textbf{always} produced with the board present in the \textbf{laboratory}.\\
+There is a small chance that problems may arise from the fact that I manly tested the auto-deployment scripts on my personal board during the summer when the University was closed. Keep that in mind if there are some problems with the deploy scripts, since it may simply be a difference on packages names or a broken dependency.
 
 \subsubsection{Power Measurement}
-For the energy consumption measurements I used the \href{http://www.hardkernel.com/main/products/prdt_info.php?g_code=G137361754360}{Hardkernel Smart Power} provided me in the laboratory. I also had available an \href{http://www.hardkernel.com/main/products/prdt_info.php?g_code=G148048570542}{Hardkernel Smart Power 2} but unfortunately it wasn't compatible with the measurement software(detailed explanation in the software paragraph).
+For the energy consumption measurements I used the \textbf{Hardkernel Smart Power} \cite{power1website} provided me by the laboratory. I also had available an \textbf{Hardkernel Smart Power 2} \cite{power2website}, but unfortunately it wasn't compatible with the measurement software(detailed explanation on this in the software paragraph \ref{sec:smartpower}).
 
 \subsubsection{x86 Platform}
-The comparison of performances with an \textbf{x86} platform have been made on a Thinkpad X1 Carbon 3rd gen. that mounts and \href{https://ark.intel.com/products/85212/Intel-Core-i5-5200U-Processor-3M-Cache-up-to-2_70-GHz}{\textbf{Intel i5 5200U CPU}} and 8 GB of ram.
+The comparison of performances with the \textbf{x86} platform have been made on a Thinkpad X1 Carbon 3rd gen. that mounts and \href{https://ark.intel.com/products/85212/Intel-Core-i5-5200U-Processor-3M-Cache-up-to-2_70-GHz}{\textbf{Intel i5 5200U CPU}} and 8 GB of ram.
 
 \subsection{Software}
 In this section we will describe the software component used for the development of the project.
 
 \subsubsection{OS}
-For what concerns the OS used during the development, I used the \textbf{Ubuntu 16.04.2 Kernel 4.9} image downloaded from the \href{http://odroid.com/dokuwiki/doku.php?id=en:xu3_release_linux_ubuntu_k49}{Hardkernel site}. I then used the suggested utility called \textbf{Etcher} \cite{etcherwebsite} to flash the image to the eMMC of the ODROID-XU4. I assume that also the flash of the ODROID-XU3 has been done in a similar way.
+For what concerns the OS used during the development, I used the \textbf{Ubuntu 16.04.2 Kernel 4.9} image downloaded from the \href{http://odroid.com/dokuwiki/doku.php?id=en:xu3_release_linux_ubuntu_k49}{Hardkernel site}. I used the suggested utility called \textbf{Etcher} \cite{etcherwebsite} to flash the image to the eMMC of the ODROID-XU4. I assume that also the flash of the ODROID-XU3 has been done in a similar way.
 
 \subsubsection{OpenCL Runtime}
-For the benchmarks we actually used two OpenCL runtimes.The one used for the integrated Mali GPU is provided in the repository directly by the Hardkernel developers, and can be installed via the \textbf{mali-fbdev} package.\\
-Instead for the CPU we manually fetched and compiled the runtime provided by the \textbf{Portable Computing Language (\textbf{pocl})} \cite{poclwebsite} work group, \href{http://portablecl.org/downloads/pocl-0.14.tar.gz}{version 0.14}.
+For the benchmarks we actually used two OpenCL runtimes. The one used for the integrated Mali GPU is provided in the distribution repositories directly by the Hardkernel developers, and can be installed via the \textbf{mali-fbdev} package.\\
+Instead for the CPU platform we manually fetched and compiled the runtime provided by the \textbf{Portable Computing Language (\textbf{pocl})} \cite{poclwebsite} work group, \href{http://portablecl.org/downloads/pocl-0.14.tar.gz}{version 0.14}.
 
 \subsubsection{Benchmark Suite}
 The benchmark suite used is the \href{https://www.cs.virginia.edu/~skadron/wiki/rodinia/index.php/Rodinia:Accelerating_Compute-Intensive_Applications_with_Accelerators}{\textbf{Rodinia Benchmark Suite}, version 3.1}. This suite includes a lot of benchmarks specifically designed for systems that provide accelerators, and thus belong to the \textbf{heterogeneous computer systems} category. In fact the benchmarks provides parallelization features for three of the main parallel computing paradigms, that are \textbf{OpenMP, CUDA, and OpenCL}. We will of course use only the OpenCL benchmarks. The project has been started and it is mantained by the Computer Science Department of \textbf{University of Virginia} \cite{virginiawebsite}.
 
 \subsubsection{Result Analysis}
-For what concerns the gathering and the analysis of the results obtained by the benchmarks, I mainly take advantage of \textbf{Bash} and \textbf{Python(2)} scripts to collect the results, and also of \textbf{Gnuplot} \cite{gnuplotwebsite} to create graphs representing the results.
+For what concerns the gathering and the analysis of the results obtained from the run of the benchmarks, I mainly take advantage of \textbf{Bash} and \textbf{Python(v2)} scripts to collect the results, and also of \textbf{Gnuplot} \cite{gnuplotwebsite} to create graphs representing the results.
 
 \pagebreak

+ 217 - 0
report/source/sections/results.tex

@@ -1 +1,218 @@
 \section{Analysis of the results}
+For the analysis of the results I opted for using \textbf{Python} and in particular the \textbf{numpy} and \textbf{matplotlib} libraries to analyze and plot the results obtained before.\\
+At first I decided to use gnuplot to plot the results, but since the charts are quite populated I found matplotlib being more practical for this purpose.\\
+I divided this step in three scripts that also correspond to different conceptual phase, and for this reason I preferred to keep them separate.
+We have three phases:
+\begin{itemize}
+  \item Preprocessing: elimination of some faulty power measurements.
+  \item Average Computation: we compute the \textit{average} and the \textit{stderr} between all the runs of each benchmark.
+  \item Plot: the real creation of the final charts.
+\end{itemize}
+
+Let's see in more detail each phase.
+
+\subsection{Preprocessing}
+Unfortunately there is a bug in the measurement utility that in some occasions create a stall of the the process taking the measurement, leading to blank power measurement value in the \textit{.csv} file were the utility stores the results or worse to a duplication of the previous value in the total.dat file.\\
+To check this as first thing we need to look at the complete file and find duplicated values on consecutive lines. There might be a small chance that the values are correct, for example this sometimes happens with the \textit{backprop} and the \textit{bfs} benchmarks that have a really similar run time, and this actually means that also the power measurement will be very similar. We exclude this situation from our preprocessing by looking at the time values, and if their difference is less than 2.5 seconds we do not purge the power consumption value from the results. In all the other cases if we have a duplicate this is symptom of a problem with the measurement, so we purge the second identical power measurement record.
+
+\subsection{Average Computation}
+At this point we can proceed in computing the \textbf{average} and the \textbf{standard errror} of the measurements belonging to the same benchmark.\\
+It is in fact suggested to repeat the run of the benchmarks an adequate number of times to, at the end, have a value that not affected by temporary high load of the system caused by reasons external to the benchmark, and also for getting an approximate idea of the accuracy of the results. In the following of this report the results have been obtained by averaging \textbf{10 runs} of each benchmark for each platform supported by the machine.\\
+I took advantage of the \textit{average} and \textit{stderr} functions provided by \textit{numpy} for conducting this analysis.
+The results are then stored in two files named \textit{average.csv} and \textit{stderr.csv} in which we have an entry per benchmark containing respectively the average and the standard error of time and power consumption.
+
+\subsection{Plot}
+The creation of the final charts is conducted with the help of the \textit{matplotlib}, starting from the data already produced by the \textit{analyze} script in the \lstinline{average.csv} and \lstinline{stderr.cvs} files.\\
+Since the benchmarks have really different duration and energy consumption, having in a single plot the data of all the benchmarks does not make much sense, since comparing a bar with an height of 180 seconds with another of eight of 2 seconds can't give much information.\\
+So as first thing I decided to split the benchmarks in \textbf{two categories}, one with the becnhmarks that have a duration on CPU less than \tetxbf{30 seconds}, and the other the remaining benchmarks.\\
+I then organized the benchmarks in a plot chart having on the \textbf{x-axis} the various benchmarks, with \textbf{three bars} for each platform on which the has been run (CPU, GPU 4 cores, GPU 2 cores).\\
+In this way we can easily spot the differences between the various platforms. Obviously to keep the charts readable there are two charts, one for the execution time and one with the power consumption.\\
+
+\subsection{Charts}
+In this section there are the charts representing the data produced, remember that all the benchmarks have been executed \textbf{10 times} per platform, per device, so these are actually the averages with also the representation on each bar of the standard error.
+
+Following the charts relative to the benchmarks execution on the ODROIDXU3:
+
+\begin{figure}[H]
+  \centering
+  \caption{Execution time ODROIDXU3, \textit{short} category}
+  \includegraphics[width=1\textwidth]{ODROIDXU3/times-short}
+\end{figure}
+
+\begin{figure}[H]
+  \centering
+  \caption{Execution time ODROIDXU3, \textit{long} category}
+  \includegraphics[width=1\textwidth]{ODROIDXU3/times-long}
+  \label{fig:odroid3timelong}
+\end{figure}
+
+\begin{figure}[H]
+  \centering
+  \caption{Power consumption ODROIDXU3, \textit{short} category}
+  \includegraphics[width=1\textwidth]{ODROIDXU3/power-short}
+\end{figure}
+
+\begin{figure}[H]
+  \centering
+  \caption{Power consumption ODROIDXU3, \textit{long} category}
+  \includegraphics[width=1\textwidth]{ODROIDXU3/power-long}
+  \label{fig:odroid3powerlong}
+\end{figure}
+
+Following the charts relative to the benchmarks execution on the ODROIDXU4:
+
+\begin{figure}[H]
+  \centering
+  \caption{Execution time ODROIDXU4, \textit{short} category}
+  \includegraphics[width=1\textwidth]{ODROIDXU4/times-short}
+\end{figure}
+
+\begin{figure}[H]
+  \centering
+  \caption{Execution time ODROIDXU4, \textit{long} category}
+  \includegraphics[width=1\textwidth]{ODROIDXU4/times-long}
+  \label{fig:odroid4timelong}
+\end{figure}
+
+\begin{figure}[H]
+  \centering
+  \caption{Power consumption ODROIDXU4, \textit{short} category}
+  \includegraphics[width=1\textwidth]{ODROIDXU4/power-short}
+\end{figure}
+
+\begin{figure}[H]
+  \centering
+  \caption{Power consumption ODROIDXU4, \textit{long} category}
+  \includegraphics[width=1\textwidth]{ODROIDXU4/power-long}
+  \label{fig:odroid4powerlong}
+\end{figure}
+
+
+I also made some measurement of only the execution time of the benchmarks on my laptop, whose characteristics are reported in the Hardware section \ref{sec:hardware} (I couldn't take power consumption measurements and even if I could have taking the power consumption of a such complex machine wouldn't have made much sense).
+Here are the charts relative to the benchmarks execution on my laptop (X86), only CPU, only execution time:
+
+\begin{figure}[H]
+  \centering
+  \caption{Execution time on Thinkpad X1, \textit{short} category}
+  \includegraphics[width=1\textwidth]{X86/times-short}
+\end{figure}
+
+\begin{figure}[H]
+  \centering
+  \caption{Execution time on Thinkpad X1, \textit{long} category}
+  \includegraphics[width=1\textwidth]{X86/times-long}
+  \label{fig:x86timelong}
+\end{figure}
+
+\subsection{Comments on the results}
+Let's analyze first the results obtained on the ODROIDXU3 . We can clearly see that on average running a benchmark on the GPU (especially on the one with 4 cores) is really beneficial for the execution time \ref{fig:odroid3timelong}. This is particularly evident for the benchmarks which belong to the \textit{long} category, where we have more time to see the effects of the increased computational power.
+On the \textit{short} category instead in some cases we have an inversion of the roles, but this can be explained with the additional overhead that a GPU computation brings. In fact we need to copy the buffers on which the OpenCL kernels will work between the central memory and the memory of the GPU, and on benchmarks that are really short this additional overhead compensate the reduced execution time.\\
+\smallskip
+\\
+What is really surprising is the power consumption of the GPU \ref{fig:odroid3powerlong}(4 cores). We can clearly see that for basically all the benchmarks (except for the \textit{gaussian} and \textit{dwt2d} benchmarks, but we can explain this with the fact that for example gaussian is a benchmark which does not give opportunity to much parallelization, in fact it implements the gaussian elimination algorithm on a matrix, that is per se a sequential task, while dwt2d is a task such short that probably presents the problems of the copy of the buffers between the CPU memory and the GPU one) the power consumption is drastically reduced in the case of a GPU computation. Often, also the computation with the GPU composed of only 2 cores, even presenting an execution time higher than the one on CPU can achieve a better power efficiency. We can have an hint of this noticing that when running the benchmarks on the GPU the fan of the board spin significantly more rarely than when running the benchmarks on the CPU. This means that the SOC is dissipating less heat, a clear clue that the board is draining less power.\\
+\smallskip
+\\
+Also the results for the ODROIDXU4 presents more or less the same trends, and this is expected since the hardware configuration of the two boards is basically identical.
+What instead we can notice is that especially for the benchmarks belonging to the \textit{long} category we have a reduction in execution time and power consumption when moving from the ODROIDXU3 to the ODROIDXU4. At first this result may seem strange, since as we've said the SOC on the two boards should be identical, but I think that we can explain this noticing that while the XU3 was running the OS on a SD card, the XU4 was instead running the OS on a eMMC, which have a speed considerably superior to the one of even a good SD card. And this, together with the fact that often the benchmarks work on huge input files that we have to load from the storage, explains why the execution time reduced so much.\\
+\smallskip
+Let's have a detailed look at the results in \ref{fig:odroid4timelong}: we can see for streamcluster we have an execution time on GPU of 60 seconds, when instead on the CPU we have more than the double, and the same trend is verified for the power consumption (more than double). For leukocyte instead we have an execution time 10 times higher for the CPU with respect to the GPU, and this is probably due to the intrinsic parallel nature of the benchmarks.\\
+The main exception to this trend is the gaussian benchmark. In this case the GPU computation is always disadvantaged, probably due to the structure of the benchmark that is not so much parallel.
+\\
+Another thing that is worth highlighting is that especially on \textit{long} benchmarks the standard error computed on the power consumption is quite high. This means that during the measurement external factors may have influenced the measurements, even if obviously during the tests we had no other task running on the board. Probably the scheduling mechanism of the Linux Kernel caused this high variance, so maybe in the future will be interesting to try to deepen in this aspect and try to isolate the possible causes.\\
+\smallskip
+\\
+As a final note we can see that the execution times on the X86 platform are significantly slower \ref{fig:x86timelong}, but this is something that in a way we can forecast. An X86 machine (even if with only 4 cores) has a lot of additional support hardware in the CPU, and a \textbf{CISC } architecture and achieve performances considerably superior with respect to a \textbf{RISC} one.\\
+But if we look at the results of some benchmarks (e.g cfd or leukocyte), if we assume to be able to optimally parallelize the execution on all the devices available on the ODROID (not considering contention on the central memory), we can imagine that the final execution time could be comparable or at least of the same order of the one of the X86 architecture, leave alone the power consumption, that will be probably one order of magnitude inferior.\\
+Obviously we can't be sure of this, and more research and experiment should be conducted in this direction.\\
+But also without speculation, we can notice that for example the leukocyte benchmark is always really fast on the ARM GPU, even faster that on the X86 machine. This is because the benchmark is intrinsically designed to be run on a GPU platform (it consists of image frames elaboration) and thus we can conclude that our little ODROID can beat a \textbf{beast} such as Intel i5 CPU, under certain assumptions.
+
+\subsection{Detailed results}
+In this section we provide the detailed results obtained with the ODROIDXU3.
+\begin{table}[H]
+  \caption{Results on CPU}
+  \begin{center}
+  \begin{tabular}{||c c c c c||}
+  \hline
+  Name & Ex time avg. & Power avg. & Ex time stderr & Power stderr\\ [0.1ex]
+  \hline\hline
+  bfs & 7.305 & 12.6 & 0.109 & 0.6\\
+  gaussian & 5.033 & 9.7 & 0.371 & 1.2\\
+  hybridsort & 6.367 & 8.6 & 1.037 & 1.3\\
+  nn & 4.969 & 8.3 & 0.502 & 0.7\\
+  dwt2d & 1.293 & 1.3 & 0.532 & 0.5\\
+  lavaMD & 13.425 & 43.1 & 0.696 & 1.6\\
+  streamcluster & 248.779 & 416.0 & 15.785 & 67.0\\
+  cfd & 123.94 & 362.0 & 4.612 & 12.5\\
+  kmeans & 71.942 & 193.5 & 4.715 & 16.5\\
+  pathfinder & 32.635 & 103.1 & 0.84 & 1.8\\
+  particlefilter & 35.226 & 113.8 & 1.691 & 3.5\\
+  backprop & 8.307 & 14.3 & 0.9 & 1.6\\
+  srad & 11.294 & 36.1 & 0.839 & 2.1\\
+  leukocyte & 86.971 & 233.7 & 3.705 & 19.2\\
+  nw & 2.135 & 3.2 & 0.557 & 0.9\\
+  lud & 6.498 & 21.6 & 0.905 & 1.2\\
+  hotspot & 42.938 & 142.0 & 1.152 & 3.1\\
+  \hline
+  \end{tabular}
+  \end{center}
+\end{table}
+
+\begin{table}[H]
+  \caption{Results on GPU (4 cores)}
+  \begin{center}
+  \begin{tabular}{||c c c c c||}
+  \hline
+  Name & Ex time avg. & Power avg. & Ex time stderr & Power stderr\\ [0.1ex]
+  \hline\hline
+  bfs & 7.237 & 13.5 & 0.321 & 2.0\\
+  gaussian & 7.751 & 14.7 & 0.344 & 1.8\\
+  hybridsort & 1.434 & 1.7 & 0.131 & 0.5\\
+  nn & 4.509 & 7.8 & 0.055 & 0.9\\
+  dwt2d & 2.315 & 3.2 & 0.3 & 0.9\\
+  lavaMD & 2.781 & 4.4 & 0.029 & 0.7\\
+  streamcluster & 195.689 & 282.8 & 16.543 & 38.0\\
+  cfd & 90.693 & 174.1 & 1.69 & 58.0\\
+  kmeans & 79.049 & 168.5 & 2.063 & 15.9\\
+  pathfinder & 31.747 & 51.6 & 0.064 & 8.0\\
+  particlefilter & 31.599 & 56.1 & 0.272 & 19.1\\
+  backprop & 7.947 & 13.6 & 0.678 & 2.1\\
+  srad & 5.038 & 8.3 & 0.169 & 0.9\\
+  leukocyte & 11.284 & 16.6 & 0.863 & 4.0\\
+  nw & 2.808 & 4.1 & 0.262 & 1.1\\
+  lud & 4.099 & 7.8 & 0.058 & 0.9\\
+  hotspot & 33.946 & 63.1 & 0.08 & 7.8\\
+  \hline
+  \end{tabular}
+  \end{center}
+\end{table}
+
+\begin{table}[H]
+  \caption{Results on GPU (2 cores)}
+  \begin{center}
+  \begin{tabular}{||c c c c c||}
+  \hline
+  Name & Ex time avg. & Power avg. & Ex time stderr & Power stderr\\ [0.1ex]
+  \hline\hline
+  bfs & 7.217 & 13.4 & 0.274 & 1.5\\
+  gaussian & 10.181 & 17.1 & 0.357 & 2.1\\
+  hybridsort & 1.482 & 1.9 & 0.38 & 0.3\\
+  nn & 4.457 & 7.0 & 0.032 & 0.0\\
+  dwt2d & 2.335 & 3.2 & 0.371 & 0.8\\
+  lavaMD & 4.57 & 6.1 & 0.025 & 0.3\\
+  streamcluster & 203.219 & 241.7 & 10.673 & 67.2\\
+  cfd & 157.451 & 265.5 & 0.81 & 22.0\\
+  kmeans & 109.956 & 188.2 & 1.189 & 9.9\\
+  pathfinder & 62.599 & 76.7 & 0.049 & 13.0\\
+  particlefilter & 60.217 & 83.2 & 0.396 & 15.5\\
+  backprop & 8.051 & 13.6 & 0.462 & 1.9\\
+  srad & 8.133 & 12.3 & 0.13 & 1.1\\
+  leukocyte & 17.193 & 19.2 & 0.4 & 6.3\\
+  nw & 2.823 & 3.8 & 0.075 & 0.4\\
+  lud & 7.41 & 10.8 & 0.045 & 0.4\\
+  hotspot & 66.436 & 100.3 & 0.051 & 11.0\\
+  \hline
+  \end{tabular}
+  \end{center}
+\end{table}
+
+\pagebreak

+ 66 - 42
report/source/sections/work.tex

@@ -1,24 +1,26 @@
 \section{Summary Of The Work}
 
 \subsection{Becoming familiar with the OpenCL framework}
-Before starting the project I never worked with OpenCL, so before starting the work I decided to research information through the documentation available online.
+Before starting the project I never worked with OpenCL, so I decided to research information through the documentation available online to have a grasp of a how an OpenCL application works. I used the documentation provided by the \textbf{Khronos Gropu} \cite{khronoswebsite} as the main source of information about the C++ OpenCL Wrapper API.\\
 In the meantime I tried to compile and play with \textbf{pocl} on my laptop, just to understand how to start from an OpenCL application and run it on the hardware.
 My main reference has been the pocl website, in particular the documentation \cite{poclwebsite} on the pocl project website.\\
 I had some previous experience with the \textbf{LLVM }framework \cite{llvmwebsite}, that pocl uses for compiling the runtime, so this part was not too difficult to manage, also because the required version of LLVM (3.8) is the default version shipped by the Ubuntu distribution, but anyway I had it already compiled on my machine.\\
 Once I had the runtime compiled and ready for my laptop, I moved to becoming familiar with the designated benchmark suite.\\
-The first impact with the benchmark suite has been a little problematic since, for reasons that will be more clear when reading the section dedicated to the modifications made at the \textbf{Rodinia Benchmark Suite}, the suite is tailored for running on GPU, and since the pocl runtime on my laptop only exposed a CPU device, I wasn't able to run a single benchmark, and not having yet developed the skills necessary to debug and work with the C++ OpenCL Wrapper API, I was having some difficulties.\\
+\medskip
+\\
+The first impact with the benchmark suite has been a little problematic since, for reasons that will be more clear when reading the section dedicated to the modifications made at the \textbf{Rodinia Benchmark Suite} \ref{sec:benchmark}, the suite is tailored for running on GPU, and since the pocl runtime on my laptop only exposed a CPU device, I wasn't able to run even a single benchmark out of the box, and not having yet developed the skills necessary to debug and work with the C++ OpenCL Wrapper API, I was having some difficulties.\\
 For this reason I decided to begin with something simpler, and I searched for other Benchmark Suites online. I searched for a little bit and found the \textbf{ViennaCL} \cite{viennawebsite} suite.\\
 This time things went better, and after some experiments and tentatives I managed to run some benchmarks of the suite on my laptop, and reading the code I began to understand how the initialization and run of an OpenCL platform worked.\\
-Also during the documentation phase I become aware of the existence of the \textbf{Beignet} project, an Open Source OpenCL implementation to support the integrated GPUs on Intel chipset, so I had the opportunity to experiment a little also with a GPU device even before working on the board.\\
+During the documentation phase I become aware of the existence of the \textbf{Beignet} project, an Open Source OpenCL implementation to support the integrated GPUs on Intel chipsets, so I had the opportunity to experiment a little also with a GPU device even before working on the board.\\
 At this point I felt that I had the prerequisites to start working with the \textbf{ODROID}, so I began the work on the board.
 
 \subsection{Build of the runtime}
 The first challenge to tackle was the retrieval and compilation of the OpenCL runtimes.\\
-The runtime for the \textbf{Mali GPU} is already provided in the Hardkernel repository, so a simple \lstinline{sudo apt-get install mali-fbdev} does the trick.
-For what concenrs the Pocl runtime instead we need to start from scratch.\\
-The first thing to do is to retrieve the last version of the OpenCL runtime (currently version 0.14) from the \href{http://portablecl.org/downloads/pocl-0.14.tar.gz}{website}.
-The next thing to do is to decompress the archive of with simple \lstinline{tar xvfz pocl-0.14.tar.gz}.\\
-Pocl take adavante of \textbf{LLVM} to build itself, so we need to install a few dependencies from the package manager before being able to compile it. We can find at the \href{http://portablecl.org/docs/html/install.html}{dedicated page} on the official wiki a list of all the packages needed for the build. Basically we need LLVM and a bunch of development package of it, CMake to build the Makefiles, the standard utilities for compiling (gcc, lex, bison), and some packages to have an Installable client driver (\textbf{ICD}) to be able to load the appropriate OpenCL at runtime.\\
+The runtime for the \textbf{Mali GPU} is already provided in the Hardkernel repository, so a simple \lstinline{sudo apt-get install mali-fbdev} does the trick.\\
+For what concerns the pocl runtime instead we need to start from scratch.\\
+The first thing to do is to retrieve the last version of the OpenCL runtime (currently the last available versio is the 0.14) from the \href{http://portablecl.org/downloads/pocl-0.14.tar.gz}{website}.
+The next thing to do is to decompress the archive with a simple \lstinline{tar xvfz pocl-0.14.tar.gz}.\\
+Pocl take advantage of \textbf{LLVM} to build itself, so we need to install a few dependencies from the package manager before being able to compile it. We can find at the \href{http://portablecl.org/docs/html/install.html}{dedicated page} on the official wiki a list of all the packages needed for the build. Basically we need LLVM and a bunch of development package of it, CMake to build the Makefiles, the standard utilities for compiling (gcc, lex, bison), and some packages to have an Installable client driver (\textbf{ICD}), in order to be able to load the appropriate OpenCL at runtime.\\
 What we need to do on our system is basically:
 \bigskip
 
@@ -30,30 +32,37 @@ sudo apt-get install -y vim build-essential flex bison libtool libncurses5* git-
 \end{lstlisting}
 \bigskip
 
-At this point we can proceed and build pocl. To to that we enter the directory with the sources and create a folder called \textit{build} in which we will have all the compiled stuff. At this point we take advantage of \textbf{CMake} for actually preparing our folder for the build. Usually a \lstinline{cmake ../} should suffice, but on the ODROID we have a little problem.\\
+At this point we can proceed and build pocl. To do that we enter the directory with the sources and create a folder called \textit{build} in which we will have all the compiled stuff. At this point we take advantage of \textbf{CMake} for actually preparing our folder for the build. Usually a \lstinline{cmake ../} should suffice, but on the ODROID we have a little problem.\\
+\smallskip
+\\
 Since our CPU is composed of four cortex a7 and four cortex a15 cores, CMake can't by itself understand what is the target CPU to use for the build. Luckily the two types of cores shares the \textbf{same ISA}, so we can explicitly tell CMake to use the cortex a15 as a target type of cpu. All we have to do is to launch \lstinline{cmake -DLLC\_HOST\_CPU=cortex-a15 ../} .\\
 At this point we are ready for the build, just type \lstinline{make -j8} and we are done. We can also run some tests with \lstinline{ctest -j8}, just to be sure that everything went smooth, and finally install the runtime in the system with \lstinline{sudo make install}. At this point if everything went fine we will have a \lstinline{pocl.icd} file in \lstinline{/etc/OpenCL/vendors/}, and running \lstinline{clinfo} we should be able to see our brand new OpenCL runtime.\\
 
-Additionally in order to be able to use the runtime for the \textbf{Mali GPU} we additionally need to place a file containing:
+Additionally in order to be able to use the runtime for the \textbf{Mali GPU} we need to place a file named \lstinline{mali.icd} containing:
 
 \begin{lstlisting}
   /usr/lib/arm-linux-gnueabihf/mali-egl/libOpenCL.so
 \end{lstlisting}
 
-in a file named \lstinline{mali.icd} at the path \lstinline{/etc/OpenCL/vendors/}.\\
-This should conclude the part regarding the OpenCL runtime deploy, and at this point we should be able to see both the CPU Pocl platform with an eight core device and the Mali GPU platform with two devices of four and two cores respectively.
+at the path \lstinline{/etc/OpenCL/vendors/}.\\
+This should conclude the part regarding the OpenCL runtime deploy, and at this point we should be able to see both the CPU pocl platform with an eight cores device and the Mali GPU platform with two devices of four and two cores respectively invoking \lstinline{clinfo}.
 
 \subsection{Build of the power measurement utility}
+\label{sec:smartpower}
 At this point we should get and compile the utility for measuring the power consumption of the board. The utility used is a modified version of the official utility provided by Hardkernel, that simply stores the consumption detected in a csv file, that we can later use for results analysis and plotting.
-For building the utility we start from \href{https://bitbucket.org/zanella_michele/odroid_smartpower_bridge}{this repository}.\\
-The use of the utility has been kindly granted to me by \textit{Michele Zanella}, who is the main maintainer of the utility. He also helped me understanding how to make the utility work on the board, and he helped me debugging a problem with the setup of the USB interface and kindly agreed to publish on his repository a dedicated branch were all the unnecessary Qt dependencies have been removed.\\
+For building the utility we start from this repository \cite{powerrepo}.\\
+\smallskip
+\\
+The use of the utility has been kindly granted to me by \textit{Michele Zanella}, who is the main maintainer of the utility. He also helped me understanding how to make the utility work on the board, debugging a problem with the setup of the USB interface and kindly agreed to publish on his repository a dedicated branch were all the unnecessary Qt dependencies have been removed.\\
 As first step we can retrieve the repository with the following bash command:
 
 \begin{lstlisting}
   git clone https://bitbucket.org/zanella_michele/odroid_smartpower_bridge
 \end{lstlisting}
 
-At this point we should switch to the \textbf{no\_qt} branch with a simple \lstinline{git checkout no_qt}. In this branch all the non essential dependencies to Qt libraries have been removed, in order to avoid cluttering the board with the full KDE framework for just storing an integer representing the consumption. Of course if we want to have available the original GUI interface we need to compile the version present on the \textbf{master} branch.\\
+At this point we should switch to the \textbf{no\_qt} branch with a simple \lstinline{git checkout no_qt}. In this branch all the non essential dependencies to Qt libraries have been removed, in order to avoid cluttering the board with the full KDE framework just for storing an integer representing the consumption. Of course if we want to have available the original GUI interface we need to compile the version present on the \textbf{master} branch.\\
+\smallskip
+\\
 Unfortunately the HIDAPI library provided with the sources of the utility has been already compiled for x86 and stored in the repository, causing an error when trying to link the utility.\\
 To avoid this we need to recompile the library, by entering the HIDAPI folder and giving the following commands:
 
@@ -71,7 +80,7 @@ At this point enter the smartpower folder and compile the utility with:
   make
 \end{lstlisting}
 
-At this point we should have in the \lstinline{linux} folder a binary named \lstinline{SmartPower}, this self-contained binary is the utility that we need. Please take care to install also the dependencies necessary for building this utility, in particular \lstinline{qt4-qmake libqt4-dev libusb-1.0-0-dev}.\\
+now we should have in the \lstinline{linux} folder a binary named \lstinline{SmartPower}, this self-contained binary is the utility that we need. Please take care to install also the dependencies necessary for building this utility, in particular \lstinline{qt4-qmake libqt4-dev libusb-1.0-0-dev}.\\
 
 In addition, in order to be able to communicate through USB to the device even if we are not root, we need to add a file name \lstinline{99-hiid.rules} in the path \lstinline{/etc/udev/rules.d/} containing the following:
 
@@ -80,27 +89,39 @@ In addition, in order to be able to communicate through USB to the device even i
 SUBSYSTEM=="usb", ATTRS{idVendor}=="04d8", ATTRS{idProduct}=="003f", MODE="0666"
 \end{lstlisting}
 
-Reached this point we should be able to take power measurements. To test it simply launch the \textbf{SmartPower} binary with as argument the file in which you want to store the results, let it run for a while and then stop it with a \textbf{SIGUSR1} signal. In the file you should find the power consumption (double check it with the display of the power measurement device). Also take into account that there is a known bug with the software, meaning that sometimes the utility is not able to retrieve the consumption and the process become a zombie process in the system. Take into consideration this if you have trouble in taking measurements, and before starting a new measurement please be sure that no other SmartPower process is running.
+\smallskip
+\\
+Reached this point we should be able to take power measurements. To test it simply launch the \textbf{SmartPower} binary with as argument the file in which you want to store the results, let it run for a while and then stop it with a \textbf{SIGUSR1} signal. In the file you should find the power consumption (double check it with the display of the power measurement device). Also take into account that there is a known bug in the software, meaning that sometimes the utility is not able to retrieve the consumption and the process become a zombie process in the system. Take into consideration this if you have trouble in taking measurements, and before starting a new measurement please be sure that no other SmartPower process is running.\\
+\medskip
+\\
+I also had the new version of the SmartPower device, but unfortunately they changed the interface and now it is no more possible to read the measurements via USB with the utility.
 
 \subsection{Build of the benchmarks}
 For what concerns the benchmarks, we start from the vanilla \textbf{Rodinia 3.1} benchmark suite, taken directly from the site of Virginia University \cite{virginiawebsite} (you need to register on the site, and then you'll receive via mail a link to the real download page).
-Unfortunately the benchmarks are \textbf{not ready for running}.\\
+Unfortunately the benchmarks are \textbf{not ready for running out of the box}.\\
 Some of them presents some bugs, and you need to apply a lot of fixes and modifications to successfully run them on the ODROID. Since the modifications are really big (I estimate that making the benchmarks usable has in fact taken most of the time of the development of the project), I opted for creating a repository that I initialized with the sources of the benchmarks and on which I worked.\\
-You can find \textbf{the repository} at \href{http://gogs.heisenberg.ovh/andreagus/rodinia-benchmark.git}{this url}. There are multiple branches on the repository since I worked in parallel on CPU and GPU benchmarks to make them work, and later I tried to merge all the results in a single branch to use for the benchmarks.\\
+You can find \textbf{the repository} here \cite{rodiniarepo}. There are multiple branches on the repository since I worked in parallel on CPU and GPU benchmarks to make them work, and later I tried to merge all the results in a single branch to use for the benchmarks.\\
+\smallskip
+\\
 In addition to bugs and other problems the main difficulty was that the creator of the benchmarks \textbf{hard-coded} in the source the OpenCL platform, device and type of device to use. This meant that if you wanted to run benchmarks on different OpenCL devices you had to manually modify the source, recompile the benchmark and run it. At the beginning of the development I also followed this approach and specialized a different branch for running the benchmarks on CPU or GPU.\\
 But this approach bugged me, since the main advantage and the ultimate goal of having an OpenCL application should be to be able to run it on different devices and accelerators with the minimum effort possible. So in the end I modified heavily the benchmarks in order to take as parameter the platform, the device and the type of device to use. I then added different \textbf{run scripts} that contain the right parameters for each available device.\\
 In this way we \textbf{compile} the benchmarks \textbf{once}, and then at runtime we select the platform and device to use. The selection simply implies to use the \lstinline{run-cpu} or \lstinline{run-gpu} script. In this way we have the more \textit{transparent} interface as possible.
 
 \subsection{Work on the Benchmark Suite}
-In this section I'll try to explain what are the main problems that I found in trying running the Rodinia Suite, and how I overcame the problems.\\
+\label{sec:benchmark}
+In this section I'll try to explain what are the main problems that I found in trying running the Rodinia Suite, and how I overcame them.\\
 As said previously I decided to create a new repository containing the benchmark sources in order to keep track of the work and have a better organization over all the code base.\\
 The first two steps where to initialize the repository with the original sources of the suite and then to remove all the \textbf{CUDA} and \textbf{OpenMP} related folders and references. I opted for this strategy and not for completely avoiding inserting them in the repository to facilitate keeping track of all the changes made at the code base, in the eventuality that in the future, when a new official release of Rodinia will be released, we want to re-apply all the changes.\\
+\smallskip
+\\
 The next problem to solve was the fact that all the benchmarks (with the exception of a couple) had hard-coded in the source code the OpenCL platform, device, and type of device to use, meaning that they always expected to find a GPU available on the platform and device with index zero.\\
 The first idea that came to my mind was to create two branches on the repository, one to use with CPU and one to use with GPU. I then proceeded to work in parallel on the two branches modifying the source code of the benchmark to use the right device. This approach worked and in the end I was able to run the benchmarks on the two different types of device.\\
-But this solution didn't really satisfied me, since was in some way \textbf{not coherent} with the OpenCL ultimate goals. Writing an application in OpenCL should give you the possibility to have a portable application that is able to run on different devices with the minimum effort possible. With the branches approach in order to switch from an executable for CPU to one for GPU we needed to switch between the branches a recompile the executable.
-In addition I find this kind of approach really not elegant since the setup and initialization of the OpenCL devices is all done at runtime, so there is not a particular reason for having those parameters hard-coded in the source code. We can in principle pass all those information at runtime when executing the benchmark. So I tried to make another step and, taking inspiration from the couple of benchmarks that already followed this kind of approach, I implemented a platform, device, and device type selection through passing different parameters to the command line.\\
-As a general guideline the convention is to specify a \textit{-p} and an index to specify the platform to use, a \textit{-d} and an index to specify the device, and a \textit{-g} and a boolean with the meaning of using or not a GPU.
-for example if we want to execute a benchmark on platform 0, device 1 and on GPU we need to pass something like this
+\smallskip
+\\
+But this solution didn't really satisfied me, since was in some way \textbf{not coherent} with the OpenCL ultimate goals. Writing an application in OpenCL should give you the possibility to have a portable application that is able to run on different devices with the minimum effort possible. With the branches approach in order to switch from an executable for CPU to one for GPU we needed to switch between the branches and recompile the executable.
+In addition I find this kind of approach really not elegant since the setup and initialization of the OpenCL devices is all done at runtime, so there is no particular reason for having those parameters hard-coded in the source code. We can in principle pass all those information at runtime when executing the benchmark. So I tried to make another step and, taking inspiration from the couple of benchmarks that already followed this kind of approach, I implemented a platform, device, and device type selection through passing different parameters to the command line.\\
+As a general guideline the convention is to specify \textit{-p} and an index to specify the platform to use, \textit{-d} and an index to specify the device, and \textit{-g} and a boolean with the meaning of using or not a GPU.
+For example, if we want to execute a benchmark on platform 0, device 1 and on GPU we need to pass something like this
 
 \begin{lstlisting}
   -p 0 -d 1 -g 1
@@ -111,27 +132,28 @@ Instead if we want to execute on platform 1, device 0 and on CPU we pass somethi
   -p 1 -d 0 -g 0
 \end{lstlisting}
 
-All this made possible the creation of different run scripts for the different types of execution. Look in the benchmarks folder to the various run-something scripts and see how we invoke the benchmark with different parameters in case we want to execute something on the Mali GPU or on the CPU.\\
+All this made possible the creation of different run scripts for the different types of execution. Look in the benchmarks folder for the various run-something scripts and see how we invoke the benchmark with different parameters in case we want to execute something on the Mali GPU or on the CPU.\\
 In some situations was not possible to do this (parameters already taken or parameter parsing made in a way not compatible with this restructuring), and I'll specify this cases in each subsection explaining in detail the modifications made at the single benchmark. Also consider executing the benchmark binary without parameters (or with \lstinline{-help}) to get an usage summary with all the necessary flags.\\
 I'll now add a subsection for each benchmark trying to detail the modifications introduced with a brief explanation of them.
 
 \subsubsection{Backprop}
-The benchmark didn't use correctly the \lstinline{clGetPlatformIDs} primitive, not retrieving at all the platforms present on the system. Modified this and added parameter parsing for OpenCL stuff. In this case we need to specify the platform, device, and device type in this order without the selectors (e.g. \lstinline{-p}) since the already present argument parsing expects the parameters in a certain order without flags.
+The benchmark didn't use correctly the \lstinline{clGetPlatformIDs} primitive, not retrieving at all the platforms present on the system. Modified this and added parameter parsing for the OpenCL initialization.
 
 \subsubsection{Bfs}
-The benchmark sources imported a \textbf{timer} utility for debug purposes that consisted of ad-hoc X86 assembly instructions to get the time in different execution points. This obviously prevented the compilation on an ARM device. Removed this dependency since we time the execution in a different manner, so we do not use this mechanism. Also in this case the parameters parsing is done as in the Backprop benchmarks.
+The benchmark sources imported a \textbf{timer} utility for debug purposes that consisted of ad-hoc X86 assembly instructions to get the time in different execution points. This obviously prevented the compilation on an ARM device. Removed this dependency since we time the execution in a different manner, so we do not use this mechanism. The parameter follows the general guidelines.
+
 
 \subsubsection{Cfd}
-This benchmark didn't compile for problems with the import of the \textit{rand()} function, so we fixed this. In addition the platform and device selection was not parametrized, so we also changed this. In this case we use the standard convention on the parameters as explained before.
+This benchmark didn't compile for problems with the import of the \textit{rand()} function, so we fixed this. In addition the platform and device selection was not parametrized, so changed this. In this case we use the standard convention on the parameters as explained before.
 
 \subsubsection{Dwt2d}
-Implemented the device selection and fixed a bug with a \lstinline{char} variable not compatible with our architecture. Since the -d flag was already taken in this benchmark to specify the dimension we used -i for the device id specification.
+Implemented the device selection and fixed a bug with a \lstinline{char} variable not compatible with our architecture. Since the -d flag was already taken in this benchmark to specify the dimension we use -i for the device id specification.
 
 \subsubsection{Gaussian}
 This benchmark already presented a prototype of platform and device selection. Added the possibility to select also the device type and changed some minor details in the use of the OpenCL primitives.
 
 \subsubsection{Heartwall}
-At first we implemented the device selection as in the other case, and reduced the work group size in order to be compatible with the board. Unfortunately in the end the execution on CPU always returned the \lstinline{CL_OUT_OF_HOST_MEMORY} error, and even with the minimum work group size the execution on CPU was not possible. I decided to disable and remove this benchmark since having only the data relative to the execution on GPU made no sense for the final comparative.
+At first implemented the device selection as in the other case, and reduced the work group size in order to be compatible with the board. Unfortunately in the end the execution on CPU always returned the \lstinline{CL_OUT_OF_HOST_MEMORY} error, and even with the minimum work group size the execution on CPU was not possible. I decided to disable and remove this benchmark since having only the data relative to the execution on GPU made no sense for the final comparative.
 
 \subsubsection{Hotspot}
 In this case there was an additional problem with work group size that was not compatible with CPU device. Reduced this work group size and implemented the device selection as described before.
@@ -143,11 +165,11 @@ In this benchmark implemented the device selection adding a parameter parsing ro
 In this case the only problem was with the platform retrieval, as in backdrop. Changed this and implemented device selection as described before.
 
 \subsubsection{LavaMD}
-In this benchmarks we had multiple problems. The first thing was the work group size too big to be handled on our device, so we reduced this.\\
+In this benchmarks there were multiple problems. The first thing was the work group size too big to be handled on our device, so I reduced this.\\
 The other more subtle problem was with the size of the parameter passed to the OpenCL kernel. Since the C++ \lstinline{long} type has different sizes on 32-bit and 64-bit architectures (respectively 32-bit and 64-bit), while the \lstinline{long} type in OpenCL code is always 64-bit wide, during the execution of the benchmark we received strange errors indicating some problems with the maximum size of the argument.\\
-At first I thought that simply the benchmark was not adequate to be run on this platform, but after receiving similar strange errors with other benchmark I decided to investigate more. After firing up \lstinline{gdb} and some tentatives to understand what caused the \lstinline{SEGFAULT} I decided to go for a step by step execution in parallel on two 32-bit and 64-bit devices. I finally found that the problem was with the \lstinline{clSetKernelArg()} function. In fact I noticed that the the parameter passed to the kernel were different in size, and the kernel always expected arguments multiple of 64-bit.\\
+At first I thought that simply the benchmark was not adequate to be run on this platform, but after receiving similar strange errors with other benchmarks I decided to investigate more. After firing up \lstinline{gdb} and some tentatives to understand what caused the \lstinline{SEGFAULT} I decided to go for a step by step execution in parallel on two 32-bit and 64-bit devices. I finally found that the problem was with the \lstinline{clSetKernelArg()} function. In fact I noticed that the the parameter passed to the kernel were different in size, and the kernel always expected arguments multiple of 64-bit.\\
 Once understood this I modified the C++ variables corresponding to the arguments from type \lstinline{long} to type \lstinline{long long}, fixing this bug.\\
-I find that this type of bug is really subtle, since for someone not knowing in detail the internals of OpenCL is really difficult to spot and solve. In some way this should be prevented with some coding convention, for example always using the \lstinline{long long} type for 64-bit wide variables. When writing an application that should be portable relying on behavior of the compiler for a specific architecture should not be acceptable.\\
+I find that this type of bug is really subtle, since for someone not knowing in detail the internals of OpenCL is really difficult to spot and solve a situation like this. In some way this should be prevented with some coding convention, for example always using the \lstinline{long long} type for 64-bit wide variables. When writing an application that should be portable relying on behavior of the compiler for a specific architecture should not be acceptable.\\
 Also in this benchmark we implemented the device selection as described before.
 
 \subsubsection{Leukocyte}
@@ -157,7 +179,7 @@ The first problem with this benchmark was an error with a Makefile target that p
 In this benchmark the main change was the introduction of device selection as described before. Also fixed the use of the \lstinline{clGetPlatformIDs} primitive to get all the platforms available on the board.
 
 \subsubsection{Nn}
-Also in this case was already present a prototype of platform and device selection as for Nn. Changed some details on the initialization of the OpenCl context to take into account the addition of device type specification.
+Also in this case there was already present a prototype of platform and device selection. Changed some details on the initialization of the OpenCl context to take into account the addition of device type specification.
 
 \subsubsection{Nw}
 Also in this benchmark the main change was the implementation of the device selection, and in doing this we changed also the parameter parsing for the already required parameters.
@@ -169,7 +191,7 @@ In this benchmark I mainly implemented the device selection. Take care that in t
 Implemented device selection following the guidelines defined before. In this case the task had been a little difficult since we have a lot of function activations between the parameters parsing and the actual OpenCL context initialization, so we have a lot of parameters passing between the modules. The alternative was to use a global object to store the parameters, but I don't like this approach since in case of problems we can't simply debug looking at the function parameters to spot problems, but we need to trace the state of a global object, thing that I find not elegant and prone to synchronization errors.
 
 \subsubsection{Srad}
-In this benchmark we needed to reduced the work group size to be compatible with the ODROID, and implemented the device selection as showed before.
+In this benchmark we needed to reduce the work group size to be compatible with the ODROID, and implement the device selection as showed before.
 Also in this case the problem regarding the size of the arguments for the kernel manifested, so changed the size to match the one that the OpenCL kernel is expecting as done in LavaMD.
 
 \subsubsection{Streamcluster}
@@ -178,9 +200,9 @@ Also in this benchmark we had the same problem already showed for LavaMD and Sra
 \subsubsection{Consideration valid for all the benchmarks}
 Please keep into account that the code base of the benchmark has probably been modified by a lot of different developers, with different styles and approach to the OpenCL framework.\\
 One problem that you can spot as soon as you look at a single commit is that there is no convention on the use of spaces or tabs(who would have guessed it?), so code is often misaligned and present trailing white-spaces and is really awful to look at with the editor set in the wrong way.\\
-To avoid cluttering the commits is a lot of blank space removals, substitutions of tabs with white-space I preferred to disable on my editor all mechanism that corrected this thing and leave the source code with misaligned lined but at least highlighting only the changes really made to the source.\\
-I then tried as much as possible all this things in a later commit that simply tries to fix all this things to obtain a source code not horrible.\\
-I apologize for this inconvenient and I ask you to not look at this problems withing the commits, but I preferred to keep them as little as possible to have a better chance to spot the real modifications made and to get lost in a commit with thousands of line added and removed to fix a tab.
+To avoid cluttering the commits with a lot of blank space removals and substitutions of tabs with whitespace I preferred to disable on my editor all the mechanisms that corrected this thing and leave the source code with misaligned lined but at least highlighting only the changes really made to the source.\\
+I then tried to solve as much as possible all this things in a later commit that has the only purpose to obtain a source code less horrible to look at.\\
+I apologize for this inconvenient and I ask you to not look at this problems within the commits, since I preferred to keep them as little as possible to have a better chance to spot the real modifications made and to avoid to get lost in a commit with thousands of line added and removed to fix a tab.
 
 \subsection{Running the benchmarks}
 Arrived at this point we should have a working version of the benchmarks. We can then proceed to run them on our board.
@@ -193,9 +215,9 @@ As the names of the run scripts say:
 \end{itemize}
 We can also use the targets present in the Makefile inside the benchmark directory to conveniently run the sequence of all the benchmarks. We have:
 \begin{itemize}
-  \item \lstinline{OPENCL_BENCHMARK_CPU} to run all the benchmarks on the cpu
-  \item \lstinline{OPENCL_BENCHMARK_GPU_PRIMARY} to run the benchmarks on the GPU device 1
-  \item \lstinline{OPENCL_BENCHMARK_GPU_SECONDARY} to run the benchmarks on the GPU device 1
+  \item \lstinline{OPENCL_BENCHMARK_CPU} to run all the benchmarks on the CPU
+  \item \lstinline{OPENCL_BENCHMARK_GPU_PRIMARY} to run the benchmarks on the GPU device 1 (4 cores)
+  \item \lstinline{OPENCL_BENCHMARK_GPU_SECONDARY} to run the benchmarks on the GPU device 2 (2 cores)
   \item \lstinline{OPENCL_BENCHMARK_ALL} to run the benchmarks on alle the three previous devices
   \item \lstinline{OPENCL_BENCHMARK_GPU} to run the benchmarks on the GPU device 1 (kept for compatibility reasons)
 \end{itemize}
@@ -207,4 +229,6 @@ The files are basically \textit{csv} files with three columns, and each record i
   \item the name of the benchmark
   \item the run time of the run expressed in seconds
   \item the consumption expressed in Watt/hour
-\end{itemize}
+\end{itemize}
+
+\pagebreak