Pārlūkot izejas kodu

REPORT: Wrote part on work on the benchmark suite

Detailed in a dedicated section the modifications made to the benchmark
suite to make it usable on the ODROID.

Some fixes in the introduction part.

Added a bibliography file.
Andrea Gus 8 gadi atpakaļ
vecāks
revīzija
785a792139

+ 47 - 0
report/source/project_bibliography.bib

@@ -0,0 +1,47 @@
+@manual{poclwebsite,
+    title     = "Pocl Documentation",
+    url       = "http://portablecl.org/docs/html/",
+    keywords  = "OpenCL, pocl"
+}
+
+@online{viennawebsite,
+    title     = "ViennaCl Website",
+    url       = "http://viennacl.sourceforge.net/",
+    keywords  = "OpenCL, ViennaCL"
+}
+
+@online{beignetwebsite,
+    title     = "Beignet Website",
+    url       = "https://www.freedesktop.org/wiki/Software/Beignet/",
+    keywords  = "OpenCL, Beignet"
+}
+
+@online{llvmwebsite,
+    title     = "LLVM Website",
+    url       = "https://llvm.org/",
+    keywords  = "OpenCL, LLVM"
+}
+
+@online{virginiawebsite,
+    title     = "Virginia University Website",
+    url       = "http://lava.cs.virginia.edu/Rodinia/download.htm",
+    keywords  = "OpenCL, Virginia University"
+}
+
+@online{hardkernelwebsite,
+    title     = "HardKernel Website",
+    url       = "http://www.hardkernel.com/",
+    keywords  = "OpenCL, Hardkernel, ODROID"
+}
+
+@online{etcherwebsite,
+    title     = "Etcher Website",
+    url       = "https://etcher.io/",
+    keywords  = "Flash, OS"
+}
+
+@online{gnuplotwebsite,
+    title     = "Gnuplot Website",
+    url       = "http://www.gnuplot.info/",
+    keywords  = "Plot"
+}

+ 7 - 7
report/source/sections/introduction.tex

@@ -8,16 +8,16 @@ Let's characterize more in detail the hardware and software used for the project
 \subsection{Hardware}
 
 \subsubsection{ARM Board}
-The main ARM board used for the project is a \textbf{ODROID-XU3} produced by \textbf{Hardkernel co., Ltd.} provided by the \textbf{HEAP lab}.\\
+The main ARM board used for the project is a \textbf{ODROID-XU3} produced by \textbf{Hardkernel co., Ltd.} \cite{hardkernelwebsite} provided by the \textbf{HEAP lab}.\\
 It mounts a CPU belongin to the \textbf{Arm big.LITTLE} series, in particular a Samsung Exynos5422 Cortex\texttrademark-A15 2Ghz and Cortex\texttrademark-A7 Octa core CPU.\\
 The GPU is a \textbf{Mali-T628 MP6} that is certified for OpenGL ES 3.1/2.0/1.1 and OpenCL 1.2 Full profile.\\
 This configuration of the board is equipped with \textbf{2Gbyte LPDDR3 RAM} PoP stacked.
-\href{http://www.hardkernel.com/main/products/prdt_info.php?g_code=g140448267127}{You can visit the product page for further details on the hardware}\\
+\href{http://www.hardkernel.com/main/products/prdt_info.php?g_code=g140448267127}{You can visit the product page for further details on the hardware.}\\
 I also used an \textbf{ODROID-XU4} of my own to adavance in the completion of the project during the summer. I opted for this model since the previous model wasn't available from the producer, and the SOC platform (CPU and GPU) is identical with respect to the \textbf{ODOID-XU3} except for small differences with ports and integrated peripherals, and I don't expect that this have influenced the results of the benchmarks, also because the final results proposed here have been \textbf{always} produced with the board present in \textbf{laboratory}.\\
 There is a small chance that problems may arise from the fact that I manly tested the auto-deployment scripts on my personal board during the summer when the University was closed, keep in mind that if there are some problems with the deploy scripts, since they may simply be differences on packages names or something like this.
 
 \subsubsection{Power Measurement}
-For the energy consumption measurements I used the \href{http://www.hardkernel.com/main/products/prdt_info.php?g_code=G137361754360}{Hardkernel Smart Power} provided me in the laboratory. I also had available an \href{http://www.hardkernel.com/main/products/prdt_info.php?g_code=G148048570542}{Hardkernel Smart Power 2} but unfortunately it wasn't compatible with the measurement software(detailed in the software paragraph).
+For the energy consumption measurements I used the \href{http://www.hardkernel.com/main/products/prdt_info.php?g_code=G137361754360}{Hardkernel Smart Power} provided me in the laboratory. I also had available an \href{http://www.hardkernel.com/main/products/prdt_info.php?g_code=G148048570542}{Hardkernel Smart Power 2} but unfortunately it wasn't compatible with the measurement software(detailed explanation in the software paragraph).
 
 \subsubsection{x86 Platform}
 The comparison of performances with an \textbf{x86} platform have been made on a Thinkpad X1 Carbon 3rd gen. that mounts and \href{https://ark.intel.com/products/85212/Intel-Core-i5-5200U-Processor-3M-Cache-up-to-2_70-GHz}{\textbf{Intel i5 5200U CPU}} and 8 GB of ram.
@@ -26,16 +26,16 @@ The comparison of performances with an \textbf{x86} platform have been made on a
 In this section we will describe the software component used for the development of the project.
 
 \subsubsection{OS}
-For what concerns the OS used during the development, I used the \textbf{Ubuntu 16.04.2 Kernel 4.9} image downloaded from the \href{http://odroid.com/dokuwiki/doku.php?id=en:xu3_release_linux_ubuntu_k49}{Hardkernel site}. I then used the suggested utility called \href{https://etcher.io/}{Etcher} to flash the image to the eMMC of the ODROID-XU4. I assume that also the flash of the ODROID-XU3 has been done in a similar way.
+For what concerns the OS used during the development, I used the \textbf{Ubuntu 16.04.2 Kernel 4.9} image downloaded from the \href{http://odroid.com/dokuwiki/doku.php?id=en:xu3_release_linux_ubuntu_k49}{Hardkernel site}. I then used the suggested utility called \textbf{Etcher} \cite{etcherwebsite} to flash the image to the eMMC of the ODROID-XU4. I assume that also the flash of the ODROID-XU3 has been done in a similar way.
 
 \subsubsection{OpenCL Runtime}
 For the benchmarks we actually used two OpenCL runtimes.The one used for the integrated Mali GPU is provided in the repository directly by the Hardkernel developers, and can be installed via the \textbf{mali-fbdev} package.\\
-Instead for the CPU we manually fetched and compiled the runtime provided by the \href{http://portablecl.org/}{Portable Computing Language (\textbf{pocl})}, version 0.14.
+Instead for the CPU we manually fetched and compiled the runtime provided by the \textbf{Portable Computing Language (\textbf{pocl})} \cite{poclwebsite} work group, \href{http://portablecl.org/downloads/pocl-0.14.tar.gz}{version 0.14}.
 
 \subsubsection{Benchmark Suite}
-The benchmark suite used is the \href{https://www.cs.virginia.edu/~skadron/wiki/rodinia/index.php/Rodinia:Accelerating_Compute-Intensive_Applications_with_Accelerators}{\textbf{Rodinia Benchmark Suite}, version 3.1}. This suite includes a lot of benchmarks specifically designed for systems that provide accelerators, and thus belong to the \textbf{heterogeneous computer systems} category. In fact the benchmarks provides parallelization features for three of the main parallel computing paradigms, that are \textbf{OpenMP, CUDA, and OpenCL}. We will of course use only the OpenCL benchmarks. The project has been started and it is mantained by the Computer Science Department of \href{https://engineering.virginia.edu/departments/computer-science}{\textbf{University of Virginia}}
+The benchmark suite used is the \href{https://www.cs.virginia.edu/~skadron/wiki/rodinia/index.php/Rodinia:Accelerating_Compute-Intensive_Applications_with_Accelerators}{\textbf{Rodinia Benchmark Suite}, version 3.1}. This suite includes a lot of benchmarks specifically designed for systems that provide accelerators, and thus belong to the \textbf{heterogeneous computer systems} category. In fact the benchmarks provides parallelization features for three of the main parallel computing paradigms, that are \textbf{OpenMP, CUDA, and OpenCL}. We will of course use only the OpenCL benchmarks. The project has been started and it is mantained by the Computer Science Department of \textbf{University of Virginia} \cite{virginiawebsite}.
 
 \subsubsection{Result Analysis}
-For what concerns the gathering and the analysis of the results obtained by the benchmarks, I mainly take advantage of \textbf{Bash} and \textbf{Python} scripts to collect the results, and also of \textbf{Gnuplot} to create graphs representing the results.
+For what concerns the gathering and the analysis of the results obtained by the benchmarks, I mainly take advantage of \textbf{Bash} and \textbf{Python(2)} scripts to collect the results, and also of \textbf{Gnuplot} \cite{gnuplotwebsite} to create graphs representing the results.
 
 \pagebreak

+ 101 - 8
report/source/sections/work.tex

@@ -59,20 +59,113 @@ At this point enter the smartpower folder and compile the utility with:
   make
 \end{lstlisting}
 
-At this point we should have in the \lstinline{linux} folder a binary named \lstinline{SmartPower} that is the utility that we need. Please take care to install also the dependencies necessary for building this utility, in particular \lstinline{qt4-qmake libqt4-dev libusb-1.0-0-dev}.\\
+At this point we should have in the \lstinline{linux} folder a binary named \lstinline{SmartPower}, this self-contained binary is the utility that we need. Please take care to install also the dependencies necessary for building this utility, in particular \lstinline{qt4-qmake libqt4-dev libusb-1.0-0-dev}.\\
 
-In addition, in order to be able to communicate through USB to the device even if we are not root we need to add a file name \lstinline{99-hiid.rules} in the path \lstinline{/etc/udev/rules.d/} containing the following:
+In addition, in order to be able to communicate through USB to the device even if we are not root, we need to add a file name \lstinline{99-hiid.rules} in the path \lstinline{/etc/udev/rules.d/} containing the following:
 
 \begin{lstlisting}
 #HIDAPI/libusb
 SUBSYSTEM=="usb", ATTRS{idVendor}=="04d8", ATTRS{idProduct}=="003f", MODE="0666"
 \end{lstlisting}
 
-Reached this point we should be able to take power measurements. To test it simply launch the \textbf{SmartPower} binary with as argument the file in which you want to store the results, let it run for a while and then stop it with a \textbf{SIGUSR1} signal. In the file you should find the power consumption (double check it with the display of the power measurement device). Also take into account that there is a known bug with the software, meaning that sometimes the utility is not able to retrieve the consumption and the process become a zombie process in the system. Take into consideration this if you have trouble in taking measurements.
+Reached this point we should be able to take power measurements. To test it simply launch the \textbf{SmartPower} binary with as argument the file in which you want to store the results, let it run for a while and then stop it with a \textbf{SIGUSR1} signal. In the file you should find the power consumption (double check it with the display of the power measurement device). Also take into account that there is a known bug with the software, meaning that sometimes the utility is not able to retrieve the consumption and the process become a zombie process in the system. Take into consideration this if you have trouble in taking measurements, and before starting a new measurement please be sure that no other SmartPower process is running.
 
 \subsection{Build of the benchmarks}
-For what concerns the benchmarks, we start from the vanilla Rodinia 3.1 benchmark suite, taken directly from \href{http://lava.cs.virginia.edu/Rodinia/download.htm}{the site} of Virginia University (you need to register on the site, and then you'll receive via mail a link to the real download page).
-Unfortunately the benchmarks are not ready for running. Some of them presents some bugs, and you need to actually apply a lot of fixes and modifications to successfully run them on the ODROID. Since the modifications are really big (I estimate that making the benchmarks usable has in fact taken most of the time of the development of the project), I opted for creating a repository that I initialized with the sources of the benchmarks and on which I worked. You can find the repository at \href{http://gogs.heisenberg.ovh/andreagus/rodinia-benchmark.git}{this url}. There are multiple branches on the repository since I worked in parallel on CPU and GPU benchmarks to make them work, and later I tried to merge all the results in a single branch to use for the benchmarks.\\
-In addition to bugs and other problems the main difficulty was that the creator of the benchmarks hardcoded in the source the OpenCL platform, device and type of device to use. This meant that if you wanted to run benchmarks on different OpenCL devices you had to manually modify the source, recompile the benchmark and run it. At the beginning of the development I also followed this approach and specializing a different branch for running the bencharks on CPU or GPU.\\
-But this approach bugged be since the main advantage and the ultimate goal of having and OpenCl application should be to be able to run it on different devices and accelerators with the minimum effort possible. So in the end I modified heavily the benchmarks in order to take as parameter the platform, the device and the type of device to use. I then added different run scripts that contain the right parameters for each available device.\\
-In this way we compile the benchmarks once, and then at runtime we select the platform and device to use. The selection simply implies to use the \lstinline{run-cpu} or \lstinline{run-gpu} script. In this way we have the more \textit{transparent} interface as possible.
+For what concerns the benchmarks, we start from the vanilla \textbf{Rodinia 3.1} benchmark suite, taken directly from the site of Virginia University \cite{virginiawebsite} (you need to register on the site, and then you'll receive via mail a link to the real download page).
+Unfortunately the benchmarks are \textbf{not ready for running}.\\
+Some of them presents some bugs, and you need to apply a lot of fixes and modifications to successfully run them on the ODROID. Since the modifications are really big (I estimate that making the benchmarks usable has in fact taken most of the time of the development of the project), I opted for creating a repository that I initialized with the sources of the benchmarks and on which I worked.\\
+You can find \textbf{the repository} at \href{http://gogs.heisenberg.ovh/andreagus/rodinia-benchmark.git}{this url}. There are multiple branches on the repository since I worked in parallel on CPU and GPU benchmarks to make them work, and later I tried to merge all the results in a single branch to use for the benchmarks.\\
+In addition to bugs and other problems the main difficulty was that the creator of the benchmarks \textbf{hard-coded} in the source the OpenCL platform, device and type of device to use. This meant that if you wanted to run benchmarks on different OpenCL devices you had to manually modify the source, recompile the benchmark and run it. At the beginning of the development I also followed this approach and specialized a different branch for running the benchmarks on CPU or GPU.\\
+But this approach bugged me, since the main advantage and the ultimate goal of having an OpenCL application should be to be able to run it on different devices and accelerators with the minimum effort possible. So in the end I modified heavily the benchmarks in order to take as parameter the platform, the device and the type of device to use. I then added different \textbf{run scripts} that contain the right parameters for each available device.\\
+In this way we \textbf{compile} the benchmarks \textbf{once}, and then at runtime we select the platform and device to use. The selection simply implies to use the \lstinline{run-cpu} or \lstinline{run-gpu} script. In this way we have the more \textit{transparent} interface as possible.
+
+\subsection{Work on the Benchmark Suite}
+In this section I'll try to explain what are the main problems that I found in trying running the Rodinia Suite, and how I overcame the problems.\\
+As said previously I decided to create a new repository containing the benchmark sources in order to keep track of the work and have a better organization over all the code base.\\
+The first two steps where to initialize the repository with the original sources of the suite and then to remove all the \textbf{CUDA} and \textbf{OpenMP} related folders and references. I opted for this strategy and not for completely avoiding inserting them in the repository to facilitate keeping track of all the changes made at the code base, in the eventuality that in the future, when a new official release of Rodinia will be released, we want to re-apply all the changes.\\
+The next problem to solve was the fact that all the benchmarks (with the exception of a couple) had hard-coded in the source code the OpenCL platform, device, and type of device to use, meaning that they always expected to find a GPU available on the platform and device with index zero.\\
+The first idea that came to my mind was to create two branches on the repository, one to use with CPU and one to use with GPU. I then proceeded to work in parallel on the two branches modifying the source code of the benchmark to use the right device. This approach worked and in the end I was able to run the benchmarks on the two different types of device.\\
+But this solution didn't really satisfied me, since was in some way \textbf{not coherent} with the OpenCL ultimate goals. Writing an application in OpenCL should give you the possibility to have a portable application that is able to run on different devices with the minimum effort possible. With the branches approach in order to switch from an executable for CPU to one for GPU we needed to switch between the branches a recompile the executable.
+In addition I find this kind of approach really not elegant since the setup and initialization of the OpenCL devices is all done at runtime, so there is not a particular reason for having those parameters hard-coded in the source code. We can in principle pass all those information at runtime when executing the benchmark. So I tried to make another step and, taking inspiration from the couple of benchmarks that already followed this kind of approach, I implemented a platform, device, and device type selection through passing different parameters to the command line.\\
+As a general guideline the convention is to specify a \textit{-p} and an index to specify the platform to use, a \textit{-d} and an index to specify the device, and a \textit{-g} and a boolean with the meaning of using or not a GPU.
+for example if we want to execute a benchmark on platform 0, device 1 and on GPU we need to pass something like this
+
+\begin{lstlisting}
+  -p 0 -d 1 -g 1
+\end{lstlisting}
+
+Instead if we want to execute on platform 1, device 0 and on CPU we pass something like this
+\begin{lstlisting}
+  -p 1 -d 0 -g 0
+\end{lstlisting}
+
+All this made possible the creation of different run scripts for the different types of execution. Look in the benchmarks folder to the various run-something scripts and see how we invoke the benchmark with different parameters in case we want to execute something on the Mali GPU or on the CPU.\\
+In some situations was not possible to do this (parameters already taken or parameter parsing made in a way not compatible with this restructuring), and I'll specify this cases in each subsection explaining in detail the modifications made at the single benchmark. Also consider executing the benchmark binary without parameters (or with \lstinline{-help}) to get an usage summary with all the necessary flags.\\
+I'll now add a subsection for each benchmark trying to detail the modifications introduced with a brief explanation of them.
+
+\subsubsection{Backprop}
+The benchmark didn't use correctly the \lstinline{clGetPlatformIDs} primitive, not retrieving at all the platforms present on the system. Modified this and added parameter parsing for OpenCL stuff. In this case we need to specify the platform, device, and device type in this order without the selectors (e.g. \lstinline{-p}) since the already present argument parsing expects the parameters in a certain order without flags.
+
+\subsubsection{Bfs}
+The benchmark sources imported a \textbf{timer} utility for debug purposes that consisted of ad-hoc X86 assembly instructions to get the time in different execution points. This obviously prevented the compilation on an ARM device. Removed this dependency since we time the execution in a different manner, so we do not use this mechanism. Also in this case the parameters parsing is done as in the Backprop benchmarks.
+
+\subsubsection{Cfd}
+This benchmark didn't compile for problems with the import of the \textit{rand()} function, so we fixed this. In addition the platform and device selection was not parametrized, so we also changed this. In this case we use the standard convention on the parameters as explained before.
+
+\subsubsection{Dwt2d}
+Implemented the device selection and fixed a bug with a \lstinline{char} variable not compatible with our architecture.
+
+\subsubsection{Gaussian}
+This benchmark already presented a prototype of platform and device selection. Added the possibility to select also the device type and changed some minor details in the use of the OpenCL primitives.
+
+\subsubsection{Heartwall}
+At first we implemented the device selection as in the other case, and reduced the work group size in order to be compatible with the board. Unfortunately in the end the execution on CPU always returned the \lstinline{CL_OUT_OF_HOST_MEMORY} error, and even with the minimum work group size the execution on CPU was not possible. I decided to disable and remove this benchmark since having only the data relative to the execution on GPU made no sense for the final comparative.
+
+\subsubsection{Hotspot}
+In this case there was an additional problem with work group size that was not compatible with CPU device. Reduced this work group size and implemented the device selection as described before.
+
+\subsubsection{Hybridsort}
+In this benchmark implemented the device selection adding a parameter parsing routine that works in parallel with the already present argument parsing routine, since integrating the two was too problematic.
+
+\subsubsection{Kmeans}
+In this case the only problem was with the platform retrieval, as in backdrop. Changed this and implemented device selection as described before.
+
+\subsubsection{LavaMD}
+In this benchmarks we had multiple problems. The first thing was the work group size too big to be handled on our device, so we reduced this.\\
+The other more subtle problem was with the size of the parameter passed to the OpenCL kernel. Since the C++ \lstinline{long} type has different sizes on 32-bit and 64-bit architectures (respectively 32-bit and 64-bit), while the \lstinline{long} type in OpenCL code is always 64-bit wide, during the execution of the benchmark we received strange errors indicating some problems with the maximum size of the argument.\\
+At first I thought that simply the benchmark was not adequate to be run on this platform, but after receiving similar strange errors with other benchmark I decided to investigate more. After firing up \lstinline{gdb} and some tentatives to understand what caused the \lstinline{SEGFAULT} I decided to go for a step by step execution in parallel on two 32-bit and 64-bit devices. I finally found that the problem was with the \lstinline{clSetKernelArg()} function. In fact I noticed that the the parameter passed to the kernel were different in size, and the kernel always expected arguments multiple of 64-bit.\\
+Once understood this I modified the C++ variables corresponding to the arguments from type \lstinline{long} to type \lstinline{long long}, fixing this bug.\\
+I find that this type of bug is really subtle, since for someone not knowing in detail the internals of OpenCL is really difficult to spot and solve. In some way this should be prevented with some coding convention, for example always using the \lstinline{long long} type for 64-bit wide variables. When writing an application that should be portable relying on behavior of the compiler for a specific architecture should not be acceptable.\\
+Also in this benchmark we implemented the device selection as described before.
+
+\subsubsection{Leukocyte}
+The first problem with this benchmark was an error with a Makefile target that prevented the compilation at all. Make was erroneously trying to compile also the header files resulting in an error when linking the final executable. Once fixed this the other problems encountered where with the dimension of the work group size that needed to be reduced. In addition also the initialization of the OpenCL context was done in a customary way not really functional, so I rewrote it in a more standard way.
+
+\subsubsection{Lud}
+In this benchmark the main change was the introduction of device selection as described before. Also fixed the use of the \lstinline{clGetPlatformIDs} primitive to get all the platforms available on the board.
+
+\subsubsection{Nn}
+Also in this case was already present a prototype of platform and device selection as for Nn. Changed some details on the initialization of the OpenCl context to take into account the addition of device type specification.
+
+\subsubsection{Nw}
+Also in this benchmark the main change was the implementation of the device selection, and in doing this we changed also the parameter parsing for the already required parameters.
+
+\subsubsection{Particlefilter}
+In this benchmark I mainly implemented the device selection. Take care that in this case the argument order is important for compatibility reason with the argument parsing already there.
+
+\subsubsection{Pathfinder}
+Implemented device selection following the guidelines defined before. In this case the task had been a little difficult since we have a lot of function activations between the parameters parsing and the actual OpenCL context initialization, so we have a lot of parameters passing between the modules. The alternative was to use a global object to store the parameters, but I don't like this approach since in case of problems we can't simply debug looking at the function parameters to spot problems, but we need to trace the state of a global object, thing that I find not elegant and prone to synchronization errors.
+
+\subsubsection{Srad}
+In this benchmark we needed to reduced the work group size to be compatible with the ODROID, and implemented the device selection as showed before.
+Also in this case the problem regarding the size of the arguments for the kernel manifested, so changed the size to match the one that the OpenCL kernel is expecting as done in LavaMD.
+
+\subsubsection{Streamcluster}
+Also in this benchmark we had the same problem already showed for LavaMD and Srad with the size of the kernel arguments. Fixed this and implemented the device selection, also fixing another bug with the initialization of the OpenCL Context with the \lstinline{clCreateContextFromType()} primitive.
+
+\subsubsection{Consideration valid for all the benchmarks}
+Please keep into account that the code base of the benchmark has probably been modified by a lot of different developers, with different styles and approach to the OpenCL framework.\\
+One problem that you can spot as soon as you look at a single commit is that there is no convention on the use of spaces or tabs(who would have guessed it?), so code is often misaligned and present trailing white-spaces and is really awful to look at with the editor set in the wrong way.\\
+To avoid cluttering the commits is a lot of blank space removals, substitutions of tabs with white-space I preferred to disable on my editor all mechanism that corrected this thing and leave the source code with misaligned lined but at least highlighting only the changes really made to the source.\\
+I then tried as much as possible all this things in a later commit that simply tries to fix all this things to obtain a source code not horrible.\\
+I apologize for this inconvenient and I ask you to not look at this problems withing the commits, but I preferred to keep them as little as possible to have a better chance to spot the real modifications made and to get lost in a commit with thousands of line added and removed to fix a tab.