|
@@ -1,24 +1,26 @@
|
|
|
\section{Summary Of The Work}
|
|
|
|
|
|
\subsection{Becoming familiar with the OpenCL framework}
|
|
|
-Before starting the project I never worked with OpenCL, so before starting the work I decided to research information through the documentation available online.
|
|
|
+Before starting the project I never worked with OpenCL, so I decided to research information through the documentation available online to have a grasp of a how an OpenCL application works. I used the documentation provided by the \textbf{Khronos Gropu} \cite{khronoswebsite} as the main source of information about the C++ OpenCL Wrapper API.\\
|
|
|
In the meantime I tried to compile and play with \textbf{pocl} on my laptop, just to understand how to start from an OpenCL application and run it on the hardware.
|
|
|
My main reference has been the pocl website, in particular the documentation \cite{poclwebsite} on the pocl project website.\\
|
|
|
I had some previous experience with the \textbf{LLVM }framework \cite{llvmwebsite}, that pocl uses for compiling the runtime, so this part was not too difficult to manage, also because the required version of LLVM (3.8) is the default version shipped by the Ubuntu distribution, but anyway I had it already compiled on my machine.\\
|
|
|
Once I had the runtime compiled and ready for my laptop, I moved to becoming familiar with the designated benchmark suite.\\
|
|
|
-The first impact with the benchmark suite has been a little problematic since, for reasons that will be more clear when reading the section dedicated to the modifications made at the \textbf{Rodinia Benchmark Suite}, the suite is tailored for running on GPU, and since the pocl runtime on my laptop only exposed a CPU device, I wasn't able to run a single benchmark, and not having yet developed the skills necessary to debug and work with the C++ OpenCL Wrapper API, I was having some difficulties.\\
|
|
|
+\medskip
|
|
|
+\\
|
|
|
+The first impact with the benchmark suite has been a little problematic since, for reasons that will be more clear when reading the section dedicated to the modifications made at the \textbf{Rodinia Benchmark Suite} \ref{sec:benchmark}, the suite is tailored for running on GPU, and since the pocl runtime on my laptop only exposed a CPU device, I wasn't able to run even a single benchmark out of the box, and not having yet developed the skills necessary to debug and work with the C++ OpenCL Wrapper API, I was having some difficulties.\\
|
|
|
For this reason I decided to begin with something simpler, and I searched for other Benchmark Suites online. I searched for a little bit and found the \textbf{ViennaCL} \cite{viennawebsite} suite.\\
|
|
|
This time things went better, and after some experiments and tentatives I managed to run some benchmarks of the suite on my laptop, and reading the code I began to understand how the initialization and run of an OpenCL platform worked.\\
|
|
|
-Also during the documentation phase I become aware of the existence of the \textbf{Beignet} project, an Open Source OpenCL implementation to support the integrated GPUs on Intel chipset, so I had the opportunity to experiment a little also with a GPU device even before working on the board.\\
|
|
|
+During the documentation phase I become aware of the existence of the \textbf{Beignet} project, an Open Source OpenCL implementation to support the integrated GPUs on Intel chipsets, so I had the opportunity to experiment a little also with a GPU device even before working on the board.\\
|
|
|
At this point I felt that I had the prerequisites to start working with the \textbf{ODROID}, so I began the work on the board.
|
|
|
|
|
|
\subsection{Build of the runtime}
|
|
|
The first challenge to tackle was the retrieval and compilation of the OpenCL runtimes.\\
|
|
|
-The runtime for the \textbf{Mali GPU} is already provided in the Hardkernel repository, so a simple \lstinline{sudo apt-get install mali-fbdev} does the trick.
|
|
|
-For what concenrs the Pocl runtime instead we need to start from scratch.\\
|
|
|
-The first thing to do is to retrieve the last version of the OpenCL runtime (currently version 0.14) from the \href{http://portablecl.org/downloads/pocl-0.14.tar.gz}{website}.
|
|
|
-The next thing to do is to decompress the archive of with simple \lstinline{tar xvfz pocl-0.14.tar.gz}.\\
|
|
|
-Pocl take adavante of \textbf{LLVM} to build itself, so we need to install a few dependencies from the package manager before being able to compile it. We can find at the \href{http://portablecl.org/docs/html/install.html}{dedicated page} on the official wiki a list of all the packages needed for the build. Basically we need LLVM and a bunch of development package of it, CMake to build the Makefiles, the standard utilities for compiling (gcc, lex, bison), and some packages to have an Installable client driver (\textbf{ICD}) to be able to load the appropriate OpenCL at runtime.\\
|
|
|
+The runtime for the \textbf{Mali GPU} is already provided in the Hardkernel repository, so a simple \lstinline{sudo apt-get install mali-fbdev} does the trick.\\
|
|
|
+For what concerns the pocl runtime instead we need to start from scratch.\\
|
|
|
+The first thing to do is to retrieve the last version of the OpenCL runtime (currently the last available versio is the 0.14) from the \href{http://portablecl.org/downloads/pocl-0.14.tar.gz}{website}.
|
|
|
+The next thing to do is to decompress the archive with a simple \lstinline{tar xvfz pocl-0.14.tar.gz}.\\
|
|
|
+Pocl take advantage of \textbf{LLVM} to build itself, so we need to install a few dependencies from the package manager before being able to compile it. We can find at the \href{http://portablecl.org/docs/html/install.html}{dedicated page} on the official wiki a list of all the packages needed for the build. Basically we need LLVM and a bunch of development package of it, CMake to build the Makefiles, the standard utilities for compiling (gcc, lex, bison), and some packages to have an Installable client driver (\textbf{ICD}), in order to be able to load the appropriate OpenCL at runtime.\\
|
|
|
What we need to do on our system is basically:
|
|
|
\bigskip
|
|
|
|
|
@@ -30,30 +32,37 @@ sudo apt-get install -y vim build-essential flex bison libtool libncurses5* git-
|
|
|
\end{lstlisting}
|
|
|
\bigskip
|
|
|
|
|
|
-At this point we can proceed and build pocl. To to that we enter the directory with the sources and create a folder called \textit{build} in which we will have all the compiled stuff. At this point we take advantage of \textbf{CMake} for actually preparing our folder for the build. Usually a \lstinline{cmake ../} should suffice, but on the ODROID we have a little problem.\\
|
|
|
+At this point we can proceed and build pocl. To do that we enter the directory with the sources and create a folder called \textit{build} in which we will have all the compiled stuff. At this point we take advantage of \textbf{CMake} for actually preparing our folder for the build. Usually a \lstinline{cmake ../} should suffice, but on the ODROID we have a little problem.\\
|
|
|
+\smallskip
|
|
|
+\\
|
|
|
Since our CPU is composed of four cortex a7 and four cortex a15 cores, CMake can't by itself understand what is the target CPU to use for the build. Luckily the two types of cores shares the \textbf{same ISA}, so we can explicitly tell CMake to use the cortex a15 as a target type of cpu. All we have to do is to launch \lstinline{cmake -DLLC\_HOST\_CPU=cortex-a15 ../} .\\
|
|
|
At this point we are ready for the build, just type \lstinline{make -j8} and we are done. We can also run some tests with \lstinline{ctest -j8}, just to be sure that everything went smooth, and finally install the runtime in the system with \lstinline{sudo make install}. At this point if everything went fine we will have a \lstinline{pocl.icd} file in \lstinline{/etc/OpenCL/vendors/}, and running \lstinline{clinfo} we should be able to see our brand new OpenCL runtime.\\
|
|
|
|
|
|
-Additionally in order to be able to use the runtime for the \textbf{Mali GPU} we additionally need to place a file containing:
|
|
|
+Additionally in order to be able to use the runtime for the \textbf{Mali GPU} we need to place a file named \lstinline{mali.icd} containing:
|
|
|
|
|
|
\begin{lstlisting}
|
|
|
/usr/lib/arm-linux-gnueabihf/mali-egl/libOpenCL.so
|
|
|
\end{lstlisting}
|
|
|
|
|
|
-in a file named \lstinline{mali.icd} at the path \lstinline{/etc/OpenCL/vendors/}.\\
|
|
|
-This should conclude the part regarding the OpenCL runtime deploy, and at this point we should be able to see both the CPU Pocl platform with an eight core device and the Mali GPU platform with two devices of four and two cores respectively.
|
|
|
+at the path \lstinline{/etc/OpenCL/vendors/}.\\
|
|
|
+This should conclude the part regarding the OpenCL runtime deploy, and at this point we should be able to see both the CPU pocl platform with an eight cores device and the Mali GPU platform with two devices of four and two cores respectively invoking \lstinline{clinfo}.
|
|
|
|
|
|
\subsection{Build of the power measurement utility}
|
|
|
+\label{sec:smartpower}
|
|
|
At this point we should get and compile the utility for measuring the power consumption of the board. The utility used is a modified version of the official utility provided by Hardkernel, that simply stores the consumption detected in a csv file, that we can later use for results analysis and plotting.
|
|
|
-For building the utility we start from \href{https://bitbucket.org/zanella_michele/odroid_smartpower_bridge}{this repository}.\\
|
|
|
-The use of the utility has been kindly granted to me by \textit{Michele Zanella}, who is the main maintainer of the utility. He also helped me understanding how to make the utility work on the board, and he helped me debugging a problem with the setup of the USB interface and kindly agreed to publish on his repository a dedicated branch were all the unnecessary Qt dependencies have been removed.\\
|
|
|
+For building the utility we start from this repository \cite{powerrepo}.\\
|
|
|
+\smallskip
|
|
|
+\\
|
|
|
+The use of the utility has been kindly granted to me by \textit{Michele Zanella}, who is the main maintainer of the utility. He also helped me understanding how to make the utility work on the board, debugging a problem with the setup of the USB interface and kindly agreed to publish on his repository a dedicated branch were all the unnecessary Qt dependencies have been removed.\\
|
|
|
As first step we can retrieve the repository with the following bash command:
|
|
|
|
|
|
\begin{lstlisting}
|
|
|
git clone https://bitbucket.org/zanella_michele/odroid_smartpower_bridge
|
|
|
\end{lstlisting}
|
|
|
|
|
|
-At this point we should switch to the \textbf{no\_qt} branch with a simple \lstinline{git checkout no_qt}. In this branch all the non essential dependencies to Qt libraries have been removed, in order to avoid cluttering the board with the full KDE framework for just storing an integer representing the consumption. Of course if we want to have available the original GUI interface we need to compile the version present on the \textbf{master} branch.\\
|
|
|
+At this point we should switch to the \textbf{no\_qt} branch with a simple \lstinline{git checkout no_qt}. In this branch all the non essential dependencies to Qt libraries have been removed, in order to avoid cluttering the board with the full KDE framework just for storing an integer representing the consumption. Of course if we want to have available the original GUI interface we need to compile the version present on the \textbf{master} branch.\\
|
|
|
+\smallskip
|
|
|
+\\
|
|
|
Unfortunately the HIDAPI library provided with the sources of the utility has been already compiled for x86 and stored in the repository, causing an error when trying to link the utility.\\
|
|
|
To avoid this we need to recompile the library, by entering the HIDAPI folder and giving the following commands:
|
|
|
|
|
@@ -71,7 +80,7 @@ At this point enter the smartpower folder and compile the utility with:
|
|
|
make
|
|
|
\end{lstlisting}
|
|
|
|
|
|
-At this point we should have in the \lstinline{linux} folder a binary named \lstinline{SmartPower}, this self-contained binary is the utility that we need. Please take care to install also the dependencies necessary for building this utility, in particular \lstinline{qt4-qmake libqt4-dev libusb-1.0-0-dev}.\\
|
|
|
+now we should have in the \lstinline{linux} folder a binary named \lstinline{SmartPower}, this self-contained binary is the utility that we need. Please take care to install also the dependencies necessary for building this utility, in particular \lstinline{qt4-qmake libqt4-dev libusb-1.0-0-dev}.\\
|
|
|
|
|
|
In addition, in order to be able to communicate through USB to the device even if we are not root, we need to add a file name \lstinline{99-hiid.rules} in the path \lstinline{/etc/udev/rules.d/} containing the following:
|
|
|
|
|
@@ -80,27 +89,39 @@ In addition, in order to be able to communicate through USB to the device even i
|
|
|
SUBSYSTEM=="usb", ATTRS{idVendor}=="04d8", ATTRS{idProduct}=="003f", MODE="0666"
|
|
|
\end{lstlisting}
|
|
|
|
|
|
-Reached this point we should be able to take power measurements. To test it simply launch the \textbf{SmartPower} binary with as argument the file in which you want to store the results, let it run for a while and then stop it with a \textbf{SIGUSR1} signal. In the file you should find the power consumption (double check it with the display of the power measurement device). Also take into account that there is a known bug with the software, meaning that sometimes the utility is not able to retrieve the consumption and the process become a zombie process in the system. Take into consideration this if you have trouble in taking measurements, and before starting a new measurement please be sure that no other SmartPower process is running.
|
|
|
+\smallskip
|
|
|
+\\
|
|
|
+Reached this point we should be able to take power measurements. To test it simply launch the \textbf{SmartPower} binary with as argument the file in which you want to store the results, let it run for a while and then stop it with a \textbf{SIGUSR1} signal. In the file you should find the power consumption (double check it with the display of the power measurement device). Also take into account that there is a known bug in the software, meaning that sometimes the utility is not able to retrieve the consumption and the process become a zombie process in the system. Take into consideration this if you have trouble in taking measurements, and before starting a new measurement please be sure that no other SmartPower process is running.\\
|
|
|
+\medskip
|
|
|
+\\
|
|
|
+I also had the new version of the SmartPower device, but unfortunately they changed the interface and now it is no more possible to read the measurements via USB with the utility.
|
|
|
|
|
|
\subsection{Build of the benchmarks}
|
|
|
For what concerns the benchmarks, we start from the vanilla \textbf{Rodinia 3.1} benchmark suite, taken directly from the site of Virginia University \cite{virginiawebsite} (you need to register on the site, and then you'll receive via mail a link to the real download page).
|
|
|
-Unfortunately the benchmarks are \textbf{not ready for running}.\\
|
|
|
+Unfortunately the benchmarks are \textbf{not ready for running out of the box}.\\
|
|
|
Some of them presents some bugs, and you need to apply a lot of fixes and modifications to successfully run them on the ODROID. Since the modifications are really big (I estimate that making the benchmarks usable has in fact taken most of the time of the development of the project), I opted for creating a repository that I initialized with the sources of the benchmarks and on which I worked.\\
|
|
|
-You can find \textbf{the repository} at \href{http://gogs.heisenberg.ovh/andreagus/rodinia-benchmark.git}{this url}. There are multiple branches on the repository since I worked in parallel on CPU and GPU benchmarks to make them work, and later I tried to merge all the results in a single branch to use for the benchmarks.\\
|
|
|
+You can find \textbf{the repository} here \cite{rodiniarepo}. There are multiple branches on the repository since I worked in parallel on CPU and GPU benchmarks to make them work, and later I tried to merge all the results in a single branch to use for the benchmarks.\\
|
|
|
+\smallskip
|
|
|
+\\
|
|
|
In addition to bugs and other problems the main difficulty was that the creator of the benchmarks \textbf{hard-coded} in the source the OpenCL platform, device and type of device to use. This meant that if you wanted to run benchmarks on different OpenCL devices you had to manually modify the source, recompile the benchmark and run it. At the beginning of the development I also followed this approach and specialized a different branch for running the benchmarks on CPU or GPU.\\
|
|
|
But this approach bugged me, since the main advantage and the ultimate goal of having an OpenCL application should be to be able to run it on different devices and accelerators with the minimum effort possible. So in the end I modified heavily the benchmarks in order to take as parameter the platform, the device and the type of device to use. I then added different \textbf{run scripts} that contain the right parameters for each available device.\\
|
|
|
In this way we \textbf{compile} the benchmarks \textbf{once}, and then at runtime we select the platform and device to use. The selection simply implies to use the \lstinline{run-cpu} or \lstinline{run-gpu} script. In this way we have the more \textit{transparent} interface as possible.
|
|
|
|
|
|
\subsection{Work on the Benchmark Suite}
|
|
|
-In this section I'll try to explain what are the main problems that I found in trying running the Rodinia Suite, and how I overcame the problems.\\
|
|
|
+\label{sec:benchmark}
|
|
|
+In this section I'll try to explain what are the main problems that I found in trying running the Rodinia Suite, and how I overcame them.\\
|
|
|
As said previously I decided to create a new repository containing the benchmark sources in order to keep track of the work and have a better organization over all the code base.\\
|
|
|
The first two steps where to initialize the repository with the original sources of the suite and then to remove all the \textbf{CUDA} and \textbf{OpenMP} related folders and references. I opted for this strategy and not for completely avoiding inserting them in the repository to facilitate keeping track of all the changes made at the code base, in the eventuality that in the future, when a new official release of Rodinia will be released, we want to re-apply all the changes.\\
|
|
|
+\smallskip
|
|
|
+\\
|
|
|
The next problem to solve was the fact that all the benchmarks (with the exception of a couple) had hard-coded in the source code the OpenCL platform, device, and type of device to use, meaning that they always expected to find a GPU available on the platform and device with index zero.\\
|
|
|
The first idea that came to my mind was to create two branches on the repository, one to use with CPU and one to use with GPU. I then proceeded to work in parallel on the two branches modifying the source code of the benchmark to use the right device. This approach worked and in the end I was able to run the benchmarks on the two different types of device.\\
|
|
|
-But this solution didn't really satisfied me, since was in some way \textbf{not coherent} with the OpenCL ultimate goals. Writing an application in OpenCL should give you the possibility to have a portable application that is able to run on different devices with the minimum effort possible. With the branches approach in order to switch from an executable for CPU to one for GPU we needed to switch between the branches a recompile the executable.
|
|
|
-In addition I find this kind of approach really not elegant since the setup and initialization of the OpenCL devices is all done at runtime, so there is not a particular reason for having those parameters hard-coded in the source code. We can in principle pass all those information at runtime when executing the benchmark. So I tried to make another step and, taking inspiration from the couple of benchmarks that already followed this kind of approach, I implemented a platform, device, and device type selection through passing different parameters to the command line.\\
|
|
|
-As a general guideline the convention is to specify a \textit{-p} and an index to specify the platform to use, a \textit{-d} and an index to specify the device, and a \textit{-g} and a boolean with the meaning of using or not a GPU.
|
|
|
-for example if we want to execute a benchmark on platform 0, device 1 and on GPU we need to pass something like this
|
|
|
+\smallskip
|
|
|
+\\
|
|
|
+But this solution didn't really satisfied me, since was in some way \textbf{not coherent} with the OpenCL ultimate goals. Writing an application in OpenCL should give you the possibility to have a portable application that is able to run on different devices with the minimum effort possible. With the branches approach in order to switch from an executable for CPU to one for GPU we needed to switch between the branches and recompile the executable.
|
|
|
+In addition I find this kind of approach really not elegant since the setup and initialization of the OpenCL devices is all done at runtime, so there is no particular reason for having those parameters hard-coded in the source code. We can in principle pass all those information at runtime when executing the benchmark. So I tried to make another step and, taking inspiration from the couple of benchmarks that already followed this kind of approach, I implemented a platform, device, and device type selection through passing different parameters to the command line.\\
|
|
|
+As a general guideline the convention is to specify \textit{-p} and an index to specify the platform to use, \textit{-d} and an index to specify the device, and \textit{-g} and a boolean with the meaning of using or not a GPU.
|
|
|
+For example, if we want to execute a benchmark on platform 0, device 1 and on GPU we need to pass something like this
|
|
|
|
|
|
\begin{lstlisting}
|
|
|
-p 0 -d 1 -g 1
|
|
@@ -111,27 +132,28 @@ Instead if we want to execute on platform 1, device 0 and on CPU we pass somethi
|
|
|
-p 1 -d 0 -g 0
|
|
|
\end{lstlisting}
|
|
|
|
|
|
-All this made possible the creation of different run scripts for the different types of execution. Look in the benchmarks folder to the various run-something scripts and see how we invoke the benchmark with different parameters in case we want to execute something on the Mali GPU or on the CPU.\\
|
|
|
+All this made possible the creation of different run scripts for the different types of execution. Look in the benchmarks folder for the various run-something scripts and see how we invoke the benchmark with different parameters in case we want to execute something on the Mali GPU or on the CPU.\\
|
|
|
In some situations was not possible to do this (parameters already taken or parameter parsing made in a way not compatible with this restructuring), and I'll specify this cases in each subsection explaining in detail the modifications made at the single benchmark. Also consider executing the benchmark binary without parameters (or with \lstinline{-help}) to get an usage summary with all the necessary flags.\\
|
|
|
I'll now add a subsection for each benchmark trying to detail the modifications introduced with a brief explanation of them.
|
|
|
|
|
|
\subsubsection{Backprop}
|
|
|
-The benchmark didn't use correctly the \lstinline{clGetPlatformIDs} primitive, not retrieving at all the platforms present on the system. Modified this and added parameter parsing for OpenCL stuff. In this case we need to specify the platform, device, and device type in this order without the selectors (e.g. \lstinline{-p}) since the already present argument parsing expects the parameters in a certain order without flags.
|
|
|
+The benchmark didn't use correctly the \lstinline{clGetPlatformIDs} primitive, not retrieving at all the platforms present on the system. Modified this and added parameter parsing for the OpenCL initialization.
|
|
|
|
|
|
\subsubsection{Bfs}
|
|
|
-The benchmark sources imported a \textbf{timer} utility for debug purposes that consisted of ad-hoc X86 assembly instructions to get the time in different execution points. This obviously prevented the compilation on an ARM device. Removed this dependency since we time the execution in a different manner, so we do not use this mechanism. Also in this case the parameters parsing is done as in the Backprop benchmarks.
|
|
|
+The benchmark sources imported a \textbf{timer} utility for debug purposes that consisted of ad-hoc X86 assembly instructions to get the time in different execution points. This obviously prevented the compilation on an ARM device. Removed this dependency since we time the execution in a different manner, so we do not use this mechanism. The parameter follows the general guidelines.
|
|
|
+
|
|
|
|
|
|
\subsubsection{Cfd}
|
|
|
-This benchmark didn't compile for problems with the import of the \textit{rand()} function, so we fixed this. In addition the platform and device selection was not parametrized, so we also changed this. In this case we use the standard convention on the parameters as explained before.
|
|
|
+This benchmark didn't compile for problems with the import of the \textit{rand()} function, so we fixed this. In addition the platform and device selection was not parametrized, so changed this. In this case we use the standard convention on the parameters as explained before.
|
|
|
|
|
|
\subsubsection{Dwt2d}
|
|
|
-Implemented the device selection and fixed a bug with a \lstinline{char} variable not compatible with our architecture. Since the -d flag was already taken in this benchmark to specify the dimension we used -i for the device id specification.
|
|
|
+Implemented the device selection and fixed a bug with a \lstinline{char} variable not compatible with our architecture. Since the -d flag was already taken in this benchmark to specify the dimension we use -i for the device id specification.
|
|
|
|
|
|
\subsubsection{Gaussian}
|
|
|
This benchmark already presented a prototype of platform and device selection. Added the possibility to select also the device type and changed some minor details in the use of the OpenCL primitives.
|
|
|
|
|
|
\subsubsection{Heartwall}
|
|
|
-At first we implemented the device selection as in the other case, and reduced the work group size in order to be compatible with the board. Unfortunately in the end the execution on CPU always returned the \lstinline{CL_OUT_OF_HOST_MEMORY} error, and even with the minimum work group size the execution on CPU was not possible. I decided to disable and remove this benchmark since having only the data relative to the execution on GPU made no sense for the final comparative.
|
|
|
+At first implemented the device selection as in the other case, and reduced the work group size in order to be compatible with the board. Unfortunately in the end the execution on CPU always returned the \lstinline{CL_OUT_OF_HOST_MEMORY} error, and even with the minimum work group size the execution on CPU was not possible. I decided to disable and remove this benchmark since having only the data relative to the execution on GPU made no sense for the final comparative.
|
|
|
|
|
|
\subsubsection{Hotspot}
|
|
|
In this case there was an additional problem with work group size that was not compatible with CPU device. Reduced this work group size and implemented the device selection as described before.
|
|
@@ -143,11 +165,11 @@ In this benchmark implemented the device selection adding a parameter parsing ro
|
|
|
In this case the only problem was with the platform retrieval, as in backdrop. Changed this and implemented device selection as described before.
|
|
|
|
|
|
\subsubsection{LavaMD}
|
|
|
-In this benchmarks we had multiple problems. The first thing was the work group size too big to be handled on our device, so we reduced this.\\
|
|
|
+In this benchmarks there were multiple problems. The first thing was the work group size too big to be handled on our device, so I reduced this.\\
|
|
|
The other more subtle problem was with the size of the parameter passed to the OpenCL kernel. Since the C++ \lstinline{long} type has different sizes on 32-bit and 64-bit architectures (respectively 32-bit and 64-bit), while the \lstinline{long} type in OpenCL code is always 64-bit wide, during the execution of the benchmark we received strange errors indicating some problems with the maximum size of the argument.\\
|
|
|
-At first I thought that simply the benchmark was not adequate to be run on this platform, but after receiving similar strange errors with other benchmark I decided to investigate more. After firing up \lstinline{gdb} and some tentatives to understand what caused the \lstinline{SEGFAULT} I decided to go for a step by step execution in parallel on two 32-bit and 64-bit devices. I finally found that the problem was with the \lstinline{clSetKernelArg()} function. In fact I noticed that the the parameter passed to the kernel were different in size, and the kernel always expected arguments multiple of 64-bit.\\
|
|
|
+At first I thought that simply the benchmark was not adequate to be run on this platform, but after receiving similar strange errors with other benchmarks I decided to investigate more. After firing up \lstinline{gdb} and some tentatives to understand what caused the \lstinline{SEGFAULT} I decided to go for a step by step execution in parallel on two 32-bit and 64-bit devices. I finally found that the problem was with the \lstinline{clSetKernelArg()} function. In fact I noticed that the the parameter passed to the kernel were different in size, and the kernel always expected arguments multiple of 64-bit.\\
|
|
|
Once understood this I modified the C++ variables corresponding to the arguments from type \lstinline{long} to type \lstinline{long long}, fixing this bug.\\
|
|
|
-I find that this type of bug is really subtle, since for someone not knowing in detail the internals of OpenCL is really difficult to spot and solve. In some way this should be prevented with some coding convention, for example always using the \lstinline{long long} type for 64-bit wide variables. When writing an application that should be portable relying on behavior of the compiler for a specific architecture should not be acceptable.\\
|
|
|
+I find that this type of bug is really subtle, since for someone not knowing in detail the internals of OpenCL is really difficult to spot and solve a situation like this. In some way this should be prevented with some coding convention, for example always using the \lstinline{long long} type for 64-bit wide variables. When writing an application that should be portable relying on behavior of the compiler for a specific architecture should not be acceptable.\\
|
|
|
Also in this benchmark we implemented the device selection as described before.
|
|
|
|
|
|
\subsubsection{Leukocyte}
|
|
@@ -157,7 +179,7 @@ The first problem with this benchmark was an error with a Makefile target that p
|
|
|
In this benchmark the main change was the introduction of device selection as described before. Also fixed the use of the \lstinline{clGetPlatformIDs} primitive to get all the platforms available on the board.
|
|
|
|
|
|
\subsubsection{Nn}
|
|
|
-Also in this case was already present a prototype of platform and device selection as for Nn. Changed some details on the initialization of the OpenCl context to take into account the addition of device type specification.
|
|
|
+Also in this case there was already present a prototype of platform and device selection. Changed some details on the initialization of the OpenCl context to take into account the addition of device type specification.
|
|
|
|
|
|
\subsubsection{Nw}
|
|
|
Also in this benchmark the main change was the implementation of the device selection, and in doing this we changed also the parameter parsing for the already required parameters.
|
|
@@ -169,7 +191,7 @@ In this benchmark I mainly implemented the device selection. Take care that in t
|
|
|
Implemented device selection following the guidelines defined before. In this case the task had been a little difficult since we have a lot of function activations between the parameters parsing and the actual OpenCL context initialization, so we have a lot of parameters passing between the modules. The alternative was to use a global object to store the parameters, but I don't like this approach since in case of problems we can't simply debug looking at the function parameters to spot problems, but we need to trace the state of a global object, thing that I find not elegant and prone to synchronization errors.
|
|
|
|
|
|
\subsubsection{Srad}
|
|
|
-In this benchmark we needed to reduced the work group size to be compatible with the ODROID, and implemented the device selection as showed before.
|
|
|
+In this benchmark we needed to reduce the work group size to be compatible with the ODROID, and implement the device selection as showed before.
|
|
|
Also in this case the problem regarding the size of the arguments for the kernel manifested, so changed the size to match the one that the OpenCL kernel is expecting as done in LavaMD.
|
|
|
|
|
|
\subsubsection{Streamcluster}
|
|
@@ -178,9 +200,9 @@ Also in this benchmark we had the same problem already showed for LavaMD and Sra
|
|
|
\subsubsection{Consideration valid for all the benchmarks}
|
|
|
Please keep into account that the code base of the benchmark has probably been modified by a lot of different developers, with different styles and approach to the OpenCL framework.\\
|
|
|
One problem that you can spot as soon as you look at a single commit is that there is no convention on the use of spaces or tabs(who would have guessed it?), so code is often misaligned and present trailing white-spaces and is really awful to look at with the editor set in the wrong way.\\
|
|
|
-To avoid cluttering the commits is a lot of blank space removals, substitutions of tabs with white-space I preferred to disable on my editor all mechanism that corrected this thing and leave the source code with misaligned lined but at least highlighting only the changes really made to the source.\\
|
|
|
-I then tried as much as possible all this things in a later commit that simply tries to fix all this things to obtain a source code not horrible.\\
|
|
|
-I apologize for this inconvenient and I ask you to not look at this problems withing the commits, but I preferred to keep them as little as possible to have a better chance to spot the real modifications made and to get lost in a commit with thousands of line added and removed to fix a tab.
|
|
|
+To avoid cluttering the commits with a lot of blank space removals and substitutions of tabs with whitespace I preferred to disable on my editor all the mechanisms that corrected this thing and leave the source code with misaligned lined but at least highlighting only the changes really made to the source.\\
|
|
|
+I then tried to solve as much as possible all this things in a later commit that has the only purpose to obtain a source code less horrible to look at.\\
|
|
|
+I apologize for this inconvenient and I ask you to not look at this problems within the commits, since I preferred to keep them as little as possible to have a better chance to spot the real modifications made and to avoid to get lost in a commit with thousands of line added and removed to fix a tab.
|
|
|
|
|
|
\subsection{Running the benchmarks}
|
|
|
Arrived at this point we should have a working version of the benchmarks. We can then proceed to run them on our board.
|
|
@@ -193,9 +215,9 @@ As the names of the run scripts say:
|
|
|
\end{itemize}
|
|
|
We can also use the targets present in the Makefile inside the benchmark directory to conveniently run the sequence of all the benchmarks. We have:
|
|
|
\begin{itemize}
|
|
|
- \item \lstinline{OPENCL_BENCHMARK_CPU} to run all the benchmarks on the cpu
|
|
|
- \item \lstinline{OPENCL_BENCHMARK_GPU_PRIMARY} to run the benchmarks on the GPU device 1
|
|
|
- \item \lstinline{OPENCL_BENCHMARK_GPU_SECONDARY} to run the benchmarks on the GPU device 1
|
|
|
+ \item \lstinline{OPENCL_BENCHMARK_CPU} to run all the benchmarks on the CPU
|
|
|
+ \item \lstinline{OPENCL_BENCHMARK_GPU_PRIMARY} to run the benchmarks on the GPU device 1 (4 cores)
|
|
|
+ \item \lstinline{OPENCL_BENCHMARK_GPU_SECONDARY} to run the benchmarks on the GPU device 2 (2 cores)
|
|
|
\item \lstinline{OPENCL_BENCHMARK_ALL} to run the benchmarks on alle the three previous devices
|
|
|
\item \lstinline{OPENCL_BENCHMARK_GPU} to run the benchmarks on the GPU device 1 (kept for compatibility reasons)
|
|
|
\end{itemize}
|
|
@@ -207,4 +229,6 @@ The files are basically \textit{csv} files with three columns, and each record i
|
|
|
\item the name of the benchmark
|
|
|
\item the run time of the run expressed in seconds
|
|
|
\item the consumption expressed in Watt/hour
|
|
|
-\end{itemize}
|
|
|
+\end{itemize}
|
|
|
+
|
|
|
+\pagebreak
|