work.tex 21 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171
  1. \section{Summary Of The Work}
  2. \subsection{Build of the runtime}
  3. The first challenge to tackle was the retrieval and compilation of the OpenCL runtimes.
  4. The runtime for the Mali GPU is already provided in the Hardkernel repository, so a simple \lstinline{sudo apt-get install mali-fbdev} does the trick.
  5. For what concenrs the Pocl runtime instead we need to start from scratch.
  6. The first thing to do is to retrieve the last version of the OpenCL runtime (curenntly version 0.14) from the \href{http://portablecl.org/downloads/pocl-0.14.tar.gz}{website}.
  7. The next thing to do is to decompress the archive of with simple \lstinline{tar xvfz pocl-0.14.tar.gz}.\\
  8. Pocl take adavante of \textbf{LLVM} to build itself, so we need to install a few dependencies from the package manager before being able to compile it. We can find at the \href{http://portablecl.org/docs/html/install.html}{dedicated page} on the official wiki a list of all the packages needed for the build. Basically we need LLVM and a bunch of development package of it, CMake to build the Makefiles, the standard utilities for compiòing (gcc, lex, bison), and some packages to have an Installable client driver (ICD) to be able to load the appropriate OpenCL at runtime.\\
  9. What we need to do on our system is basically:
  10. \begin{lstlisting}
  11. sudo apt-get update && sudo apt-get upgrade -y
  12. sudo apt-get install -y vim build-essential flex bison libtool\
  13. libncurses5* git-core htop cmake libhwloc-dev libclang-3.8-dev\
  14. clang-3.8 and llvm-3.8-dev zlib1g ocl-icd-libopencl1 clinfo\
  15. libglew-dev time gnuplot clinfo ocl-icd-dev ocl-icd-opencl-dev\
  16. qt4-qmake libqt4-dev libusb-1.0-0-dev
  17. \end{lstlisting}
  18. At this point can proceed and build Pocl. To to that we enter the directory with the sources and create a folder called \textit{build} in which we will have all the compiled stuff. At this point we take advantage of CMake for actually preparing our folder for the build. Usually a \lstinline{cmake ../} should suffice, but on the ODROID we have a little problem. Since our CPU is composed of four cortex a7 and four cortex a15 cores, CMake can't by itself understand what is the target CPU to use for the build. Luckily the two types of cores shares the same ISA, so we can explicitly tell CMake to use the cortex a15 as a target type of cpu. All we have to do is to launch \lstinline{cmake -DLLC\_HOST\_CPU=cortex-a15 ../}.\\
  19. At this point we are ready for the build, just type \lstinline{make -j8} and we are done. At this point we can run some tests with \lstinline{ctest -j8} just to be sure that everything went smooth, and finally install the runtime in the system with \lstinline{sudo make install}. At this point if everything went fine we will have a \lstinline{pocl.icd} file in \lstinline{/etc/OpenCL/vendors/}, and running \lstinline{clinfo} we should be able to see our brand new OpenCL runtime.\\
  20. Additionally in order to be able to use the runtime for the Mali GPU we additionally need to place a file containing:
  21. \begin{lstlisting}
  22. /usr/lib/arm-linux-gnueabihf/mali-egl/libOpenCL.so
  23. \end{lstlisting}
  24. in a file named \lstinline{mali.icd} at the path \lstinline{/etc/OpenCL/vendors/}.\\
  25. This should conclude the part regarding the OpenCL runtime deploy, and at this point we should be able to see both the CPU Pocl platform with an eight core device and the Mali GPU platform with two devices of four and 2 cores respectively.
  26. \subsection{Build of the power measurement utility}
  27. At this point we should get and compile the utility for measuring the power consumption of the board. The utility used is a modified version of the official utility provided by Hardkernel, that simply stores the consumption detected in a csv file, that we can later use for results analysis and plotting.
  28. For building the utility we start from \href{https://bitbucket.org/zanella_michele/odroid_smartpower_bridge}{the repository} containing utility, that has been kindly provided to me by \textit{Michele Zanella}.
  29. In bash commands:
  30. \begin{lstlisting}
  31. git clone https://bitbucket.org/zanella_michele/odroid_smartpower_bridge
  32. \end{lstlisting}
  33. At this point we should switch to the \textbf{no\_qt} branch with a simple \lstinline{git checkout no_qt}. In this branch all the non essential dependencies to Qt libraries have been removed, in order to avoid cluttering the board with the full KDE framework for just storing an integer representing the consumption.\\
  34. Unfortunately the HIDAPI library provided with the sources of the utility has been already compiled for x86 and stored in the repository, causing an error when trying to link the utility. To avoid this we need to recompile the library, by entering the HIDAPI folder and giving the following commands:
  35. \begin{lstlisting}
  36. qmake
  37. make clean
  38. make
  39. \end{lstlisting}
  40. At this point enter the smartpower folder and compile the utility with:
  41. \begin{lstlisting}
  42. qmake
  43. make clean
  44. make
  45. \end{lstlisting}
  46. At this point we should have in the \lstinline{linux} folder a binary named \lstinline{SmartPower}, this self-contained binary is the utility that we need. Please take care to install also the dependencies necessary for building this utility, in particular \lstinline{qt4-qmake libqt4-dev libusb-1.0-0-dev}.\\
  47. In addition, in order to be able to communicate through USB to the device even if we are not root, we need to add a file name \lstinline{99-hiid.rules} in the path \lstinline{/etc/udev/rules.d/} containing the following:
  48. \begin{lstlisting}
  49. #HIDAPI/libusb
  50. SUBSYSTEM=="usb", ATTRS{idVendor}=="04d8", ATTRS{idProduct}=="003f", MODE="0666"
  51. \end{lstlisting}
  52. Reached this point we should be able to take power measurements. To test it simply launch the \textbf{SmartPower} binary with as argument the file in which you want to store the results, let it run for a while and then stop it with a \textbf{SIGUSR1} signal. In the file you should find the power consumption (double check it with the display of the power measurement device). Also take into account that there is a known bug with the software, meaning that sometimes the utility is not able to retrieve the consumption and the process become a zombie process in the system. Take into consideration this if you have trouble in taking measurements, and before starting a new measurement please be sure that no other SmartPower process is running.
  53. \subsection{Build of the benchmarks}
  54. For what concerns the benchmarks, we start from the vanilla \textbf{Rodinia 3.1} benchmark suite, taken directly from the site of Virginia University \cite{virginiawebsite} (you need to register on the site, and then you'll receive via mail a link to the real download page).
  55. Unfortunately the benchmarks are \textbf{not ready for running}.\\
  56. Some of them presents some bugs, and you need to apply a lot of fixes and modifications to successfully run them on the ODROID. Since the modifications are really big (I estimate that making the benchmarks usable has in fact taken most of the time of the development of the project), I opted for creating a repository that I initialized with the sources of the benchmarks and on which I worked.\\
  57. You can find \textbf{the repository} at \href{http://gogs.heisenberg.ovh/andreagus/rodinia-benchmark.git}{this url}. There are multiple branches on the repository since I worked in parallel on CPU and GPU benchmarks to make them work, and later I tried to merge all the results in a single branch to use for the benchmarks.\\
  58. In addition to bugs and other problems the main difficulty was that the creator of the benchmarks \textbf{hard-coded} in the source the OpenCL platform, device and type of device to use. This meant that if you wanted to run benchmarks on different OpenCL devices you had to manually modify the source, recompile the benchmark and run it. At the beginning of the development I also followed this approach and specialized a different branch for running the benchmarks on CPU or GPU.\\
  59. But this approach bugged me, since the main advantage and the ultimate goal of having an OpenCL application should be to be able to run it on different devices and accelerators with the minimum effort possible. So in the end I modified heavily the benchmarks in order to take as parameter the platform, the device and the type of device to use. I then added different \textbf{run scripts} that contain the right parameters for each available device.\\
  60. In this way we \textbf{compile} the benchmarks \textbf{once}, and then at runtime we select the platform and device to use. The selection simply implies to use the \lstinline{run-cpu} or \lstinline{run-gpu} script. In this way we have the more \textit{transparent} interface as possible.
  61. \subsection{Work on the Benchmark Suite}
  62. In this section I'll try to explain what are the main problems that I found in trying running the Rodinia Suite, and how I overcame the problems.\\
  63. As said previously I decided to create a new repository containing the benchmark sources in order to keep track of the work and have a better organization over all the code base.\\
  64. The first two steps where to initialize the repository with the original sources of the suite and then to remove all the \textbf{CUDA} and \textbf{OpenMP} related folders and references. I opted for this strategy and not for completely avoiding inserting them in the repository to facilitate keeping track of all the changes made at the code base, in the eventuality that in the future, when a new official release of Rodinia will be released, we want to re-apply all the changes.\\
  65. The next problem to solve was the fact that all the benchmarks (with the exception of a couple) had hard-coded in the source code the OpenCL platform, device, and type of device to use, meaning that they always expected to find a GPU available on the platform and device with index zero.\\
  66. The first idea that came to my mind was to create two branches on the repository, one to use with CPU and one to use with GPU. I then proceeded to work in parallel on the two branches modifying the source code of the benchmark to use the right device. This approach worked and in the end I was able to run the benchmarks on the two different types of device.\\
  67. But this solution didn't really satisfied me, since was in some way \textbf{not coherent} with the OpenCL ultimate goals. Writing an application in OpenCL should give you the possibility to have a portable application that is able to run on different devices with the minimum effort possible. With the branches approach in order to switch from an executable for CPU to one for GPU we needed to switch between the branches a recompile the executable.
  68. In addition I find this kind of approach really not elegant since the setup and initialization of the OpenCL devices is all done at runtime, so there is not a particular reason for having those parameters hard-coded in the source code. We can in principle pass all those information at runtime when executing the benchmark. So I tried to make another step and, taking inspiration from the couple of benchmarks that already followed this kind of approach, I implemented a platform, device, and device type selection through passing different parameters to the command line.\\
  69. As a general guideline the convention is to specify a \textit{-p} and an index to specify the platform to use, a \textit{-d} and an index to specify the device, and a \textit{-g} and a boolean with the meaning of using or not a GPU.
  70. for example if we want to execute a benchmark on platform 0, device 1 and on GPU we need to pass something like this
  71. \begin{lstlisting}
  72. -p 0 -d 1 -g 1
  73. \end{lstlisting}
  74. Instead if we want to execute on platform 1, device 0 and on CPU we pass something like this
  75. \begin{lstlisting}
  76. -p 1 -d 0 -g 0
  77. \end{lstlisting}
  78. All this made possible the creation of different run scripts for the different types of execution. Look in the benchmarks folder to the various run-something scripts and see how we invoke the benchmark with different parameters in case we want to execute something on the Mali GPU or on the CPU.\\
  79. In some situations was not possible to do this (parameters already taken or parameter parsing made in a way not compatible with this restructuring), and I'll specify this cases in each subsection explaining in detail the modifications made at the single benchmark. Also consider executing the benchmark binary without parameters (or with \lstinline{-help}) to get an usage summary with all the necessary flags.\\
  80. I'll now add a subsection for each benchmark trying to detail the modifications introduced with a brief explanation of them.
  81. \subsubsection{Backprop}
  82. The benchmark didn't use correctly the \lstinline{clGetPlatformIDs} primitive, not retrieving at all the platforms present on the system. Modified this and added parameter parsing for OpenCL stuff. In this case we need to specify the platform, device, and device type in this order without the selectors (e.g. \lstinline{-p}) since the already present argument parsing expects the parameters in a certain order without flags.
  83. \subsubsection{Bfs}
  84. The benchmark sources imported a \textbf{timer} utility for debug purposes that consisted of ad-hoc X86 assembly instructions to get the time in different execution points. This obviously prevented the compilation on an ARM device. Removed this dependency since we time the execution in a different manner, so we do not use this mechanism. Also in this case the parameters parsing is done as in the Backprop benchmarks.
  85. \subsubsection{Cfd}
  86. This benchmark didn't compile for problems with the import of the \textit{rand()} function, so we fixed this. In addition the platform and device selection was not parametrized, so we also changed this. In this case we use the standard convention on the parameters as explained before.
  87. \subsubsection{Dwt2d}
  88. Implemented the device selection and fixed a bug with a \lstinline{char} variable not compatible with our architecture.
  89. \subsubsection{Gaussian}
  90. This benchmark already presented a prototype of platform and device selection. Added the possibility to select also the device type and changed some minor details in the use of the OpenCL primitives.
  91. \subsubsection{Heartwall}
  92. At first we implemented the device selection as in the other case, and reduced the work group size in order to be compatible with the board. Unfortunately in the end the execution on CPU always returned the \lstinline{CL_OUT_OF_HOST_MEMORY} error, and even with the minimum work group size the execution on CPU was not possible. I decided to disable and remove this benchmark since having only the data relative to the execution on GPU made no sense for the final comparative.
  93. \subsubsection{Hotspot}
  94. In this case there was an additional problem with work group size that was not compatible with CPU device. Reduced this work group size and implemented the device selection as described before.
  95. \subsubsection{Hybridsort}
  96. In this benchmark implemented the device selection adding a parameter parsing routine that works in parallel with the already present argument parsing routine, since integrating the two was too problematic.
  97. \subsubsection{Kmeans}
  98. In this case the only problem was with the platform retrieval, as in backdrop. Changed this and implemented device selection as described before.
  99. \subsubsection{LavaMD}
  100. In this benchmarks we had multiple problems. The first thing was the work group size too big to be handled on our device, so we reduced this.\\
  101. The other more subtle problem was with the size of the parameter passed to the OpenCL kernel. Since the C++ \lstinline{long} type has different sizes on 32-bit and 64-bit architectures (respectively 32-bit and 64-bit), while the \lstinline{long} type in OpenCL code is always 64-bit wide, during the execution of the benchmark we received strange errors indicating some problems with the maximum size of the argument.\\
  102. At first I thought that simply the benchmark was not adequate to be run on this platform, but after receiving similar strange errors with other benchmark I decided to investigate more. After firing up \lstinline{gdb} and some tentatives to understand what caused the \lstinline{SEGFAULT} I decided to go for a step by step execution in parallel on two 32-bit and 64-bit devices. I finally found that the problem was with the \lstinline{clSetKernelArg()} function. In fact I noticed that the the parameter passed to the kernel were different in size, and the kernel always expected arguments multiple of 64-bit.\\
  103. Once understood this I modified the C++ variables corresponding to the arguments from type \lstinline{long} to type \lstinline{long long}, fixing this bug.\\
  104. I find that this type of bug is really subtle, since for someone not knowing in detail the internals of OpenCL is really difficult to spot and solve. In some way this should be prevented with some coding convention, for example always using the \lstinline{long long} type for 64-bit wide variables. When writing an application that should be portable relying on behavior of the compiler for a specific architecture should not be acceptable.\\
  105. Also in this benchmark we implemented the device selection as described before.
  106. \subsubsection{Leukocyte}
  107. The first problem with this benchmark was an error with a Makefile target that prevented the compilation at all. Make was erroneously trying to compile also the header files resulting in an error when linking the final executable. Once fixed this the other problems encountered where with the dimension of the work group size that needed to be reduced. In addition also the initialization of the OpenCL context was done in a customary way not really functional, so I rewrote it in a more standard way.
  108. \subsubsection{Lud}
  109. In this benchmark the main change was the introduction of device selection as described before. Also fixed the use of the \lstinline{clGetPlatformIDs} primitive to get all the platforms available on the board.
  110. \subsubsection{Nn}
  111. Also in this case was already present a prototype of platform and device selection as for Nn. Changed some details on the initialization of the OpenCl context to take into account the addition of device type specification.
  112. \subsubsection{Nw}
  113. Also in this benchmark the main change was the implementation of the device selection, and in doing this we changed also the parameter parsing for the already required parameters.
  114. \subsubsection{Particlefilter}
  115. In this benchmark I mainly implemented the device selection. Take care that in this case the argument order is important for compatibility reason with the argument parsing already there.
  116. \subsubsection{Pathfinder}
  117. Implemented device selection following the guidelines defined before. In this case the task had been a little difficult since we have a lot of function activations between the parameters parsing and the actual OpenCL context initialization, so we have a lot of parameters passing between the modules. The alternative was to use a global object to store the parameters, but I don't like this approach since in case of problems we can't simply debug looking at the function parameters to spot problems, but we need to trace the state of a global object, thing that I find not elegant and prone to synchronization errors.
  118. \subsubsection{Srad}
  119. In this benchmark we needed to reduced the work group size to be compatible with the ODROID, and implemented the device selection as showed before.
  120. Also in this case the problem regarding the size of the arguments for the kernel manifested, so changed the size to match the one that the OpenCL kernel is expecting as done in LavaMD.
  121. \subsubsection{Streamcluster}
  122. Also in this benchmark we had the same problem already showed for LavaMD and Srad with the size of the kernel arguments. Fixed this and implemented the device selection, also fixing another bug with the initialization of the OpenCL Context with the \lstinline{clCreateContextFromType()} primitive.
  123. \subsubsection{Consideration valid for all the benchmarks}
  124. Please keep into account that the code base of the benchmark has probably been modified by a lot of different developers, with different styles and approach to the OpenCL framework.\\
  125. One problem that you can spot as soon as you look at a single commit is that there is no convention on the use of spaces or tabs(who would have guessed it?), so code is often misaligned and present trailing white-spaces and is really awful to look at with the editor set in the wrong way.\\
  126. To avoid cluttering the commits is a lot of blank space removals, substitutions of tabs with white-space I preferred to disable on my editor all mechanism that corrected this thing and leave the source code with misaligned lined but at least highlighting only the changes really made to the source.\\
  127. I then tried as much as possible all this things in a later commit that simply tries to fix all this things to obtain a source code not horrible.\\
  128. I apologize for this inconvenient and I ask you to not look at this problems withing the commits, but I preferred to keep them as little as possible to have a better chance to spot the real modifications made and to get lost in a commit with thousands of line added and removed to fix a tab.