This web site provides supplementary material for the TACO 2017 paper "Iterative Schedule Optimization for Parallelization in the Polyhedron Model". In the paper, we propose an approach of iteratively optimizing the schedule of a program that is representable in the Polyhedron Model in order to exploit the computing capacity of a given multi-core hardware. The exploration of the schedule search space can be done either at random or in a guided fashion using a genetic algorithm.
Our approach is implemented in the tool Polyite. To extract SCoPs and generate transformed code from the schedules produced by Polyite, we used the polyhedral optimizer Polly, which is built on top of LLVM. Polyite requires a patched version of Polly.
To demonstrate the benefit of our approach, we conducted a series of experiments on several benchmarks of the Polybench 4.1 benchmark suite.
Table 1 lists the benchmarks that we examine in the paper together with some of their characteristics. The raw data can be found here.
| benchmark | # stmts | # deps | max. loop depth | # structure params | % unit vectors | avg. #lines | avg. # rays | avg. # vertices | % rational vertices |
|---|---|---|---|---|---|---|---|---|---|
| 2mm | 4 | 6 | 3 | 7 | 99.84% | 9 | 19 | 1 | 1.00% |
| 3mm | 6 | 10 | 3 | 9 | 99.90% | 10 | 31 | 1 | 0.90% |
| adi | 9 | 64 | 3 | 3 | 83.06% | 4 | 27 | 4 | 2.02% |
| bicg | 2 | 4 | 2 | 3 | 100.00% | 4 | 6 | 1 | 0.67% |
| correlation | 13 | 22 | 3 | 3 | 96.26% | 13 | 40 | 1 | 0.54% |
| covariance | 7 | 12 | 3 | 3 | 99.50% | 4 | 26 | 1 | 0.50% |
| doitgen | 3 | 8 | 4 | 5 | 99.02% | 6 | 1 | 1 | 0.16% |
| fdtd-2d | 4 | 24 | 3 | 4 | 92.08% | 5 | 32 | 3 | 10.16% |
| gemm | 2 | 2 | 3 | 5 | 100.00% | 8 | 7 | 1 | 1.78% |
| gemver | 4 | 6 | 2 | 2 | 99.72% | 3 | 15 | 1 | 1.17% |
| gesummv | 3 | 3 | 2 | 2 | 100.00% | 4 | 7 | 1 | 0.56% |
| heat-3d | 2 | 186 | 4 | 2 | 47.34% | 3 | 15 | 1 | 16.69% |
| jacobi-2d | 2 | 56 | 3 | 3 | 67.64% | 4 | 6 | 1 | 12.42% |
| mvt | 2 | 2 | 2 | 2 | 100.00% | 8 | 2 | 1 | 2.14% |
| seidel-2d | 1 | 59 | 3 | 3 | 85.44% | 4 | 3 | 1 | 9.18% |
| syr2k | 2 | 2 | 3 | 4 | 100.00% | 7 | 7 | 1 | 1.94% |
| syrk | 2 | 2 | 3 | 4 | 100.00% | 7 | 6 | 1 | 2.30% |
| trmm | 2 | 4 | 3 | 4 | 83.95% | 6 | 9 | 1 | 2.15% |
| benchmark | kernel function | region entry point | Polly + OpenMP + Tiling: exec. time in seconds | O3: exec. time in seconds |
|---|---|---|---|---|
| 2mm | kernel_2mm | entry.split | 2.674429 | 39.869263 |
| 3mm | kernel_3mm | entry.split | 3.842146 | 53.016816 |
| adi | kernel_adi | entry.split | 22.684682 | 144.224997 |
| atax | kernel_atax | for.cond7.preheader | 0.011577 | 0.009663 |
| bicg | kernel_bicg | for.cond6.preheader | 0.012659 | 0.013809 |
| cholesky | kernel_cholesky | for.cond5.preheader | 32.691635 | 13.772428 |
| correlation | kernel_correlation | entry.split | 1.536253 | 51.319759 |
| covariance | kernel_covariance | entry.split | 1.596007 | 51.344424 |
| deriche | kernel_deriche | entry.split | 0.82307 | 0.857132 |
| doitgen | kernel_doitgen | entry.split | 1.306154 | 6.051574 |
| durbin | kernel_durbin | entry.split | 0.021078 | 0.02321 |
| fdtd-2d | kernel_fdtd_2d | entry.split | 23.749577 | 26.664058 |
| floyd-warshall | kernel_floyd_warshall | entry.split | compile-error | 220.229997 |
| gemm | kernel_gemm | entry.split | 2.126748 | 15.426021 |
| gemver | kernel_gemver | entry.split | 0.027748 | 0.10245 |
| gesummv | kernel_gesummv | entry.split | 0.010722 | 0.027538 |
| gramschmidt | kernel_gramschmidt | for.body41 | 41.554061 | 41.625859 |
| heat-3d | kernel_heat_3d | for.cond6.preheader | 14.060728 | 61.68343 |
| jacobi-1d | kernel_jacobi_1d | entry.split | 0.01079 | 0.011698 |
| jacobi-2d | kernel_jacobi_2d | entry.split | 28.933752 | 29.84025 |
| ludcmp | kernel_ludcmp | entry.split | 69.222817 | 69.23428 |
| lu | kernel_lu | entry.split | 13.18634 | 71.634541 |
| mvt | kernel_mvt | entry.split | 0.017661 | 0.076668 |
| seidel-2d | kernel_seidel_2d | entry.split | 251.340155 | 234.006023 |
| symm | kernel_symm | entry.split | 27.099688 | 27.319278 |
| syr2k | kernel_syr2k | entry.split | 4.211531 | 87.597581 |
| syrk | kernel_syrk | entry.split | 1.352546 | 16.878636 |
| trisolv | kernel_trisolv | entry.split | 0.027528 | 0.010382 |
| trmm | kernel_trmm | entry.split | 9.171322 | 18.163072 |
In the following, we list the experiments together with their results. The archives contain the raw data in JSON and CSV format, the log files of the experiments (they are necessary to generate some of the plots) and a JSCOP file for each of the schedules.
For each of the configurations evaluated in Experiment 2, we determined the share of schedules that could not evaluated for different reasons. We could not fit the complete data into the paper and therefore show it in Table 3. The raw data can be found here.
| benchmark | configuration | compile timeout | miscompiles | execution fails | wrong output | timeout | healthy |
|---|---|---|---|---|---|---|---|
| 3mm | GA | 0.18% | 0.00% | 5.50% | 0.52% | 2.30% | 91.49% |
| 3mm | random sparse | 0.67% | 0.00% | 17.12% | 1.30% | 7.42% | 73.48% |
| 3mm | random dense | 78.79% | 2.52% | 9.48% | 1.24% | 0.36% | 7.61% |
| 3mm | random mixed | 59.06% | 1.73% | 16.39% | 2.03% | 1.00% | 19.79% |
| 3mm | LeTSeE random | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 100.00% |
| 3mm | LeTSeE random + completion | 0.06% | 0.00% | 0.00% | 0.00% | 77.12% | 22.82% |
| adi | GA | 0.24% | 0.00% | 2.51% | 0.95% | 10.92% | 85.38% |
| adi | random sparse | 0.15% | 0.00% | 4.52% | 1.76% | 5.76% | 87.82% |
| adi | random dense | 2.48% | 0.00% | 23.58% | 18.00% | 4.52% | 51.42% |
| adi | LeTSeE random | 0.00% | 0.00% | 0.00% | 0.00% | 0.61% | 99.39% |
| adi | LeTSeE random + completion | 0.00% | 0.00% | 0.00% | 0.00% | 4.18% | 95.82% |
| correlation | GA | 0.09% | 0.18% | 5.15% | 1.39% | 0.18% | 93.00% |
| correlation | random sparse | 0.36% | 0.58% | 12.64% | 3.21% | 0.70% | 82.52% |
| correlation | random dense | 62.67% | 8.00% | 16.21% | 3.27% | 0.30% | 9.55% |
| correlation | LeTSeE random | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 100.00% |
| correlation | LeTSeE random + completion | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 100.00% |
The histogram from Figure 7 of the paper: histogram.pdf. In the paper, the gray bars are labeled as "PollyOpenMPTiling", but should be labeled with "isl". We fixed the label in the PDF provided here.
The raw data of the histogram: histogram_data.csv
In the following, we list the benchmarks together with the raw data produced by different configurations of iterative optimization and the following items:
In Experiment 2, we ran the best schedule found by the genetic algorithm for each benchmark 1-thread parallel and 8-threads parallel to determine the speedup that results from parallelization. The raw data is available: parallelization_speedup.csv
The raw data: data.tar.gz
A plot that shows the median convergence speed of each of the evaluated configurations: plot.pdf
The raw data: data.tar.gz
A plot that shows the median convergence speed of each of the evaluated configurations: plot.pdf
The large number of program versions that we had to benchmark during our experiments forced us to repeat each execution time measurement only five times. This is hardly enough to reach statistical significance. We took several precautions to stabilize the measurements, such as disabling Intel Turbo Boost and hyper-threading.
To give you an idea of the stability of our measurements, we analyzed the data from the runs of random sparse in Experiment 3. Per schedule, we calculated the relative standard error of the respective five execution time measurements. Figure 1 shows the results using a box plot for each benchmark. One can see, that for longer running benchmarks, results are stable, but the error is quite large in case of extremely short running benchmarks. You can download the raw data.
After the publication of the paper, we detected a fault in the schedule tree import of our adapted version of Polly 3.9. Due to this bug, Polly fails to apply tiling to imported schedules in some situations where the schedule permits tiling.
In order to estimate the impact on the reproducibility of the presented data and the drawn conclusions, we first repeated Experiment 3 for all benchmarks from Polybench 4.1 except lu, ludcmp and nussinov. We found that fixing the bug has few influence on the profitability of the best schedules found by our genetic algorithm and random exploration with the sparse setting. It enables us to find significantly better schedules for heat-3d and, for the first time, we are able to find profitable schedules for atax and trisolv. The bug fix has no significant influence on the adaptation of the approach by Pouchet et al. (LeTSeE). However, the adapted LeTSeE combined with schedule completion profits from the bug fix.
Further, we repeated Experiment 2. Again, the only configuration that shows significantly different behavior is the adapted LeTSeE combined with schedule completion. Across repeated runs, the effect of schedule completion on LeTSeE is positive. With schedule completion, schedules with higher profitability can be found.
Afterall, we are affirmed, that the main conclusions drawn in the paper remain valid. We have updated the publicly available version of the adapted Polly 3.9. We did not reproduce Experiments 5 and 6, as the outcome of Experiment 2 suggest strongly that their previous results remain valid.
In the following, we make all data from our reevaluation available.
We provide updated data for Table 3, here: failure_info.csv
The updated histogram: histogram.pdf
The updated histogram's raw data: histogram_data.csv
The following archive contains, per benchmark, a plot that illustrates the distribution of the performance yielded by the schedules in the genetic algorithms populations: boxplots.tar.gz
We recomputed the p-values from Table 2 of the paper. Table 4 shows the new data. Other than stated in the paper, there is a significant difference between random LeTSeE and random LeTSeE + completion.
| GA | isl | random LeTSeE | random LeTSeE + completion | |
|---|---|---|---|---|
| isl | 3.0e-07 | - | - | - |
| random LeTSeE | 7.0e-07 | 0.0340 | - | - |
| random LeTSeE + completion | 1.9e-05 | 0.1923 | 0.0055 | - |
| random | 0.2109 | 2.5e-06 | 7.0e-07 | 9.1e-06 |
During random exploration, Polyite did not sample one schedule per search space region at a time, as stated in the paper, but one or two. Further, we make the choice whether to mutate one schedule or to cross two schedules with uniform probability. Then, depending on the previous choice, we uniformly select a crossover operator or a mutation operator among the enabled respective operators.
In the previous evaluation, LLVM used a generic X86 CPU model. Of course, this was unintended. Therefore, we present new results for all experiments in the paper. The major difference to the previous data is that there is a slight but significant advantage of the genetic algorithm over random exploration.
To demonstrate the benefit of our approach, we conducted a series of experiments on several benchmarks of the Polybench 4.1 benchmark suite.
Table 4 lists the benchmarks that we examine in the paper together with some of their characteristics. The raw data can be found here.
| benchmark | # stmts | # deps | max. loop depth | # structure params | % unit vectors | avg. #lines | avg. # rays | avg. # vertices | % rational vertices |
|---|---|---|---|---|---|---|---|---|---|
| 2mm | 4 | 6 | 3 | 7 | 99.83% | 9 | 19 | 1 | 0.00% |
| 3mm | 6 | 10 | 3 | 9 | 99.89% | 10 | 31 | 1 | 0.00% |
| adi | 9 | 64 | 3 | 3 | 83.79% | 4 | 27 | 4 | 1.10% |
| atax | 3 | 4 | 2 | 3 | 99.88% | 4 | 11 | 1 | 0.00% |
| bicg | 2 | 4 | 2 | 3 | 100.00% | 4 | 6 | 1 | 0.00% |
| cholesky | 4 | 8 | 3 | 2 | 88.76% | 3 | 27 | 4 | 0.00% |
| correlation | 13 | 19 | 3 | 3 | 97.18% | 17 | 37 | 1 | 0.00% |
| covariance | 7 | 12 | 3 | 3 | 99.49% | 4 | 26 | 1 | 0.00% |
| deriche | 11 | 25 | 2 | 3 | 91.54% | 4 | 70 | 26 | 0.00% |
| doitgen | 3 | 8 | 4 | 5 | 99.08% | 6 | 1 | 1 | 0.00% |
| durbin | 10 | 43 | 2 | 1 | 79.12% | 2 | 28 | 4 | 0.76% |
| fdtd-2d | 4 | 24 | 3 | 4 | 91.58% | 5 | 32 | 3 | 0.00% |
| floyd-warshall | 1 | 18 | 3 | 2 | 100.00% | 3 | 1 | 1 | 0.00% |
| gemm | 2 | 2 | 3 | 5 | 100.00% | 8 | 7 | 1 | 0.00% |
| gemver | 4 | 6 | 2 | 2 | 99.73% | 3 | 15 | 1 | 0.00% |
| gesummv | 3 | 3 | 2 | 2 | 100.00% | 4 | 7 | 1 | 0.00% |
| gramschmidt | 9 | 23 | 3 | 3 | 85.47% | 4 | 45 | 4 | 0.17% |
| heat-3d | 2 | 171 | 4 | 2 | 25.80% | 3 | 19 | 2 | 43.40% |
| jacobi-1d | 2 | 16 | 2 | 2 | 76.94% | 3 | 4 | 1 | 10.79% |
| jacobi-2d | 2 | 56 | 3 | 3 | 69.57% | 4 | 6 | 1 | 7.77% |
| lu | 3 | 8 | 3 | 2 | 90.25% | 3 | 23 | 3 | 1.82% |
| ludcmp | 20 | 89 | 3 | 2 | 77.57% | 3 | 66 | 10 | 0.68% |
| mvt | 2 | 2 | 2 | 2 | 100.00% | 8 | 2 | 1 | 0.00% |
| nussinov | 8 | 24 | 3 | 3 | 97.79% | 13 | 3 | 1 | 0.43% |
| seidel-2d | 1 | 59 | 3 | 3 | 85.56% | 4 | 3 | 1 | 0.47% |
| symm | 5 | 21 | 3 | 4 | 92.22% | 5 | 1 | 1 | 0.01% |
| syr2k | 2 | 2 | 3 | 4 | 100.00% | 7 | 7 | 1 | 0.00% |
| syrk | 2 | 2 | 3 | 4 | 100.00% | 7 | 6 | 1 | 0.00% |
| trisolv | 3 | 5 | 2 | 2 | 97.63% | 3 | 8 | 1 | 0.00% |
| trmm | 2 | 4 | 3 | 4 | 84.12% | 6 | 9 | 1 | 0.00% |
| benchmark | O3: exec. time in seconds | Polly + OpenMP + Tiling: exec. time in seconds |
|---|---|---|
| 2mm | 37.751322 | 2.73354 |
| 3mm | 51.325618 | 3.831878 |
| adi | 171.905285 | 22.590693 |
| atax | 0.007051 | 0.007146 |
| bicg | 0.013776 | 0.013137 |
| cholesky | 13.474997 | 5.031873 |
| correlation | 53.112906 | 1.592277 |
| covariance | 53.09311 | 1.587869 |
| deriche | 0.858374 | 0.7997 |
| doitgen | 5.477251 | 1.34941 |
| durbin | 0.015644 | 0.015816 |
| fdtd-2d | 26.813327 | 29.10721 |
| floyd-warshall | 196.720035 | 1328.528475 |
| gemm | 8.947127 | 2.114805 |
| gemver | 0.097731 | 0.025878 |
| gesummv | 0.027541 | 0.008797 |
| gramschmidt | 42.201424 | 5.212398 |
| heat-3d | 37.30712 | 13.927879 |
| jacobi-1d | 0.008496 | 0.009684 |
| jacobi-2d | 32.81812 | 28.960936 |
| lu | 71.531376 | 13.284997 |
| ludcmp | 69.192892 | 71.500008 |
| mvt | 0.077325 | 0.016641 |
| nussinov | 80.979881 | 84.325014 |
| seidel-2d | 233.939541 | 251.483785 |
| symm | 27.41288 | 27.060895 |
| syr2k | 91.404606 | 4.237682 |
| syrk | 18.362794 | 1.33415 |
| trisolv | 0.010217 | 0.028039 |
| trmm | 18.856367 | 9.312357 |
In the following, we list the experiments together with their results. The archives contain the raw data in JSON and CSV format, the log files of the experiments (they are necessary to generate some of the plots) and a JSCOP file for each of the schedules.
The updated histogram: histogram.pdf
The updated histogram's raw data: histogram_data.csv
We recomputed the p-values from Table 2 of the paper. Table 4 shows the new data. Other than stated in the paper, there is a significant difference between random LeTSeE and random LeTSeE + completion and between the genetic algorithm and random exploration.
| GA | isl | random LeTSeE | random LeTSeE + completion | |
|---|---|---|---|---|
| isl | 1.862645E-08 | - | - | - |
| random LeTSeE | 1.573935E-06 | 0.05919633 | - | - |
| random LeTSeE + completion | 1.36594E-06 | 0.04077904 | 0.009216491 | - |
| random | 0.04739078 | 2.793968E-08 | 3.327926E-06 | 3.327926E-06 |
The raw data: genetic_op_eval.tar.xz
A plot showing the convergence of the configurations tested: 3mm_genetic_operators_convergence.pdf
The raw data: sim_annealing_eval.tar.xz
A plot showing the convergence of the configurations tested: 3mm-sim-annealing-convergence.pdf
The paper "Iterative Schedule Optimization for Parallelization in the Polyhedron Model" has been written by Stefan Ganser, Armin Größlinger, Norbert Siegmund, Sven Apel and Christian Lengauer. For questions regarding the paper, please contact the authors.
Stefan Ganser
Faculty of Computer Science and Mathematics
University of Passau
Innstraße 33
94032 Passau
Germany