Overview

This web site provides supplementary material for the TACO 2017 paper "Iterative Schedule Optimization for Parallelization in the Polyhedron Model". In the paper, we propose an approach of iteratively optimizing the schedule of a program that is representable in the Polyhedron Model in order to exploit the computing capacity of a given multi-core hardware. The exploration of the schedule search space can be done either at random or in a guided fashion using a genetic algorithm.

Our approach is implemented in the tool Polyite. To extract SCoPs and generate transformed code from the schedules produced by Polyite, we used the polyhedral optimizer Polly, which is built on top of LLVM. Polyite requires a patched version of Polly.

Polyite and its dependencies are available from github.com/stganser
Please find the latest results of the article's evaluation at the bottom of the page.

Benchmarks

To demonstrate the benefit of our approach, we conducted a series of experiments on several benchmarks of the Polybench 4.1 benchmark suite.

Table 1 lists the benchmarks that we examine in the paper together with some of their characteristics. The raw data can be found here.

Table 1: Characteristics of the benchmarks used. The data in columns 6-9 originates from the analysis of P1 of 1000 search space regions per benchmark. The data in column 10 is from an analysis of the schedule coefficient vectors produced during random exploration with the sparse setting in Experiment 3.
benchmark # stmts # deps max. loop depth # structure params % unit vectors avg. #lines avg. # rays avg. # vertices % rational vertices
2mm463799.84%91911.00%
3mm6103999.90%103110.90%
adi9643383.06%42742.02%
bicg2423100.00%4610.67%
correlation13223396.26%134010.54%
covariance7123399.50%42610.50%
doitgen384599.02%6110.16%
fdtd-2d4243492.08%532310.16%
gemm2235100.00%8711.78%
gemver462299.72%31511.17%
gesummv3322100.00%4710.56%
heat-3d21864247.34%315116.69%
jacobi-2d2563367.64%46112.42%
mvt2222100.00%8212.14%
seidel-2d1593385.44%4319.18%
syr2k2234100.00%7711.94%
syrk2234100.00%7612.30%
trmm243483.95%6912.15%

Baseline Measurements

In Table 2, we show the results of our baseline measurements. The table includes benchmarks that we did not investigate further in the paper. The raw data can be found here.
Table 2: The results of our baseline measurements. The last two columns show the execution time of the kernel function, once compiled with Polly and once with O3 alone.
benchmarkkernel functionregion entry pointPolly + OpenMP + Tiling:
exec. time in seconds
O3:
exec. time in seconds
2mmkernel_2mmentry.split2.67442939.869263
3mmkernel_3mmentry.split3.84214653.016816
adikernel_adientry.split22.684682144.224997
ataxkernel_ataxfor.cond7.preheader0.0115770.009663
bicgkernel_bicgfor.cond6.preheader0.0126590.013809
choleskykernel_choleskyfor.cond5.preheader32.69163513.772428
correlationkernel_correlationentry.split1.53625351.319759
covariancekernel_covarianceentry.split1.59600751.344424
derichekernel_dericheentry.split0.823070.857132
doitgenkernel_doitgenentry.split1.3061546.051574
durbinkernel_durbinentry.split0.0210780.02321
fdtd-2dkernel_fdtd_2dentry.split23.74957726.664058
floyd-warshallkernel_floyd_warshallentry.splitcompile-error220.229997
gemmkernel_gemmentry.split2.12674815.426021
gemverkernel_gemverentry.split0.0277480.10245
gesummvkernel_gesummventry.split0.0107220.027538
gramschmidtkernel_gramschmidtfor.body4141.55406141.625859
heat-3dkernel_heat_3dfor.cond6.preheader14.06072861.68343
jacobi-1dkernel_jacobi_1dentry.split0.010790.011698
jacobi-2dkernel_jacobi_2dentry.split28.93375229.84025
ludcmpkernel_ludcmpentry.split69.22281769.23428
lukernel_luentry.split13.1863471.634541
mvtkernel_mvtentry.split0.0176610.076668
seidel-2dkernel_seidel_2dentry.split251.340155234.006023
symmkernel_symmentry.split27.09968827.319278
syr2kkernel_syr2kentry.split4.21153187.597581
syrkkernel_syrkentry.split1.35254616.878636
trisolvkernel_trisolventry.split0.0275280.010382
trmmkernel_trmmentry.split9.17132218.163072

Experiments

In the following, we list the experiments together with their results. The archives contain the raw data in JSON and CSV format, the log files of the experiments (they are necessary to generate some of the plots) and a JSCOP file for each of the schedules.

Experiment 1

Experiment 2

For each of the configurations evaluated in Experiment 2, we determined the share of schedules that could not evaluated for different reasons. We could not fit the complete data into the paper and therefore show it in Table 3. The raw data can be found here.

Table 3: Per benchmark and configuration the percentage of schedules that cannot be benchmarked successfully for different reasons.
benchmarkconfigurationcompile timeoutmiscompilesexecution failswrong outputtimeouthealthy
3mmGA0.18%0.00%5.50%0.52%2.30%91.49%
3mmrandom sparse0.67%0.00%17.12%1.30%7.42%73.48%
3mmrandom dense78.79%2.52%9.48%1.24%0.36%7.61%
3mmrandom mixed59.06%1.73%16.39%2.03%1.00%19.79%
3mmLeTSeE random0.00%0.00%0.00%0.00%0.00%100.00%
3mmLeTSeE random + completion0.06%0.00%0.00%0.00%77.12%22.82%
adiGA0.24%0.00%2.51%0.95%10.92%85.38%
adirandom sparse0.15%0.00%4.52%1.76%5.76%87.82%
adirandom dense2.48%0.00%23.58%18.00%4.52%51.42%
adiLeTSeE random0.00%0.00%0.00%0.00%0.61%99.39%
adiLeTSeE random + completion0.00%0.00%0.00%0.00%4.18%95.82%
correlationGA0.09%0.18%5.15%1.39%0.18%93.00%
correlationrandom sparse0.36%0.58%12.64%3.21%0.70%82.52%
correlationrandom dense62.67%8.00%16.21%3.27%0.30%9.55%
correlationLeTSeE random0.00%0.00%0.00%0.00%0.00%100.00%
correlationLeTSeE random + completion0.00%0.00%0.00%0.00%0.00%100.00%

Experiments 3 and 4

The histogram from Figure 7 of the paper: histogram.pdf. In the paper, the gray bars are labeled as "PollyOpenMPTiling", but should be labeled with "isl". We fixed the label in the PDF provided here.

The raw data of the histogram: histogram_data.csv

In the following, we list the benchmarks together with the raw data produced by different configurations of iterative optimization and the following items:

  • A boxplot showing the performance distribution within the generations of the genetic algorithm.
  • The results from analyzing the generations of the genetic algorithm for semantically equivalent schedules. As stated in the paper, our automatic detection of equivalence classes is not able to fully canonicalize schedules. Therefore, it may be possible to merge some of the detected equivalence classes by manual inspection.
  • The results from determining the schedule space regions visited by random sparse and the genetic algorithm.

In Experiment 2, we ran the best schedule found by the genetic algorithm for each benchmark 1-thread parallel and 8-threads parallel to determine the speedup that results from parallelization. The raw data is available: parallelization_speedup.csv

Experiment 5

The raw data: data.tar.gz

A plot that shows the median convergence speed of each of the evaluated configurations: plot.pdf

Experiment 6

The raw data: data.tar.gz

A plot that shows the median convergence speed of each of the evaluated configurations: plot.pdf

Stability of Measurements

The large number of program versions that we had to benchmark during our experiments forced us to repeat each execution time measurement only five times. This is hardly enough to reach statistical significance. We took several precautions to stabilize the measurements, such as disabling Intel Turbo Boost and hyper-threading.

To give you an idea of the stability of our measurements, we analyzed the data from the runs of random sparse in Experiment 3. Per schedule, we calculated the relative standard error of the respective five execution time measurements. Figure 1 shows the results using a box plot for each benchmark. One can see, that for longer running benchmarks, results are stable, but the error is quite large in case of extremely short running benchmarks. You can download the raw data.

Box plots showing the standard errors of schedules' execution time measurements.
Figure 1: Box plots showing the standard errors of schedules' execution time measurements per benchmark. In parentheses is the mean execution time of each benchmarks' schedules in seconds. The length of the whiskers is 1.5 times the interquartile range. The whiskers may have been truncated to remain within the data set’s range. The dots are outliers beyond the range of the whiskers.

Remark

After the publication of the paper, we detected a fault in the schedule tree import of our adapted version of Polly 3.9. Due to this bug, Polly fails to apply tiling to imported schedules in some situations where the schedule permits tiling.

In order to estimate the impact on the reproducibility of the presented data and the drawn conclusions, we first repeated Experiment 3 for all benchmarks from Polybench 4.1 except lu, ludcmp and nussinov. We found that fixing the bug has few influence on the profitability of the best schedules found by our genetic algorithm and random exploration with the sparse setting. It enables us to find significantly better schedules for heat-3d and, for the first time, we are able to find profitable schedules for atax and trisolv. The bug fix has no significant influence on the adaptation of the approach by Pouchet et al. (LeTSeE). However, the adapted LeTSeE combined with schedule completion profits from the bug fix.

Further, we repeated Experiment 2. Again, the only configuration that shows significantly different behavior is the adapted LeTSeE combined with schedule completion. Across repeated runs, the effect of schedule completion on LeTSeE is positive. With schedule completion, schedules with higher profitability can be found.

Afterall, we are affirmed, that the main conclusions drawn in the paper remain valid. We have updated the publicly available version of the adapted Polly 3.9. We did not reproduce Experiments 5 and 6, as the outcome of Experiment 2 suggest strongly that their previous results remain valid.

In the following, we make all data from our reevaluation available.

Experiment 2

We provide updated data for Table 3, here: failure_info.csv

Experiment 3

The updated histogram: histogram.pdf

The updated histogram's raw data: histogram_data.csv

The following archive contains, per benchmark, a plot that illustrates the distribution of the performance yielded by the schedules in the genetic algorithms populations: boxplots.tar.gz

We recomputed the p-values from Table 2 of the paper. Table 4 shows the new data. Other than stated in the paper, there is a significant difference between random LeTSeE and random LeTSeE + completion.

Table 4: p-values obtained from a pair-wise Mann-Whitney U Test quantifying the significance of differences between different configurations of schedule optimization.
GAislrandom LeTSeErandom LeTSeE + completion
isl3.0e-07---
random LeTSeE7.0e-070.0340--
random LeTSeE + completion1.9e-050.19230.0055-
random0.21092.5e-067.0e-079.1e-06

Remark 2

During random exploration, Polyite did not sample one schedule per search space region at a time, as stated in the paper, but one or two. Further, we make the choice whether to mutate one schedule or to cross two schedules with uniform probability. Then, depending on the previous choice, we uniformly select a crossover operator or a mutation operator among the enabled respective operators.


Latest Result

In the previous evaluation, LLVM used a generic X86 CPU model. Of course, this was unintended. Therefore, we present new results for all experiments in the paper. The major difference to the previous data is that there is a slight but significant advantage of the genetic algorithm over random exploration.

Benchmarks

To demonstrate the benefit of our approach, we conducted a series of experiments on several benchmarks of the Polybench 4.1 benchmark suite.

Table 4 lists the benchmarks that we examine in the paper together with some of their characteristics. The raw data can be found here.

Table 4: Characteristics of the benchmarks used. The data originates from the analysis of 1000 search space regions per benchmark.
benchmark # stmts # deps max. loop depth # structure params % unit vectors avg. #lines avg. # rays avg. # vertices % rational vertices
2mm463799.83%91910.00%
3mm6103999.89%103110.00%
adi9643383.79%42741.10%
atax342399.88%41110.00%
bicg2423100.00%4610.00%
cholesky483288.76%32740.00%
correlation13193397.18%173710.00%
covariance7123399.49%42610.00%
deriche11252391.54%470260.00%
doitgen384599.08%6110.00%
durbin10432179.12%22840.76%
fdtd-2d4243491.58%53230.00%
floyd-warshall11832100.00%3110.00%
gemm2235100.00%8710.00%
gemver462299.73%31510.00%
gesummv3322100.00%4710.00%
gramschmidt9233385.47%44540.17%
heat-3d21714225.80%319243.40%
jacobi-1d2162276.94%34110.79%
jacobi-2d2563369.57%4617.77%
lu383290.25%32331.82%
ludcmp20893277.57%366100.68%
mvt2222100.00%8210.00%
nussinov8243397.79%13310.43%
seidel-2d1593385.56%4310.47%
symm5213492.22%5110.01%
syr2k2234100.00%7710.00%
syrk2234100.00%7610.00%
trisolv352297.63%3810.00%
trmm243484.12%6910.00%

Baseline Measurements

In Table 5, we show the results of our baseline measurements. The raw data can be found here.
Table 5: The results of our baseline measurements. The last two columns show the execution time of the kernel function, once compiled with Polly and once with O3 alone.
benchmarkO3:
exec. time in seconds
Polly + OpenMP + Tiling:
exec. time in seconds
2mm37.7513222.73354
3mm51.3256183.831878
adi171.90528522.590693
atax0.0070510.007146
bicg0.0137760.013137
cholesky13.4749975.031873
correlation53.1129061.592277
covariance53.093111.587869
deriche0.8583740.7997
doitgen5.4772511.34941
durbin0.0156440.015816
fdtd-2d26.81332729.10721
floyd-warshall196.7200351328.528475
gemm8.9471272.114805
gemver0.0977310.025878
gesummv0.0275410.008797
gramschmidt42.2014245.212398
heat-3d37.3071213.927879
jacobi-1d0.0084960.009684
jacobi-2d32.8181228.960936
lu71.53137613.284997
ludcmp69.19289271.500008
mvt0.0773250.016641
nussinov80.97988184.325014
seidel-2d233.939541251.483785
symm27.4128827.060895
syr2k91.4046064.237682
syrk18.3627941.33415
trisolv0.0102170.028039
trmm18.8563679.312357

Experiments

In the following, we list the experiments together with their results. The archives contain the raw data in JSON and CSV format, the log files of the experiments (they are necessary to generate some of the plots) and a JSCOP file for each of the schedules.

Experiment 1

The outcome of the experiment did not change.

Experiment 2

The updated histogram: histogram.pdf

The updated histogram's raw data: histogram_data.csv

We recomputed the p-values from Table 2 of the paper. Table 4 shows the new data. Other than stated in the paper, there is a significant difference between random LeTSeE and random LeTSeE + completion and between the genetic algorithm and random exploration.

Table 4: p-values obtained from a pair-wise Mann-Whitney U Test quantifying the significance of differences between different configurations of schedule optimization.
GAislrandom LeTSeErandom LeTSeE + completion
isl1.862645E-08---
random LeTSeE1.573935E-060.05919633--
random LeTSeE + completion1.36594E-060.040779040.009216491-
random0.047390782.793968E-083.327926E-063.327926E-06

Experiment 5

The raw data: genetic_op_eval.tar.xz

A plot showing the convergence of the configurations tested: 3mm_genetic_operators_convergence.pdf

Experiment 6

The raw data: sim_annealing_eval.tar.xz

A plot showing the convergence of the configurations tested: 3mm-sim-annealing-convergence.pdf

Contact

The paper "Iterative Schedule Optimization for Parallelization in the Polyhedron Model" has been written by Stefan Ganser, Armin Größlinger, Norbert Siegmund, Sven Apel and Christian Lengauer. For questions regarding the paper, please contact the authors.

Administrative Contact

Stefan Ganser

Faculty of Computer Science and Mathematics
University of Passau

Innstraße 33
94032 Passau
Germany

stefan.ganser@uni-passau.de


Built with bootstrap