Iterative Schedule Optimization for Parallelization in the Polyhedron Model

Overview

This web site provides supplementary material for the TACO 2017 paper "Iterative Schedule Optimization for Parallelization in the Polyhedron Model". In the paper, we propose an approach of iteratively optimizing the schedule of a program that is representable in the Polyhedron Model in order to exploit the computing capacity of a given multi-core hardware. The exploration of the schedule search space can be done either at random or in a guided fashion using a genetic algorithm.

Our approach is implemented in the tool Polyite. To extract SCoPs and generate transformed code from the schedules produced by Polyite, we used the polyhedral optimizer Polly, which is built on top of LLVM. Polyite requires a patched version of Polly.

Polyite and its dependencies are available from github.com/stganser

Please find the latest results of the article's evaluation at the bottom of the page.

Benchmarks

To demonstrate the benefit of our approach, we conducted a series of experiments on several benchmarks of the Polybench 4.1 benchmark suite.

Table 1 lists the benchmarks that we examine in the paper together with some of their characteristics. The raw data can be found here.

Table 1: Characteristics of the benchmarks used. The data in columns 6-9 originates from the analysis of P₁ of 1000 search space regions per benchmark. The data in column 10 is from an analysis of the schedule coefficient vectors produced during random exploration with the sparse setting in Experiment 3.
benchmark	# stmts	# deps	max. loop depth	# structure params	% unit vectors	avg. #lines	avg. # rays	avg. # vertices	% rational vertices
2mm	4	6	3	7	99.84%	9	19	1	1.00%
3mm	6	10	3	9	99.90%	10	31	1	0.90%
adi	9	64	3	3	83.06%	4	27	4	2.02%
bicg	2	4	2	3	100.00%	4	6	1	0.67%
correlation	13	22	3	3	96.26%	13	40	1	0.54%
covariance	7	12	3	3	99.50%	4	26	1	0.50%
doitgen	3	8	4	5	99.02%	6	1	1	0.16%
fdtd-2d	4	24	3	4	92.08%	5	32	3	10.16%
gemm	2	2	3	5	100.00%	8	7	1	1.78%
gemver	4	6	2	2	99.72%	3	15	1	1.17%
gesummv	3	3	2	2	100.00%	4	7	1	0.56%
heat-3d	2	186	4	2	47.34%	3	15	1	16.69%
jacobi-2d	2	56	3	3	67.64%	4	6	1	12.42%
mvt	2	2	2	2	100.00%	8	2	1	2.14%
seidel-2d	1	59	3	3	85.44%	4	3	1	9.18%
syr2k	2	2	3	4	100.00%	7	7	1	1.94%
syrk	2	2	3	4	100.00%	7	6	1	2.30%
trmm	2	4	3	4	83.95%	6	9	1	2.15%

Baseline Measurements

In Table 2, we show the results of our baseline measurements. The table includes benchmarks that we did not investigate further in the paper. The raw data can be found here.

Table 2: The results of our baseline measurements. The last two columns show the execution time of the kernel function, once compiled with Polly and once with O3 alone.
benchmark	kernel function	region entry point	Polly + OpenMP + Tiling: exec. time in seconds	O3: exec. time in seconds
2mm	kernel_2mm	entry.split	2.674429	39.869263
3mm	kernel_3mm	entry.split	3.842146	53.016816
adi	kernel_adi	entry.split	22.684682	144.224997
atax	kernel_atax	for.cond7.preheader	0.011577	0.009663
bicg	kernel_bicg	for.cond6.preheader	0.012659	0.013809
cholesky	kernel_cholesky	for.cond5.preheader	32.691635	13.772428
correlation	kernel_correlation	entry.split	1.536253	51.319759
covariance	kernel_covariance	entry.split	1.596007	51.344424
deriche	kernel_deriche	entry.split	0.82307	0.857132
doitgen	kernel_doitgen	entry.split	1.306154	6.051574
durbin	kernel_durbin	entry.split	0.021078	0.02321
fdtd-2d	kernel_fdtd_2d	entry.split	23.749577	26.664058
floyd-warshall	kernel_floyd_warshall	entry.split	compile-error	220.229997
gemm	kernel_gemm	entry.split	2.126748	15.426021
gemver	kernel_gemver	entry.split	0.027748	0.10245
gesummv	kernel_gesummv	entry.split	0.010722	0.027538
gramschmidt	kernel_gramschmidt	for.body41	41.554061	41.625859
heat-3d	kernel_heat_3d	for.cond6.preheader	14.060728	61.68343
jacobi-1d	kernel_jacobi_1d	entry.split	0.01079	0.011698
jacobi-2d	kernel_jacobi_2d	entry.split	28.933752	29.84025
ludcmp	kernel_ludcmp	entry.split	69.222817	69.23428
lu	kernel_lu	entry.split	13.18634	71.634541
mvt	kernel_mvt	entry.split	0.017661	0.076668
seidel-2d	kernel_seidel_2d	entry.split	251.340155	234.006023
symm	kernel_symm	entry.split	27.099688	27.319278
syr2k	kernel_syr2k	entry.split	4.211531	87.597581
syrk	kernel_syrk	entry.split	1.352546	16.878636
trisolv	kernel_trisolv	entry.split	0.027528	0.010382
trmm	kernel_trmm	entry.split	9.171322	18.163072

Experiments

In the following, we list the experiments together with their results. The archives contain the raw data in JSON and CSV format, the log files of the experiments (they are necessary to generate some of the plots) and a JSCOP file for each of the schedules.

Experiment 1

jacobi-2d:
- jacobi-2d.tar.gz
- jacobi-2d.pdf
seidel-2d:
- seidel-2d.tar.gz
- seidel-2d.pdf

Experiment 2

For each of the configurations evaluated in Experiment 2, we determined the share of schedules that could not evaluated for different reasons. We could not fit the complete data into the paper and therefore show it in Table 3. The raw data can be found here.

Table 3: Per benchmark and configuration the percentage of schedules that cannot be benchmarked successfully for different reasons.
benchmark	configuration	compile timeout	miscompiles	execution fails	wrong output	timeout	healthy
3mm	GA	0.18%	0.00%	5.50%	0.52%	2.30%	91.49%
3mm	random sparse	0.67%	0.00%	17.12%	1.30%	7.42%	73.48%
3mm	random dense	78.79%	2.52%	9.48%	1.24%	0.36%	7.61%
3mm	random mixed	59.06%	1.73%	16.39%	2.03%	1.00%	19.79%
3mm	LeTSeE random	0.00%	0.00%	0.00%	0.00%	0.00%	100.00%
3mm	LeTSeE random + completion	0.06%	0.00%	0.00%	0.00%	77.12%	22.82%
adi	GA	0.24%	0.00%	2.51%	0.95%	10.92%	85.38%
adi	random sparse	0.15%	0.00%	4.52%	1.76%	5.76%	87.82%
adi	random dense	2.48%	0.00%	23.58%	18.00%	4.52%	51.42%
adi	LeTSeE random	0.00%	0.00%	0.00%	0.00%	0.61%	99.39%
adi	LeTSeE random + completion	0.00%	0.00%	0.00%	0.00%	4.18%	95.82%
correlation	GA	0.09%	0.18%	5.15%	1.39%	0.18%	93.00%
correlation	random sparse	0.36%	0.58%	12.64%	3.21%	0.70%	82.52%
correlation	random dense	62.67%	8.00%	16.21%	3.27%	0.30%	9.55%
correlation	LeTSeE random	0.00%	0.00%	0.00%	0.00%	0.00%	100.00%
correlation	LeTSeE random + completion	0.00%	0.00%	0.00%	0.00%	0.00%	100.00%

Experiments 3 and 4

The histogram from Figure 7 of the paper: histogram.pdf. In the paper, the gray bars are labeled as "PollyOpenMPTiling", but should be labeled with "isl". We fixed the label in the PDF provided here.

The raw data of the histogram: histogram_data.csv

In the following, we list the benchmarks together with the raw data produced by different configurations of iterative optimization and the following items:

A boxplot showing the performance distribution within the generations of the genetic algorithm.
The results from analyzing the generations of the genetic algorithm for semantically equivalent schedules. As stated in the paper, our automatic detection of equivalence classes is not able to fully canonicalize schedules. Therefore, it may be possible to merge some of the detected equivalence classes by manual inspection.
The results from determining the schedule space regions visited by random sparse and the genetic algorithm.

2mm: 2mm.tar.gz
3mm: 3mm.tar.gz
adi: adi.tar.gz
bicg: bicg.tar.gz
correlation: correlation.tar.gz

covariance: covariance.tar.gz
doitgen: doitgen.tar.gz
fdtd-2d: fdtd-2d.tar.gz
gemm: gemm.tar.gz
gemver: gemver.tar.gz

gesummv: gesummv.tar.gz
heat-3d: heat-3d.tar.gz
jacobi-2d: jacobi-2d.tar.gz
mvt: mvt.tar.gz
seidel-2d: seidel-2d.tar.gz

syr2k: syr2k.tar.gz
syrk: syrk.tar.gz
trmm: trmm.tar.gz

In Experiment 2, we ran the best schedule found by the genetic algorithm for each benchmark 1-thread parallel and 8-threads parallel to determine the speedup that results from parallelization. The raw data is available: parallelization_speedup.csv

Experiment 5

The raw data: data.tar.gz

A plot that shows the median convergence speed of each of the evaluated configurations: plot.pdf

Experiment 6

The raw data: data.tar.gz

A plot that shows the median convergence speed of each of the evaluated configurations: plot.pdf

Stability of Measurements

The large number of program versions that we had to benchmark during our experiments forced us to repeat each execution time measurement only five times. This is hardly enough to reach statistical significance. We took several precautions to stabilize the measurements, such as disabling Intel Turbo Boost and hyper-threading.

To give you an idea of the stability of our measurements, we analyzed the data from the runs of random sparse in Experiment 3. Per schedule, we calculated the relative standard error of the respective five execution time measurements. Figure 1 shows the results using a box plot for each benchmark. One can see, that for longer running benchmarks, results are stable, but the error is quite large in case of extremely short running benchmarks. You can download the raw data.

Box plots showing the standard errors of schedules' execution time measurements. — Figure 1: Box plots showing the standard errors of schedules' execution time measurements per benchmark. In parentheses is the mean execution time of each benchmarks' schedules in seconds. The length of the whiskers is 1.5 times the interquartile range. The whiskers may have been truncated to remain within the data set’s range. The dots are outliers beyond the range of the whiskers.

Remark

After the publication of the paper, we detected a fault in the schedule tree import of our adapted version of Polly 3.9. Due to this bug, Polly fails to apply tiling to imported schedules in some situations where the schedule permits tiling.

In order to estimate the impact on the reproducibility of the presented data and the drawn conclusions, we first repeated Experiment 3 for all benchmarks from Polybench 4.1 except lu, ludcmp and nussinov. We found that fixing the bug has few influence on the profitability of the best schedules found by our genetic algorithm and random exploration with the sparse setting. It enables us to find significantly better schedules for heat-3d and, for the first time, we are able to find profitable schedules for atax and trisolv. The bug fix has no significant influence on the adaptation of the approach by Pouchet et al. (LeTSeE). However, the adapted LeTSeE combined with schedule completion profits from the bug fix.

Further, we repeated Experiment 2. Again, the only configuration that shows significantly different behavior is the adapted LeTSeE combined with schedule completion. Across repeated runs, the effect of schedule completion on LeTSeE is positive. With schedule completion, schedules with higher profitability can be found.

Afterall, we are affirmed, that the main conclusions drawn in the paper remain valid. We have updated the publicly available version of the adapted Polly 3.9. We did not reproduce Experiments 5 and 6, as the outcome of Experiment 2 suggest strongly that their previous results remain valid.

In the following, we make all data from our reevaluation available.

Experiment 2

We provide updated data for Table 3, here: failure_info.csv

Experiment 3

2mm: 2mm.tar.gz
3mm: 3mm.tar.gz
adi: adi.tar.gz
atax: atax.tar.gz
bicg: bicg.tar.gz

boxplots: boxplots.tar.gz
cholesky: cholesky.tar.gz
correlation: correlation.tar.gz
covariance: covariance.tar.gz
deriche: deriche.tar.gz

doitgen: doitgen.tar.gz
durbin: durbin.tar.gz
fdtd-2d: fdtd-2d.tar.gz
floyd-warshall: floyd-warshall.tar.gz
gemm: gemm.tar.gz

gemver: gemver.tar.gz
gesummv: gesummv.tar.gz
gramschmidt: gramschmidt.tar.gz
heat-3d: heat-3d.tar.gz
jacobi-1d: jacobi-1d.tar.gz

jacobi-2d: jacobi-2d.tar.gz
mvt: mvt.tar.gz
seidel-2d: seidel-2d.tar.gz
symm: symm.tar.gz
syr2k: syr2k.tar.gz

syrk: syrk.tar.gz
trisolv: trisolv.tar.gz
trmm: trmm.tar.gz

The updated histogram: histogram.pdf

The updated histogram's raw data: histogram_data.csv

The following archive contains, per benchmark, a plot that illustrates the distribution of the performance yielded by the schedules in the genetic algorithms populations: boxplots.tar.gz

We recomputed the p-values from Table 2 of the paper. Table 4 shows the new data. Other than stated in the paper, there is a significant difference between random LeTSeE and random LeTSeE + completion.

Table 4: p-values obtained from a pair-wise Mann-Whitney U Test quantifying the significance of differences between different configurations of schedule optimization.
	GA	isl	random LeTSeE	random LeTSeE + completion
isl	3.0e-07	-	-	-
random LeTSeE	7.0e-07	0.0340	-	-
random LeTSeE + completion	1.9e-05	0.1923	0.0055	-
random	0.2109	2.5e-06	7.0e-07	9.1e-06

Remark 2

During random exploration, Polyite did not sample one schedule per search space region at a time, as stated in the paper, but one or two. Further, we make the choice whether to mutate one schedule or to cross two schedules with uniform probability. Then, depending on the previous choice, we uniformly select a crossover operator or a mutation operator among the enabled respective operators.

Latest Result

In the previous evaluation, LLVM used a generic X86 CPU model. Of course, this was unintended. Therefore, we present new results for all experiments in the paper. The major difference to the previous data is that there is a slight but significant advantage of the genetic algorithm over random exploration.

Benchmarks

To demonstrate the benefit of our approach, we conducted a series of experiments on several benchmarks of the Polybench 4.1 benchmark suite.

Table 4 lists the benchmarks that we examine in the paper together with some of their characteristics. The raw data can be found here.

Table 4: Characteristics of the benchmarks used. The data originates from the analysis of 1000 search space regions per benchmark.
benchmark	# stmts	# deps	max. loop depth	# structure params	% unit vectors	avg. #lines	avg. # rays	avg. # vertices	% rational vertices
2mm	4	6	3	7	99.83%	9	19	1	0.00%
3mm	6	10	3	9	99.89%	10	31	1	0.00%
adi	9	64	3	3	83.79%	4	27	4	1.10%
atax	3	4	2	3	99.88%	4	11	1	0.00%
bicg	2	4	2	3	100.00%	4	6	1	0.00%
cholesky	4	8	3	2	88.76%	3	27	4	0.00%
correlation	13	19	3	3	97.18%	17	37	1	0.00%
covariance	7	12	3	3	99.49%	4	26	1	0.00%
deriche	11	25	2	3	91.54%	4	70	26	0.00%
doitgen	3	8	4	5	99.08%	6	1	1	0.00%
durbin	10	43	2	1	79.12%	2	28	4	0.76%
fdtd-2d	4	24	3	4	91.58%	5	32	3	0.00%
floyd-warshall	1	18	3	2	100.00%	3	1	1	0.00%
gemm	2	2	3	5	100.00%	8	7	1	0.00%
gemver	4	6	2	2	99.73%	3	15	1	0.00%
gesummv	3	3	2	2	100.00%	4	7	1	0.00%
gramschmidt	9	23	3	3	85.47%	4	45	4	0.17%
heat-3d	2	171	4	2	25.80%	3	19	2	43.40%
jacobi-1d	2	16	2	2	76.94%	3	4	1	10.79%
jacobi-2d	2	56	3	3	69.57%	4	6	1	7.77%
lu	3	8	3	2	90.25%	3	23	3	1.82%
ludcmp	20	89	3	2	77.57%	3	66	10	0.68%
mvt	2	2	2	2	100.00%	8	2	1	0.00%
nussinov	8	24	3	3	97.79%	13	3	1	0.43%
seidel-2d	1	59	3	3	85.56%	4	3	1	0.47%
symm	5	21	3	4	92.22%	5	1	1	0.01%
syr2k	2	2	3	4	100.00%	7	7	1	0.00%
syrk	2	2	3	4	100.00%	7	6	1	0.00%
trisolv	3	5	2	2	97.63%	3	8	1	0.00%
trmm	2	4	3	4	84.12%	6	9	1	0.00%

Baseline Measurements

In Table 5, we show the results of our baseline measurements. The raw data can be found here.

Table 5: The results of our baseline measurements. The last two columns show the execution time of the kernel function, once compiled with Polly and once with O3 alone.
benchmark	O3: exec. time in seconds	Polly + OpenMP + Tiling: exec. time in seconds
2mm	37.751322	2.73354
3mm	51.325618	3.831878
adi	171.905285	22.590693
atax	0.007051	0.007146
bicg	0.013776	0.013137
cholesky	13.474997	5.031873
correlation	53.112906	1.592277
covariance	53.09311	1.587869
deriche	0.858374	0.7997
doitgen	5.477251	1.34941
durbin	0.015644	0.015816
fdtd-2d	26.813327	29.10721
floyd-warshall	196.720035	1328.528475
gemm	8.947127	2.114805
gemver	0.097731	0.025878
gesummv	0.027541	0.008797
gramschmidt	42.201424	5.212398
heat-3d	37.30712	13.927879
jacobi-1d	0.008496	0.009684
jacobi-2d	32.81812	28.960936
lu	71.531376	13.284997
ludcmp	69.192892	71.500008
mvt	0.077325	0.016641
nussinov	80.979881	84.325014
seidel-2d	233.939541	251.483785
symm	27.41288	27.060895
syr2k	91.404606	4.237682
syrk	18.362794	1.33415
trisolv	0.010217	0.028039
trmm	18.856367	9.312357

Experiments

Experiment 1

The outcome of the experiment did not change.

Experiment 2

3mm:
- 3mm.tar.xz
- 3mm_convergence.pdf
- 3mm_perf_distr.pdf
- statistics regarding unsuccessful schedule evaluation: 3mm_failure_statistics.csv
adi:
- adi.tar.xz
- adi_convergence.pdf
- adi_perf_distr.pdf
- statistics regarding unsuccessful schedule evaluation: adi_failure_statistics.csv
correlation:
- correlation.tar.xz
- correlation_convergence.pdf
- correlation_perf_distr.pdf
- statistics regarding unsuccessful schedule evaluation: correlation_failure_statistics.csv

Experiments 3/4

Per benchmark, we provide an archive that contains the raw data and a box plots of the speedups yielded by the schedules in genetic algorithm's generations.

The updated histogram: histogram.pdf

The updated histogram's raw data: histogram_data.csv

Table 4: p-values obtained from a pair-wise Mann-Whitney U Test quantifying the significance of differences between different configurations of schedule optimization.
	GA	isl	random LeTSeE	random LeTSeE + completion
isl	1.862645E-08	-	-	-
random LeTSeE	1.573935E-06	0.05919633	-	-
random LeTSeE + completion	1.36594E-06	0.04077904	0.009216491	-
random	0.04739078	2.793968E-08	3.327926E-06	3.327926E-06

Experiment 5

The raw data: genetic_op_eval.tar.xz

A plot showing the convergence of the configurations tested: 3mm_genetic_operators_convergence.pdf

Experiment 6

The raw data: sim_annealing_eval.tar.xz

A plot showing the convergence of the configurations tested: 3mm-sim-annealing-convergence.pdf

Contact

The paper "Iterative Schedule Optimization for Parallelization in the Polyhedron Model" has been written by Stefan Ganser, Armin Größlinger, Norbert Siegmund, Sven Apel and Christian Lengauer. For questions regarding the paper, please contact the authors.

Stefan Ganser
- University of Passau
- stefan.ganser@uni-passau.de
Armin Größlinger
- University of Passau
- armin.groesslinger@uni-passau.de
Norbert Siegmund
- Bauhaus-University, Weimar
- norbert.siegmund@uni-weimar.de
Sven Apel
- University of Passau
- apel@uni-passau.de
Christian Lengauer
- University of Passau
- christian.lengauer@uni-passau.de

Administrative Contact

Stefan Ganser

Faculty of Computer Science and Mathematics
University of Passau

Innstraße 33
94032 Passau
Germany

stefan.ganser@uni-passau.de

Built with bootstrap

Iterative Schedule Optimization for Parallelization in the Polyhedron Model

Supplementary Web Site

Overview

Benchmarks

Baseline Measurements

Experiments

Experiment 1

Experiment 2

Experiments 3 and 4

Experiment 5

Experiment 6

Stability of Measurements

Remark

Experiment 2

Experiment 3

Remark 2

Latest Result

Benchmarks

Baseline Measurements

Experiments

Experiment 1

Experiment 2

Experiments 3/4

Experiment 5

Experiment 6

Contact

Administrative Contact