Experience in Optimizing rho and H

Published: September 25, 2021

Author: Luo Haiwen

Experience in Optimizing rho and H

Writing Slave Functions in Fortran

The biggest difference between writing slave functions in Fortran and in C is the way arrays are passed (as shown by the passing of the matrix array in the code below). Fortran slave functions use Cray pointers, and the -fcray-pointer flag is required during compilation.

!Main core code
program main
	type :: param_t
		integer*8 :: matrix_loc
		integer m,n
		real*8 :: k
	end type param_t
	type(param_t) param
	integer,external :: slave_func
	real*8 matrix(100)
	integer :: m,n
	real*8 :: k
	param%matrix_loc=loc(matrix)
	param%m=m
	param%n=n
	param%k=k
	call athread_spawn(slave_func, param)
	call athread_join()
end program main

!Slave core code
subroutine func(param)
	type :: param_t
		integer*8 :: matrix_loc
		integer m,n
		real*8 :: k
	end type param_t
	type(param_t) :: param,local_param
	real*8 gl_matrix(100)
	pointer(matrix_ptr,gl_matrix)
	integer i
	call crts_dma_get(local_param,param,24)
	matrix_ptr=local_param%matrix_loc
	do i=1,100,1
		print *,gl_matrix(i)
	end do
end subroutine func

Optimization of Small Kernel rho

Calculation Process of rho (Described in C Language)

for(int i_my_batch=1;i_my_batch<n_my_batches_work;i_my_batch++)
{
    tab_atom_centered_coords();
    prune_density_matrix_sparse_polar_reduce_memory();
    collect_batch_centers_p2();
    for(int i_index=0;i_index<batches_work_size(i_my_batch);i_index++)
    {
        prune_radial_basis_p2();
        tab_local_geometry_p2();
        tab_trigonom_p0();
        tab_wave_ylm_p0();
        evaluate_radial_functions_p0();
        evaluate_waves_p2();
    }
    dgemm();
    matrix_ddot();
}

Hotspot Analysis

	tab_atom	prune_density	i_index_loop	dgemm	matrix_ddot	collect_batch
Proportion	31%	25%	31%	12%	1%	0.2%
Clock Cycles	2.1 billion	1.7 billion	2.1 billion	0.8 billion	70 million	17 million

Optimization of tab_atom_centered_coords

for(int i=0;i<n_centers_integrals;i++)
{
    dir_tab[0,i]=coord_current[0]-coords_center[0,centers_basis_integrals[i]];
    dir_tab[1,i]=coord_current[1]-coords_center[1,centers_basis_integrals[i]];
    dir_tab[2,i]=coord_current[2]-coords_center[2,centers_basis_integrals[i]];
}

Precompute values to reduce discrete memory access. The tab_atom_centered_coords function needs to be run n_my_batches_work times, which is the number of iterations of the outer loop. Each time, the access to coords_center[1,centers_basis_integrals[i]] is the same, so precomputing these values eliminates the need to access the centers_basis_integrals array repeatedly, converting the discrete memory access of coords_center into continuous memory access.
Offload to slave core using OpenACC.

Before Optimization	After Optimization	Speedup Ratio
2.1 billion	400 million	5.25

Optimization of prune_density_matrix_sparse_polar_reduce_memory

A similar function exists in H, where after offloading to the slave core, the speedup was 12.5. However, in rho, more than 60% of the time is spent on discrete memory access. Attempts were made to use non-blocking stride DMA, but a bug occurred after calling the corresponding function, which is yet to be resolved.

Before Optimization	After Optimization	Speedup Ratio
1.7 billion	980 million	1.7

Optimization of matrix_ddot

SIMD.
Double buffering was initially implemented with non-blocking DMA, but the final result was incorrect, indicating that the computation finished before the DMA completed. As a result, blocking DMA had to be used, which negates the benefit of double buffering. Clearly, when the memory access to computation ratio is 1, this kernel is memory-bound.
Load balancing was improved by ensuring each slave core’s task is contiguous rather than using round-robin.

Before Optimization	After Offloading	SIMD Optimization	double buffer	Speedup Ratio
70 million	14.4 million	13.6 million	13.6 million	5.1

Final Optimization Result for rho

	Before Optimization	After Optimization
512 Cores	3.380	2.220
64 Cores	27.350	17.630

Optimization of Small Kernel H

Calculation Process of H (Described in C Language)

for(int i_my_batch=1;i_my_batch<n_my_batches_work;i_my_batch++)
{
    tab_atom_centered_coords();
    collect_batch_centers_p2();
    for(int i_index=0;i_index<batches_work_size(i_my_batch);i_index++)
    {
        prune_radial_basis_p2();
        tab_local_geometry_p2();
        tab_trigonom_p0();
        tab_wave_ylm_p0();
        evaluate_radial_functions_p0();
        evaluate_waves_p2();
        evaluate_xc_DFPT();
    }
    evaluate_first_order_H_polar_reduce_memory();
    prune_density_matrix_sparse_from_dense();
}

Eliminating Transposition of wave Array

!Original code
real*8 :: wave(n_max_compute_ham,n_points)
real*8 :: contract(n_points,n_compute_c),wave_t(n_points,n_max_compute_ham)

wave_t=transpose(wave)
do i_compute=1,n_compute_c
	do i_point=1,n_points
		contract(i_points,i_compute)=partition_tab(i_point)*wave_t(i_point,i_compute)*&
								   (-grid_coord(i_point)+v_hartree_gradient(i_point)&
								    +dvxc_drho(i_point)*first_order_rho(i_point))
	end do
end do
call dgemm("T","N",n_compute_c,n_compute_c,n_points,1.d0,contract,n_points,wave_t,n_points,0.d0,&
            first_order_H_dense,n_compute_c)

The above is part of the original code for evaluate_first_order_H_polar_reduce_memory. The original code performs a transposition on the wave array, which was likely done to improve the locality of array accesses within the double loop. However, the transposition operation itself is time-consuming, and after linking with xmath, the transposition took longer than the matrix multiplication. Therefore, the transposition was eliminated. To maintain the locality within the double loop, the loop order was changed, and the definition of the contract array was modified. The modified code is shown below:

real*8 :: wave(n_max_compute_ham,n_points)
real*8 :: contract(n_compute_c,n_points)

do i_point=1,n_points
	do i_compute=1,n_compute_c
		contract(i_compute,i_point)=partition_tab(i_point)*wave(i_compute,i_point)*&
								   (-grid_coord(i_point)+v_hartree_gradient(i_point)&
								    +dvxc_drho(i_point)*first_order_rho(i_point))
	end do
end do
call dgemm("N","T",n_compute_c,n_compute_c,n_points,1.d0,contract,n_compute_c,&
            wave,n_max_compute_ham,0.d0,first_order_H_dense,n_compute_c)

Before Optimization	After Optimization	Speedup Ratio
650 million	320 million	2

matrix_ddot Optimization

This kernel is the double loop from the previous section. With the experience from rho, this kernel neither uses SIMD nor double buffer.

Before Optimization	After Optimization	Speedup Ratio
8.9 million	1.6 million	5.6

prune_density_matrix_sparse_from_dense Optimization

Before Optimization	After Optimization	Speedup Ratio
1.5 billion	120 million	12.5

Final Optimization Results of H

	Before Optimization	After Optimization
512 cores	4.86	3.15
64 cores	35.28	21.10

Large Kernel rho Optimization

Hotspot Analysis

	tab_atom	prune_density	i_index_loop	dgemm	matrix_ddot	collect_batch
Proportion	31%	25%	31%	12%	1%	0.2%
Clock Cycles	4 billion	400 million	1.75 billion	26 million	1.7 million	25 million

tab_atom_centered_coords

Optimization Method: Similar to the small kernel, use a lookup table to replace discrete memory access with continuous memory access, and then use DMA operations to access memory.

Before Optimization	After Optimization	Speedup Ratio
4 billion	400 million	10

prune_density_matrix_sparse_polar_reduce_memory

Optimization Method: Symmetric matrices can reduce memory access by half. Discrete memory access can be converted to faster DMA operations through data prefetching, fully utilizing bandwidth.

Before Optimization	After Optimization	Speedup Ratio
400 million	300 million	1.3

Final Optimization Results of Large Kernel rho

	Before Optimization	After Optimization
512 cores	3.51	2.32
64 cores	22.31	15.23

###

Large Kernel H Optimization

Hotspot Analysis

tab_atom_centered_coords	10%
prune_radial_basis	4%
tab_local_geometry
tab_trigonom
tab_wave_ylm
evaluate_radial_function
evaluate_waves
pz_lda
dgemm+ddot
pre_reduction
reduction	82%

Comparison of Large and Small Kernels

rho

	Large Kernel	Small Kernel	Original
64 cores	15.11	17.61	239.60
512 cores	2.50	2.24	31.45

Comparison of rho Components (Only 64 cores are compared due to load imbalance in 512-core large kernel)

	tab_atom	prune_density	i_index_loop	dgemm	matrix_ddot	collect_batch
64_Small Kernel	3.4 billion	8.2 billion	17.1 billion	100 million	15 million	140 million
64_Large Kernel	2.5 billion	6.4 billion	21.1 billion	360 million	24 million	480 million

Summary

The previous conclusion was that the small kernel is slightly faster than the large kernel. As shown in the table above, the large kernel takes 2.50 seconds, while the small kernel takes 2.24 seconds. However, the results of testing with 64 cores show that the large kernel is slightly faster. Later, I tested the runtime of each sub-core of the large kernel and found that the reason was the load imbalance among the sub-cores.

H

	Large Kernel	Small Kernel	Original
64 cores	85.92	21.10	131.02
512 cores	13.63	3.15	17.47

Comparison of H Components

	tab_atom	update first_order_H	i_index_loop	dgemm+ddot	collect_batch
64_Small Kernel	3.4 billion	980 million	26.3 billion	980 million	140 million
64_Large Kernel	2.5 billion	137.2 billion	9.08 billion	370 million	240 million

Summary

The update first_order_H part involves 64 sub-cores simultaneously updating first_order_H, which can cause data conflicts. Except for the update first_order_H part, each component in the large kernel performs better than in the small kernel. Therefore, if the large kernel can use a more efficient method to update first_order_H, it would make sense to use the large kernel to compute H. Currently, two methods have been considered: using fine-grained locks or having a few sub-cores dedicated to updating first_order_H.

Some Failed Attempts

Optimization of i_index_loop

In the rho optimization section, it can be seen that this loop accounts for 31%. The loop contains seven small functions. Initially, the idea was to run the entire loop on sub-cores, but the result was that the runtime increased from 2.1 billion clock cycles to 3.4 billion clock cycles. The analysis showed that the loop involves too many arrays, many of which are accessed discretely, so even enabling the cache didn’t help much. Later, the approach was changed to run the small functions within the loop on sub-cores individually, selecting tab_local_geometry_p2 and tab_trigonom_p0. The tab_local_geometry_p2 function involves four arrays, three of which are accessed continuously and one discretely, and has a significant amount of computation. Running it on sub-cores increased the runtime from 200 million clock cycles to 800 million clock cycles. The tab_trigonom_p0 function involves only two arrays, both accessed continuously, and also has some computation. Running it on sub-cores increased the runtime from 190 million clock cycles to 650 million clock cycles. The analysis found two reasons: first, repeatedly running on sub-cores incurs significant overhead; second, the tasks of these two kernels are too small to be distributed among 64 sub-cores. In summary, running the entire loop on sub-cores slows down due to too many discrete memory accesses, and running the functions within the loop on sub-cores individually cannot offset the overhead of repeatedly running on sub-cores due to the small task size. Another idea is loop splitting, but this would waste too much space, so this loop cannot be optimized.

Master-Slave Collaboration

There is no data dependency between i_index_loop and prune_density_matrix_sparse_polar_reduce_memory. As mentioned above, i_index_loop cannot be run on sub-cores, so prune_density_matrix_sparse_polar_reduce_memory can run on sub-cores while i_index_loop runs on the master core to hide the runtime of prune_density_matrix_sparse_polar_reduce_memory. However, due to the presence of xmath, some built-in functions of i_index_loop call sub-cores, such as the log function, so this attempt also failed.

Since prune_density_matrix_sparse_polar_reduce_memory cannot collaborate with i_index_loop in a master-slave manner, it can collaborate with tab_atom_centered_coords in a master-slave manner. However, the result is obvious: the speedup of running tab_atom_centered_coords on sub-cores is much greater than 2, so it is better to run both functions on sub-cores rather than using master-slave collaboration.

Summary

The Sunway OceanLight is severely bandwidth-limited between the master core and sub-cores, so unless the computation is particularly intensive, the sub-cores of the Sunway OceanLight cannot be fully utilized. For kernels that are not particularly computation-intensive, they can be considered memory-bound. Therefore, even if a kernel is entirely memory-bound, optimizing it is not meaningless. Computation-intensive kernels achieve the best speedup on sub-cores, kernels with only continuous or nearly continuous memory access achieve moderate speedup with a speedup ratio between 2 and 12, and kernels with a lot of discrete memory access may even experience a slowdown when using sub-cores.