Written with StackEdit.
12-1-25 Meeting Note
https://github.com/happyandslow/cs_sdk/tree/main/csl-extras-202505230211-4-d9070058/examples/demo/gemv-h2d-multiple-pes-two-tenants-horizonal

https://github.com/happyandslow/cs_sdk/tree/main/csl-extras-202505230211-4-d9070058/examples/demo/gemv-h2d-multiple-pes-two-tenants
Comparing 1*4 kernels side-by-side
2 Iterations with single memcpy and compute
Vertical Layout (2 Iterations)
metric count mean stddev min p25 p50 p75 max
------------------------------------------------------------------------------------------------------------------------------------------
cycle_count 10 46761.4 10208.6 36295 38642.8 43995.5 51811.2 66195
cycles_per_second 10 16047.6 5693.57 10284.4 11218.1 14480.1 19286 25246.4
fabric_x 10 11 0 11 11 11 11 11
fabric_y 10 4 0 4 4 4 4 4
hwtile_count 10 22 0 22 22 22 22 22
idle_ce_cycles 10 24216.7 9535.46 14248 16674.2 22243.5 28409.8 42639
idle_wavelet_cycles 10 24252.7 9535.46 14284 16710.2 22279.5 28445.8 42675
init_time 10 0.0322554 0.0023908 0.029521 0.0306993 0.0311811 0.0329807 0.0366311
iotile_count 10 4 0 4 4 4 4 4
nodes 10 1 0 1 1 1 1 1
nultile_count 10 48 0 48 48 48 48 48
sim_time 10 3.04425 0.46249 2.39171 2.6894 2.9982 3.31186 3.78715
simulated_tile_count 10 26 0 26 26 26 26 26
sunset_count 10 0 0 0 0 0 0 0
threads 10 5 0 5 5 5 5 5
tile_count 10 74 0 74 74 74 74 74
tile_cycles_per_second 10 417236 148033 267394 291670 376482 501435 656406
tile_cycles_per_second_per_thread 10 83447.3 29606.6 53478.7 58333.9 75296.3 100287 131281
total_time 10 3.0765 0.462769 2.42123 2.72191 3.03185 3.34757 3.81781
Horizontal Layout (2 iterations)
metric count mean stddev min p25 p50 p75 max
------------------------------------------------------------------------------------------------------------------------------------------
cycle_count 10 46973.9 9032.91 37021 40133.8 44255 52892.5 63999
cycles_per_second 10 14203.7 4245.05 9039 11317.4 13517.3 16463.7 21244.4
fabric_x 10 15 0 15 15 15 15 15
fabric_y 10 3 0 3 3 3 3 3
hwtile_count 10 17 0 17 17 17 17 17
idle_ce_cycles 10 24212.3 8979.98 14918 17760 20836.5 30180 41510
idle_wavelet_cycles 10 24267.3 8979.98 14973 17815 20891.5 30235 41565
init_time 10 0.0283541 0.00164099 0.0264111 0.0271699 0.0278826 0.0297654 0.03069
iotile_count 10 4 0 4 4 4 4 4
nodes 10 1 0 1 1 1 1 1
nultile_count 10 60 0 60 60 60 60 60
sim_time 10 3.41647 0.470655 2.86544 3.03209 3.42082 3.66098 4.17313
simulated_tile_count 10 21 0 21 21 21 21 21
sunset_count 10 0 0 0 0 0 0 0
threads 10 5 0 5 5 5 5 5
tile_count 10 81 0 81 81 81 81 81
tile_cycles_per_second 10 298279 89146 189819 237666 283864 345737 446133
tile_cycles_per_second_per_thread 10 59655.7 17829.2 37963.8 47533.2 56772.7 69147.3 89226.6
total_time 10 3.44482 0.471059 2.89604 3.06081 3.44816 3.68816 4.20311
10 Iterations with single memcpy and compute
Vertical Layout (10 Iterations)
metric count mean stddev min p25 p50 p75 max
------------------------------------------------------------------------------------------------------------------------------------------
cycle_count 10 125435 6920.86 116842 121031 125154 128450 140549
cycles_per_second 10 10828.6 2037.6 7736.92 9274.07 10493.4 12163.3 13871.1
fabric_x 10 11 0 11 11 11 11 11
fabric_y 10 4 0 4 4 4 4 4
hwtile_count 10 22 0 22 22 22 22 22
idle_ce_cycles 10 23548.5 7149.74 15726 18401.5 22981 26423.8 39442
idle_wavelet_cycles 10 23584.5 7149.74 15762 18437.5 23017 26459.8 39478
init_time 10 0.0693907 0.0166098 0.0303891 0.0658712 0.068144 0.0788336 0.089931
iotile_count 10 4 0 4 4 4 4 4
nodes 10 1 0 1 1 1 1 1
nultile_count 10 48 0 48 48 48 48 48
sim_time 10 11.8887 1.84913 9.12147 11.0402 11.8142 12.995 15.1019
simulated_tile_count 10 26 0 26 26 26 26 26
sunset_count 10 0 0 0 0 0 0 0
threads 10 5 0 5 5 5 5 5
tile_count 10 74 0 74 74 74 74 74
tile_cycles_per_second 10 281543 52977.6 201160 241126 272827 316245 360648
tile_cycles_per_second_per_thread 10 56308.6 10595.5 40232 48225.2 54565.5 63249.1 72129.5
total_time 10 11.958 1.85909 9.18755 11.1099 11.8925 13.061 15.1897
Horizontal Layout (10 Iterations)
metric count mean stddev min p25 p50 p75 max
------------------------------------------------------------------------------------------------------------------------------------------
cycle_count 10 130990 10830.2 119911 121688 128575 135322 149732
cycles_per_second 10 10147 2837.8 7594.06 8277.59 9563.56 11133 17004
fabric_x 10 15 0 15 15 15 15 15
fabric_y 10 3 0 3 3 3 3 3
hwtile_count 10 17 0 17 17 17 17 17
idle_ce_cycles 10 25675 10677.6 14347 16815.5 23227.5 29955.8 43641
idle_wavelet_cycles 10 25730 10677.6 14402 16870.5 23282.5 30010.8 43696
init_time 10 0.066142 0.0157478 0.031801 0.0591087 0.0676404 0.0789888 0.082206
iotile_count 10 4 0 4 4 4 4 4
nodes 10 1 0 1 1 1 1 1
nultile_count 10 60 0 60 60 60 60 60
sim_time 10 13.4398 2.16316 8.80571 12.4819 13.7572 14.8982 15.8689
simulated_tile_count 10 21 0 21 21 21 21 21
sunset_count 10 0 0 0 0 0 0 0
threads 10 5 0 5 5 5 5 5
tile_count 10 81 0 81 81 81 81 81
tile_cycles_per_second 10 213087 59593.7 159475 173829 200835 233792 357083
tile_cycles_per_second_per_thread 10 42617.4 11918.7 31895.1 34765.9 40166.9 46758.5 71416.7
total_time 10 13.5059 2.16122 8.88524 12.5207 13.8348 14.9573 15.926
2 Iterations with single memcpy and 10 x compute
Vertical Layout
metric count mean stddev min p25 p50 p75 max
------------------------------------------------------------------------------------------------------------------------------------------
cycle_count 10 52818.2 10225.7 44366 45959 47990 58884.5 74729
cycles_per_second 10 12798.7 4484.03 8138.38 10123.6 11735.7 13142.1 23798.4
fabric_x 10 11 0 11 11 11 11 11
fabric_y 10 4 0 4 4 4 4 4
hwtile_count 10 22 0 22 22 22 22 22
idle_ce_cycles 10 21807 9669.41 14067 15049.5 17293 27304 42364
idle_wavelet_cycles 10 21843 9669.41 14103 15085.5 17329 27340 42400
init_time 10 0.0406059 0.0103537 0.033648 0.0358059 0.036755 0.040313 0.0687821
iotile_count 10 4 0 4 4 4 4 4
nodes 10 1 0 1 1 1 1 1
nultile_count 10 48 0 48 48 48 48 48
sim_time 10 4.30005 0.660048 3.14009 3.84667 4.45663 4.63277 5.45145
simulated_tile_count 10 26 0 26 26 26 26 26
sunset_count 10 0 0 0 0 0 0 0
threads 10 5 0 5 5 5 5 5
tile_count 10 74 0 74 74 74 74 74
tile_cycles_per_second 10 332766 116585 211598 263215 305129 341694 618757
tile_cycles_per_second_per_thread 10 66553.1 23317 42319.6 52642.9 61025.8 68338.9 123751
total_time 10 4.34065 0.657292 3.17687 3.89049 4.49558 4.67027 5.48642
Horizontal Layout
metric count mean stddev min p25 p50 p75 max
------------------------------------------------------------------------------------------------------------------------------------------
cycle_count 10 55421.7 10357.6 45138 46344.2 53271 59809.2 73621
cycles_per_second 10 11468.1 3787.58 7637.07 9016.43 10047.7 13227.2 18823.8
fabric_x 10 15 0 15 15 15 15 15
fabric_y 10 3 0 3 3 3 3 3
hwtile_count 10 17 0 17 17 17 17 17
idle_ce_cycles 10 24095.4 10124.7 14507 15631.5 21510.5 28366 42914
idle_wavelet_cycles 10 24150.4 10124.7 14562 15686.5 21565.5 28421 42969
init_time 10 0.0400846 0.0130995 0.0309448 0.0326537 0.0374765 0.0400236 0.0757778
iotile_count 10 4 0 4 4 4 4 4
nodes 10 1 0 1 1 1 1 1
nultile_count 10 60 0 60 60 60 60 60
sim_time 10 5.02384 0.687986 3.77337 4.54604 5.01124 5.60599 5.91038
simulated_tile_count 10 21 0 21 21 21 21 21
sunset_count 10 0 0 0 0 0 0 0
threads 10 5 0 5 5 5 5 5
tile_count 10 81 0 81 81 81 81 81
tile_cycles_per_second 10 240830 79539.1 160378 189345 211001 277772 395299
tile_cycles_per_second_per_thread 10 48166 15907.8 32075.7 37869 42200.1 55554.4 79059.8
total_time 10 5.06393 0.692825 3.80431 4.58548 5.04832 5.63884 5.98616
2 Iterations with single memcpy and 50 x compute
Vertical Layout
metric count mean stddev min p25 p50 p75 max
------------------------------------------------------------------------------------------------------------------------------------------
cycle_count 10 93748.4 5242.77 85065 91060 95339.5 96506 102582
cycles_per_second 10 10727.5 1714.83 7826.58 9394.24 10848.7 12077 12801.9
fabric_x 10 11 0 11 11 11 11 11
fabric_y 10 4 0 4 4 4 4 4
hwtile_count 10 22 0 22 22 22 22 22
idle_ce_cycles 10 24431.9 5313.73 15914 22064.5 25640 27515.8 33093
idle_wavelet_cycles 10 24467.9 5313.73 15950 22100.5 25676 27551.8 33129
init_time 10 0.058961 0.0093474 0.0353189 0.0586743 0.059738 0.061956 0.0725899
iotile_count 10 4 0 4 4 4 4 4
nodes 10 1 0 1 1 1 1 1
nultile_count 10 48 0 48 48 48 48 48
sim_time 10 8.89136 1.07938 7.38655 8.17505 8.8842 9.57714 10.8687
simulated_tile_count 10 26 0 26 26 26 26 26
sunset_count 10 0 0 0 0 0 0 0
threads 10 5 0 5 5 5 5 5
tile_count 10 74 0 74 74 74 74 74
tile_cycles_per_second 10 278916 44585.5 203491 244250 282067 314003 332850
tile_cycles_per_second_per_thread 10 55783.3 8917.1 40698.2 48850.1 56413.4 62800.5 66569.9
total_time 10 8.95032 1.0779 7.44884 8.23304 8.94613 9.64683 10.9282
Horizontal Layout
metric count mean stddev min p25 p50 p75 max
------------------------------------------------------------------------------------------------------------------------------------------
cycle_count 10 97647.5 8022.51 87910 93995.5 95196.5 99443.8 117358
cycles_per_second 10 9900.58 2260.3 8225.53 8506.79 9104.97 9779.21 15330.7
fabric_x 10 15 0 15 15 15 15 15
fabric_y 10 3 0 3 3 3 3 3
hwtile_count 10 17 0 17 17 17 17 17
idle_ce_cycles 10 27235.7 7630.04 17905 24340.5 24792.5 29402.2 45592
idle_wavelet_cycles 10 27290.7 7630.04 17960 24395.5 24847.5 29457.2 45647
init_time 10 0.0603226 0.0126066 0.0365732 0.0535103 0.0601815 0.0650843 0.079649
iotile_count 10 4 0 4 4 4 4 4
nodes 10 1 0 1 1 1 1 1
nultile_count 10 60 0 60 60 60 60 60
sim_time 10 10.1091 1.27679 7.6551 9.83759 10.2127 11.199 11.4163
simulated_tile_count 10 21 0 21 21 21 21 21
sunset_count 10 0 0 0 0 0 0 0
threads 10 5 0 5 5 5 5 5
tile_count 10 81 0 81 81 81 81 81
tile_cycles_per_second 10 207912 47466.2 172736 178643 191204 205363 321945
tile_cycles_per_second_per_thread 10 41582.5 9493.25 34547.2 35728.5 38240.9 41072.7 64389
total_time 10 10.1694 1.28014 7.7113 9.8906 10.2826 11.2573 11.496
TODO
- Move right 1x4 to the right bottom corner
- Control + Data color

