Project progress and Log

This is the log of 100 Days of CUDA challenge and what I implemented during this challenge.

Mentor: https://github.com/hkproj

Task list

Day	Task Description	STATUS
D001	Mandatory FA2-Forward Pass: Implement forward pass for FA2	DONE ✅
D005	Mandatory FA2-Backward Pass: Implement backward pass for FA2	PENDING
D010	Side Quest Chunked Cross Entropy Loss: Fuse the logits layer and the computation of the CE loss by chunks. (Ref. Liger Kernel imp in triton	PENDING

Progress by day

Day	Files
day001	flash_attention_fwd.cu: Forward pass of Flash Attention 2

Previous stride

Short summary

Day	Files
day01	vecAdd.cu: Parallel vector addition answers.cu: Answers to PMPP Chap 2
day02	matrixMult.cu: Matrix multiplication kernel grayscale: Color to grayscale kernel imageBlur.cu: Blur image kernel
day03	answers.cu: Answers to exercise of ch3 of PMPP
day04	simpleSumReductionKernel.cu: tree-based sum reduction Learnings: barrier syncronization
day05	convergentSumReduction.cu: convergence to previous reduction Log: Exercises of ch4
day06	tiledMatMul.cu: Tiled Matrix Multiplication
day07	convoluton_2d.cu: Implemented a simple 2D convolution
day08	convolution_with_caching.cu Implemented 2D convolution with tiling and caching in constant memory
day09	matmulEnhanced.cu: Enhanced the tile matrix multiplication for generalization with dynamic 1D shared memory array and memory colescing
day10	ch5_exercises.cu: Solutions to chapter of PMPP tile_matrix_transpose.cu: Tiled matrix transpose kernel
day11	convolution_2d.cu: tiled convolution
day12	convolution.cu: tiled convolution with cached halo cells

Summary

Day 08

Enhanced the 2D convolution to implement caching and tiling. Key points in learning:

Intrinsic hardware caching in constant memory by __constant__
shared memory

Day 09

Enhanced the 2D Matrix mulitplication by adding dynamic shared memory and generalization (any dimensions supported). Key Takeaways from experiments:

Profiling tracks kernel hardware performance ncu <executable>
Coalescing memory for better memory througput (use consecutive memory instead of scattered/strided memory accesses)
Prevent garbage value errors by boundary conditions for arbitrary dimensions
Appropriate tile size can bring drastic changes. Observations from running on colab's T4:

Tile size	Time Taken
2	Non-tiled kernel execution time: 41609.312 ms Tiled kernel execution time: 99787.430 ms
4	Non-tiled kernel execution time: 16879.109 ms Tiled kernel execution time: 17574.977 ms
8	Non-tiled kernel execution time: 8604.168 ms Tiled kernel execution time: 5561.509 ms
16	Non-tiled kernel execution time: 5727.267 ms Tiled kernel execution time: 4158.605 ms
32	Non-tiled kernel execution time: 4160.248 ms Tiled kernel execution time: 4791.448 ms
64	Non-tiled kernel execution time: 0.826 ms Tiled kernel execution time: 0.347 ms
128	Non-tiled kernel execution time: 0.838 ms Tiled kernel execution time: 0.238 ms

Day 10

Implemented tiled matrix transpose kernel Solved exercises of Chapter 5 from PMPP

Key learnings:

Optimize for occupancy
Check for compute / memory boundedness in applications
Improve arithmetic intensity
Look for race conditions among threads in a block in shared memory access patterns

Day 11

Added tiling to 2D convolution kernel

Day 12

Added caching for halo cells in 2D convolution

Key learnings

Constant Memory
L1, L2, L3 Cache

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
day001		day001
day002		day002
day003		day003
day004		day004
day005		day005
day006		day006
day007		day007
day01_ex		day01_ex
day02_ex		day02_ex
day03_ex		day03_ex
day04_ex		day04_ex
day05_ex		day05_ex
day06_ex		day06_ex
day07_ex		day07_ex
day08_ex		day08_ex
day09_ex		day09_ex
day10_ex		day10_ex
day11_ex		day11_ex
day12_ex		day12_ex
day13_ex		day13_ex
day14_ex		day14_ex
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project progress and Log

Task list

Progress by day

Short summary

Summary

Day 08

Day 09

Day 10

Day 11

Day 12

About

Releases

Packages

Languages

zmusaddique/100daysCUDA

Folders and files

Latest commit

History

Repository files navigation

Project progress and Log

Task list

Progress by day

Short summary

Summary

Day 08

Day 09

Day 10

Day 11

Day 12

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages