Parallel programming can be intimidating, but doesn't need to be! There's a new paradigm for parallel programming that's newcomer-friendly, highly productive, and performant: tile-based programming models.
In this example-driven talk, we'll introduce you to tile-based programming in Python, C++, and Rust. We'll present [cuTile](https://github.com/NVIDIA/cutile-python), NVIDIA's new tile programming stack and [Tile IR](https://github.com/NVIDIA/cuda-tile), the new compiler stack that it is built with. You'll learn all about new features of CUDA Tile that have recently been announced, including multi-GPU communication, interoperability with traditional CUDA SIMT, and support for more diverse kernels like convolutions and stencils. We'll compare and contrast tile-based models with traditional parallel programming models. You'll see examples from a variety of domains, including HPC stencils, a sparse matrix vector (SPMV) and conjugate gradient (CG) solver, and AI models from [TileGym](https://github.com/NVIDIA/TileGym).
Tile programming has its roots in HPC libraries, such as [NWChem’s TCE](https://nwchemgit.github.io/TCE.html), [BLIS](https://github.com/flame/blis), and [ATLAS](https://math-atlas.sourceforge.net/). In recent years, this paradigm has grown in popularity for GPU programming in languages such as [Triton](https://openai.com/index/triton/), [JAX/Pallas](https://docs.jax.dev/en/latest/pallas/index.html), and [Warp](https://nvidia.github.io/warp/modules/tiles.html).
In this session, you'll:
- Learn the best practices for writing tile parallel applications for GPUs.
- Gain insight into the performance of tile code and how it actually gets executed.
- Discover how to reason about and debug tile applications.
- Understand the differences between tile and traditional parallel programming and when each paradigm should be used.
- See how tile programming makes your software portable in light of recent hardware trends.
By the end of the session, you'll understand how tile programming enables more intuitive, portable, and efficient development of high-performance, data-parallel applications, for HPC, data science, and machine learning.