Bookmarks

Advanced Performance Optimizations for Models

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model. - tenstorrent/tt-metal

Algorithms for Modern Hardware

Its intended audience is everyone from performance engineers and practical algorithm researchers to undergraduate computer science students who have just finished an advanced algorithms course and want to learn more practical ways to speed up a program than by going from O(nlogn) to O(nloglogn).

Optimizing subroutines in assembly language

Optimizing subroutines in assembly language involves various techniques such as using inline assembly in a C++ compiler, separating code using MMX registers from code using ST registers, and understanding different register sizes and memory operands. It is important to consider the use of instruction prefixes, intrinsic functions for vector operations, and accessing class and structure members efficiently. Additionally, preventing false dependences, aligning loop and subroutine entries, and optimizing instruction sizes can improve performance. However, it is crucial to note that these optimizations are processor-specific and may vary depending on the target platform.

Unknown

Hardware prefetching in multicore processors can be too aggressive, wasting resources and impacting performance for co-running threads. Combining hardware and software prefetching can optimize performance by efficiently handling irregular memory accesses. A method described in Paper II offers a low-overhead framework for accurate software prefetching in applications with irregular access patterns.

Subcategories