Bookmarks
Optimize CPU performance with Instruments
Learn how to optimize your app for Apple silicon with two new hardware-assisted tools in Instruments. We'll start by covering how to...
My notes while reading about GPUs
I had a bunch of notion pages in which I had written some notes while reading and watching videos on GPUs for CUDA purpose so thought of doing vibe blogging ...
TPU Deep Dive
Their origins go back to Google in 2006, when they were first evaluating whether they should implement either GPUs, FPGAs, or custom ASICs.
Domain specific architectures for AI inference
fleetwood.dev
AI Arrives In The Middle East: US Strikes A Deal with UAE and KSA – SemiAnalysis
The US has signed two landmark agreements with the United Arab Emirates and Kingdom of Saudi Arabia (KSA) that that will noticeably shift the balance of power. The deals have economic, geopolitical…
A Guide on Semiconductor Development
How It's Made: Fancy Sand
ARM's Chernobyl Moment
Incredibly self-destructive cancelation of Qualcomm's v8 ALA.
Memory on Tenstorrent
When I first started programming Metalium. Memory was a
Subnanosecond flash memory enabled by 2D-enhanced hot-carrier injection
A two-dimensional Dirac graphene-channel flash memory based on a two-dimensional-enhanced hot-carrier-injection mechanism that supports both electron and hole injection is used to make devices with a subnanosecond program speed.
Building an Open Future
We are building an open future for AI. Own your silicon future. Join us.
Accelerate
Accelerate is a language for array-based computations, designed to exploit massive parallelism.
Analyzing Modern NVIDIA GPU cores
GPUs are the most popular platform for accelerating HPC workloads, such as artificial intelligence and science simulations. However, most microarchitectural research in academia relies on GPU core pipeline designs based on architectures that are more than 15 years old.
This paper reverse engineers modern NVIDIA GPU cores, unveiling many key aspects of its design and explaining how GPUs leverage hardware-compiler techniques where the compiler guides hardware during execution. In particular, it reveals how the issue logic works including the policy of the issue scheduler, the structure of the register file and its associated cache, and multiple features of the memory pipeline. Moreover, it analyses how a simple instruction prefetcher based on a stream buffer fits well with modern NVIDIA GPUs and is likely to be used. Furthermore, we investigate the impact of the register file cache and the number of register file read ports on both simulation accuracy and performance.
By modeling all these new discovered microarchitectural details, we achieve 18.24% lower mean absolute percentage error (MAPE) in execution cycles than previous state-of-the-art simulators, resulting in an average of 13.98% MAPE with respect to real hardware (NVIDIA RTX A6000). Also, we demonstrate that this new model stands for other NVIDIA architectures, such as Turing. Finally, we show that the software-based dependence management mechanism included in modern NVIDIA GPUs outperforms a hardware mechanism based on scoreboards in terms of performance and area.
Advanced Performance Optimizations for Models
:metal: TT-NN operator library, and TT-Metalium low level kernel programming model. - tenstorrent/tt-metal
User Guide for NVPTX Back-end
To support GPU programming, the NVPTX back-end supports a subset of LLVM IR along with a defined set of conventions used to represent GPU programming concepts.
An AnandTech Interview with Jim Keller: 'The Laziest Person at Tesla'
I've spoken about Jim Keller many times on AnandTech.
Implementation of simple microprocessor using verilog
I am trying to make a simple microprocessor in verilog as a way to understand verilog and assembly at the same time.
I am not sure if I am implementing what I think of microprocessors well enough ...
learn-fpga/FemtoRV/TUTORIALS/FROM_BLINKER_TO_RISCV/README.md at master · BrunoLevy/learn-fpga · GitHub
Learning FPGA, yosys, nextpnr, and RISC-V . Contribute to BrunoLevy/learn-fpga development by creating an account on GitHub.
Scoping out the Tenstorrent Wormhole
The Tenstorrent Wormhole n300s PCIe accelerator board is available for purchase, featuring 672 RISC-V cores driving 466 TFLOP/s of FP8 matmul.
What’s the (floating) Point of all these data types? A (not so) brief overview of the history and usage of datatypes within the wide world of computation
This presentation delves into the fascinating and sometimes aggravating world of numerical data types, exploring the evolution, strengths, and weaknesses of decimal, fixed point, floating point, and shared exponent formats over the past 70 years.
Tenstorrent first thoughts
I've looked into alternative AI accelerators to continue my saga of running GGML on lower power-consumption hardware. The most promising - and the only one that ever replied to my emails - was Tenstorrent. This post is me deeply thinking about if buying their hardware for development is a good inve ...
How to Think About TPUs
All about how TPUs work, how they're networked together to enable multi-chip training and inference, and how they limit the performance of our favorite algorithms. While this may seem a little dry, it's super important for actually making models efficient.
Tenstorrent Wormhole Series Part 1: Physicalities
A company called Tenstorrent design and sell PCIe cards for AI acceleration. At the time of writing, they've recently started shipping their Wormhole n150s and Wormhole n300s cards.
Community Highlight: Tenstorrent Wormhole Series Part 2: Which disabled rows?
An in depth look at Tenstorrent Wormhole, originally posted on corsix.org
Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling
As AI models extend their capabilities to solve more sophisticated challenges, a new scaling law known as test-time scaling or inference-time scaling is emerging. Also known as AI reasoning or long…
FPGAs for Software Engineers 0: The Basics
A brief introduction to FPGAs, Verilog and simulation
How Many Computers Are In Your Computer?
Any ‘computer’ is made up of hundreds of separate computers plugged together, any of which can be hacked. I list some of these parts.
A Beginner's Guide to Vectorization By Hand: Part 3
We're continuing our expendition to the world of manual vectorization. In this part, we will explain the most common technique for vectorizing conditional code (usually referred as if-conversion).
Algorithms for Modern Hardware
Its intended audience is everyone from performance engineers and practical algorithm researchers to undergraduate computer science students who have just finished an advanced algorithms course and want to learn more practical ways to speed up a program than by going from O(nlogn) to O(nloglogn).
A Beginner's Guide to Vectorization By Hand: Part 1
The CPU vendors have been trying for a lot of time to exploit as much parallelism as they can and the introduction of vector instructions is one way to go.
Nine Rules for SIMD Acceleration of Your Rust Code (Part 1)
General Lessons from Boosting Data Ingestion in the range-set-blaze Crate by 7x
bytecode interpreters for tiny computers
I've previously come to the conclusion that there's little reason for using bytecode in the modern world, except in order to get more compact code, for which it can be very effective.
Fast Multidimensional Matrix Multiplication on CPU from Scratch
Numpy can multiply two 1024x1024 matrices on a 4-core Intel CPU in ~8ms.This is incredibly fast, considering this boils down to 18 FLOPs / core / cycle, with...
Optimizing subroutines in assembly language
Optimizing subroutines in assembly language involves various techniques such as using inline assembly in a C++ compiler, separating code using MMX registers from code using ST registers, and understanding different register sizes and memory operands. It is important to consider the use of instruction prefixes, intrinsic functions for vector operations, and accessing class and structure members efficiently. Additionally, preventing false dependences, aligning loop and subroutine entries, and optimizing instruction sizes can improve performance. However, it is crucial to note that these optimizations are processor-specific and may vary depending on the target platform.
Tiled Matrix Multiplication
Tiled matrix multiplication is an efficient algorithm used on GPUs that reduces memory access by utilizing shared memory. By organizing threads into blocks, each thread can perform calculations more quickly and with fewer memory accesses. This method is important for improving performance in tasks like graphics rendering and machine learning.
Updating the Go Memory Model
The Go memory model needs updates to clarify how synchronization works and to endorse race detectors for safer concurrency. It suggests adding typed atomic operations and possibly unsynchronized atomics to improve program correctness and performance. The goal is to ensure that Go programs behave consistently and avoid data races, making them easier to debug.
Programming Language Memory Models (Memory Models, Part 2) Posted on Tuesday, July 6, 2021. PDF
Modern programming languages use atomic variables and operations to help synchronize threads and prevent data races. This ensures that programs run correctly by allowing proper communication between threads without inconsistent memory access. All major languages, like C++, Java, and Rust, support sequentially consistent atomics to simplify the development of multithreaded programs.
Hardware Memory Models (Memory Models, Part 1) Posted on Tuesday, June 29, 2021. PDF
This text discusses hardware memory models, focusing on how different processors handle memory operations and maintain order. It explains the concept of sequential consistency, where operations are executed in a predictable order, and contrasts it with more relaxed models like those used in ARM and POWER architectures. The author highlights the importance of synchronization to avoid data races in concurrent programming.
Tiny Tapeout
Tiny Tapeout is a project that helps people easily and affordably create their own chip designs. It offers resources for beginners and advanced users, along with a special price for submissions. Join the community to learn and share your designs before the deadline on September 6th.
What Every Computer Scientist Should Know About Floating-Point Arithmetic
The text discusses the challenges and considerations of floating-point arithmetic in computer science. It emphasizes the importance of rounding in floating-point calculations and the implications of different precision levels. Additionally, it highlights the need for careful implementation to ensure correctness and accuracy in programs that rely on floating-point arithmetic.
spikedoanz/from-bits-to-intelligence: machine learninig stack in under 100,000 lines of code
The text discusses building a machine learning stack in under 100,000 lines of code with hardware, software, tensors, and machine learning components. It outlines the required components like a CPU, GPU, storage, C compiler, Python runtime, operating system, and more. The goal is to simplify the machine learning stack while providing detailed steps for implementation in different programming languages.
What every systems programmer should know about concurrency
The document delves into the complexities of concurrency for systems programmers, explaining the challenges of running multithreaded programs where code is optimized and executed in unexpected sequences. It covers fundamental concepts like atomicity, enforcing order in multithreaded programs, and memory orderings. The text emphasizes the importance of understanding how hardware, compilers, programming languages, and applications interact to create a sense of order in multithreaded programs. Key topics include atomic operations, read-modify-write operations, compare-and-swap mechanisms, and memory barriers in weakly-ordered hardware architectures.
Comparing SIMD on x86-64 and arm64
The text compares SIMD implementations using SSE on x86-64 and Neon on arm64 processors, including emulating SSE on arm64 with Neon. It explores vectorized code performance using intrinsics, auto-vectorization, and ISPC, highlighting the efficiency of SSE and Neon implementations. The study shows how optimizing for SIMD instructions significantly boosts performance over scalar implementations in ray-box intersection tests.
Unknown
Hardware prefetching in multicore processors can be too aggressive, wasting resources and impacting performance for co-running threads. Combining hardware and software prefetching can optimize performance by efficiently handling irregular memory accesses. A method described in Paper II offers a low-overhead framework for accurate software prefetching in applications with irregular access patterns.
Introduction 2016 NUMA Deep Dive Series
The 2016 NUMA Deep Dive Series by staroceans.org explores various aspects of computer architecture, focusing on NUMA systems and their optimization for performance. The series covers topics such as system architecture, cache coherency, memory optimization, and VMkernel constructs to help readers understand and improve their host design and management. The series aims to provide valuable insights for configuring and deploying dual socket systems using Intel Xeon processors, with a focus on enhancing overall platform performance.
Udacity CS344: Intro to Parallel Programming
Intro to Parallel Programming is a free online course by NVIDIA and Udacity teaching parallel computing with CUDA. It's for developers, scientists, engineers, and students looking to learn about GPU programming and optimization. The course is self-paced, requires C programming knowledge, and offers approximately 21 hours of content.
Using ASCII waveforms to test hardware designs
Using expect tests automates the validation of code output, detecting errors efficiently. Jane Street uses Hardcaml in OCaml for hardware development, simplifying testbench creation. Waveform expect tests help visualize hardware behavior, improving development workflows.
Writing CUDA Kernels for PyTorch
The text shows the thread distribution on different streaming multiprocessors (SM) in CUDA. Threads are organized into warps, lanes, and specific thread numbers within each SM. This information is crucial for optimizing CUDA kernels in PyTorch.
Microsoft PowerPoint - SRAM Architecture
The text discusses the architecture of Static Random Access Memory (SRAM) cells, focusing on their read and write operations, sizing considerations, and column circuitry. SRAM cells store data using cross-coupled inverters, with specific steps for reading and writing data. Column circuitry includes bitline conditioning, sense amplifiers, and multiplexing for efficient data access.
Chapter 2 Basics of SIMD Programming
The text explains how to organize data for SIMD operations and provides examples of SIMD-Ready Vectors. It also discusses the relationship between vectors and scalars in SIMD programming. Built-in functions for VMX instructions and SIMD operation principles are outlined in the text.
How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
The text is a worklog by Simon Boehm about optimizing a CUDA Matmul Kernel for cuBLAS-like performance. It can be found on the domain siboehm.com.
How to round to 2 decimals with Python? [duplicate]
To round a number to 2 decimals in Python, the usual method is using round(value, significantDigit), but it can behave unexpectedly when the digit before the one being rounded is a 5. To address this, a workaround involves adding a small value to ensure proper rounding. This method allows for traditional rounding commonly used in statistics without needing to import additional libraries like Decimal. By incorporating this workaround into a function, you can achieve the desired rounding results without encountering the issue with numbers ending in 5.
Subcategories
- applications (9)
- compression (9)
- computer_vision (8)
- deep_learning (94)
- ethics (2)
- generative_models (25)
- interpretability (17)
- natural_language_processing (24)
- optimization (7)
- recommendation (2)
- reinforcement_learning (11)
- supervised_learning (1)