Ludwig - cs/computer_architecture/hardware

Optimize CPU performance with Instruments

Added on July 9, 2025

Learn how to optimize your app for Apple silicon with two new hardware-assisted tools in Instruments. We'll start by covering how to...

cs/software_development/performance_optimization

My notes while reading about GPUs

Added on June 27, 2025

I had a bunch of notion pages in which I had written some notes while reading and watching videos on GPUs for CUDA purpose so thought of doing vibe blogging ...

cs/computer_architecture/hardware/gpus

TPU Deep Dive

Added on June 26, 2025

Their origins go back to Google in 2006, when they were first evaluating whether they should implement either GPUs, FPGAs, or custom ASICs.

cs/computer_architecture/hardware/floating_point

Domain specific architectures for AI inference

Added on May 29, 2025

fleetwood.dev

AI Arrives In The Middle East: US Strikes A Deal with UAE and KSA – SemiAnalysis

Added on May 16, 2025

The US has signed two landmark agreements with the United Arab Emirates and Kingdom of Saudi Arabia (KSA) that that will noticeably shift the balance of power. The deals have economic, geopolitical…

A Guide on Semiconductor Development

Added on April 24, 2025

How It's Made: Fancy Sand

ARM's Chernobyl Moment

Added on April 22, 2025

Incredibly self-destructive cancelation of Qualcomm's v8 ALA.

cs/computer_architecture/hardware/arm

Memory on Tenstorrent

Added on April 19, 2025

When I first started programming Metalium. Memory was a

cs/computer_architecture/hardware/memory_models

Subnanosecond flash memory enabled by 2D-enhanced hot-carrier injection

Added on April 19, 2025

A two-dimensional Dirac graphene-channel flash memory based on a two-dimensional-enhanced hot-carrier-injection mechanism that supports both electron and hole injection is used to make devices with a subnanosecond program speed.

Building an Open Future

Added on April 5, 2025

We are building an open future for AI. Own your silicon future. Join us.

Accelerate

Added on March 29, 2025

Accelerate is a language for array-based computations, designed to exploit massive parallelism.

cs/computer_architecture/hardware/gpus cs/theory/programming_languages/haskell

Analyzing Modern NVIDIA GPU cores

Added on March 29, 2025

GPUs are the most popular platform for accelerating HPC workloads, such as artificial intelligence and science simulations. However, most microarchitectural research in academia relies on GPU core pipeline designs based on architectures that are more than 15 years old. This paper reverse engineers modern NVIDIA GPU cores, unveiling many key aspects of its design and explaining how GPUs leverage hardware-compiler techniques where the compiler guides hardware during execution. In particular, it reveals how the issue logic works including the policy of the issue scheduler, the structure of the register file and its associated cache, and multiple features of the memory pipeline. Moreover, it analyses how a simple instruction prefetcher based on a stream buffer fits well with modern NVIDIA GPUs and is likely to be used. Furthermore, we investigate the impact of the register file cache and the number of register file read ports on both simulation accuracy and performance. By modeling all these new discovered microarchitectural details, we achieve 18.24% lower mean absolute percentage error (MAPE) in execution cycles than previous state-of-the-art simulators, resulting in an average of 13.98% MAPE with respect to real hardware (NVIDIA RTX A6000). Also, we demonstrate that this new model stands for other NVIDIA architectures, such as Turing. Finally, we show that the software-based dependence management mechanism included in modern NVIDIA GPUs outperforms a hardware mechanism based on scoreboards in terms of performance and area.

cs/computer_architecture/hardware/gpus

Advanced Performance Optimizations for Models

Added on March 29, 2025

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model. - tenstorrent/tt-metal

cs/computer_architecture/hardware/optimization ai/deep_learning

User Guide for NVPTX Back-end

Added on March 28, 2025

To support GPU programming, the NVPTX back-end supports a subset of LLVM IR along with a defined set of conventions used to represent GPU programming concepts.

cs/computer_architecture/hardware/gpus cs/theory/compilers

An AnandTech Interview with Jim Keller: 'The Laziest Person at Tesla'

Added on March 27, 2025

I've spoken about Jim Keller many times on AnandTech.

Implementation of simple microprocessor using verilog

Added on March 25, 2025

I am trying to make a simple microprocessor in verilog as a way to understand verilog and assembly at the same time. I am not sure if I am implementing what I think of microprocessors well enough ...

learn-fpga/FemtoRV/TUTORIALS/FROM_BLINKER_TO_RISCV/README.md at master · BrunoLevy/learn-fpga · GitHub

Added on March 24, 2025

Learning FPGA, yosys, nextpnr, and RISC-V . Contribute to BrunoLevy/learn-fpga development by creating an account on GitHub.

cs/software_development/tutorials

Scoping out the Tenstorrent Wormhole

Added on March 17, 2025

The Tenstorrent Wormhole n300s PCIe accelerator board is available for purchase, featuring 672 RISC-V cores driving 466 TFLOP/s of FP8 matmul.

What’s the (floating) Point of all these data types? A (not so) brief overview of the history and usage of datatypes within the wide world of computation

Added on March 17, 2025

This presentation delves into the fascinating and sometimes aggravating world of numerical data types, exploring the evolution, strengths, and weaknesses of decimal, fixed point, floating point, and shared exponent formats over the past 70 years.

cs/computer_architecture/hardware/floating_point

Tenstorrent first thoughts

Added on March 17, 2025

I've looked into alternative AI accelerators to continue my saga of running GGML on lower power-consumption hardware. The most promising - and the only one that ever replied to my emails - was Tenstorrent. This post is me deeply thinking about if buying their hardware for development is a good inve ...

How to Think About TPUs

Added on February 26, 2025

All about how TPUs work, how they're networked together to enable multi-chip training and inference, and how they limit the performance of our favorite algorithms. While this may seem a little dry, it's super important for actually making models efficient.

ai/deep_learning

Tenstorrent Wormhole Series Part 1: Physicalities

Added on February 25, 2025

A company called Tenstorrent design and sell PCIe cards for AI acceleration. At the time of writing, they've recently started shipping their Wormhole n150s and Wormhole n300s cards.

Community Highlight: Tenstorrent Wormhole Series Part 2: Which disabled rows?

Added on February 25, 2025

An in depth look at Tenstorrent Wormhole, originally posted on corsix.org

Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling

Added on February 15, 2025

As AI models extend their capabilities to solve more sophisticated challenges, a new scaling law known as test-time scaling or inference-time scaling is emerging. Also known as AI reasoning or long…

cs/computer_architecture/hardware/gpus

FPGAs for Software Engineers 0: The Basics

Added on December 22, 2024

A brief introduction to FPGAs, Verilog and simulation

cs/software_development/tutorials

How Many Computers Are In Your Computer?

Added on December 11, 2024

Any ‘computer’ is made up of hundreds of separate computers plugged together, any of which can be hacked. I list some of these parts.

cs/software_development/security

A Beginner's Guide to Vectorization By Hand: Part 3

Added on November 29, 2024

We're continuing our expendition to the world of manual vectorization. In this part, we will explain the most common technique for vectorizing conditional code (usually referred as if-conversion).

cs/software_development/performance_optimization cs/computer_architecture/hardware/vectorization

Algorithms for Modern Hardware

Added on November 19, 2024

Its intended audience is everyone from performance engineers and practical algorithm researchers to undergraduate computer science students who have just finished an advanced algorithms course and want to learn more practical ways to speed up a program than by going from O(nlogn) to O(nloglogn).

cs/software_development/performance_optimization cs/computer_architecture/hardware/optimization

A Beginner's Guide to Vectorization By Hand: Part 1

Added on November 17, 2024

The CPU vendors have been trying for a lot of time to exploit as much parallelism as they can and the introduction of vector instructions is one way to go.

cs/software_development/performance_optimization cs/computer_architecture/hardware/vectorization

Nine Rules for SIMD Acceleration of Your Rust Code (Part 1)

Added on September 23, 2024

General Lessons from Boosting Data Ingestion in the range-set-blaze Crate by 7x

cs/software_development/performance_optimization cs/computer_architecture/hardware/vectorization

bytecode interpreters for tiny computers

Added on August 4, 2024

I've previously come to the conclusion that there's little reason for using bytecode in the modern world, except in order to get more compact code, for which it can be very effective.

cs/systems_programming/emulators cs/theory/compilers ai/optimization

Fast Multidimensional Matrix Multiplication on CPU from Scratch

Added on July 30, 2024

Numpy can multiply two 1024x1024 matrices on a 4-core Intel CPU in ~8ms.This is incredibly fast, considering this boils down to 18 FLOPs / core / cycle, with...

cs/software_development/performance_optimization cs/computer_architecture/hardware/vectorization

Optimizing subroutines in assembly language

Added on July 29, 2024

Optimizing subroutines in assembly language involves various techniques such as using inline assembly in a C++ compiler, separating code using MMX registers from code using ST registers, and understanding different register sizes and memory operands. It is important to consider the use of instruction prefixes, intrinsic functions for vector operations, and accessing class and structure members efficiently. Additionally, preventing false dependences, aligning loop and subroutine entries, and optimizing instruction sizes can improve performance. However, it is crucial to note that these optimizations are processor-specific and may vary depending on the target platform.

cs/computer_architecture/low_level cs/computer_architecture/hardware/optimization

Tiled Matrix Multiplication

Added on July 29, 2024

Tiled matrix multiplication is an efficient algorithm used on GPUs that reduces memory access by utilizing shared memory. By organizing threads into blocks, each thread can perform calculations more quickly and with fewer memory accesses. This method is important for improving performance in tasks like graphics rendering and machine learning.

cs/computer_architecture/hardware/gpus cs/theory/algorithms/matrix_multiplication

Unnamed Document

Added on July 28, 2024

cs/software_development/performance_optimization

Updating the Go Memory Model

Added on July 23, 2024

The Go memory model needs updates to clarify how synchronization works and to endorse race detectors for safer concurrency. It suggests adding typed atomic operations and possibly unsynchronized atomics to improve program correctness and performance. The goal is to ensure that Go programs behave consistently and avoid data races, making them easier to debug.

cs/software_development/concurrency/atomics cs/computer_architecture/hardware/memory_models

Programming Language Memory Models (Memory Models, Part 2) Posted on Tuesday, July 6, 2021. PDF

Added on July 23, 2024

Modern programming languages use atomic variables and operations to help synchronize threads and prevent data races. This ensures that programs run correctly by allowing proper communication between threads without inconsistent memory access. All major languages, like C++, Java, and Rust, support sequentially consistent atomics to simplify the development of multithreaded programs.

cs/software_development/concurrency/atomics cs/computer_architecture/hardware/memory_models

Hardware Memory Models (Memory Models, Part 1) Posted on Tuesday, June 29, 2021. PDF

Added on July 23, 2024

This text discusses hardware memory models, focusing on how different processors handle memory operations and maintain order. It explains the concept of sequential consistency, where operations are executed in a predictable order, and contrasts it with more relaxed models like those used in ARM and POWER architectures. The author highlights the importance of synchronization to avoid data races in concurrent programming.

cs/computer_architecture/hardware/memory_models

Tiny Tapeout

Added on July 23, 2024

Tiny Tapeout is a project that helps people easily and affordably create their own chip designs. It offers resources for beginners and advanced users, along with a special price for submissions. Join the community to learn and share your designs before the deadline on September 6th.

What Every Computer Scientist Should Know About Floating-Point Arithmetic

Added on July 21, 2024

The text discusses the challenges and considerations of floating-point arithmetic in computer science. It emphasizes the importance of rounding in floating-point calculations and the implications of different precision levels. Additionally, it highlights the need for careful implementation to ensure correctness and accuracy in programs that rely on floating-point arithmetic.

cs/computer_architecture/hardware/floating_point

spikedoanz/from-bits-to-intelligence: machine learninig stack in under 100,000 lines of code

Added on July 16, 2024

The text discusses building a machine learning stack in under 100,000 lines of code with hardware, software, tensors, and machine learning components. It outlines the required components like a CPU, GPU, storage, C compiler, Python runtime, operating system, and more. The goal is to simplify the machine learning stack while providing detailed steps for implementation in different programming languages.

ai/deep_learning

What every systems programmer should know about concurrency

Added on July 11, 2024

The document delves into the complexities of concurrency for systems programmers, explaining the challenges of running multithreaded programs where code is optimized and executed in unexpected sequences. It covers fundamental concepts like atomicity, enforcing order in multithreaded programs, and memory orderings. The text emphasizes the importance of understanding how hardware, compilers, programming languages, and applications interact to create a sense of order in multithreaded programs. Key topics include atomic operations, read-modify-write operations, compare-and-swap mechanisms, and memory barriers in weakly-ordered hardware architectures.

cs/software_development/concurrency/atomics cs/computer_architecture/hardware/memory_models

Comparing SIMD on x86-64 and arm64

Added on July 10, 2024

The text compares SIMD implementations using SSE on x86-64 and Neon on arm64 processors, including emulating SSE on arm64 with Neon. It explores vectorized code performance using intrinsics, auto-vectorization, and ISPC, highlighting the efficiency of SSE and Neon implementations. The study shows how optimizing for SIMD instructions significantly boosts performance over scalar implementations in ray-box intersection tests.

cs/software_development/performance_optimization cs/computer_architecture/hardware/vectorization

Unknown

Added on July 9, 2024

Hardware prefetching in multicore processors can be too aggressive, wasting resources and impacting performance for co-running threads. Combining hardware and software prefetching can optimize performance by efficiently handling irregular memory accesses. A method described in Paper II offers a low-overhead framework for accurate software prefetching in applications with irregular access patterns.

cs/computer_architecture/hardware/optimization

Introduction 2016 NUMA Deep Dive Series

Added on July 9, 2024

The 2016 NUMA Deep Dive Series by staroceans.org explores various aspects of computer architecture, focusing on NUMA systems and their optimization for performance. The series covers topics such as system architecture, cache coherency, memory optimization, and VMkernel constructs to help readers understand and improve their host design and management. The series aims to provide valuable insights for configuring and deploying dual socket systems using Intel Xeon processors, with a focus on enhancing overall platform performance.

cs/software_development/performance_optimization cs/computer_architecture/hardware/memory_models

Udacity CS344: Intro to Parallel Programming

Added on July 1, 2024

Intro to Parallel Programming is a free online course by NVIDIA and Udacity teaching parallel computing with CUDA. It's for developers, scientists, engineers, and students looking to learn about GPU programming and optimization. The course is self-paced, requires C programming knowledge, and offers approximately 21 hours of content.

cs/software_development/educational_resources cs/computer_architecture/hardware/gpus

Using ASCII waveforms to test hardware designs

Added on June 19, 2024

Using expect tests automates the validation of code output, detecting errors efficiently. Jane Street uses Hardcaml in OCaml for hardware development, simplifying testbench creation. Waveform expect tests help visualize hardware behavior, improving development workflows.

cs/software_development/testing

Writing CUDA Kernels for PyTorch

Added on June 11, 2024

The text shows the thread distribution on different streaming multiprocessors (SM) in CUDA. Threads are organized into warps, lanes, and specific thread numbers within each SM. This information is crucial for optimizing CUDA kernels in PyTorch.

cs/computer_architecture/hardware/gpus ai/deep_learning

Microsoft PowerPoint - SRAM Architecture

Added on May 28, 2024

The text discusses the architecture of Static Random Access Memory (SRAM) cells, focusing on their read and write operations, sizing considerations, and column circuitry. SRAM cells store data using cross-coupled inverters, with specific steps for reading and writing data. Column circuitry includes bitline conditioning, sense amplifiers, and multiplexing for efficient data access.

Chapter 2 Basics of SIMD Programming

Added on May 27, 2024

The text explains how to organize data for SIMD operations and provides examples of SIMD-Ready Vectors. It also discusses the relationship between vectors and scalars in SIMD programming. Built-in functions for VMX instructions and SIMD operation principles are outlined in the text.

cs/software_development/tutorials cs/computer_architecture/hardware/vectorization

How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

Added on May 27, 2024

The text is a worklog by Simon Boehm about optimizing a CUDA Matmul Kernel for cuBLAS-like performance. It can be found on the domain siboehm.com.

cs/computer_architecture/hardware/gpus cs/software_development/performance_optimization

How to round to 2 decimals with Python? [duplicate]

Added on March 5, 2024

To round a number to 2 decimals in Python, the usual method is using round(value, significantDigit), but it can behave unexpectedly when the digit before the one being rounded is a 5. To address this, a workaround involves adding a small value to ensure proper rounding. This method allows for traditional rounding commonly used in statistics without needing to import additional libraries like Decimal. By incorporating this workaround into a function, you can achieve the desired rounding results without encountering the issue with numbers ending in 5.

cs/software_development cs/computer_architecture/hardware/floating_point

Bookmarks

Subcategories