Ludwig - cs/software_development/performance

Optimize CPU performance with Instruments

Added on July 9, 2025

Learn how to optimize your app for Apple silicon with two new hardware-assisted tools in Instruments. We'll start by covering how to...

cs/computer_architecture/hardware

On Technical Challenges: Lock Free Programming

Added on July 9, 2025

The Oxide Computer Company job application process1 asks applicants to answer a set of personal questions about their career and experiences.

cs/software_development/concurrency

Optimizing Transformer-Based Diffusion Models for Video Generation with NVIDIA TensorRT

Added on May 16, 2025

State-of-the-art image diffusion models take tens of seconds to process a single image. This makes video diffusion even more challenging, requiring significant computational resources and high costs.

ai/generative_models

On Bloat

Added on April 22, 2025

On Bloat Rob Pike Brighter Tech Commonwealth Bank September 30, 2024

Why is Yazi fast?

Added on March 28, 2025

This article assumes that you have already used Yazi and are familiar with most of its features.

cs/software_development/concurrency/asynchronous_programming

Data-Oriented Design

Added on December 19, 2024

Data-Oriented Design

cs/software_development/data_oriented_design

Haskell as fast as C: working at a high altitude for low level performance

Added on December 3, 2024

After the last post about high performance, high level programming, Slava Pestov, of Factor fame, wondered whether it was generally true that “if you want good performance you have to write C…

cs/theory/programming_languages/haskell

On Competing with C Using Haskell

Added on December 3, 2024

Mark Karpov wrote in his article on Migrating text metrics to pure Haskell how he originally did foreign calls out to C for many of the functions in his text metric package, but now ported them to Haskell when he learned that Haskell can give you performance comparable to C.

cs/theory/programming_languages/haskell

Performance

Added on December 3, 2024

Moreover, it's often not clear if two programs which supposedly have the same functionality really do the same thing.

cs/theory/programming_languages/haskell

Daniel Lemire's blog

Added on November 29, 2024

I find that there can still be a significant benefit to using csFastFloat over the . NET library: it can be about 3 times faster.

A Beginner's Guide to Vectorization By Hand: Part 3

Added on November 29, 2024

We're continuing our expendition to the world of manual vectorization. In this part, we will explain the most common technique for vectorizing conditional code (usually referred as if-conversion).

cs/computer_architecture/hardware/vectorization

Algorithms for Modern Hardware

Added on November 19, 2024

Its intended audience is everyone from performance engineers and practical algorithm researchers to undergraduate computer science students who have just finished an advanced algorithms course and want to learn more practical ways to speed up a program than by going from O(nlogn) to O(nloglogn).

cs/computer_architecture/hardware/optimization

A Beginner's Guide to Vectorization By Hand: Part 1

Added on November 17, 2024

The CPU vendors have been trying for a lot of time to exploit as much parallelism as they can and the introduction of vector instructions is one way to go.

cs/computer_architecture/hardware/vectorization

Nine Rules for SIMD Acceleration of Your Rust Code (Part 1)

Added on September 23, 2024

General Lessons from Boosting Data Ingestion in the range-set-blaze Crate by 7x

cs/computer_architecture/hardware/vectorization

Fast Multidimensional Matrix Multiplication on CPU from Scratch

Added on July 30, 2024

Numpy can multiply two 1024x1024 matrices on a 4-core Intel CPU in ~8ms.This is incredibly fast, considering this boils down to 18 FLOPs / core / cycle, with...

cs/computer_architecture/hardware/vectorization

Efficient n-states on x86 systems

Added on July 29, 2024

The text discusses how to efficiently handle control flow in x86 systems when a flag can have multiple states beyond true and false. It explains how to use condition codes, such as testing for zero and parity, to minimize the number of instructions needed for these tests. Additionally, it touches on the challenges and limitations of using inline assembly for optimization in C programming.

cs/computer_architecture/low_level

Program tuning as a resource allocation problem

Added on July 29, 2024

Program tuning involves balancing simplicity and performance while sharing cache resources among various subsystems. Optimizing one function can impact others, making it a global resource allocation problem that requires careful consideration of algorithms and their resource footprints. Better tools and metrics are needed to manage and analyze cache resource consumption effectively.

How web bloat impacts users with slow connections

Added on July 29, 2024

Web bloat makes many websites difficult to use for people with slow internet connections and devices. Sites like Discourse and Reddit perform poorly on low-end devices, even if they seem fast on high-end ones. Improving web performance for these users is crucial, as many people rely on older, slower devices.

applicative-mental-models

Added on July 29, 2024

The text discusses the importance of understanding program performance for effective optimization. It emphasizes that while most optimizations may not be necessary, being aware of critical performance paths is essential. The author provides latency numbers to help programmers grasp the impact of different operations on performance.

cs/systems_programming/latency

applicative-mental-models

Added on July 29, 2024

The text discusses the importance of understanding program performance for effective optimization. It emphasizes that while most optimizations may not be necessary, being aware of critical performance paths is essential. The author provides latency numbers to help programmers grasp the impact of different operations on performance.

Recent presentations and papers

Added on July 29, 2024

Andi Kleen's work focuses on improving Linux performance through various techniques like hardware monitoring and profiling. He has presented on topics such as lock elision, multi-core scalability, and error handling in the Linux kernel. His contributions include discussions on modern CPU performance, tools for Linux development, and enhancements for energy efficiency.

cs/systems_programming

How long does it take to make a context switch?

Added on July 29, 2024

Context switching times vary significantly across different Intel CPU models, with more expensive CPUs generally performing better. The performance can be greatly affected by cache usage and thread migration between cores, leading to increased costs when tasks are switched. Optimizing the number of threads to match the number of hardware threads can improve CPU efficiency and reduce context switching overhead.

cs/software_development/concurrency

Cache-Oblivious Algorithms

Added on July 29, 2024

Cache-oblivious algorithms are designed to use processor caches efficiently without needing to know specific cache details. They work by dividing data into smaller parts, allowing more computations to happen in cache and reducing memory access. This leads to better performance, especially in parallel algorithms, by minimizing shared memory bottlenecks.

cs/theory/algorithms

Too Fast, Too Megamorphic: what influences method call performance in Java?

Added on July 28, 2024

The performance of method calls in Java can be improved through techniques like inlining and using inline caches. Monomorphic calls, where only one method can be invoked, are the fastest, while bimorphic and megamorphic calls are slower due to increased lookup costs. The study highlights that simply adding the "final" keyword or overriding methods does not significantly enhance performance.

cs/theory/compilers/jit

The Black Magic of (Java) Method Dispatch

Added on July 28, 2024

The content shows code execution percentages for different operations within a program. It includes instructions for handling different coders, with comparisons and jumps based on coder values. The code includes sections like the main entry point, epilogue, handling other coders, and specific coder cases like Coder1 and Coder2.

cs/theory/compilers/jit

Unnamed Document

Added on July 28, 2024

cs/computer_architecture/hardware

Using Uninitialized Memory for Fun and Profit Posted on Friday, March 14, 2008.

Added on July 27, 2024

A clever trick involves using uninitialized memory to improve performance in certain programming situations by representing sparse sets efficiently with two arrays that point at each other. This technique allows for fast constant-time operations for adding, checking, and clearing elements in the set, making it a valuable tool for optimizing algorithms and data structures. The sparse set representation is especially useful for scenarios where speed is critical, such as in compiler optimizations and graph traversal algorithms.

cs/theory/data_structures

How To Build a User-Level CPU Profiler Posted on Thursday, August 8, 2013.

Added on July 27, 2024

The text discusses how the pprof tool simplifies CPU profiling for C++ and Go programs by utilizing hardware timers and the operating system. Profiling information is gathered through hardware interrupts, providing insights into a program's performance and resource usage. By moving profiling logic to user-level timers, programs can customize and enhance profiling capabilities without kernel changes.

cs/systems_programming

Implementing a file pager in Zig

Added on July 17, 2024

Implementing a file pager in Zig involves delaying disk writes until a threshold is reached. Two eviction strategies include least recently used and least frequently used models. Prioritizing pages based on usage can help optimize performance.

cs/software_development/memory/safety

Comparing SIMD on x86-64 and arm64

Added on July 10, 2024

The text compares SIMD implementations using SSE on x86-64 and Neon on arm64 processors, including emulating SSE on arm64 with Neon. It explores vectorized code performance using intrinsics, auto-vectorization, and ISPC, highlighting the efficiency of SSE and Neon implementations. The study shows how optimizing for SIMD instructions significantly boosts performance over scalar implementations in ray-box intersection tests.

cs/computer_architecture/hardware/vectorization

Introduction 2016 NUMA Deep Dive Series

Added on July 9, 2024

The 2016 NUMA Deep Dive Series by staroceans.org explores various aspects of computer architecture, focusing on NUMA systems and their optimization for performance. The series covers topics such as system architecture, cache coherency, memory optimization, and VMkernel constructs to help readers understand and improve their host design and management. The series aims to provide valuable insights for configuring and deploying dual socket systems using Intel Xeon processors, with a focus on enhancing overall platform performance.

cs/computer_architecture/hardware/memory_models

Compiling tree transforms to operate on packed representations

Added on July 8, 2024

The article explains how tree traversals in programming can be optimized by compiling them to work on serialized tree structures without using pointers. This approach can make programs run significantly faster on current x86 architectures. The authors developed a prototype compiler for a functional language that generates efficient code for traversing trees using packed data representations.

cs/theory/compilers/code_generation

Infographics: Operation Costs in CPU Clock Cycles

Added on July 8, 2024

The text discusses the operation costs in CPU clock cycles for different types of operations, including simple operations, floating-point operations, and vector operations. It highlights that memory involvement can significantly impact operation costs, with some operations taking as little as 1 CPU cycle. Different CPU architectures and types of operations can result in varying costs, with some operations requiring specialized CPU support to work efficiently.

cs/computer_architecture/low_level

When Network is Faster than Cache

Added on July 5, 2024

Firefox introduced a feature called RCWN to improve web performance by racing cached requests against the network. In some cases, the network can be faster than fetching data from the cache due to various factors like browser bugs and resource prioritization. Factors like device hardware and the total number of assets served from the cache impact cache retrieval performance significantly.

How to speed up the Rust compiler one last time

Added on July 3, 2024

The author at Mozilla is concluding their work on speeding up the Rust compiler after several years of dedicated effort. They wrote multiple blog posts detailing their performance optimizations and shared valuable lessons learned from the process. The author expressed gratitude to those who supported their work and highlighted the importance of ongoing contributions to Rust's development.

cs/theory/compilers

How to speed up the Rust compiler in March 2024

Added on July 3, 2024

In March 2024, updates on the Rust compiler's performance highlighted several key improvements. Changes like using a single codegen unit, marking Debug::fmt methods with #[inline], introducing a cache, and upgrading LLVM versions led to notable reductions in wall-time, binary size, and hash table lookups. Additionally, the availability of the Cranelift codegen backend for x86-64/Linux and ARM/Linux offers an alternative for faster compile times. While the author didn't contribute to speed improvements this time, overall performance from August 2023 to March 2024 showed reductions in wall-time, peak memory usage, and binary size, indicating steady progress in enhancing the Rust compiler's efficiency.

cs/theory/compilers

When FFI Function Calls Beat Native C

Added on July 1, 2024

David Yu performed a benchmark comparing different Foreign Function Interfaces (FFI) for function calls. LuaJIT's FFI was found to be faster than native C function calls due to efficient dynamic function call handling. Direct function calls, like those used by LuaJIT, can outperform indirect calls routed through a Procedure Linkage Table (PLT).

cs/systems_programming/linkers

Aggregating Millions of Groups Fast in Apache Arrow DataFusion 28.0.0

Added on July 1, 2024

Apache Arrow DataFusion version 28.0.0 now offers faster parallel aggregation for queries with many groups. The improvements aim to enhance user experiences by generating insights more efficiently. These enhancements bring DataFusion closer to the grouping speed of DuckDB.

Columnar kernels in go?

Added on June 24, 2024

Over the winter I'm going to be adding a columnar query engine to an existing system written in go.

cs/software_development/data_oriented_design

Text Buffer Reimplementation

Added on June 19, 2024

The Visual Studio Code 1.21 release includes a new text buffer implementation that improves performance in terms of speed and memory usage. The previous implementation used an array of lines, but it had limitations such as high memory usage and slow file opening times. The new implementation uses a piece table data structure, which allows for better memory usage and faster line look-up. Additionally, the implementation uses techniques such as caching for faster line lookup and a balanced binary tree for efficient searching. Benchmarks showed that the new implementation outperformed the previous line array implementation in terms of memory usage, file opening times, and reading operations.

cs/theory/data_structures

Your ABI is Probably Wrong

Added on June 19, 2024

The text discusses how most ABIs have a design flaw that harms performance by passing large structures inefficiently. Different ABIs handle passing large structures differently, but they all repeat the same mistakes. A correctly-specified ABI should pass large structures by immutable reference to avoid unnecessary copies.

cs/systems_programming

GitHub - sirupsen/napkin-math: Techniques and numbers for estimating system's performance from first-principles

Added on June 19, 2024

The project "Napkin Math" aims to provide resources and techniques to estimate system performance quickly and accurately. It includes examples like estimating memory reading speed and storage costs for applications. The best way to learn this skill is through practical application, with the option to subscribe for regular practice problems. Detailed numbers and cost estimates are provided, along with compression ratios and techniques to simplify calculations. The project encourages user participation to enhance and refine the provided data and tools for napkin math calculations.

cs/systems_programming

Why is Python slow

Added on June 19, 2024

Python's performance issues stem from spending most time in the C runtime, rather than the Python code itself. Pyston focuses on speeding up the C code to improve performance. Suggestions to improve Python's speed by using other JIT techniques overlook the fundamental issue of optimizing C code.

cs/theory/compilers/jit

John Carmack on Inlined Code

Added on June 19, 2024

Consider inlining functions that are only called in one place for efficiency. Simplify code structure to reduce bugs and improve performance. Emphasize consistent execution paths over avoiding minor optimizations.

cs/software_development/design/simplicity

zackoverflow

Added on June 14, 2024

Zack, the author, enjoys building things and delving into the inner workings of systems and computers for dopamine. He works on the Bun JavaScript runtime and creates music when not coding. Zack invites anyone to chat through his open calendar link.

cs/theory/compilers