Bookmarks
Optimize CPU performance with Instruments
Learn how to optimize your app for Apple silicon with two new hardware-assisted tools in Instruments. We'll start by covering how to...
On Technical Challenges: Lock Free Programming
The Oxide Computer Company job application process1 asks applicants to answer a set of personal questions about their career and experiences.
Optimizing Transformer-Based Diffusion Models for Video Generation with NVIDIA TensorRT
State-of-the-art image diffusion models take tens of seconds to process a single image. This makes video diffusion even more challenging, requiring significant computational resources and high costs.
On Bloat
On Bloat Rob Pike Brighter Tech Commonwealth Bank September 30, 2024
Why is Yazi fast?
This article assumes that you have already used Yazi and are familiar with most of its features.
Data-Oriented Design
Data-Oriented Design
Haskell as fast as C: working at a high altitude for low level performance
After the last post about high performance, high level programming, Slava Pestov, of Factor fame, wondered whether it was generally true that “if you want good performance you have to write C…
On Competing with C Using Haskell
Mark Karpov wrote in his article on Migrating text metrics to pure Haskell how he originally did foreign calls out to C for many of the functions in his text metric package, but now ported them to Haskell when he learned that Haskell can give you performance comparable to C.
Performance
Moreover, it's often not clear if two programs which supposedly have the same functionality really do the same thing.
Daniel Lemire's blog
I find that there can still be a significant benefit to using csFastFloat over the . NET library: it can be about 3 times faster.
A Beginner's Guide to Vectorization By Hand: Part 3
We're continuing our expendition to the world of manual vectorization. In this part, we will explain the most common technique for vectorizing conditional code (usually referred as if-conversion).
Algorithms for Modern Hardware
Its intended audience is everyone from performance engineers and practical algorithm researchers to undergraduate computer science students who have just finished an advanced algorithms course and want to learn more practical ways to speed up a program than by going from O(nlogn) to O(nloglogn).
A Beginner's Guide to Vectorization By Hand: Part 1
The CPU vendors have been trying for a lot of time to exploit as much parallelism as they can and the introduction of vector instructions is one way to go.
Nine Rules for SIMD Acceleration of Your Rust Code (Part 1)
General Lessons from Boosting Data Ingestion in the range-set-blaze Crate by 7x
Fast Multidimensional Matrix Multiplication on CPU from Scratch
Numpy can multiply two 1024x1024 matrices on a 4-core Intel CPU in ~8ms.This is incredibly fast, considering this boils down to 18 FLOPs / core / cycle, with...
Efficient n-states on x86 systems
The text discusses how to efficiently handle control flow in x86 systems when a flag can have multiple states beyond true and false. It explains how to use condition codes, such as testing for zero and parity, to minimize the number of instructions needed for these tests. Additionally, it touches on the challenges and limitations of using inline assembly for optimization in C programming.
Program tuning as a resource allocation problem
Program tuning involves balancing simplicity and performance while sharing cache resources among various subsystems. Optimizing one function can impact others, making it a global resource allocation problem that requires careful consideration of algorithms and their resource footprints. Better tools and metrics are needed to manage and analyze cache resource consumption effectively.
How web bloat impacts users with slow connections
Web bloat makes many websites difficult to use for people with slow internet connections and devices. Sites like Discourse and Reddit perform poorly on low-end devices, even if they seem fast on high-end ones. Improving web performance for these users is crucial, as many people rely on older, slower devices.
applicative-mental-models
The text discusses the importance of understanding program performance for effective optimization. It emphasizes that while most optimizations may not be necessary, being aware of critical performance paths is essential. The author provides latency numbers to help programmers grasp the impact of different operations on performance.
applicative-mental-models
The text discusses the importance of understanding program performance for effective optimization. It emphasizes that while most optimizations may not be necessary, being aware of critical performance paths is essential. The author provides latency numbers to help programmers grasp the impact of different operations on performance.
Recent presentations and papers
Andi Kleen's work focuses on improving Linux performance through various techniques like hardware monitoring and profiling. He has presented on topics such as lock elision, multi-core scalability, and error handling in the Linux kernel. His contributions include discussions on modern CPU performance, tools for Linux development, and enhancements for energy efficiency.
How long does it take to make a context switch?
Context switching times vary significantly across different Intel CPU models, with more expensive CPUs generally performing better. The performance can be greatly affected by cache usage and thread migration between cores, leading to increased costs when tasks are switched. Optimizing the number of threads to match the number of hardware threads can improve CPU efficiency and reduce context switching overhead.
Cache-Oblivious Algorithms
Cache-oblivious algorithms are designed to use processor caches efficiently without needing to know specific cache details. They work by dividing data into smaller parts, allowing more computations to happen in cache and reducing memory access. This leads to better performance, especially in parallel algorithms, by minimizing shared memory bottlenecks.
Too Fast, Too Megamorphic: what influences method call performance in Java?
The performance of method calls in Java can be improved through techniques like inlining and using inline caches. Monomorphic calls, where only one method can be invoked, are the fastest, while bimorphic and megamorphic calls are slower due to increased lookup costs. The study highlights that simply adding the "final" keyword or overriding methods does not significantly enhance performance.
The Black Magic of (Java) Method Dispatch
The content shows code execution percentages for different operations within a program. It includes instructions for handling different coders, with comparisons and jumps based on coder values. The code includes sections like the main entry point, epilogue, handling other coders, and specific coder cases like Coder1 and Coder2.
Using Uninitialized Memory for Fun and Profit Posted on Friday, March 14, 2008.
A clever trick involves using uninitialized memory to improve performance in certain programming situations by representing sparse sets efficiently with two arrays that point at each other. This technique allows for fast constant-time operations for adding, checking, and clearing elements in the set, making it a valuable tool for optimizing algorithms and data structures. The sparse set representation is especially useful for scenarios where speed is critical, such as in compiler optimizations and graph traversal algorithms.
How To Build a User-Level CPU Profiler Posted on Thursday, August 8, 2013.
The text discusses how the pprof tool simplifies CPU profiling for C++ and Go programs by utilizing hardware timers and the operating system. Profiling information is gathered through hardware interrupts, providing insights into a program's performance and resource usage. By moving profiling logic to user-level timers, programs can customize and enhance profiling capabilities without kernel changes.
Implementing a file pager in Zig
Implementing a file pager in Zig involves delaying disk writes until a threshold is reached. Two eviction strategies include least recently used and least frequently used models. Prioritizing pages based on usage can help optimize performance.
Comparing SIMD on x86-64 and arm64
The text compares SIMD implementations using SSE on x86-64 and Neon on arm64 processors, including emulating SSE on arm64 with Neon. It explores vectorized code performance using intrinsics, auto-vectorization, and ISPC, highlighting the efficiency of SSE and Neon implementations. The study shows how optimizing for SIMD instructions significantly boosts performance over scalar implementations in ray-box intersection tests.
Introduction 2016 NUMA Deep Dive Series
The 2016 NUMA Deep Dive Series by staroceans.org explores various aspects of computer architecture, focusing on NUMA systems and their optimization for performance. The series covers topics such as system architecture, cache coherency, memory optimization, and VMkernel constructs to help readers understand and improve their host design and management. The series aims to provide valuable insights for configuring and deploying dual socket systems using Intel Xeon processors, with a focus on enhancing overall platform performance.
Compiling tree transforms to operate on packed representations
The article explains how tree traversals in programming can be optimized by compiling them to work on serialized tree structures without using pointers. This approach can make programs run significantly faster on current x86 architectures. The authors developed a prototype compiler for a functional language that generates efficient code for traversing trees using packed data representations.
Infographics: Operation Costs in CPU Clock Cycles
The text discusses the operation costs in CPU clock cycles for different types of operations, including simple operations, floating-point operations, and vector operations. It highlights that memory involvement can significantly impact operation costs, with some operations taking as little as 1 CPU cycle. Different CPU architectures and types of operations can result in varying costs, with some operations requiring specialized CPU support to work efficiently.
When Network is Faster than Cache
Firefox introduced a feature called RCWN to improve web performance by racing cached requests against the network. In some cases, the network can be faster than fetching data from the cache due to various factors like browser bugs and resource prioritization. Factors like device hardware and the total number of assets served from the cache impact cache retrieval performance significantly.
How to speed up the Rust compiler one last time
The author at Mozilla is concluding their work on speeding up the Rust compiler after several years of dedicated effort.
They wrote multiple blog posts detailing their performance optimizations and shared valuable lessons learned from the process.
The author expressed gratitude to those who supported their work and highlighted the importance of ongoing contributions to Rust's development.
How to speed up the Rust compiler in March 2024
In March 2024, updates on the Rust compiler's performance highlighted several key improvements. Changes like using a single codegen unit, marking Debug::fmt methods with #[inline], introducing a cache, and upgrading LLVM versions led to notable reductions in wall-time, binary size, and hash table lookups. Additionally, the availability of the Cranelift codegen backend for x86-64/Linux and ARM/Linux offers an alternative for faster compile times. While the author didn't contribute to speed improvements this time, overall performance from August 2023 to March 2024 showed reductions in wall-time, peak memory usage, and binary size, indicating steady progress in enhancing the Rust compiler's efficiency.
When FFI Function Calls Beat Native C
David Yu performed a benchmark comparing different Foreign Function Interfaces (FFI) for function calls. LuaJIT's FFI was found to be faster than native C function calls due to efficient dynamic function call handling. Direct function calls, like those used by LuaJIT, can outperform indirect calls routed through a Procedure Linkage Table (PLT).
Aggregating Millions of Groups Fast in Apache Arrow DataFusion 28.0.0
Apache Arrow DataFusion version 28.0.0 now offers faster parallel aggregation for queries with many groups. The improvements aim to enhance user experiences by generating insights more efficiently. These enhancements bring DataFusion closer to the grouping speed of DuckDB.
Columnar kernels in go?
Over the winter I'm going to be adding a columnar query engine to an existing system written in go.
Text Buffer Reimplementation
The Visual Studio Code 1.21 release includes a new text buffer implementation that improves performance in terms of speed and memory usage. The previous implementation used an array of lines, but it had limitations such as high memory usage and slow file opening times. The new implementation uses a piece table data structure, which allows for better memory usage and faster line look-up. Additionally, the implementation uses techniques such as caching for faster line lookup and a balanced binary tree for efficient searching. Benchmarks showed that the new implementation outperformed the previous line array implementation in terms of memory usage, file opening times, and reading operations.
Your ABI is Probably Wrong
The text discusses how most ABIs have a design flaw that harms performance by passing large structures inefficiently. Different ABIs handle passing large structures differently, but they all repeat the same mistakes. A correctly-specified ABI should pass large structures by immutable reference to avoid unnecessary copies.
GitHub - sirupsen/napkin-math: Techniques and numbers for estimating system's performance from first-principles
The project "Napkin Math" aims to provide resources and techniques to estimate system performance quickly and accurately. It includes examples like estimating memory reading speed and storage costs for applications. The best way to learn this skill is through practical application, with the option to subscribe for regular practice problems. Detailed numbers and cost estimates are provided, along with compression ratios and techniques to simplify calculations. The project encourages user participation to enhance and refine the provided data and tools for napkin math calculations.
Why is Python slow
Python's performance issues stem from spending most time in the C runtime, rather than the Python code itself. Pyston focuses on speeding up the C code to improve performance. Suggestions to improve Python's speed by using other JIT techniques overlook the fundamental issue of optimizing C code.
John Carmack on Inlined Code
Consider inlining functions that are only called in one place for efficiency. Simplify code structure to reduce bugs and improve performance. Emphasize consistent execution paths over avoiding minor optimizations.
zackoverflow
Zack, the author, enjoys building things and delving into the inner workings of systems and computers for dopamine. He works on the Bun JavaScript runtime and creates music when not coding. Zack invites anyone to chat through his open calendar link.
Matrix multiplication in Mojo
The text discusses matrix multiplication in Mojo. It is written by modular.com and can be found on docs.modular.com.
Matrix Multiplication on CPU
The text is about matrix multiplication on a CPU. The author is Marek Kolodziej and the domain is marek.ai.
How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
The text is a worklog by Simon Boehm about optimizing a CUDA Matmul Kernel for cuBLAS-like performance. It can be found on the domain siboehm.com.
Subcategories
- applications (9)
- compression (9)
- computer_vision (8)
- deep_learning (94)
- ethics (2)
- generative_models (25)
- interpretability (17)
- natural_language_processing (24)
- optimization (7)
- recommendation (2)
- reinforcement_learning (11)
- supervised_learning (1)