Bookmarks

Implementation of simple microprocessor using verilog

I am trying to make a simple microprocessor in verilog as a way to understand verilog and assembly at the same time. I am not sure if I am implementing what I think of microprocessors well enough ...

learn-fpga/FemtoRV/TUTORIALS/FROM_BLINKER_TO_RISCV/README.md at master · BrunoLevy/learn-fpga · GitHub

Learning FPGA, yosys, nextpnr, and RISC-V . Contribute to BrunoLevy/learn-fpga development by creating an account on GitHub.

What’s the (floating) Point of all these data types? A (not so) brief overview of the history and usage of datatypes within the wide world of computation

This presentation delves into the fascinating and sometimes aggravating world of numerical data types, exploring the evolution, strengths, and weaknesses of decimal, fixed point, floating point, and shared exponent formats over the past 70 years.

Tenstorrent first thoughts

I've looked into alternative AI accelerators to continue my saga of running GGML on lower power-consumption hardware. The most promising - and the only one that ever replied to my emails - was Tenstorrent. This post is me deeply thinking about if buying their hardware for development is a good inve ...

How to Think About TPUs

All about how TPUs work, how they're networked together to enable multi-chip training and inference, and how they limit the performance of our favorite algorithms. While this may seem a little dry, it's super important for actually making models efficient.

Tenstorrent Wormhole Series Part 1: Physicalities

A company called Tenstorrent design and sell PCIe cards for AI acceleration. At the time of writing, they've recently started shipping their Wormhole n150s and Wormhole n300s cards.

Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling

As AI models extend their capabilities to solve more sophisticated challenges, a new scaling law known as test-time scaling or inference-time scaling is emerging. Also known as AI reasoning or long…

FPGAs for Software Engineers 0: The Basics

A brief introduction to FPGAs, Verilog and simulation

Richard Hamming - Wikipedia

Richard Wesley Hamming (February 11, 1915 – January 7, 1998) was an American mathematician whose work had many implications for computer engineering and telecommunications.

How Many Computers Are In Your Computer?

Any ‘computer’ is made up of hundreds of separate computers plugged together, any of which can be hacked. I list some of these parts.

Haskell as fast as C: working at a high altitude for low level performance

After the last post about high performance, high level programming, Slava Pestov, of Factor fame, wondered whether it was generally true that “if you want good performance you have to write C…

A Beginner's Guide to Vectorization By Hand: Part 3

We're continuing our expendition to the world of manual vectorization. In this part, we will explain the most common technique for vectorizing conditional code (usually referred as if-conversion).

Algorithms for Modern Hardware

Its intended audience is everyone from performance engineers and practical algorithm researchers to undergraduate computer science students who have just finished an advanced algorithms course and want to learn more practical ways to speed up a program than by going from O(nlogn) to O(nloglogn).

How LLVM Optimizes a Function

In some compilers the IR format remains fixed throughout the optimization pipeline, in others the format or semantics change.

A Beginner's Guide to Vectorization By Hand: Part 1

The CPU vendors have been trying for a lot of time to exploit as much parallelism as they can and the introduction of vector instructions is one way to go.

How Target-Independent is Your IR?

An esoteric exploration on the target independence of compiler IRs.

Nine Rules for SIMD Acceleration of Your Rust Code (Part 1)

General Lessons from Boosting Data Ingestion in the range-set-blaze Crate by 7x

Putting the “You” in CPU

Curious exactly what happens when you run a program on your computer? Learn how multiprocessing works, what system calls really are, how computers manage memory with hardware interrupts, and how Linux loads executables.

bytecode interpreters for tiny computers

I've previously come to the conclusion that there's little reason for using bytecode in the modern world, except in order to get more compact code, for which it can be very effective.

Fast Multidimensional Matrix Multiplication on CPU from Scratch

Numpy can multiply two 1024x1024 matrices on a 4-core Intel CPU in ~8ms.This is incredibly fast, considering this boils down to 18 FLOPs / core / cycle, with...

Efficient n-states on x86 systems

The text discusses how to efficiently handle control flow in x86 systems when a flag can have multiple states beyond true and false. It explains how to use condition codes, such as testing for zero and parity, to minimize the number of instructions needed for these tests. Additionally, it touches on the challenges and limitations of using inline assembly for optimization in C programming.

Program tuning as a resource allocation problem

Program tuning involves balancing simplicity and performance while sharing cache resources among various subsystems. Optimizing one function can impact others, making it a global resource allocation problem that requires careful consideration of algorithms and their resource footprints. Better tools and metrics are needed to manage and analyze cache resource consumption effectively.

Optimizing subroutines in assembly language

Optimizing subroutines in assembly language involves various techniques such as using inline assembly in a C++ compiler, separating code using MMX registers from code using ST registers, and understanding different register sizes and memory operands. It is important to consider the use of instruction prefixes, intrinsic functions for vector operations, and accessing class and structure members efficiently. Additionally, preventing false dependences, aligning loop and subroutine entries, and optimizing instruction sizes can improve performance. However, it is crucial to note that these optimizations are processor-specific and may vary depending on the target platform.

Brian Robert Callahan

This blog post starts a series on creating programs that demystify how programs work. The first program is a disassembler that reads bytecode and converts it into assembly language, while a future post will cover creating an assembler. The disassembler uses a table of mnemonics and instruction sizes to print out the corresponding assembly instructions from bytecode.

Recent presentations and papers

Andi Kleen's work focuses on improving Linux performance through various techniques like hardware monitoring and profiling. He has presented on topics such as lock elision, multi-core scalability, and error handling in the Linux kernel. His contributions include discussions on modern CPU performance, tools for Linux development, and enhancements for energy efficiency.

How long does it take to make a context switch?

Context switching times vary significantly across different Intel CPU models, with more expensive CPUs generally performing better. The performance can be greatly affected by cache usage and thread migration between cores, leading to increased costs when tasks are switched. Optimizing the number of threads to match the number of hardware threads can improve CPU efficiency and reduce context switching overhead.

Tiled Matrix Multiplication

Tiled matrix multiplication is an efficient algorithm used on GPUs that reduces memory access by utilizing shared memory. By organizing threads into blocks, each thread can perform calculations more quickly and with fewer memory accesses. This method is important for improving performance in tasks like graphics rendering and machine learning.

Compiler Backend

The QBE compiler backend is designed to be a compact yet high-performance C embeddable backend that prioritizes correctness, simplicity, and user-friendliness. It compiles on various x64 operating systems and boasts features like IEEE floating point support, SSA-based intermediate language, and quick compilation times. While currently limited to x64 platforms, plans include ARM support and further enhancements. The backend has been successfully utilized in various projects, showcasing its adaptability and effectiveness in compiler development.

1024cores

Dmitry Vyukov shares information on synchronization algorithms, multicore design patterns, and high-performance computing on his website, 1024cores.net. He focuses on shared-memory systems and does not cover topics like clusters or GPUs. New content is added regularly, and readers can subscribe for updates.

Pointers Are Complicated, or: What's in a Byte?

The document explains the complexities of pointers in low-level programming languages like C++ and Rust, debunking the misconception that pointers are simple integers. It delves into examples showing how assumptions about pointers can lead to undefined behavior and how pointer arithmetic can be tricky. The text proposes a model where a pointer is a pair of an allocation ID and an offset, rather than just an integer. Additionally, it discusses the challenges of representing bytes in memory, especially when dealing with uninitialized memory and the need for a more nuanced byte representation to ensure program correctness.

How To Build a User-Level CPU Profiler Posted on Thursday, August 8, 2013.

The text discusses how the pprof tool simplifies CPU profiling for C++ and Go programs by utilizing hardware timers and the operating system. Profiling information is gathered through hardware interrupts, providing insights into a program's performance and resource usage. By moving profiling logic to user-level timers, programs can customize and enhance profiling capabilities without kernel changes.

Programming Language Memory Models (Memory Models, Part 2) Posted on Tuesday, July 6, 2021. PDF

Modern programming languages use atomic variables and operations to help synchronize threads and prevent data races. This ensures that programs run correctly by allowing proper communication between threads without inconsistent memory access. All major languages, like C++, Java, and Rust, support sequentially consistent atomics to simplify the development of multithreaded programs.

Hardware Memory Models (Memory Models, Part 1) Posted on Tuesday, June 29, 2021. PDF

This text discusses hardware memory models, focusing on how different processors handle memory operations and maintain order. It explains the concept of sequential consistency, where operations are executed in a predictable order, and contrasts it with more relaxed models like those used in ARM and POWER architectures. The author highlights the importance of synchronization to avoid data races in concurrent programming.

Tiny Tapeout

Tiny Tapeout is a project that helps people easily and affordably create their own chip designs. It offers resources for beginners and advanced users, along with a special price for submissions. Join the community to learn and share your designs before the deadline on September 6th.

What Every Computer Scientist Should Know About Floating-Point Arithmetic

The text discusses the challenges and considerations of floating-point arithmetic in computer science. It emphasizes the importance of rounding in floating-point calculations and the implications of different precision levels. Additionally, it highlights the need for careful implementation to ensure correctness and accuracy in programs that rely on floating-point arithmetic.

But how, exactly, databases use mmap?

Databases use memory-mapped files like mmap to handle data on disk larger than available memory. Examples include SQLite, LevelDB, Lucene, LMDB, and MongoDB. By understanding how mmap is used, we can grasp how databases efficiently read and write data from disk.

spikedoanz/from-bits-to-intelligence: machine learninig stack in under 100,000 lines of code

The text discusses building a machine learning stack in under 100,000 lines of code with hardware, software, tensors, and machine learning components. It outlines the required components like a CPU, GPU, storage, C compiler, Python runtime, operating system, and more. The goal is to simplify the machine learning stack while providing detailed steps for implementation in different programming languages.

Comparing SIMD on x86-64 and arm64

The text compares SIMD implementations using SSE on x86-64 and Neon on arm64 processors, including emulating SSE on arm64 with Neon. It explores vectorized code performance using intrinsics, auto-vectorization, and ISPC, highlighting the efficiency of SSE and Neon implementations. The study shows how optimizing for SIMD instructions significantly boosts performance over scalar implementations in ray-box intersection tests.

Compiler Optimizations Are Hard Because They Forget

Compiler optimizations involve breaking down complex changes into smaller, more manageable steps to improve code efficiency. However, as more optimizations are added, the potential for errors and missed opportunities increases, making it challenging to maintain optimal performance. Compilers struggle with balancing aggressive optimizations while preserving correct program behavior, highlighting the complexity and difficulties inherent in optimizing compilers.

A new JIT engine for PHP-8.4/9

A new JIT engine for PHP is being developed, improving performance and simplifying development. The engine will be included in the next major PHP version, potentially PHP 9.0. The new JIT engine generates a single Intermediate Representation (IR), eliminating the need to support assembler code for different CPUs.

Introduction 2016 NUMA Deep Dive Series

The 2016 NUMA Deep Dive Series by staroceans.org explores various aspects of computer architecture, focusing on NUMA systems and their optimization for performance. The series covers topics such as system architecture, cache coherency, memory optimization, and VMkernel constructs to help readers understand and improve their host design and management. The series aims to provide valuable insights for configuring and deploying dual socket systems using Intel Xeon processors, with a focus on enhancing overall platform performance.

von Neumann architecture - Wikipedia

The von Neumann architecture is a computer design with a processing unit, control unit, memory, and input/output mechanisms. It allows for instructions and data operations to be stored in memory, advancing computer technology from fixed-function machines like the ENIAC. This architecture was influenced by the work of Alan Turing and John von Neumann and has been widely used in the development of modern computers.

Compiling tree transforms to operate on packed representations

The article explains how tree traversals in programming can be optimized by compiling them to work on serialized tree structures without using pointers. This approach can make programs run significantly faster on current x86 architectures. The authors developed a prototype compiler for a functional language that generates efficient code for traversing trees using packed data representations.

Infographics: Operation Costs in CPU Clock Cycles

The text discusses the operation costs in CPU clock cycles for different types of operations, including simple operations, floating-point operations, and vector operations. It highlights that memory involvement can significantly impact operation costs, with some operations taking as little as 1 CPU cycle. Different CPU architectures and types of operations can result in varying costs, with some operations requiring specialized CPU support to work efficiently.

KHM+15

The text discusses a formal C memory model that supports integer-pointer casts, essential for low-level C programming. It proposes a quasi-concrete memory model that allows standard compiler optimizations while fully supporting integer-pointer casts. This model helps verify programs and optimizations that are challenging to validate with integer-pointer casts.

Xv6, a simple Unix-like teaching operating system

Xv6 is a teaching operating system developed by MIT for their operating systems course. It is based on Unix V6, written in ANSI C, and runs on Intel x86 machines. The xv6 source code is available on GitHub and is used in lectures to teach operating system concepts.

C Is Not a Low-level Language

C is often considered a low-level language, but this article argues that it is not. The author explains that vulnerabilities like Spectre and Meltdown occurred because processor architects were trying to build fast processors that exposed the same abstract machine as a PDP-11, which C programmers believe is close to the underlying hardware. However, the reality is that C code runs on a complex compiler that performs intricate transformations to achieve the desired performance. The article also discusses how C's memory model and optimizations make it difficult to understand and can lead to undefined behavior. The author suggests that instead of trying to make C code fast, it may be time to explore programming models on processors designed for speed.

In-depth analysis on Valorant’s Guarded Regions

The text discusses how Valorant's anti-cheat system, Vanguard, uses innovative techniques to protect against memory manipulation by whitelisting threads and creating shadow regions. These methods involve cloning and modifying the game's paging tables to allow access to hidden memory without affecting performance. By implementing these advanced security measures, Vanguard effectively prevents cheats from bypassing its guarded regions.

Exploit Development: No Code Execution? No Problem! Living The Age of VBS, HVCI, and Kernel CFG

The text discusses various techniques used in exploit development, particularly focusing on targeting the Windows kernel. It mentions concepts like Hypervisor-Protected Code Integrity (HVCI) and how exploits can manipulate memory to execute attacker-controlled code in kernel mode. The text also delves into details like leaking kernel-mode memory, constructing ROP chains on the kernel-mode stack, and utilizing functions like NtQuerySystemInformation to escalate privileges and perform malicious actions in the system.

Udacity CS344: Intro to Parallel Programming

Intro to Parallel Programming is a free online course by NVIDIA and Udacity teaching parallel computing with CUDA. It's for developers, scientists, engineers, and students looking to learn about GPU programming and optimization. The course is self-paced, requires C programming knowledge, and offers approximately 21 hours of content.

When FFI Function Calls Beat Native C

David Yu performed a benchmark comparing different Foreign Function Interfaces (FFI) for function calls. LuaJIT's FFI was found to be faster than native C function calls due to efficient dynamic function call handling. Direct function calls, like those used by LuaJIT, can outperform indirect calls routed through a Procedure Linkage Table (PLT).

What Is The Minimal Set Of Optimizations Needed For Zero-Cost Abstraction?

Rust and C++ offer "zero-cost abstractions" where high-level code compiles to low-level code without added runtime overhead, but enabling necessary compiler optimizations can slow down compilation and impact debugging. The challenge is to find the minimal set of optimizations that maintain zero-cost abstractions while improving build speed and debug information quality. Balancing fast debuggable builds with zero-cost abstractions is crucial for performance and developer experience in languages like Rust and C++.

Using ASCII waveforms to test hardware designs

Using expect tests automates the validation of code output, detecting errors efficiently. Jane Street uses Hardcaml in OCaml for hardware development, simplifying testbench creation. Waveform expect tests help visualize hardware behavior, improving development workflows.

Your ABI is Probably Wrong

The text discusses how most ABIs have a design flaw that harms performance by passing large structures inefficiently. Different ABIs handle passing large structures differently, but they all repeat the same mistakes. A correctly-specified ABI should pass large structures by immutable reference to avoid unnecessary copies.

BSTJ 57: 6. July-August 1978: The UNIX Time-Sharing System. (Ritchie, D.M.; Thompson, K.)

The UNIX Time-Sharing System is a versatile operating system with unique features. It runs on Digital Equipment Corporation computers and emphasizes simplicity and ease of use. UNIX has been widely adopted for research, education, and document preparation purposes.

Writing CUDA Kernels for PyTorch

The text shows the thread distribution on different streaming multiprocessors (SM) in CUDA. Threads are organized into warps, lanes, and specific thread numbers within each SM. This information is crucial for optimizing CUDA kernels in PyTorch.

An Introduction to Assembly Programming with RISC-V

This text provides information about a resource related to RISC-V programming. The ISBN number for this resource is 978-65-00-15811-3. It is authored by riscv-programming.org.

Chapter 2 Basics of SIMD Programming

The text explains how to organize data for SIMD operations and provides examples of SIMD-Ready Vectors. It also discusses the relationship between vectors and scalars in SIMD programming. Built-in functions for VMX instructions and SIMD operation principles are outlined in the text.

Matrix Multiplication on CPU

The text is about matrix multiplication on a CPU. The author is Marek Kolodziej and the domain is marek.ai.

How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

The text is a worklog by Simon Boehm about optimizing a CUDA Matmul Kernel for cuBLAS-like performance. It can be found on the domain siboehm.com.

Where do LLMs spend their FLOPS?

LLMs (large language models) spend their FLOPS (floating point operations) on various tasks, including computing QKV (query, key, value) matrices, attention output matrices, and running the feed-forward network (FFN). The attention mechanism plays a crucial role in LLMs, even though the FLOPS required for attention calculations are relatively small. The KV cache, which stores information for each token, requires significant memory but is necessary for generating sequences. Different architectural choices, such as grouped query attention and sliding window attention, can affect the size and efficiency of the KV cache. Increasing the number of layers in an LLM linearly scales the FLOPS and parameters, while increasing the model width quadratically scales the model size. Wider models parallelize better, while deeper models increase inference time linearly.

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Recent research is leading to a new era of 1-bit Large Language Models (LLMs), such as BitNet, introducing a variant called BitNet b1.58 where every parameter is ternary {-1, 0, 1}. This model matches the performance of full-precision Transformer LLMs while being more cost-effective in terms of latency, memory, throughput, and energy consumption. The 1.58-bit LLM sets a new standard for training high-performance and cost-effective models, paving the way for new computation methods and specialized hardware designed for 1-bit LLMs.

Subcategories