Bookmarks
03 CUDA Fundamental Optimization Part 1
Detailed lecture on foundational CUDA performance techniques—memory coalescing, occupancy, and kernel launch parameters—illustrated through hands-on code profiling and optimization steps.
ARM Assembly: Lesson 1 (MOV, Exit Syscall)
Step-by-step lesson introducing ARM assembly programming—registers, MOV instruction, SWI syscall, compiling, and emulation—providing foundational skills for low-level ARM development.
CRAFTING A CPU TO RUN PROGRAMS
Step-by-step project that assembles fundamental digital components into a functioning minimalist CPU, explaining instruction decoding, control signals and integration of prior ALU and memory modules.
Refterm Lecture Part 5 - Parsing with SIMD
Technical lecture showing how to accelerate text parsing by leveraging SIMD instructions, delving into low-level CPU mechanics, data alignment, and practical code optimization strategies.
CppCon 2016: Timur Doumler “Want fast C++? Know your hardware!"
CppCon talk illustrating how cache hierarchies, branch prediction, alignment, and SIMD influence C++ performance and providing guidelines for writing hardware-conscious, high-speed code.
C++ cache locality and branch predictability
Practical C++ demonstration of how cache locality and branch prediction affect real-world runtime, showcasing code patterns and optimizations to exploit modern CPU behaviour for faster programs.
The Tech Poutine #23: AMD's Moving to 2nm
Long-form industry analysis show covering semiconductor manufacturing roadmaps, AMD’s 2 nm “Venice” chiplets, yield calculations, HBM4, CHIPS Act developments, and organizational changes at Intel and Nvidia—providing practitioners with deep context on cutting-edge processor and foundry hardware.
HOW TRANSISTORS REMEMBER DATA
Clear explanation of how memory storage works at the transistor level, valuable for understanding computer architecture fundamentals.
Dylan Patel - Inference Math, Simulation, and AI Megaclusters - Stanford CS 229S - Autumn 2024
Stanford CS 229S lecture on large-scale inference math and AI megaclusters—direct, advanced technical content useful to ML researchers and engineers.
BLAZINGLY FAST C++ Optimizations
Focuses on techniques for high-performance C++ code, aligning with software optimization and best practices.
Fujitsu’s New ARM Chip: Focused, Fast, and Unlike Anything Else
Concise technical analysis of a forthcoming ARMv9 CPU, covering micro-architectural features, packaging, and compiler strategy—highly relevant to computer-architecture enthusiasts.
Memristors for Analog AI Chips
Technical overview of memristor technology and its role in power-efficient analog in-memory computing for AI accelerators.
1.2 - Racing Down the Slopes of Moore’s Law (Bram Nauta)
Keynote analyzes the limits of Moore’s Law scaling and advocates mixed-signal and ADC-centric approaches for power-efficient RF/digital design.
How does Groq LPU work? (w/ Head of Silicon Igor Arsovski!)
Deep technical interview on Groq’s Language Processing Unit architecture—single-cycle SIMD fabric, compiler stack, and network scaling versus GPUs.
How do CPUs read machine code? — 6502 part 2
Details how a 6502 CPU fetches and decodes machine instructions, covering control lines, micro-sequencing, and timing states at the transistor level.