Bookmarks

03 CUDA Fundamental Optimization Part 1

Detailed lecture on foundational CUDA performance techniques—memory coalescing, occupancy, and kernel launch parameters—illustrated through hands-on code profiling and optimization steps.

ARM Assembly: Lesson 1 (MOV, Exit Syscall)

Step-by-step lesson introducing ARM assembly programming—registers, MOV instruction, SWI syscall, compiling, and emulation—providing foundational skills for low-level ARM development.

CRAFTING A CPU TO RUN PROGRAMS

Step-by-step project that assembles fundamental digital components into a functioning minimalist CPU, explaining instruction decoding, control signals and integration of prior ALU and memory modules.

Refterm Lecture Part 5 - Parsing with SIMD

Technical lecture showing how to accelerate text parsing by leveraging SIMD instructions, delving into low-level CPU mechanics, data alignment, and practical code optimization strategies.

CppCon 2016: Timur Doumler “Want fast C++? Know your hardware!"

CppCon talk illustrating how cache hierarchies, branch prediction, alignment, and SIMD influence C++ performance and providing guidelines for writing hardware-conscious, high-speed code.

C++ cache locality and branch predictability

Practical C++ demonstration of how cache locality and branch prediction affect real-world runtime, showcasing code patterns and optimizations to exploit modern CPU behaviour for faster programs.

The Tech Poutine #23: AMD's Moving to 2nm

Long-form industry analysis show covering semiconductor manufacturing roadmaps, AMD’s 2 nm “Venice” chiplets, yield calculations, HBM4, CHIPS Act developments, and organizational changes at Intel and Nvidia—providing practitioners with deep context on cutting-edge processor and foundry hardware.

HOW TRANSISTORS REMEMBER DATA

Clear explanation of how memory storage works at the transistor level, valuable for understanding computer architecture fundamentals.

Dylan Patel - Inference Math, Simulation, and AI Megaclusters - Stanford CS 229S - Autumn 2024

Stanford CS 229S lecture on large-scale inference math and AI megaclusters—direct, advanced technical content useful to ML researchers and engineers.

BLAZINGLY FAST C++ Optimizations

Focuses on techniques for high-performance C++ code, aligning with software optimization and best practices.

One System, Eight Tenstorrent Wormholes

NVIDIA Doesn't Care About GPUs

Fujitsu’s New ARM Chip: Focused, Fast, and Unlike Anything Else

Concise technical analysis of a forthcoming ARMv9 CPU, covering micro-architectural features, packaging, and compiler strategy—highly relevant to computer-architecture enthusiasts.

The Lab That Invented The 21st Century

Memristors for Analog AI Chips

Technical overview of memristor technology and its role in power-efficient analog in-memory computing for AI accelerators.

1.2 - Racing Down the Slopes of Moore’s Law (Bram Nauta)

Keynote analyzes the limits of Moore’s Law scaling and advocates mixed-signal and ADC-centric approaches for power-efficient RF/digital design.

How does Groq LPU work? (w/ Head of Silicon Igor Arsovski!)

Deep technical interview on Groq’s Language Processing Unit architecture—single-cycle SIMD fabric, compiler stack, and network scaling versus GPUs.

How do CPUs read machine code? — 6502 part 2

Details how a 6502 CPU fetches and decodes machine instructions, covering control lines, micro-sequencing, and timing states at the transistor level.

Building an Open Future

Tenstorrent first thoughts

Tiny Tapeout

Unknown

Subcategories