Bookmarks

How to Think About GPUs

03 CUDA Fundamental Optimization Part 1

Detailed lecture on foundational CUDA performance techniques—memory coalescing, occupancy, and kernel launch parameters—illustrated through hands-on code profiling and optimization steps.

ARM Assembly: Lesson 1 (MOV, Exit Syscall)

Step-by-step lesson introducing ARM assembly programming—registers, MOV instruction, SWI syscall, compiling, and emulation—providing foundational skills for low-level ARM development.

Refterm Lecture Part 5 - Parsing with SIMD

Technical lecture showing how to accelerate text parsing by leveraging SIMD instructions, delving into low-level CPU mechanics, data alignment, and practical code optimization strategies.

CppCon 2016: Timur Doumler “Want fast C++? Know your hardware!"

CppCon talk illustrating how cache hierarchies, branch prediction, alignment, and SIMD influence C++ performance and providing guidelines for writing hardware-conscious, high-speed code.

C++ cache locality and branch predictability

Practical C++ demonstration of how cache locality and branch prediction affect real-world runtime, showcasing code patterns and optimizations to exploit modern CPU behaviour for faster programs.

The Tech Poutine #23: AMD's Moving to 2nm

Long-form industry analysis show covering semiconductor manufacturing roadmaps, AMD’s 2 nm “Venice” chiplets, yield calculations, HBM4, CHIPS Act developments, and organizational changes at Intel and Nvidia—providing practitioners with deep context on cutting-edge processor and foundry hardware.

HOW TRANSISTORS REMEMBER DATA

Clear explanation of how memory storage works at the transistor level, valuable for understanding computer architecture fundamentals.

Dylan Patel - Inference Math, Simulation, and AI Megaclusters - Stanford CS 229S - Autumn 2024

Stanford CS 229S lecture on large-scale inference math and AI megaclusters—direct, advanced technical content useful to ML researchers and engineers.

Modern CPUs Assign Registers To Speed Up Your Code - Computerphile

BLAZINGLY FAST C++ Optimizations

Focuses on techniques for high-performance C++ code, aligning with software optimization and best practices.

Designing in 2023: 10 Problems to Solve w/ Jim Keller

One System, Eight Tenstorrent Wormholes

The Genius of RISC-V Microprocessors - Erik Engheim - ACCU 2022

Tenstorrent: Relegating the Important Stuff to the Compiler

Past, Present & Future of AI Compute (Panel) | Beyond CUDA Summit 2025

NVIDIA Doesn't Care About GPUs

Fujitsu’s New ARM Chip: Focused, Fast, and Unlike Anything Else

Concise technical analysis of a forthcoming ARMv9 CPU, covering micro-architectural features, packaging, and compiler strategy—highly relevant to computer-architecture enthusiasts.

RISC-V and the CPU Revolution, Yunsup Lee, Samsung Forum

Building precision machines is simple, until it isn't.

Gordon Moore: Behind the Ubiquitous Microchip

The Lab That Invented The 21st Century

imec — the most important company you've never heard of

Memristors for Analog AI Chips

Technical overview of memristor technology and its role in power-efficient analog in-memory computing for AI accelerators.

1.2 - Racing Down the Slopes of Moore’s Law (Bram Nauta)

Keynote analyzes the limits of Moore’s Law scaling and advocates mixed-signal and ADC-centric approaches for power-efficient RF/digital design.

How does Groq LPU work? (w/ Head of Silicon Igor Arsovski!)

Deep technical interview on Groq’s Language Processing Unit architecture—single-cycle SIMD fabric, compiler stack, and network scaling versus GPUs.

Intel's Crazy Plan for AI Chips IS WORKING! (Supercut)

Hardware SPI Continued - IO from Scratch - Part 7

Learn how to design your own custom computer chips!

Building an Open Future

Implementation of simple microprocessor using verilog

Scoping out the Tenstorrent Wormhole

Tenstorrent first thoughts

Tiny Tapeout

Unknown

Subcategories