Bookmarks
Taking a Look at Compression Algorithms
Dissecting various compression algorithms.
attention is logarithmic, actually
supaiku dot com § attention is logarithmic, actually § time complexity is a very bad model when working with parallelism. in which i make the case for work-depth analysis instead of time complexity.
Are Transformers universal approximators of sequence-to-sequence functions?
Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Furthermore, using positional encodings, we circumvent the restriction of permutation equivariance, and show that Transformer models can universally approximate arbitrary continuous sequence-to-sequence functions on a compact domain. Interestingly, our proof techniques clearly highlight the different roles of the self-attention and the feed-forward layers in Transformers. In particular, we prove that fixed width self-attention layers can compute contextual mappings of the input sequences, playing a key role in the universal approximation property of Transformers. Based on this insight from our analysis, we consider other simpler alternatives to self-attention layers and empirically evaluate them.
Training Large Language Models to Reason in a Continuous Latent Space
Large language models (LLMs) are restricted to reason in the “language space”, where they typically express the reasoning process with a chain-of-thought (CoT) to solve a complex reasoning problem.
Training Large Language Models to Reason in a Continuous Latent Space
Large language models (LLMs) are restricted to reason in the "language space", where they typically express the reasoning process with a chain-of-thought (CoT) to solve a complex reasoning problem. However, we argue that language space may not always be optimal for reasoning. For example, most word tokens are primarily for textual coherence and not essential for reasoning, while some critical tokens require complex planning and pose huge challenges to LLMs. To explore the potential of LLM reasoning in an unrestricted latent space instead of using natural language, we introduce a new paradigm Coconut (Chain of Continuous Thought). We utilize the last hidden state of the LLM as a representation of the reasoning state (termed "continuous thought"). Rather than decoding this into a word token, we feed it back to the LLM as the subsequent input embedding directly in the continuous space. Experiments show that Coconut can effectively augment the LLM on several reasoning tasks. This novel latent reasoning paradigm leads to emergent advanced reasoning patterns: the continuous thought can encode multiple alternative next reasoning steps, allowing the model to perform a breadth-first search (BFS) to solve the problem, rather than prematurely committing to a single deterministic path like CoT. Coconut outperforms CoT in certain logical reasoning tasks that require substantial backtracking during planning, with fewer thinking tokens during inference. These findings demonstrate the promise of latent reasoning and offer valuable insights for future research.
What Is ChatGPT Doing … and Why Does It Work?
Stephen Wolfram explores the broader picture of what's going on inside ChatGPT and why it produces meaningful text. Discusses models, training neural nets, embeddings, tokens, transformers, language syntax.
Programming Really Is Simple Mathematics
A re-construction of the fundamentals of programming as a small mathematical theory (PRISM) based on elementary set theory. Highlights:
$\bullet$ Zero axioms. No properties are assumed, all are proved (from standard set theory).
$\bullet$ A single concept covers specifications and programs.
$\bullet$ Its definition only involves one relation and one set.
$\bullet$ Everything proceeds from three operations: choice, composition and restriction.
$\bullet$ These techniques suffice to derive the axioms of classic papers on the "laws of programming" as consequences and prove them mechanically.
$\bullet$ The ordinary subset operator suffices to define both the notion of program correctness and the concepts of specialization and refinement.
$\bullet$ From this basis, the theory deduces dozens of theorems characterizing important properties of programs and programming.
$\bullet$ All these theorems have been mechanically verified (using Isabelle/HOL); the proofs are available in a public repository.
This paper is a considerable extension and rewrite of an earlier contribution [arXiv:1507.00723]
TLA+ is hard to learn
I’m a fan of the formal specification language TLA+. With TLA+, you can build models of programs or systems, which helps to reason about their behavior. TLA+ is particularly useful for reason…
How hard is constraint programming?
Writing code using the Z3 SMT solver is different from typical programming, due to mixed programming models--not unlike CUDA for GPUs. Here's what to expect.
Competitive Programming
This is the supporting web page for a book titled: "Competitive Programming 4: The Lower Bound of Programming Contests in the 2020s" written by Steven Halim, Felix Halim, and Suhendry Effendy.
Numerical Recipes
We are Numerical Recipes, one of the oldest continuously operating sites on the Internet.
Tiled Matrix Multiplication
Tiled matrix multiplication is an efficient algorithm used on GPUs that reduces memory access by utilizing shared memory. By organizing threads into blocks, each thread can perform calculations more quickly and with fewer memory accesses. This method is important for improving performance in tasks like graphics rendering and machine learning.
Cache-Oblivious Algorithms
Cache-oblivious algorithms are designed to use processor caches efficiently without needing to know specific cache details. They work by dividing data into smaller parts, allowing more computations to happen in cache and reducing memory access. This leads to better performance, especially in parallel algorithms, by minimizing shared memory bottlenecks.
Minimal Boolean Formulas
The post discusses how to compute the minimum number of AND and OR operators needed for Boolean functions with five variables. It describes the author's program that efficiently calculates this minimum for various functions while also improving algorithms for speed. The findings contribute to understanding the complexity of Boolean functions and their representations.
What is an Invariant? Oct 6, 2023
Invariants are properties that hold true during the evolution of a system, helping to ensure correct behavior in programming. They can simplify reasoning about code, whether it’s for small algorithms or larger systems. By clearly defining invariants, programmers can create robust code and manage complex systems effectively.
Data Compression Explained
Data compression involves modeling and coding to reduce the size of data files. Modern compressors typically use arithmetic coding for efficient compression. Algorithms like Huffman coding and run-length encoding are commonly used to achieve better compression results.
The next fifty years
The text discusses the future of computing science over the next fifty years, emphasizing the importance of simplicity and elegance in design to prevent complexity. It highlights the close connection between program design and proof design, suggesting that advancements in program design can impact general mathematics. The author encourages embracing the opportunity to simplify processes and design systems that rely on formal mathematics.
Leslie Lamport
Leslie Lamport wrote several papers on verifying and specifying concurrent systems using TLA. He discovered algorithms through formal derivation and emphasized mechanical verification of concurrent algorithms. His work influenced the development of the TLAPS proof system.
Prospecting for Hash Functions
The text discusses the process of designing non-cryptographic integer hash functions, exploring different operations and constraints to create effective hash functions. It also compares various 32-bit hash functions and their bias levels, highlighting the search for high-quality hash functions with minimal bias for both 32-bit and 64-bit integers.
Nanosystems
This text is about a book called "Nanosystems" by K. Eric Drexler, which is considered groundbreaking in the field of molecular nanotechnology. The book explains how to create manufacturing systems at the molecular level and discusses the significant impact nanotechnology will have on various industries. Experts praise the book for providing a foundation for future research in molecular systems engineering and molecular manufacturing.
Understanding_Machine_Learning_-_From_Theory_to_Algorithms
I'm sorry, but there is no content provided for me to summarize. If you provide me with the specific content or information you would like summarized, I would be happy to help.
chrono-Compatible Low-Level Date Algorithms
The text explains algorithms for handling dates and determining leap years. It includes functions for calculating the last day of a month and converting dates between different calendar systems. The algorithms are designed to be efficient and accurate for various date calculations.
Matrix multiplication in Mojo
The text discusses matrix multiplication in Mojo. It is written by modular.com and can be found on docs.modular.com.
Matrix Multiplication on CPU
The text is about matrix multiplication on a CPU. The author is Marek Kolodziej and the domain is marek.ai.
numerical_recipes
The content provided is the table of contents for a book titled "Numerical Recipes: The Art of Scientific Computing, Third Edition." It includes various topics such as linear algebra, interpolation and extrapolation, integration of functions, evaluation of functions, special functions, random numbers, sorting and selection, root finding and nonlinear sets of equations, minimization or maximization of functions, eigensystems, and more.
Competitive Programmer's Handbook
The article discusses various algorithms and data structures used in computer programming, such as Kadane's algorithm, binary indexed trees, segment trees, Dijkstra's algorithm, and Floyd's algorithm. The author also explains concepts like successor graphs, index compression, and minimum spanning trees. The time complexity of each algorithm is also discussed.
Subcategories
- applications (9)
- compression (9)
- computer_vision (8)
- deep_learning (94)
- ethics (2)
- generative_models (25)
- interpretability (17)
- natural_language_processing (24)
- optimization (7)
- recommendation (2)
- reinforcement_learning (11)
- supervised_learning (1)