Bookmarks

neural video codecs: the future of video compression

how deep learning could rewrite the way we encode and decode video

How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study

Meta's LLaMA family has become one of the most powerful open-source Large Language Model (LLM) series. Notably, LLaMA3 models have recently been released and achieve impressive performance across various with super-large scale pre-training on over 15T tokens of data. Given the wide application of low-bit quantization for LLMs in resource-limited scenarios, we explore LLaMA3's capabilities when quantized to low bit-width. This exploration holds the potential to unveil new insights and challenges for low-bit quantization of LLaMA3 and other forthcoming LLMs, especially in addressing performance degradation problems that suffer in LLM compression. Specifically, we evaluate the 10 existing post-training quantization and LoRA-finetuning methods of LLaMA3 on 1-8 bits and diverse datasets to comprehensively reveal LLaMA3's low-bit quantization performance. Our experiment results indicate that LLaMA3 still suffers non-negligent degradation in these scenarios, especially in ultra-low bit-width. This highlights the signif...

How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study

Meta's LLaMA family has become one of the most powerful open-source Large Language Model (LLM) series. Notably, LLaMA3 models have recently been released and achieve impressive performance across various with super-large scale pre-training on over 15T tokens of data. Given the wide application of low-bit quantization for LLMs in resource-limited scenarios, we explore LLaMA3's capabilities when quantized to low bit-width. This exploration holds the potential to unveil new insights and challenges for low-bit quantization of LLaMA3 and other forthcoming LLMs, especially in addressing performance degradation problems that suffer in LLM compression. Specifically, we evaluate the 10 existing post-training quantization and LoRA-finetuning methods of LLaMA3 on 1-8 bits and diverse datasets to comprehensively reveal LLaMA3's low-bit quantization performance. Our experiment results indicate that LLaMA3 still suffers non-negligent degradation in these scenarios, especially in ultra-low bit-width. This highlights the signif...

1-bit Model

Quantizing small models like Llama2-7B at 1-bit yields poor performance but fine-tuning with low-rank adapters significantly improves output quality. The HQQ+ approach shows potential in extreme low-bit quantization for machine learning models, reducing memory and computational requirements while maintaining performance. Training larger models with extreme quantization can lead to superior performance compared to training smaller models from scratch.

Human Knowledge Compression Contest

The Human Knowledge Compression Contest measures intelligence through data compression ratios. Better compression leads to better prediction and understanding, showcasing a link between compression and artificial intelligence. The contest aims to raise awareness of the relationship between compression and intelligence, encouraging the development of improved compressors.

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

The article introduces a new era of 1-bit Large Language Models (LLMs) that can significantly reduce the cost of LLMs while maintaining their performance. BitNet b1.58 is a 1.58-bit LLM variant in which every parameter is ternary, taking on values of {-1, 0, 1}. It retains all the benefits of the original 1-bit BitNet, including its new computation paradigm, which requires almost no multiplication operations for matrix multiplication and can be highly optimized. Moreover, BitNet b1.58 offers two additional advantages: its modeling capability is stronger due to its explicit support for feature filtering, and it can match full precision (i.e., FP16) baselines in terms of both perplexity and end-task performance at a 3B size.

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Recent research is leading to a new era of 1-bit Large Language Models (LLMs), such as BitNet, introducing a variant called BitNet b1.58 where every parameter is ternary {-1, 0, 1}. This model matches the performance of full-precision Transformer LLMs while being more cost-effective in terms of latency, memory, throughput, and energy consumption. The 1.58-bit LLM sets a new standard for training high-performance and cost-effective models, paving the way for new computation methods and specialized hardware designed for 1-bit LLMs.

2309.10668

This article discusses the relationship between language modeling and compression. The authors argue that large language models can be viewed as powerful compressors due to their impressive predictive capabilities. They demonstrate that these models can achieve state-of-the-art compression rates across different data modalities, such as images and audio. The authors also explore the connection between compression and prediction, showing that models that compress well also generalize well. They conclude by advocating for the use of compression as a framework for studying and evaluating language models.

Pruning vs Quantization: Which is Better?

Neural network pruning and quantization are techniques used to compress deep neural networks. This paper compares the two techniques and provides an analytical comparison of expected quantization and pruning error. The results show that in most cases, quantization outperforms pruning. However, in scenarios with very high compression ratios, pruning may be beneficial. The paper also discusses the hardware implications of both techniques and provides a comparison of pruning and quantization in the post-training and fine-tuning settings.

Subcategories