MIL-OSI Russia: Large language models no longer require powerful servers

Translartion. Region: Russians Fedetion –

Source: State University Higher School of Economics – State University Higher School of Economics –

Scientists from Yandex, HSE, MIT, KAUST and ISTA have made a breakthrough in LLM optimization. The Yandex Research artificial intelligence laboratory, together with leading scientific and technological universities, has developed a method for quickly compressing large language models (LLM) without losing quality. Now, a smartphone or laptop is enough to work with the models, and there is no need to use expensive servers and powerful GPUs.

The method allows for quick testing and implementation of new solutions based on neural networks, saving time and money on development. This makes LLM more accessible not only for large companies, but also for small ones, non-profit laboratories and institutes, individual developers and researchers.

Previously, to run a language model on a smartphone or laptop, it was necessary to quantize it on an expensive server, which took several weeks. Now, quantization can be done directly on a phone or laptop in a matter of minutes.

Difficulties in applying LLM

The difficulty with using large language models is that they require significant computing resources. This is also true for open-source models. For example, one of them, the popular DeepSeek-R1, does not fit even on expensive servers designed for working with artificial intelligence and machine learning. This means that only a limited number of companies can use large models, even if the model itself is openly available.

The new method allows you to reduce the size of the model while maintaining its quality and run it on more affordable devices. For example, this method can be used to compress even such large models as DeepSeek-R1 with 671 billion parameters and Llama 4 Maverick with 400 billion parameters, which until now could only be quantized using the simplest methods with a significant loss in quality.

The new quantization method opens up more opportunities for using LLM in various fields, especially where resources are limited, such as education or the social sphere. Startups and independent developers can now use compressed models to create innovative products and services without spending money on expensive equipment. Yandex itself is already using the new method for prototyping — creating working versions of products and quickly testing ideas: compressed models are tested faster than their original versions.

More about the new method

The new quantization method is called HIGGS (from Hadamard Incoherence with Gaussian MSE-optimal GridS). It allows neural networks to be compressed without using additional data and without computationally complex parameter optimization. This is especially useful in situations where there is not enough suitable data to further train the model. The method provides a balance between quality, model size, and quantization complexity, which allows models to be used on a wide range of devices.

The method has already been tested on popular models Llama 3, Llama 4 and Qwen 2.5. Experiments have shown that HIGGS is the best quantization method in terms of quality to model size ratio among all existing data-free quantization methods, including GPTQ (GPT Quantization) and AWQ (Activation-Aware Quantization).

Scientists from the National Research University Higher School of Economics, the Massachusetts Institute of Technology (MIT), the Austrian Institute of Science and Technology (ISTA), and the King Abdullah University of Science and Technology (KAUST, Saudi Arabia) participated in the development of the method.

The HIGGS method is now available to developers and researchers at Higging Fake And Gitkhov, and a scientific article about it can be read at archive.

Reaction of the scientific community, other methods

A scientific article describing the new method has been accepted to one of the world’s largest conferences on artificial intelligence, NAACL (The North American Chapter of the Association for Computational Linguistics), which will be held from April 29 to May 4, 2025, in Albuquerque, New Mexico, USA. Along with Yandex, such companies and universities as Google, Microsoft Research, Harvard University, and others will participate. The article has already been cited by the American company Red Hat AI, Peking University, Hong Kong University of Science and Technology, Fudan University, and others.

Earlier, Yandex scientists presented 12 scientific studies in the field of quantization of large language models. In this way, the company aims to make the use of these models more efficient, less energy-consuming and accessible to all developers and researchers. For example, the Yandex Research team previously developed methods compression of large language models, helping to reduce computing costs by almost eight times without significantly losing the quality of neural network responses. The team also created service, which allows you to run a model with 8 billion parameters on a regular computer or smartphone through a browser interface even without large computing power.

Please note: This information is raw content directly from the source of the information. It is exactly what the source states and does not reflect the position of MIL-OSI or its clients.

MIL OSI Russia News