Translartion. Region: Russians Fedetion –
Source: State University Higher School of Economics – State University Higher School of Economics –
In 2025, the Natural Language Laboratory of the National Research University Higher School of Economics — St. Petersburg, under the leadership of Dmitry Ryumin, a candidate of technical sciences, will develop technologies that will allow AI not only to understand words, but also to recognize emotions, gestures, and personal characteristics of a person. Initially, the department focused exclusively on the analysis of text data. However, according to Dmitry Ryumin, now only one modality is of little interest to anyone. “Look at the current developments — everyone wants to record something with their voice, and upload a picture, and analyze a video, and work with text,” the scientist comments.
Dmitry Ryumin came to the HSE in St. Petersburg from the St. Petersburg Federal Research Center of the Russian Academy of Sciences, where he holds the position of senior research fellow at the Laboratory of Speech and Multimodal Interfaces. “I was invited for the SP4 project (strategic projects), and then offered to head the Natural Language Laboratory. Today, ten people work in the laboratory – from undergraduate students to candidates of science. I would like to expand the team to 20-30 people, so that the laboratory could be divided into related groups. For example, one group deals with avatars, another – with emotions, and then they can be combined to create emotional avatars,” the head shares his plans.
Why do neural networks need emotions?
Under the leadership of Dmitry Ryumin, the HSE-St. Petersburg laboratory will focus on several promising areas related to multimodal technologies.
“Imagine a system that simultaneously analyzes a person’s voice, facial expressions, and gestures. Assessing a person’s personal qualities and recognizing emotions can be useful, for example, when hiring,” the scientist explains. The technology allows determining how well a job seeker fits the job. “We record an interview with a candidate and analyze not only the content of the answers, but also how they speak, what emotions they show, how they gesture. This gives a more complete picture of a person. For example, openness, sociability, and resistance to stress are important for a manager. The system can analyze whether a candidate’s voice trembles, how clearly they express their thoughts, and provide a description to help HR in recruiting personnel,” comments Dmitry Ryumin.
Another promising area is personalized advertising. The neural network will be able to evaluate the user’s emotional state and tailor contextual ads to him. If he is sad, it will show one type of content, if he is happy, another.
Emotional avatar technologies will find application in virtual spaces and conferences. “Last year, large international conferences created virtual spaces where participants who could not come physically entered virtual rooms through their avatars. If these avatars are made more emotional, with realistic facial expressions and gestures, the interaction experience will be much better,” the scientist notes. There is also an entertainment direction – movement transfer. “Imagine: I upload a short video in which I am simply in a room and make ordinary movements. The system analyzes and creates a digital model of me. Then I upload another video, where, for example, a professional dancer performs a break dance. The technology replaces the dancer with me, and the result is a realistic video where I masterfully dance a break dance. Similar technologies are actively developing around the world. Large research centers and companies offer various approaches to solving this problem,” explains Dmitry Ryumin.
There is potential for using multimodal artificial intelligence in the field of psychological support. “We can try to recognize not only short-term emotions, but also long-term conditions, such as anxiety disorders, emotional burnout, or cognitive impairment. Of course, there are ethical issues and problems with obtaining data for training systems, but the direction is very promising,” says Dmitry Ryumin.
Another area of development is voice assistants for smart homes. According to the scientist, bimodal recognition is most relevant in this case, since many people would prefer to maintain the privacy of their living space and would not want to connect cameras. “The analysis will be carried out mainly based on speech, which we can convert into text. This approach allows us to work with two modalities simultaneously. I have several voice assistants installed at home. And I regularly encounter a problem: the system does not always correctly interpret speech commands. Sometimes, in one minute, the assistant can change its “mood” or manner of response several times, which, frankly speaking, is irritating,” the head of the laboratory summarizes.
The task of researchers who train large language and generative models is to make the decision-making process of a neural network transparent. According to the head of the laboratory, explainable artificial intelligence is a direction that has been actively developing in recent years.
By receiving a decoding of the model’s “train of thought”, any professional can critically evaluate the result obtained: agree with something, question something. This creates an opportunity for feedback and objectivity in decision-making.
How to teach a neural network to recognize emotions?
Modern research into multimodal models requires powerful technology, cross-disciplinary specialists and large amounts of data.
Computing base. Dmitry Ryumin has been working with neural networks for more than eight years. According to him, the main emphasis used to be on RAM and the processor, but today the central role is played by graphic accelerators (GPU). The power and number of available video cards directly determine the speed of training neural network models, the number of possible experiments and the volume of processed data.
“Therefore, it is important not only to conduct research, but also to develop the computing base. For example, supercomputer of the Higher School of Economics we see how these resources affect the quality of scientific experiments. It is especially valuable to involve students, starting from the undergraduate level, in working with such systems — to teach them how to interact with high-performance computing clusters, to give them the opportunity to train models of varying complexity. This creates a continuous educational chain: students who have mastered working with advanced equipment can subsequently be involved in research work in laboratories.”
Working with databases. Teaching large language models to recognize and reproduce emotions is a complex, multi-stage process. And neural networks are now taking part in it. For example, open AIs help automate data collection and annotation: they quickly collect texts with a given emotional coloring. “This radically reduces labor costs compared to traditional manual tagging, when you had to hire people for painstaking work. A general trend is noticeable: many research teams are trying to adapt models to work with emotions. Despite the fact that such attempts are not yet ideal and the models continue to make mistakes, the direction is actively developing,” says Dmitry Ryumin.
Cross-disciplinary research. Modern research in the field of multimodal models involves interdisciplinarity. Thus, Dmitry Ryumin is now launching a joint project within the framework of the “Fundamental Research Program” with Laboratory of Social and Cognitive Informatics in modeling cognitive and affective processes and human states. “By combining our departments and laboratories, we are creating a strong interdisciplinary platform for the development of affective technologies. Such cooperation is extremely valuable: our fellow sociologists, although not specializing directly in training neural network models, including large language and generative models, bring deep theoretical expertise. Their knowledge becomes a fundamental basis for training our models,” says the head of the Natural Language Laboratory.
The Natural Language Laboratory welcomes undergraduate and graduate students who are knowledgeable in programming, linguistics, psychology, and sociology.
Natural Language Laboratory is an interdisciplinary researcher in machine learning and natural language processing, studying fundamental properties of language, computation, and learning that can contribute to a better understanding of language in general.
Please note: This information is raw content directly from the source of the information. It is exactly what the source states and does not reflect the position of MIL-OSI or its clients.