MIL-OSI Russia: “There is no goal to say what is right. We aim to explore variability.”

Translation. Region: Russian Federal

Source: State University Higher School of Economics – State University Higher School of Economics –

Photo: Maxim Melenchenko

Works at HSE University International Laboratory of Language Convergence, which focuses on the interaction of languages of different peoples living in regions with a mixed multi-ethnic population. Research by HSE scientists helps to better understand the history of language development and study the features of perception and use of languages in a multilingual environment. Georgy Moroz, head of the laboratory, spoke about this in an interview with HSE.Glavnoe.

— How did the laboratory start working?

— It was opened in 2017, Nina Dobrushina became the head, and the scientific director was University of Berkeley professor Johanna Nichols, who is now working remotely. Most of the research staff studied the languages of the peoples of the Caucasus and their interaction: for example, Nina Dobrushina, Mikhail Daniel, Timur Maisak were interested mainly in Dagestan, Yuri Lander and Anastasia Panova studied the Abkhaz-Adyghe languages.

One of the central areas of work is typology. Typological studies in linguistics involve classifying languages by various features (for example, by the number of vowels and consonants). For this purpose, samples are created that can include dozens of languages. Our laboratory is one of the few scientific centers in Russia where such studies are conducted, and perhaps the only one that focuses specifically on the processes of language interaction. The laboratory also continues to study the languages of the Caucasus and create linguistic resources for them.

In the Caucasus, the Russian language comes into contact with languages of different groups: in addition to the Nakh-Dagestani languages, these are the Turkic languages (which include many languages of the peoples of Dagestan, for example Kumyk and Azerbaijani), as well as the Abkhaz-Adyghe languages (Abkhaz, Abaza, Adyghe and Kabardian), Kartvelian (Georgian, Megrelian, Svan and Laz languages) and Indo-European (Armenian, Ossetian, Tat).

The main goal of creating the laboratory is to study the mutual influence of languages on each other. A striking example of such influence is the Ossetian language, which is Indo-European, but unlike other Indo-European languages, it has eruptive consonants. These are sounds in which the vocal cords close and rise during pronunciation, creating a pressure difference, for example, кI, пI, тI, цI, чI. In addition, during an expedition to Azerbaijan, the laboratory staff studied the dialects of the territories bordering Dagestan, and Mikhail Daniel discovered a dialect of the Azerbaijani language that had eruptive sounds (although there were reports of it in previous works). Apparently, this can be explained by the fact that the ancestors of the inhabitants of the village of Ilisu spoke a certain Nakh-Dagestani language, presumably Tsakhur, and then switched to the Azerbaijani language, preserving such an eruptive trace. Most likely, this happened due to language contacts.

Our leader Johanna Nichols put forward a similar hypothesis about the inhabitants of some villages in Dagestan. The fact is that the Avar language is widespread in the north of Dagestan, and it is widespread mainly in the lowlands. However, one can find speakers of the Avar language in highland villages surrounded by non-Avar villages. And here the assumption arises whether they previously spoke languages other than Avar, and then switched to Avar under the influence of its prestige.

The process by which such borrowings and even transitions from one language to another occur, and as a result, the convergence of languages or dialects, is called linguistic convergence. It is important that this process is easier to see in the example of genetically unrelated languages, but a similar phenomenon can also occur with related languages or dialects.

— Is convergence of neighboring languages necessary?

— It happens in most cases, but there are also opposite cases, when languages and their speakers “try” to be different from each other. This process is called divergence. For example, last year we invited John Mansfield to speak at our seminar, who, together with his colleagues, published a typological study of divergent processes based on 42 languages from around the world.

— You mentioned Dagestan, where many languages are spoken. Could you tell us more about this region and your research related to it?

— Dagestan is wonderful for its multilingualism and the mutual influence of local languages on each other; in addition, at some point they began to change under the influence of the active penetration of the Russian language into the local environment.

Recently, my research intern Victoria Zubkova, research assistant Chiara Naccarato and I submitted an article to one of the leading international linguistic journals about the adaptation of Russian borrowings in Andean languages. Earlier borrowings were mainly through the Avar language, through its peculiar mediation. Now words are borrowed directly, and we are trying to model in which languages the influence of Russian is greater and on what factors the degree of its influence depends.

The study revealed that Avar and Botlikh have recently seen fewer phonetic changes in borrowings from Russian than other Andic languages (see, for example, Akhvakh кIебетIи — “kopeck”). The main reason: these languages have already come under the strong influence of Russian. Avar used to play an important role in the north of Dagestan; it was and remains a kind of regional lingua franca. The results of our study show that the process of adaptation of Russian borrowings in other Andic languages was slower than in Avar, but it is obvious that this process has been decreasing over time. Now, of course, any borrowing will most likely enter all of these languages without any phonetic adaptation.

— How do you obtain materials for research?

— We regularly go on expeditions to collect data; for us, this is the most important source of material. Our colleagues recently returned from Armenia, another group – from AdygeyaRecently, we have begun to make more active use of data collected by scientists outside the lab.

Thus, the laboratory collected 10 speech corpora of bilinguals, that is, people for whom Russian is not their native language, but they learned it and regularly use it in everyday life. Their speech – both pronunciation and grammar – differs from the speech of monolinguals.

Corpora of individual dialects of the Russian language are also being created. The main difficulty in collecting such material is that Russian dialectologists were previously reluctant to share their data. Thanks to Nina Dobrushina, this has changed, and now placing some dialect corpora with us is considered a common thing. In total, 26 dialect corpora have been created in the laboratory.

We are also collecting corpora of minor languages of Russia; there are currently 14 of them.

— Can you clarify what a “corpus” is for linguists? How and why do you create new corpora?

— Corpora appeared as written records of speech of various types or simply marked-up collections of texts. A corpus differs from a collection of texts by morphological or other markings. In particular, you can set up a search by categories: for example, which nouns come before infinitives. For example, the National Corpus of the Russian Language is a collection of a large number of texts that can be searched morphologically. When we prepare oral corpora — bilingual and dialectal — we use text transcripts in literary Russian, which makes automatic morphological search possible. Corpora also contain audio recordings, thanks to which we can understand the features of dialects. Sometimes you need to listen to the recordings again to understand more precisely whether certain sounds are used.

The corpus is one of the central tools of modern linguistics. It is by analyzing the frequency of use of different constructions in it that we make certain generalizations, on the basis of which we publish articles.

One of the options for using corpora is to compare dialects or small languages with each other: using vector models, one can obtain intersections of corpora of corresponding languages and thus understand which dialects and languages are closer and which are further from each other.

Thus, according to our observations of bilingual corpora, Karelians, unlike Dagestanis, speak Russian, which is closer to the literary language. In Dagestan, local languages are influenced by both the standard literary Russian and the regional Dagestan Russian that emerged in the republic and is developing in its own unique way. For children, the amount of language use is important. And if, for example, Lezgins speak Lezgin, and Adyghe speak Adyghe or Kabardian and then switch to Russian, then we can ask which Russian exactly – the literary Russian or a specific local version with local features caused by native languages. Such comparisons of features are possible precisely thanks to our corpora.

— What other resources do you create?

— As mentioned above, one of the important resources of the laboratory is the linguistic atlases of small languages of Russia.

We also compile dictionaries of such languages. For example, we recently publishedDictionary of the Kininsky dialect of the Rutul language, whose speakers live in Dagestan and Azerbaijan; the dictionary size is about 1200 words. I analyzed the Zilov dialect, one of the dialects of the Andian language, which for a long time had no written language, and also posted it on the laboratory’s page dictionaryabout 1,500 words. However, this is a significantly smaller volume compared to dictionaries published by linguists from the regions where the corresponding language is spoken. They have a better command of the languages and can usually devote more time to this task.

Dictionaries published in Dagestan include at least 5,000–6,000 units, and recently our colleague Majid Sharipovich Khalilov published a dictionary of the Tsez (Didoi) language containing 11,000 words. For an unwritten language, this is something phenomenal.

— What are the key areas of the laboratory’s current work?

— Our main focus is linguistic typology, within the framework of which research is conducted on a sample of unrelated languages from all over the world.

Another long-term project is the Typological Atlas of the Languages of Dagestan, which already has 58 chapters, each of which is devoted to a separate linguistic phenomenon, such as the presence or absence of some eruptive sounds. Researchers from our laboratory, Samira Verhees and Chiara Naccarato, studied how people speaking different languages greet each other in the morning and wrote a chapter on the subject. It turned out that in 17 languages, the greeting is “Good morning!”; the rhetorical question “Are you awake?” and “Are you up?” are also common greetings, and, for example, in the Lak language, you can find both of these options.

The project of electronic Dagestani dictionaries plays an important role now. We are trying to create a unified database that would contain lexical material of the Nakh-Dagestani languages. The database was created thanks to a series of coursework by students of the educational program “Fundamental and Computer Linguistics”, who digitalized, cleaned up the data, created a transliterator. These works contain phonetic and morphological marking and marking of borrowings from Russian, Arabic, Persian and Turkic languages. Now we have unified materials on the Andic and Avar languages.

This greatly simplifies a number of studies that required looking at different dictionaries. The already mentioned article by Victoria Zubkova and Chiara Naccarato was made possible thanks to this database, which also opens up the field for new research. This is a project with great potential, which I hope will continue.

Another important area is the study of non-standard Russian, in which we study both dialects of Russian and the peculiarities of the Russian language of those for whom it is not native. We call our group DiaL2: dia — dialects and L2 — the standard designation for the second language. We are interested in any variants that are not similar to the literary ones. We do not aim to say which is correct. We seek to study the variability that we observe. Our group includes laboratory researchers and students. For example, our research intern Anna Grishanova recently had an article accepted for publication on the loss of prepositions in the speech of bilinguals whose first native language is Chuvash.

There is a separate one Rutulian project. As part of the “Rediscovering Russia” grant, we visited 12 Rutul villages and releasedatlas, similar to the Typological Atlas of the Languages of Dagestan, which I mentioned earlier. The Rutul Atlas contains 425 separate chapters devoted to various topics of Rutul dialectology: phonetic, grammatical and lexical. For example, one of the chaptersis dedicated to the lexeme hedgehog, which is designated by different variants – both by borrowing from Russian and by our own g’yllentsI, kirpik, zh’uzh’ya or k’yng’yr.

There are also two other small projects: one on the Aramaic languages used in Russia, for which a grant from the Russian Science Foundation (24-28-01009) was received – “Areal-typological description of the neo-Aramaic idioms of Armenia” under the direction of Yuri Koryakov – and the second on the Abkhaz-Adyghe languages.

In general, documenting languages is very important for the culture of the peoples we work with, because some unwritten languages can disappear, and if we manage to somehow record them, then people will be able to see how their grandparents spoke, even if they do not understand their native language.

— How is the laboratory’s work organized?

— One of the pillars of the laboratory seems to me to be ours weekly seminar. It takes place every Tuesday at 16:00. During the laboratory’s operation, more than 230 seminars have been held, with almost 300 papers presented. Almost all seminars are held in English, which allows us to more actively involve foreign colleagues in our work and maintain scientific contacts. We are visited by various well-known linguists, for example, Martin Haspelmath, one of the leading specialists in linguistic typology. During his trip to Moscow last December, he spoke at the HSE with lecture, which attracted great interest. The seminars also show our interns how to give a report, ask questions, and conduct themselves during a report in English. In addition, when I became the head of the department, we began to use the seminars more actively as a platform for discussing new scientific articles. This is due to my deep conviction that it is easy to stop reading or limit reading to only your narrow specialization and switch to churning out articles. It is reading and discussing articles, even those far removed from your research topic, that allows you to keep the general state of modern linguistics in focus, rather than drowning in specifics, as in the parable of the elephant and the blind wise men.

— How actively do you collaborate with other universities and HSE campuses?

— As part of the project “Mirror Laboratories» We collaborated with the Southern Federal University in 2022–2024. It included three subprojects: a project to study Russian as a foreign language, a dialectological project, thanks to which we have a corpus of Don dialects, which we support and, if necessary, can continue to study dialects, as well as a digital humanities research project, or Digital Humanities (DH).

The current inter-campus project with the National Research University Higher School of Economics in St. Petersburg is focused on DH: my colleagues and I are engaged in applied computational linguistics. In particular, in St. Petersburg we created a corpus of Russian short stories from the 1930s to 2000s, a corpus of Soviet songs, and even developed a chatbot for the Hermitage.

— How does this chatbot work?

— For example, a visitor asks to show a painting of a woman with her head on a plate, meaning Judith with the head of Holofernes; the bot is supposed to give the desired painting. But hardly anyone will be surprised if it is Herodias with the head of John the Baptist.

— What other applied work can you imagine?

— We have various applied research. For example, we have started developing transliterators for the Nakh-Dagestani languages. We dream of creating a hub where transliterators of texts in different languages would be presented, which would be very useful for linguists.

In addition, we are developing morphological analyzers for small languages, collecting corpora and dictionaries. All this is ultimately rich material for verifying machine learning models of various modalities: both audio and text. Such models often suffer from a lack of expert data labeling.

Please note: This information is raw content directly from the source of the information. It is exactly what the source states and does not reflect the position of MIL-OSI or its clients.

MIL OSI Russia News –

MIL-OSI Russia: “There is no goal to say what is right. We aim to explore variability.”

More posts

MIL-OSI United Nations: World News in Brief: First UN mission to Syria’s Sweida, fresh displacement in Haiti, new lightning record

MIL-OSI Canada: Edmonton resident charged with drug importation

MIL-OSI USA: Schakowsky, Markey, Ruiz, Jayapal Introduce Dr. Paul Farmer Memorial Resolution Outlining 21st Century Global Health Strategy

MIL-OSI New Zealand: Minister announces SOE appointments