Back Brigita, July, 27 2023

DIGIRES scholars have developed datasets for disinformation language recognition

Among other activities related to disinformation and media literacy, DIGIRES also conducts experiments on identifying the language of disinformation. Project researchers Darius Amilevičius and Andrius Utka created an open access text corpus and a specialized dataset for researching disinformation in the Lithuanian language. Such data have not been collected before.

“We aimed to create a dataset that would help answer the questions – is there anything special about texts that spread disinformation? Apart from false facts, do they have certain linguistic features, or perhaps their linguistic expression methods do not differ from conventional media articles? There was no such data in the Lithuanian language” say researchers D. Amilevičius and A. Utka.

The topic of the DIGIRES project is disinformation, and a lot of disinformation has been received on the topic of the COVID-19 pandemic, so it was decided to collect texts on this particular topic. “In order to remain objective, we decided to collect only those false articles in which the false information was checked and confirmed by professional fact checkers” the researchers say.

Other members of the DIGIRES association – editor of the column named “Lie detector” of the DELFI news portal, fact checker Aistė Meidutė and VMU researcher Jūratė Ruzaitė – helped to solve this problem.

The researchers needed to accumulate enough articles with both false and true information. In total 176 false and 175 true articles with a total number of 186,000 words was collected.

“Text corpus collected in this way can be studied with traditional language technology tools – concordance programs, morphological annotators or frequency list generators. It should be mentioned that modern language research increasingly uses machine learning methods and artificial intelligence technologies, which require special data processing in order to train neural networks with them. As a result, it was decided to create another dataset for machine learning tools” the researchers say.

Both datasets are publicly available to the academic community in the repository of scientific research infrastructure CLARIN-LT.