[Dissertation defense] 18/01/2024, Noé Cécillon : "Combining graphs and text to model conversations: Application to online abuse detection".
Title of thesis
Combining graphs and text for modelling conversations: Application to online abuse detection
Date and place
Thursday 18 January 2024 at 2pm
Centre d'Enseignement et de Recherche en Informatique 339 Chemin des Meinajaries, 84000 Avignon, France
Room: Amphithéâtre ADA
Discipline
Computer Science
Laboratory
Avignon Computer Laboratory
Management
- Mr Vincent LABATUT
- Mr Richard DUFOUR
Composition of the jury
- Mr Vincent LABATUT Avignon University Thesis supervisor
- Ms Irina ILLINA University of Lorraine Rapporteur
- Richard DUFOUR Nantes University Thesis co-supervisor
- Julien VELCIN University Lyon 2 Rapporteur
- Ms Serena VILLATA CNRS - Université Côte d'Azur Examiner
- Mr Harold MOUCHERE Nantes University Examiner
Summary of the thesis
Abusive behaviour online can have dramatic consequences for users and communities. With the advent of the internet and social networking, no-one is immune to this kind of behaviour. The main responsibility lies with the companies hosting these discussion platforms. They must monitor the behaviour of their users to prevent the proliferation of abusive comments. Rapid detection and treatment of abusive cases is an important factor in reducing their impact and number. As this moderation task involves significant human and financial costs, companies have a strong interest in automating it. Automatic detection of abusive content is quite complex. For example, implicit statements and the use of innuendo often go undetected by standard automatic methods. To counter this problem, it has been shown that taking into account the context in which a message is posted can improve detection. However, the most common method in the literature consists of processing messages taken out of context. In this manuscript, we focus on the combination of content and structure for the detection of abusive content. Using the textual content of messages is the most common approach. This method is easy to implement, but it is also highly vulnerable to text-based attacks, particularly obfuscation techniques. The conversation structure, representing the context, is much less studied because it is more complex to manipulate. However, it introduces a notion of context that enables abusive cases to be detected where text alone is unable to do so. This context can be modelled in the form of a conversational graph representing the conversation containing the message under study. By comparing two methods built from a feature extraction process, we showed that a method using only conversational graphs and ignoring the textual content of messages was able to achieve better performance. As suggested in the literature, we propose several strategies for combining conversational content and structure, and our experiments show that this is indeed beneficial for detection. A limitation of these metric-based methods is that they are quite expensive both in terms of computational resources and design time. Our study also shows that only part of the computed measures are really important for this task. Representation learning methods can provide a solution to this problem, by automatically learning the numerical representation of a message or conversational graph. For graphs, we have shown that considering the attributes of links improves performance. Since the literature does not offer any method for folding a signed integer graph, we fill this gap by developing two methods of this type. We evaluate them on a newly created benchmark consisting of three signed graph datasets, and prove that they achieve better results than their counterparts that do not take signs into account. Finally, we conduct a comparative study of several lexical and graph embedding methods for the detection of abusive messages by applying them to a conversation dataset. Our results show that they are more effective than methods based on a set of measures for text, and slightly less effective for graphs. However, these methods have many other advantages, such as being completely task-independent, easier to adapt to other user environments, and much more efficient in terms of time.
Keywords Representation learning, Abuse detection, Conversations, Graphs
Mis à jour le 15 January 2024