TY - GEN
T1 - Model Comparison for the Classification of Comments Containing Suicidal Traits from Reddit via NLP and Supervised Learning
AU - Mantilla-Saavedra, Camila
AU - Gutiérrez-Cárdenas, Juan
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2022/4/20
Y1 - 2022/4/20
N2 - In recent years, suicide has become one of the most critical issues regarding public health between teenagers and adults. On the other hand, the growth and wide-spread of social networks and mobile devices have allowed us to compile relevant information that helps us understand the thoughts, feelings, and emotions extracted from these platforms. The detection of suicidal traits on social media has be-come one relevant research topic. It has permitted the identification of probable suicide traits among media users by examining their posts on known social net-works such as Reddit. For that reason, the purpose of the present research is to compare different supervised classification models such as Logistic Regression, Support Vector Machines, Random Forest, AdaBoost, Gradient Boosting, and XGBoost; together with feature extraction techniques such as TF-IDF and Glove. The results from our experiments show that the best model is SVM with TF-IDF obtaining metrics of 91.50% in Accuracy, 92.40% in Precision, 90.30% in Re-call, and 91.50% regarding the F1-score. This study also shows that TF-IDF for feature extraction outperforms Glove when applied to the different models tested.
AB - In recent years, suicide has become one of the most critical issues regarding public health between teenagers and adults. On the other hand, the growth and wide-spread of social networks and mobile devices have allowed us to compile relevant information that helps us understand the thoughts, feelings, and emotions extracted from these platforms. The detection of suicidal traits on social media has be-come one relevant research topic. It has permitted the identification of probable suicide traits among media users by examining their posts on known social net-works such as Reddit. For that reason, the purpose of the present research is to compare different supervised classification models such as Logistic Regression, Support Vector Machines, Random Forest, AdaBoost, Gradient Boosting, and XGBoost; together with feature extraction techniques such as TF-IDF and Glove. The results from our experiments show that the best model is SVM with TF-IDF obtaining metrics of 91.50% in Accuracy, 92.40% in Precision, 90.30% in Re-call, and 91.50% regarding the F1-score. This study also shows that TF-IDF for feature extraction outperforms Glove when applied to the different models tested.
UR - https://hdl.handle.net/20.500.12724/17555
UR - http://www.scopus.com/inward/record.url?scp=85128982461&partnerID=8YFLogxK
UR - https://www.mendeley.com/catalogue/8b62ab18-ff0d-31f2-a0b2-15739254695c/
U2 - 10.1007/978-3-031-04447-2_17
DO - 10.1007/978-3-031-04447-2_17
M3 - Articulo (Contribución a conferencia)
AN - SCOPUS:85128982461
SN - 978-3-031-04446-5
T3 - Communications in Computer and Information Science
SP - 253
EP - 263
BT - Information Management and Big Data - 8th Annual International Conference, SIMBig 2021, Proceedings
A2 - Lossio-Ventura, Juan Antonio
A2 - Valverde-Rebaza, Jorge
A2 - Díaz, Eduardo
A2 - Muñante, Denisse
A2 - Gavidia-Calderon, Carlos
A2 - Valejo, Alan Demétrius
A2 - Alatrista-Salas, Hugo
PB - Springer Science and Business Media Deutschland GmbH
T2 - 8th Annual International Conference on Information Management and Big Data, SIMBig 2021
Y2 - 1 December 2021 through 3 December 2021
ER -