Question Similarity Detection in Indonesian Language Consumer Health Forums with Feature-based Binary Classification Approach

Authors

  • Eka Putri Irianti Universitas Indonesia

DOI:

https://doi.org/10.33022/ijcs.v13i4.4264

Abstract

Two questions are considered similar if the same response can be given to both. Due to the increase in users of consumer health forums, a growing number of similar questions are not being adequately answered. Identifying duplicate questions in online medical Question Answering (QA) forums offers several advantages for users and medical professionals. Therefore, it is crucial for online medical QA forums to identify similar questions to provide relevant and useful answers. This study examines a feature-based binary classification method for detecting similar questions in the Indonesian consumer health domain. The results indicate that the feature-based classification approach using the CatBoost model yields the best performance. The research also explores techniques to address class imbalance in the dataset, finding that imbalanced learning technique such as ADASYN and SMOTE results in improved classification performance. This study also analyzes discriminative features for identifying semantic similarity between question pairs, concluding that a combination of distance, medical, and encoding features produce the best results.

Downloads

Published

08-08-2024