Extracting Features from Online Forums to Meet Social Needs of Breast Cancer Patients


Maitreyi Mokashi (Indiana University)
Enming Zhang (Indiana University)
Josette Jones (Indiana University)
Sunandan Chakraborty (Indiana University)

DOI: https://doi.org/10.1145/3378393.3403652

Session: 2.4. Internet content and community

Abstract: Breast cancer patients go through many ordeals when they undergo treatments. Many of these issues are personal, social, or professional. As many of them are not directly medical in nature, these issues are not discussed with their healthcare providers and hence, not included in their treatment plan. However, these issues are vital for the patients’ complete recovery. We present a novel approach that acts as the first step in including such personal and social issues resulting from breast cancer treatment into a patient’s treatment plan. There are numerous online forums where patients share their experiences and post questions about their treatments and subsequent side effects. We collected data from one such forum called “Online Breast Cancer Forum”. On this forum, users (patients) have created threads across many related topics and shared their experiences and questions. We use these message threads to identify critical issues faced by the patient and how they are related to their treatment. We convert the forum data into a bipartite network and turn the network nodes into a high-dimensional feature space. In this feature space, we perform community detection to unearth latent connections between patients and topics. We claim that these latent connections, along with the known ones, will help to create a new knowledge base that will eventually help physicians to estimate non-medical issues for a prescribed treatment. This new knowledge will help the physicians plan a more adaptive and personalized treatment and be better prepared by anticipating potential problems beforehand. We evaluated our method on two baseline methods and show that our method outperforms the baseline methods by 25% on a manually labeled reference dataset.