Unveiling Insights: A Knowledge Discovery Approach to Comparing Topic Modeling Techniques in Digital Health Research
Abstract
This paper introduces a knowledge discovery approach focused on comparing topic modeling techniques within the realm of digital health research. Knowledge discovery has been applied in massive data repositories (databases) and also in various field studies, which use these techniques for finding patterns in the data, determining which models and parameters might be suitable, and looking for patterns of interest in a specific representational. Unfortunately, the investigation delves into the utilization of Latent Dirichlet Allocation (LDA) and Pachinko Allocation Models (PAM) as generative probabilistic models in knowledge discovery, which is still limited. The study's findings position PAM as the superior technique, showcasing the greatest number of distinctive tokens per topic and the fastest processing time. Notably, PAM identifies 87 unique tokens across 10 topics, surpassing LDA Gensim's identification of only 27 unique tokens. Furthermore, PAM demonstrates remarkable efficiency by swiftly processing 404 documents within an incredibly short span of 0.000118970870 seconds, in contrast to LDA Gensim's considerably longer processing time of 0.368770837783 seconds. Ultimately, PAM emerges as the optimum method for digital health research's topic modeling, boasting unmatched efficiency in analyzing extensive digital health text data.
Downloads
References
A. Adhikari and J. Adhikari, Advances in Knowledge Discovery in Databases, Intelligen. New York Dordrecht London: Springer International Publishing Switzerland, 2015. doi: 10.1007/978-3-319-13212-9.
M. Furner, M. Z. Islam, and C.-T. Li, “Knowledge Discovery and Visualisation Framework using Machine Learning for Music Information Retrieval from Broadcast Radio Data,” Expert Syst. Appl., vol. 182, p. 115236, 2021, doi: https://doi.org/10.1016/j.eswa.2021.115236.
V. Vasilaki, V. Conca, N. Frison, A. L. Eusebi, F. Fatone, and E. Katsou, “A Knowledge Discovery Framework to Predict the N2O Emissions in the Wastewater Sector,” Water Res., vol. 178, p. 115799, 2020, doi: https://doi.org/10.1016/j.watres.2020.115799.
H. Jelodar et al., “Latent Dirichlet Allocation (LDA) and Topic modeling: Models, Applications, a Survey,” J. Mach. Learn. Res., vol. 3, no. null, pp. 993–1022, Mar. 2003, doi: https://doi.org/10.1007/s11042-018-6894-4.
A. Ahmed, R. Charate, N. V. K. Pothineni, S. K. Aedma, R. Gopinathannair, and D. R. Lakkireddy, “Role of Digital Health During Coronavirus Disease 2019 Pandemic and Future Perspectives,” Card. Electrophysiol. Clin., vol. 14, pp. 115–123, 2021, [Online]. Available: https://api.semanticscholar.org/CorpusID:240230974
K. R. Jongsma, M. N. Bekker, S. Haitjema, and A. L. Bredenoord, “How Digital Health Affects the Patient-Physician Relationship: An Empirical-Ethics Study into the Perspectives and Experiences in Obstetric Care,” Pregnancy Hypertens., vol. 25, pp. 81–86, 2021, doi: https://doi.org/10.1016/j.preghy.2021.05.017.
A. Nurlayli and M. A. Nasichuddin, “Topic Modeling Penelitian Dosen JPTEI UNY pada Google Scholar Menggunakan Latent Dirichlet Allocation,” Elinvo (Electronics, Informatics, Vocat. Educ., vol. 4, no. 2, pp. 154–161, 2019, doi: 10.21831/elinvo.v4i2.28254.
X. Cheng, Q. Cao, and S. S. Liao, “An Overview of Literature on COVID-19, MERS and SARS: Using Text Mining and Latent Dirichlet Allocation,” J. Inf. Sci., vol. 48, no. 3, pp. 304–320, Aug. 2020, doi: 10.1177/0165551520954674.
J. Tuke et al., “Pachinko Prediction: A Bayesian method for event prediction from social media data,” Inf. Process. Manag., vol. 57, no. 2, p. 102147, 2020, doi: https://doi.org/10.1016/j.ipm.2019.102147.
Y. A. Alsahafi and V. Gay, “An Overview of Electronic Personal Health Records,” Heal. Policy Technol., vol. 7, no. 4, pp. 427–432, 2018, doi: https://doi.org/10.1016/j.hlpt.2018.10.004.
L. M. Ganiem, “Efek Telemedicine pada Masyarakat (Kajian Hukum Media McLuhan: Tetrad),” Interak. J. Ilmu Komun., vol. 9, no. 2, pp. 87–97, 2021, doi: 10.14710/interaksi.9.2.87-97.
C. Schaefer and A. Makatsaria, “Framework of Data Analytics and Integrating Knowledge Management,” Int. J. Intell. Networks, vol. 2, pp. 156–165, 2021, doi: https://doi.org/10.1016/j.ijin.2021.09.004.
X. Shu and Y. Ye, “Knowledge Discovery: Methods from Data Mining and Machine Learning,” Soc. Sci. Res., vol. 110, p. 102817, 2023, doi: https://doi.org/10.1016/j.ssresearch.2022.102817.
A. Ciapetti, G. Ruggiero, and D. Toti, “A Semantic Knowledge Discovery Framework for Detecting Online Terrorist Networks,” in MultiMedia Modeling, 2019, pp. 120–131.
A. Jahani, P. Akhavan, M. Jafari, and M. Fathian, “Conceptual model for knowledge discovery process in databases based on multi-agent system,” VINE J. Inf. Knowl. Manag. Syst., vol. 46, no. 2, pp. 207–231, Jan. 2016, doi: 10.1108/VJIKMS-01-2015-0003.
A. Halder and M. Kannadhasan, “Knowledge Structure, Progression and Emergent Areas of Corporate Bankrupty: A Blibliiometric and Topic Modelling Analyses,” SSRN Electr., pp. 1–25, 2022, doi: https://dx.doi.org/10.2139/ssrn.4193714.
H. Kim, I. Cho, and M. Park, “Analyzing genderless fashion trends of consumers’ perceptions on social media: using unstructured big data analysis through Latent Dirichlet Allocation-based topic modeling,” Fash. Text., vol. 9, no. 1, p. 6, 2022, doi: 10.1186/s40691-021-00281-6.
L. Liu, L. Tang, W. Dong, S. Yao, and W. Zhou, “An overview of topic modeling and its current applications in bioinformatics,” Springerplus, vol. 5, no. 1, 2016, doi: 10.1186/s40064-016-3252-8.
M. Thompson, “The Geographies of Digital Health – Digital Therapeutic Landscapes and Mobilities,” Health Place, vol. 70, p. 102610, 2021, doi: https://doi.org/10.1016/j.healthplace.2021.102610.
A. P. Sunjaya, “Potensi, Aplikasi dan Perkembangan Digital Health di Indonesia,” J. Indones. Med. Assoc., vol. 69, no. 4, pp. 167–169, 2019, doi: 10.47830/jinma-vol.69.4-2019-63.
I. Vayansky and S. A. P. Kumar, “A Review of Topic Modeling Methods,” Inf. Syst., vol. 94, p. 101582, 2020, doi: https://doi.org/10.1016/j.is.2020.101582.
K. R. Nastiti, A. F. Hidayatullah, and A. R. Pratama, “Discovering Computer Science Research Topic Trends using Latent Dirichlet Allocation,” J. Online Inform., vol. 6, no. 1, p. 17, 2021, doi: 10.15575/join.v6i1.636.
S. Yamasaki, K. Yaji, and K. Fujita, “Knowledge Discovery in Databases for Determining Formulation in Topology Optimization,” Struct. Multidiscip. Optim., vol. 59, no. 2, pp. 595–611, 2019, doi: 10.1007/s00158-018-2086-0.
T. Y. Choi and V. Cho, “Towards a knowledge discovery framework for yield management in the Hong Kong hotel industry,” Int. J. Hosp. Manag., vol. 19, no. 1, pp. 17–31, 2000, doi: 10.1016/S0278-4319(99)00053-5.
R. J. Roiger, “The Knowledge Discovery Process,” Data Min., pp. 199–220, 2018, doi: 10.1201/9781315382586-6.
A. T. Jebb, S. Parrigon, and S. E. Woo, “Exploratory Data Analysis as a Foundation of Inductive Research,” Hum. Resour. Manag. Rev., vol. 27, no. 2, pp. 265–276, 2017, doi: 10.1016/j.hrmr.2016.08.003.
P. Chakri, S. Pratap, Lakshay, and S. K. Gouda, “An Exploratory Data Analysis Approach for Analyzing Financial Accounting Data using Machine Learning,” Decis. Anal. J., vol. 7, no. January, p. 100212, 2023, doi: 10.1016/j.dajour.2023.100212.
M. O. Adeniyi et al., “Dynamic Model of COVID-19 Disease with Exploratory Data Analysis,” Sci. African, vol. 9, p. e00477, 2020, doi: 10.1016/j.sciaf.2020.e00477.
A. Patel and S. Jain, “Formalisms of Representing Knowledge,” Procedia Comput. Sci., vol. 125, pp. 542–549, 2018, doi: 10.1016/j.procs.2017.12.070.
M. M. Abdul Jalil, C. P. Ling, N. M. Mohamad Noor, and F. Mohd, “Knowledge Representation Model for Crime Analysis,” Procedia Comput. Sci., vol. 116, pp. 484–491, 2017, doi: 10.1016/j.procs.2017.10.067.
C. Palma, V. Morgado, and R. J. N. B. da Silva, “Top-down evaluation of matrix effects uncertainty,” Talanta, vol. 192, pp. 278–287, 2019, doi: 10.1016/j.talanta.2018.09.039.
J. Rossmann, R. Gurke, L. D. Renner, R. Oertel, and W. Kirch, “Evaluation of the matrix effect of different sample matrices for 33 pharmaceuticals by post-column infusion,” J. Chromatogr. B Anal. Technol. Biomed. Life Sci., vol. 1000, pp. 84–94, 2015, doi: 10.1016/j.jchromb.2015.06.019.
X. Zhang, “Knowledge integration in interdisciplinary research teams: Role of social networks,” J. Eng. Technol. Manag., vol. 67, p. 101733, 2023, doi: https://doi.org/10.1016/j.jengtecman.2023.101733.
K. Gugerell, V. Radinger-Peer, and M. Penker, “Systemic knowledge integration in transdisciplinary and sustainability transformation research,” Futures, vol. 150, no. May, p. 103177, 2023, doi: 10.1016/j.futures.2023.103177.
M. Furner, M. Z. Islam, and C. T. Li, “Knowledge discovery and visualisation framework using machine learning for music information retrieval from broadcast radio data,” Expert Syst. Appl., vol. 182, no. May, p. 115236, 2021, doi: 10.1016/j.eswa.2021.115236.
K. Ogunsina, I. Bilionis, and D. DeLaurentis, “Exploratory data analysis for airline disruption management,” Mach. Learn. with Appl., vol. 6, no. July, p. 100102, 2021, doi: 10.1016/j.mlwa.2021.100102.
C. Meaney, T. A. Stukel, P. C. Austin, R. Moineddin, M. Greiver, and M. Escobar, “Quality indices for topic model selection and evaluation: a literature review and case study,” BMC Med. Inform. Decis. Mak., vol. 23, no. 1, pp. 1–18, 2023, doi: 10.1186/s12911-023-02216-1.
A. Abdelrazek, Y. Eid, E. Gawish, W. Medhat, and A. Hassan, “Topic modeling algorithms and applications: A survey,” Inf. Syst., vol. 112, p. 102131, 2023, doi: https://doi.org/10.1016/j.is.2022.102131.
C. C. Silva, M. Galster, and F. Gilson, “Topic modeling in software engineering research,” Empir. Softw. Eng., vol. 26, no. 6, 2021, doi: 10.1007/s10664-021-10026-0.
R. K. Gupta, R. Agarwalla, B. H. Naik, J. R. Evuri, A. Thapa, and T. D. Singh, “Prediction of research trends using LDA based topic modeling,” Glob. Transitions Proc., vol. 3, no. 1, pp. 298–304, 2022, doi: 10.1016/j.gltp.2022.03.015.
J. A. Lossio-Ventura, S. Gonzales, J. Morzan, H. Alatrista-Salas, T. Hernandez-Boussard, and J. Bian, “Evaluation of clustering and topic modeling methods over health-related tweets and emails,” Artif. Intell. Med., vol. 117, no. May, p. 102096, 2021, doi: 10.1016/j.artmed.2021.102096.
V. Alekseev, E. Egorov, K. Vorontsov, A. Goncharov, K. Nurumov, and T. Buldybayev, “TopicBank: Collection of coherent topics using multiple model training with their further use for topic model validation,” Data Knowl. Eng., vol. 135, p. 101921, 2021, doi: 10.1016/j.datak.2021.101921.
J. Gan and Y. Qi, “Selection of the optimal number of topics for LDA topic model—Taking patent policy analysis as an example,” Entropy, vol. 23, no. 10, 2021, doi: 10.3390/e23101301.
T. Huynh-The, O. Banos, B. V. Le, D. M. Bui, Y. Yoon, and S. Lee, “Traffic behavior recognition using the pachinko allocation model,” Sensors (Switzerland), vol. 15, no. 7, pp. 16040–16059, 2015, doi: 10.3390/s150716040.
W. Li; and A. McCallum, “Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations,” 2006.
Copyright (c) 2024 Siti Rohajawati, Puji Rahayu, Afny Tazkiyatul Misky, Khansha Nafi Rasyidatus Sholehah, Normala Rahim, R.R. Hutanti Setyodewi
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
1. Copyright on any article is retained by the author(s).
2. The author grants the journal, right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work’s authorship and initial publication in this journal.
3. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal’s published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
4. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
5. The article and any associated published material is distributed under the Creative Commons Attribution-ShareAlike 4.0 International License