Similarity Identification Based on Word Trigrams Using Exact String Matching Algorithms
Abstract
Several studies regarding excellent exact string matching algorithms can be used to identify similarity, including the Rabin-Karp, Winnowing, and Horspool Boyer-Moore algorithms. In determining similarities, the Rabin-Karp and Winnowing algorithms use fingerprints, while the Horspool Boyer-Moore algorithm uses a bad-character table. However, previous research focused on identifying similarities using these algorithms based on character n-gram. In contrast, identification based on the word n-gram to determine the similarity based on its linguistic meaning, especially for longer strings, had not been covered yet. Therefore, a word-level trigram was proposed to identify similarities based on the word trigrams using the three algorithms and compare each performance. Based on precision, recall, and running time comparison, the Rabin-Karp algorithm results were 100%, 100%, and 0.19 ms, respectively; the Winnowing algorithm results with the smallest window were 100%, 56%, and 0.18 ms, respectively; and the Horspool algorithm results were 100%, 100%, and 0.06 ms. From these results, it can be concluded that the performance of the Horspool Boyer-Moore algorithm is better in terms of precision, recall, and running time.
Downloads
References
I. Markić, M. Štula, M. Zorić, and D. Stipaničev, “Entropy-Based Approach in Selection Exact String-Matching Algorithms,” Entropy, vol. 23, no. 1, pp. 1–19, Jan. 2021, doi: 10.3390/e23010031.
R. K. Pandey and S. Taruna, “Prevalent exact string-matching algorithms in natural language processing: a review,” in Journal of Physics: Conference Series, May 2021, vol. 1854, no. 1. doi: 10.1088/1742-6596/1854/1/012042.
S. Hakak, A. Kamsin, P. Shivakumara, G. A. Gilkar, W. Z. Khan, and M. Imran, “Exact String Matching Algorithms: Survey, Issues, and Future Research Directions,” IEEE Access, pp. 2169–3536, 2018.
R. A. Baeza-Yates, “Algorithms for String Searching: A Survey,” ACM SIGIR Forum, vol. 23, no. 3–4, pp. 34–58, Apr. 1989, doi: https://doi.org/10.1145/74697.74700.
X. Duan, M. Wang, and J. Mu, “A plagiarism detection algorithm based on extended winnowing,” in MATEC Web of Conferences, Oct. 2017, vol. 128. doi: 10.1051/matecconf/201712802019.
S. D. Schleimer, “Winnowing: local algorithms for document fingerprinting,” in Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, Jun. 2003, pp. 76–85.
A. T. Wibowo, K. W. Sudarmadi, and A. M. Barmawi, “comparison between fingerprint and winnowing algorithm to detect plagiarism fraud on bahasa indonesia documents,” in International Conference of Information and Communication Technology, 2013, pp. 128–133. doi: https://doi.org/10.1109/ICoICT.2013.6574560.
B. Devore-Mcdonald and E. D. Berger, “Mossad: Defeating software plagiarism detection,” in Proceedings of the ACM on Programming Languages, Nov. 2020, vol. 4, no. OOPSLA. doi: 10.1145/3428206.
I. Widaningrum, D. Mustikasari, R. Arifin, and H. A. Pratiwi, “Evaluation of the accuracy of winnowing, rabin karp and knuth morris pratt algorithms in plagiarism detection applications,” in Journal of Physics: Conference Series, May 2020, vol. 1517, no. 1. doi: 10.1088/1742-6596/1517/1/012093.
M. Corman, “Why Should the Length of Your Hash Table Be a Prime Number?,” Aug. 08, 2020. https://medium.com/swlh/why-should-the-length-of-your-hash-table-be-a-prime-number-760ec65a75d1 (accessed Apr. 14, 2022).
E. Karimov, “Hash Table. In: Data Structures and Algorithms in Swift,” Apress, Berkeley, CA, pp. 55–60, Mar. 2020, doi: https://doi.org/10.1007/978-1-4842-5769-2_7.
X. Wang and L. Liu, “Image Encryption Based on Hash Table Scrambling and DNA Substitution,” IEEE Access, vol. 8, pp. 68533–68547, 2020, doi: 10.1109/ACCESS.2020.2986831.
M. Góngora-Blandón and M. Vargas-Lombardo, “State of the Art for String Analysis and Pattern Search Using CPU and GPU Based Programming,” Journal of Information Security, vol. 03, no. 04, pp. 314–318, 2012, doi: 10.4236/jis.2012.34038.
N. Horspool, “Practical fast searching in strings Cite this paper,” Software:PracticeandExperience, vol. 10, pp. 501–506, 1980.
R. Fitriyanto, A. Yudhana, and S. Sunardi, “Implementation SHA512 Hash Function And Boyer-Moore String Matching Algorithm For Jpeg/exif Message Digest Compilation,” Jurnal Online Informatika, vol. 4, no. 1, p. 16, Sep. 2019, doi: 10.15575/join.v4i1.304.
A. Thyab and A. Ajeeli, “Advanced Searching Algorithms and its Behavior on Text Structures,” 2016. [Online]. Available: www.iiste.org
M. Fisk, G. Varghese, and M. Gov, “Applying Fast String Matching to Intrusion Detection Applying Fast String Matching to Intrusion Detection,” LA-UR-01-5459, Sep. 2021, doi: DOI:10.21236/ada406266.
D. Hendrian, Y. Ueki, K. Narisawa, R. Yoshinaka, and A. Shinohara, “Permuted pattern matching algorithms on multi-track strings,” Algorithms, vol. 12, no. 4, 2019, doi: 10.3390/a12040073.
H. Min et al., “Pattern matching based sensor identification layer for an android platform,” Wireless Communications and Mobile Computing, vol. 2018, 2018, doi: 10.1155/2018/4734527.
E. Ahmed, A. Sorrour, M. Sobh, and A. Bahaa-Eldin, “A cloud-based malware detection framework,” International Journal of Interactive Mobile Technologies, vol. 11, no. 2, pp. 113–127, 2017, doi: 10.3991/ijim.v11i2.6577.
O. F. Rashid, Z. A. Othman, and S. Zainudin, “A Novel DNA Sequence Approach for Network Intrusion Detection System Based on Cryptography Encoding Method,” International Journal on Advanced Science, Engineering and Information Technology, vol. 7, no. 1, 2017.
Y. A. Gerhana, N. Lukman, A. F. Huda, C. N. Alam, U. Syaripudin, and D. Novitasari, “Comparison of search algorithms in Javanese-Indonesian dictionary application,” Telkomnika (Telecommunication Computing Electronics and Control), vol. 18, no. 5, pp. 2517–2524, Oct. 2020, doi: 10.12928/TELKOMNIKA.v18i5.14882.
A. P. U. Siahaan, R. Rahim, M. Aan, and D. Siregar, “K-Gram as a determinant of plagiarism level in rabin-karp algorithm,” International Journal of Scientific & Technology Research, vol. 6, no. 7, pp. 350–353, 2017, [Online]. Available: www.ijstr.org
A. D. Hartanto, A. Syaputra, and Y. Pristyanto, “Best parameter selection of rabin-Karp algorithm in detecting document similarity,” in 2019 International Conference on Information and Communications Technology, ICOIACT 2019, Jul. 2019, pp. 457–461. doi: 10.1109/ICOIACT46704.2019.8938458.
C. Supriyanto, S. Rakasiwi, and A. Syukur, “A Comparison of Rabin Karp and Semantic-Based Plagiarism Detection,” in 3rd International Conferences on Soft Computing, Intelligent System and Information Technology, 2012, pp. 29–31. [Online]. Available: http://ir.shef.ac.uk/cloughie/resources/plagiarism_corpus.html
Y. Nurdiansyah, F. Nur Muharrom, and F. Firdaus, “Implementation of winnowing algorithm based k-gram to identify plagiarism on file text-based document,” in MATEC Web of Conferences, Apr. 2018, vol. 164. doi: 10.1051/matecconf/201816401048.
S. Sunardi, A. Yudhana, and I. A. Mukaromah, “Indonesia Words Detection Using Fingerprint Winnowing Algorithm,” Jurnal Informatika, vol. 13, no. 1, Jan. 2019. doi: 10.26555/jifo.v13i1.a8452.
D. B. Rahmawati, M. Luthfi Irfani, R. B. Purba, and I. Ranggadara, “Text Mining To Detect Plagiarism In E-Learning System Using Rabin Karp Algorithm,” Iconic Research and Engineering Journals, vol. 3, no. 8, pp. 183-191, Feb. 2020.
N. Ulinnuha, M. Thohir, D. C. R. Novitasari, A. H. Asyhar, and A. Z. Arifin, “Implementation of winnowing algorithm for document plagiarism detection,” in Proceeding of EECSI , 2018, pp. 631–636.
A. Yudhana, Sunardi, and I. A. Mukaromah, “Implementation of Winnowing Algorithm with Dictionary English-Indonesia Technique to Detect Plagiarism,” International Journal of Advanced Computer Science and Applications, vol. 9, no. 5, pp. 183–189, 2018, [Online]. Available: www.ijacsa.thesai.org
A. Aljohani and M. Mohd, “Arabic-English Cross-language Plagiarism using winnowing,” Information Technology Journal, vol. 13, no. 14, pp. 2349–2355, 2014.
R. Sutoyo et al., “Detecting documents plagiarism using winnowing algorithm and k-gram method,” in 2017 IEEE International Conference on Cybernetics and Computational Intelligence, CyberneticsCOM 2017 - Proceedings, Mar. 2018, vol. 2017-November, pp. 67–72. doi: 10.1109/CYBERNETICSCOM.2017.8311686.
M. Schonlau and N. Guenther, “Text mining using n-grams variables,” The Stata Journal., vol. 17, no. 4, pp. 866–881, Dec. 2017.
D. Wright, “Using word n-grams to identify authors and idiolects A corpus approach to a forensic linguistic problem,” International Journal of Corpus Linguistics, vol. 22, no. 2, pp. 212–241, Sep. 2017, doi: https://doi.org/10.1075/ijcl.22.2.03wri.
J. Alcañiz and J. Andrés, “Profiling Hate Spreaders using word N-grams Notebook for PAN at CLEF 2021,” Sep. 2021. [Online]. Available: http://ceur-ws.org
D. Johnson, V. Malhotra, and P. Vamplew, “More Effective Web Search Using Bigrams and Trigrams,” Webology, vol. 3, no. 4, 2006.
A. Violentyev, “Rolling Hash Function Tutorial Used by Rabin-Karp String Searching Algorithm,” YouTube, Dec. 20, 2019.
H. Jiang and S. J. Lin, “A Rolling Hash Algorithm and the Implementation to LZ4 Data Compression,” IEEE Access, vol. 8, pp. 35529–35534, 2020, doi: 10.1109/ACCESS.2020.2974489.
M. Paul and S. Jamal, “An improved SRL based plagiarism detection technique using Sentence ranking,” in Procedia Computer Science, 2015, vol. 46, pp. 223–230. doi: 10.1016/j.procs.2015.02.015.
Copyright (c) 2022 Abdul Fadlil, Sunardi Sunardi, Rezki Ramdhani
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
1. Copyright on any article is retained by the author(s).
2. The author grants the journal, right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work’s authorship and initial publication in this journal.
3. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal’s published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
4. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
5. The article and any associated published material is distributed under the Creative Commons Attribution-ShareAlike 4.0 International License