Similarity Identification Based on Word Trigrams Using Exact String Matching Algorithms

Authors

DOI:

https://doi.org/10.29407/intensif.v6i2.18141

Keywords:

string-matching, algorithm, performance, n-gram, similarity

Abstract

Several studies regarding excellent exact string matching algorithms can be used to identify similarity, including the Rabin-Karp, Winnowing, and Horspool Boyer-Moore algorithms. In determining similarities, the Rabin-Karp and Winnowing algorithms use fingerprints, while the Horspool Boyer-Moore algorithm uses a bad-character table. However, previous research focused on identifying similarities using these algorithms based on character n-gram. In contrast, identification based on the word n-gram to determine the similarity based on its linguistic meaning, especially for longer strings, had not been covered yet. Therefore, a word-level trigram was proposed to identify similarities based on the word trigrams using the three algorithms and compare each performance. Based on precision, recall, and running time comparison, the Rabin-Karp algorithm results were 100%, 100%, and 0.19 ms, respectively; the Winnowing algorithm results with the smallest window were 100%, 56%, and 0.18 ms, respectively; and the Horspool algorithm results were 100%, 100%, and 0.06 ms. From these results, it can be concluded that the performance of the Horspool Boyer-Moore algorithm is better in terms of precision, recall, and running time.

Downloads

Download data is not yet available.

References

I. Markić, M. Štula, M. Zorić, and D. Stipaničev, “Entropy-Based Approach in Selection Exact String-Matching Algorithms,” Entropy, vol. 23, no. 1, pp. 1–19, Jan. 2021, doi: 10.3390/e23010031.

R. K. Pandey and S. Taruna, “Prevalent exact string-matching algorithms in natural language processing: a review,” in Journal of Physics: Conference Series, May 2021, vol. 1854, no. 1. doi: 10.1088/1742-6596/1854/1/012042.

S. Hakak, A. Kamsin, P. Shivakumara, G. A. Gilkar, W. Z. Khan, and M. Imran, “Exact String Matching Algorithms: Survey, Issues, and Future Research Directions,” IEEE Access, pp. 2169–3536, 2018.

R. A. Baeza-Yates, “Algorithms for String Searching: A Survey,” ACM SIGIR Forum, vol. 23, no. 3–4, pp. 34–58, Apr. 1989, doi: https://doi.org/10.1145/74697.74700.

X. Duan, M. Wang, and J. Mu, “A plagiarism detection algorithm based on extended winnowing,” in MATEC Web of Conferences, Oct. 2017, vol. 128. doi: 10.1051/matecconf/201712802019.

S. D. Schleimer, “Winnowing: local algorithms for document fingerprinting,” in Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, Jun. 2003, pp. 76–85.

A. T. Wibowo, K. W. Sudarmadi, and A. M. Barmawi, “comparison between fingerprint and winnowing algorithm to detect plagiarism fraud on bahasa indonesia documents,” in International Conference of Information and Communication Technology, 2013, pp. 128–133. doi: https://doi.org/10.1109/ICoICT.2013.6574560.

B. Devore-Mcdonald and E. D. Berger, “Mossad: Defeating software plagiarism detection,” in Proceedings of the ACM on Programming Languages, Nov. 2020, vol. 4, no. OOPSLA. doi: 10.1145/3428206.

I. Widaningrum, D. Mustikasari, R. Arifin, and H. A. Pratiwi, “Evaluation of the accuracy of winnowing, rabin karp and knuth morris pratt algorithms in plagiarism detection applications,” in Journal of Physics: Conference Series, May 2020, vol. 1517, no. 1. doi: 10.1088/1742-6596/1517/1/012093.

M. Corman, “Why Should the Length of Your Hash Table Be a Prime Number?,” Aug. 08, 2020. https://medium.com/swlh/why-should-the-length-of-your-hash-table-be-a-prime-number-760ec65a75d1 (accessed Apr. 14, 2022).

E. Karimov, “Hash Table. In: Data Structures and Algorithms in Swift,” Apress, Berkeley, CA, pp. 55–60, Mar. 2020, doi: https://doi.org/10.1007/978-1-4842-5769-2_7.

X. Wang and L. Liu, “Image Encryption Based on Hash Table Scrambling and DNA Substitution,” IEEE Access, vol. 8, pp. 68533–68547, 2020, doi: 10.1109/ACCESS.2020.2986831.

M. Góngora-Blandón and M. Vargas-Lombardo, “State of the Art for String Analysis and Pattern Search Using CPU and GPU Based Programming,” Journal of Information Security, vol. 03, no. 04, pp. 314–318, 2012, doi: 10.4236/jis.2012.34038.

N. Horspool, “Practical fast searching in strings Cite this paper,” Software:PracticeandExperience, vol. 10, pp. 501–506, 1980.

R. Fitriyanto, A. Yudhana, and S. Sunardi, “Implementation SHA512 Hash Function And Boyer-Moore String Matching Algorithm For Jpeg/exif Message Digest Compilation,” Jurnal Online Informatika, vol. 4, no. 1, p. 16, Sep. 2019, doi: 10.15575/join.v4i1.304.

A. Thyab and A. Ajeeli, “Advanced Searching Algorithms and its Behavior on Text Structures,” 2016. [Online]. Available: www.iiste.org

M. Fisk, G. Varghese, and M. Gov, “Applying Fast String Matching to Intrusion Detection Applying Fast String Matching to Intrusion Detection,” LA-UR-01-5459, Sep. 2021, doi: DOI:10.21236/ada406266.

D. Hendrian, Y. Ueki, K. Narisawa, R. Yoshinaka, and A. Shinohara, “Permuted pattern matching algorithms on multi-track strings,” Algorithms, vol. 12, no. 4, 2019, doi: 10.3390/a12040073.

H. Min et al., “Pattern matching based sensor identification layer for an android platform,” Wireless Communications and Mobile Computing, vol. 2018, 2018, doi: 10.1155/2018/4734527.

E. Ahmed, A. Sorrour, M. Sobh, and A. Bahaa-Eldin, “A cloud-based malware detection framework,” International Journal of Interactive Mobile Technologies, vol. 11, no. 2, pp. 113–127, 2017, doi: 10.3991/ijim.v11i2.6577.

O. F. Rashid, Z. A. Othman, and S. Zainudin, “A Novel DNA Sequence Approach for Network Intrusion Detection System Based on Cryptography Encoding Method,” International Journal on Advanced Science, Engineering and Information Technology, vol. 7, no. 1, 2017.

Y. A. Gerhana, N. Lukman, A. F. Huda, C. N. Alam, U. Syaripudin, and D. Novitasari, “Comparison of search algorithms in Javanese-Indonesian dictionary application,” Telkomnika (Telecommunication Computing Electronics and Control), vol. 18, no. 5, pp. 2517–2524, Oct. 2020, doi: 10.12928/TELKOMNIKA.v18i5.14882.

A. P. U. Siahaan, R. Rahim, M. Aan, and D. Siregar, “K-Gram as a determinant of plagiarism level in rabin-karp algorithm,” International Journal of Scientific & Technology Research, vol. 6, no. 7, pp. 350–353, 2017, [Online]. Available: www.ijstr.org

A. D. Hartanto, A. Syaputra, and Y. Pristyanto, “Best parameter selection of rabin-Karp algorithm in detecting document similarity,” in 2019 International Conference on Information and Communications Technology, ICOIACT 2019, Jul. 2019, pp. 457–461. doi: 10.1109/ICOIACT46704.2019.8938458.

C. Supriyanto, S. Rakasiwi, and A. Syukur, “A Comparison of Rabin Karp and Semantic-Based Plagiarism Detection,” in 3rd International Conferences on Soft Computing, Intelligent System and Information Technology, 2012, pp. 29–31. [Online]. Available: http://ir.shef.ac.uk/cloughie/resources/plagiarism_corpus.html

Y. Nurdiansyah, F. Nur Muharrom, and F. Firdaus, “Implementation of winnowing algorithm based k-gram to identify plagiarism on file text-based document,” in MATEC Web of Conferences, Apr. 2018, vol. 164. doi: 10.1051/matecconf/201816401048.

S. Sunardi, A. Yudhana, and I. A. Mukaromah, “Indonesia Words Detection Using Fingerprint Winnowing Algorithm,” Jurnal Informatika, vol. 13, no. 1, Jan. 2019. doi: 10.26555/jifo.v13i1.a8452.

D. B. Rahmawati, M. Luthfi Irfani, R. B. Purba, and I. Ranggadara, “Text Mining To Detect Plagiarism In E-Learning System Using Rabin Karp Algorithm,” Iconic Research and Engineering Journals, vol. 3, no. 8, pp. 183-191, Feb. 2020.

N. Ulinnuha, M. Thohir, D. C. R. Novitasari, A. H. Asyhar, and A. Z. Arifin, “Implementation of winnowing algorithm for document plagiarism detection,” in Proceeding of EECSI , 2018, pp. 631–636.

A. Yudhana, Sunardi, and I. A. Mukaromah, “Implementation of Winnowing Algorithm with Dictionary English-Indonesia Technique to Detect Plagiarism,” International Journal of Advanced Computer Science and Applications, vol. 9, no. 5, pp. 183–189, 2018, [Online]. Available: www.ijacsa.thesai.org

A. Aljohani and M. Mohd, “Arabic-English Cross-language Plagiarism using winnowing,” Information Technology Journal, vol. 13, no. 14, pp. 2349–2355, 2014.

R. Sutoyo et al., “Detecting documents plagiarism using winnowing algorithm and k-gram method,” in 2017 IEEE International Conference on Cybernetics and Computational Intelligence, CyberneticsCOM 2017 - Proceedings, Mar. 2018, vol. 2017-November, pp. 67–72. doi: 10.1109/CYBERNETICSCOM.2017.8311686.

M. Schonlau and N. Guenther, “Text mining using n-grams variables,” The Stata Journal., vol. 17, no. 4, pp. 866–881, Dec. 2017.

D. Wright, “Using word n-grams to identify authors and idiolects A corpus approach to a forensic linguistic problem,” International Journal of Corpus Linguistics, vol. 22, no. 2, pp. 212–241, Sep. 2017, doi: https://doi.org/10.1075/ijcl.22.2.03wri.

J. Alcañiz and J. Andrés, “Profiling Hate Spreaders using word N-grams Notebook for PAN at CLEF 2021,” Sep. 2021. [Online]. Available: http://ceur-ws.org

D. Johnson, V. Malhotra, and P. Vamplew, “More Effective Web Search Using Bigrams and Trigrams,” Webology, vol. 3, no. 4, 2006.

A. Violentyev, “Rolling Hash Function Tutorial Used by Rabin-Karp String Searching Algorithm,” YouTube, Dec. 20, 2019.

H. Jiang and S. J. Lin, “A Rolling Hash Algorithm and the Implementation to LZ4 Data Compression,” IEEE Access, vol. 8, pp. 35529–35534, 2020, doi: 10.1109/ACCESS.2020.2974489.

M. Paul and S. Jamal, “An improved SRL based plagiarism detection technique using Sentence ranking,” in Procedia Computer Science, 2015, vol. 46, pp. 223–230. doi: 10.1016/j.procs.2015.02.015.

Downloads

PlumX Metrics

Published

2022-08-13

How to Cite

[1]
A. Fadlil, S. Sunardi, and R. Ramdhani, “Similarity Identification Based on Word Trigrams Using Exact String Matching Algorithms”, INTENSIF: J. Ilm. Penelit. dan Penerap. Tek. Sist. Inf., vol. 6, no. 2, pp. 253–270, Aug. 2022.