Similarity Identification Based on Word Trigrams Using Exact String Matching Algorithms

— Several studies regarding excellent exact string matching algorithms can be used to identify similarity, including the Rabin-Karp, Winnowing, and Horspool Boyer-Moore algorithms. In determining similarities, the Rabin-Karp and Winnowing algorithms use fingerprints, while the Horspool Boyer-Moore algorithm uses a bad-character table. However, previous research focused on identifying similarities using these algorithms based on character n-gram. In contrast, identification based on the word n-gram to determine the similarity based on its linguistic meaning, especially for longer strings, had not been covered yet. Therefore, a word-level trigram was proposed to identify similarities based on the word trigrams using the three algorithms and compare each performance. Based on precision, recall, and running time comparison, the Rabin-Karp algorithm results were 100%, 100%, and 0.19 ms, respectively; the Winnowing algorithm results with the smallest window were 100%, 56%, and 0.18 ms, respectively; and the Horspool algorithm results were 100%, 100%, and 0.06 ms. From these results, it can be concluded that the performance of the Horspool Boyer-Moore algorithm is better in terms of precision, recall, and running time.


I. INTRODUCTION
Identifying similarity using string matching algorithms has been widely applied as the first step in detecting plagiarism. These algorithms can be an exact string matching algorithm or an approximate string matching algorithm. The actual string matching algorithm requires a perfect match on each character being compared, while the approximate string matching algorithms can tolerate slight mismatches in personality being compared [1], [2].
The Rabin-Karp algorithm is an exact string matching algorithm that implements a hash function to find similarities between text and pattern strings. The text strings are converted into hash values then all the obtained hash values are selected as the fingerprints. The pattern strings are also converted into hash values and then chosen as the fingerprints. The selected fingerprints of both text-pattern are then compared to find any equal value. The equivalent value indicates a similarity between text and pattern [3], [4]. The Winnowing algorithm is an advanced version of the Rabin-Karp algorithm [5]. The application of the Winnowing algorithm to the Rabin-Karp algorithm is when determining fingerprints. In the Rabin-Karp algorithm, all the obtained hash values are considered fingerprints.
Meanwhile, in the Winnowing algorithm, fingerprints are determined by first grouping all hash values into windows based on a certain width. The rightmost minimum hash value obtained from each window is used as fingerprints [6], [7], [8], [9]. To prevent collisions in hash values, the Rolling Hash formula is used with the base number in the procedure using prime numbers [4], [10], [11]. A collision is an uneven distribution of keys in the hash table, which causes several hash values to have the same key. A hash table is an array-like data structure for storing data in the form of keys and values [12]. The complexity of the hash-based algorithms in the preprocessing is O(m), while the complexity of matching is O((n-m+1)*m), where the symbol of m is the length of the pattern and the character of n is the length of the text [2], [13].
The Horspool Boyer-Moore is one of the simplified versions of the Boyer-Moore algorithm [4], [14]. The Boyer-Moore algorithm is a character-based algorithm that applies a heuristic approach in the string matching process [4], [13], [15]. Pattern matching in this algorithm starts from right to left with two heuristic approaches wrong character heuristic and good suffix heuristic. A lousy character heuristic is applied to determine the value of the skip character if there is a mismatch during pattern matching. On the other hand, the excellent suffix heuristic is used to determine the skip value if the character being compared has a match on some of its characters at the time of pattern matching. The Boyer-Moore algorithm and its modified version have excellent performance in pattern matching and have been widely applied in various technology fields, including the Search and Replace features in operating systems [13], [16]. The Horspool Boyer-Moore algorithm is one of the most effective algorithms among these algorithms, especially if the string of patterns is longer [4], [16], [17]. The Horspool algorithm removes the good suffix heuristic of the Boyer-Moore algorithm in the pattern matching process so that only the bad-character heuristic is used to compare characters [4], [14], [17], [18]. The Horspool Boyer-Moore algorithm was introduced by Nigel Horspool in 1980 [17]. This simplification of the algorithm allows a faster text-pattern matching process than the original algorithm [4], [14], [19], [20], [21]. The Horspool algorithm consists of two phases: the character matching phase and the sliding window shift phase [1]. The scanning process begins by aligning the sliding window of the pattern (needle) with the haystack string according to the length of the sliding window and then matching it from right to left [2], [17].   Figure 1 illustrates the different workflow of the Rabin-Karp algorithm, the Winnowing algorithm, and the Horspool Boyer-Moore algorithm. In Figure 1, it can be seen that the Rabin-Karp, Winnowing, and Horspool Boyer-Moore algorithms have to preprocess and split stages from the string into substrings (grams). These two hash-based algorithms split all inputs called haystacks and needles into grams. As for the Horspool, only the needle string is divided into substrings, while the haystack string is not split because the Horspool algorithm is a string-based pattern matching algorithm, and the substring pattern will be traced in the haystack string. where the algorithm had high accuracy (85.3%) with a short execution time (39.9 ms) [22]. The algorithm was superior to the original Boyer-Moore algorithm for multi-track string matching [18]. Another previous study to identify sensor devices using the Horspool algorithm also showed that the algorithm was efficient, especially if the length of the packet being matched was longer [19]. Another thing that is more interesting about this Horspool algorithm is that it can also be applied to detect malware on cloud networks; where the results showed that either used singly or integrated with other algorithms, the Horspool algorithm had a good performance because the number of attempts was less, thus speeding up the matching process [20]. Another excellence of the Horspool algorithm was implemented in the Network Intrusion Detection System (NIDS) because it could classify the types of attacks on the network well [21]. As for the Rabin-Karp and Winnowing algorithms as methods to identify word similarities between scientific work documents, the previous studies related to the application of both algorithms showed excellent results, either using in mono-language or multi-language forms and even using Chinese and Arabic letters [5], [9], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32].
Based on these results, it can be seen that the hash-function-based Rabin-Karp and Winnowing algorithms have good performance in identifying similarity using character n-gram. Likewise, the Horspool Boyer-Moore algorithm performs well in identifying similarity on complex objects and more extended patterns using the character n-gram. However, the use of these algorithms was still limited to identifying similarities using adjacent characters n-gram. In contrast, the use of adjacent words n-gram to determine similarity based on its linguistic meaning, especially for longer strings, had not been covered yet. Therefore, this research was carried out to identify similarities based on linguistic meaning using the word trigrams as the n-gram unit. The Rabin-Karp, Winnowing, and Horspool Boyer-Moore are used to determine the likeness so that the performance of the three algorithms in terms of precision and recall (sensitivity) parameters as well as the running times can be compared. Thus, it is expected that the results of this research will provide a clear picture of the algorithm with the most effective and efficient performance to be used as the foundation for developing a plagiarism detection system in future research.

II. RESEARCH METHOD
In this pattern matching research based on word-level trigrams using the Rabin-Karp algorithm and the Winnowing algorithm, the steps carried out consist of preprocessing, word-trigrams formation, hash values formation, and fingerprints comparison. Still, there was window formation before the comparison, especially in the Winnowing algorithm. The stages of the Horspool Boyer-Moore algorithm were preprocessing, needle word-trigrams formation, bad-character table formation, and haystack-needle matching. The illustration of the different stages of the Rabin-Karp, Winnowing, and Horspool Boyer-Moore algorithms [4], [14], [24], [26], [29], [30], [31], [32] in identifying similarity based on word-level trigrams is presented in Figure 2.

Figure 2. F LOWCHART O F A LGORITHMS
The illustration in Figure 2 shows that the detection process begins with the input preparation stage. In this study, two input strings were used. One input functions as a comparison text which is then called a haystack, while the other input functions as a pattern which is then called a needle.
The performance measurement for the three algorithms uses parameters of precision and recall.

A. Preprocessing Stages
In this study, the preprocessing stage was applied to the Rabin-Karp, Winnowing, and Horspool Boyer-Moore. Preprocessing on the exact string matching algorithm generally consists of case folding, tokenizing, filtering, and stemming [26], [27], [32]. Case folding is a stage of converting all the letters on the haystack and needle into lowercase uniformly, tokenizing is a INTENSIF, Vol.6 No.2 August 2022 ISSN: 2580-409X (Print) / 2549-6824 (Online) DOI: https://doi.org/10.29407/intensif.v6i2.18141 stage of slicing string into substrings, filtering is a stage of removing meaningless symbols and letters in the two input texts, and stemming is a stage of converting affixed words into their original form (root word). In this study, only case folding, tokenizing, and filtering stages were carried out without stemming from shortening the running time. The number of alphabetic characters (c) used in this study is 27 characters consisting of characters [a-z] and the symbol of spaces [" "].

B. Grams Formation
A gram is a result of slicing a string into substrings using either character or word [3], [6], [33], [34], [35]. Unigrams, bigrams, or trigrams are units in n-gram terms that indicate the length of a gram. A trigram is a gram unit with a size of three adjacent characters at character-level ngram or three adjacent words at the word-level n-gram. The n-gram based on word-trigram is generally applied in linguistics to interpret sentences based on the adjacent words [33], [34], [35].
For instance, slicing the string "searching for the same words" into substrings based on word level will result from unigrams as ["searching", "for", "the", "same", "words"]; bigrams as ["searching for", "for the", "the same", "same words"], and trigrams as ["searching for the", "for the same", "the same words"]. Using bigrams and trigram word levels to interpret sentences is more appropriate than using unigram [36].

C. Hash Values Formation
A hash-based algorithm uses hash values to identify similarity. The same hash value represents the same word. In the formation of hash values, collisions can occur, one of which is 258uet o the arrangement of characters. For instance, the substring of "abcd" can result in the same hash value as the substring of "bcda". Therefore, to avoid collisions, sub-strings should be converted into the hash, valuessing the Rolling Hash formula [37]. The Rolling Hash formula is written as follows [29], [38]: Where:

D. Window Formation by Winnowing Algorithm
Window formation is grouping hash values into a window based on a specific width to get the minimum hash value as fingerprints. The user can freely determine the window width (w), but it should be noted that the window width affects the algorithm's performance. In each window, the hash values of the haystack and the needle were grouped into a window with the number of members as many as the window's width. Then, the smallest value was taken from each window of the haystack and needle to determine fingerprints. When there were two or more equal minimum hash values from several windows of each haystack and hand, the value taken was the value on the rightmost of the window (the deal with the most extensive index) [6].

E. Fingerprints Comparison
In the Rabin-Karp algorithm, all the hash values obtained were considered fingerprints. In contrast, in the Winnowing algorithm, fingerprints were selected based on the minimum hash value obtained from each window applied to the hash value. Equal hash values obtained from fingerprints comparison between the fingerprints of the haystack and the fingerprints of the needle represented similar trigrams.

F. Bad-character table Horspool Boyer-Moore algorithm
In the Horspool algorithm, after the preprocessing stage was carried out, the next step was forming a bad-character table using the trigrams of the needle. The bad-character table is the central part of the Horspool algorithm in haystack-pattern matching. The bad-character table determines how far the sliding window must shift when there is a mismatch in the haystack pattern being compared so that the matching process becomes faster.  Figure 3 illustrates the formation of a bad-character table on the trigram of a needle so that the skip value of each character can be obtained to use in the next step. The pattern (arrow) to be searched for is "words in two". The haystack is to be matched with the needle containing the sentence "searching for the same words in two or more documents". Therefore, the formation of a bad-character table is then performed on the needle. In Figure 4, it shows that the characters in the bad-character table are "w", "o", "r", "d", "s", space, "i", "n", 't", and "?". This is because the characters that compose the needle are "w", "o", "r", "d", "s", " ", "i", "n", " ", 't", " w", and "o" where the letters "w" and "o" and space symbol appear twice which causes the skip values to be overwritten. Therefore the value taken is the last skip value. The symbol of "?" represents all the characters that compose the haystack that does not exist in the characters of the needle. The characters in the haystack that are not on the needle are the letters "e", "a", "c", "h", and so on. The skip value of "?" is 12 because the sliding window length is 12. The skip value of "?" is the same as the skip value of the letter "o" in the table because the letter "o" is the last character of the needle. Even though the letter "o" originally appeared at the beginning, it has been overwritten by the last "o".   Figure 4, it can be seen that to find a match between the Haystack string and the substring needle (pattern), it takes five attempts as follows. In the first attempt, after the alignment between the haystack and the hand and the pattern matching process has started from right to left, the results show that the characters h0 and n0 are matched; therefore, the matching continues on the following letter on the left side of h0 and n0, namely h1 and n1. However, h1 and n1 are mismatched.

G. Haystack-Needle Matching by Horspool Boyer-Moore Algorithm
Therefore the character h1 is traced in the bad-character table and checked whether its character, namely the letter "f", is in the table or not. It turns out that because the character h1 is not in the table, that h1 is categorized as the character symbol of "?". After the sliding window is shifted as long as 12 skips, the matching starts again from the first index of each character of the currently aligned haystack pattern from right to left. In the second attempt, h0 and n0 are mismatched; therefore, the character h0, namely the letter "w", is traced in the bad-character table. The skip value for the letter "w" is 1. After the sliding window is shifted as long as one skip, the matching starts again from the first index of each character of the currently aligned haystack pattern from right to left.
In the third attempt, h0 and n0 are matched to continue the matching on the next character, namely h1 and n1, which are compared. Matching continues on the next surface, namely h2 and n2 but is mismatched; therefore, the character h2, a space symbol, is traced in the bad-character table.
The skip value for the space symbol is 3. After the sliding window has shifted as far as the skip value, the matching starts again from the first index of each character of the currently aligned haystack pattern from right to left.
In the fourth attempt, h0 and n0 are mismatched; therefore, the character h0, namely the letter "s", is traced in the bad-character table. The skip value for the letter "s" is seven, so the sliding window skips as far as that value. In the fifth attempt, all characters are matched; thus, the search for the first gram is found. The match is finished because it has reached the last character of the haystack string.

H. Performance Measurement
The parameters that are commonly used for algorithm performance in identifying similarity are precision and recall [39]. Precision is a parameter to measure the algorithm's accuracy, while memory is used to measure the algorithm's sensitivity. A harmonic mean (f-measure) is a parameter used to measure the balance between an algorithm's precision and recall. The formula for calculating precision, recall, and f-measure are as follows:

B. Dataset and Preprocessing Results
The dataset used in this study was two plaintexts with different content of strings. One plaintext functions as the haystack and the other as the needle, as presented in Table 1.

String Haystack String Needle String Haystack String Needle
"Searching for the same words in two or more documents is the first step in the process of detecting plagiarism in scientific works." "The first step in detecting plagiarism in scientific works is to look for the similar words in two or more documents." "searching for the same words in two or more documents is the first step in the process of detecting plagiarism in scientific works" "the first step in detecting plagiarism in scientific works is to look for the similar words in two or more documents"

C. Grams results
In the hash-based algorithms (Rabin-Karp and Winnowing), the trigram formation was applied to both plaintexts using word-level trigrams.

Trigrams of Haystack Trigrams of Needle
[ 'searching for the', 'for the same', 'the same words', 'same words in', 'words in two', 'in two or', 'two or more', 'or more documents', 'more documents is', 'documents is the', 'is the first', 'the first step', 'first step in', 'step in the', 'in the process', 'the process of', 'process of detecting', 'of detecting plagiarism', 'detecting plagiarism in', 'plagiarism in scientific', 'in scientific works' ] [ 'the first step', 'first step in', 'step in detecting', 'in detecting plagiarism', 'detecting plagiarism in', 'plagiarism in scientific', 'in scientific works', 'scientific works is', 'works is to', 'is to look', 'to look for', 'look for the', 'for the similar', 'the similar words', 'similar words in', 'words in two', 'in two or', 'two or more', 'or more documents' ] Data in Table 2 show that the haystack trigrams obtained were 21 grams, and the trigrams needle were 19 grams-the trigrams formed by splitting the string into pieces of sub-strings. The trigrams formed on the hand were used not only to create fingerprints on the Rabin-Karp and Winnowing algorithm but also for the string matching process on the Horspool algorithm.   Table 3 show that the hash values of the haystack obtained were 21 hashes and the hash values of the needle were 19 hashes.

E. Windows Results of Winnowing algorithm
In the Winnowing algorithm, the hash values of each haystack and needle were first grouped into a window with a certain width. The hash minimum values obtained from each haystack window and needle window were then used as fingerprints. The data presented in Table 4 shows that the more comprehensive the width value set, the less the window is formed (see Appendices section for more detail).

Fingerprints Comparison Results of Rabin-Karp Algorithm
In the Rabin-Karp algorithm, the selected fingerprints were from all the haystack hash values and needle hash values. Table 6 presents the fingerprints used by the Rabin-Karp algorithm to identify the similarity between haystack and needle. [ 'words in two', 'in two or', 'two or more', 'or more documents', 'the first step', 'first step in', 'detecting plagiarism in', 'plagiarism in scientific', 'in scientific works' ] The data in Table 6 shows that from the comparison between haystack fingerprints and needles, nine fingerprints are equal, so the similarity between the two is relatively high. Table 7 shows that the wider the window used in grouping hash values, the fewer fingerprints identified as similar.

G. Bad-Character Results of Horspool Boyer-Moore Algorithm
In the Horspool Boyer-Moore algorithm, the matching process was carried out by comparing the needle's trigrams with the haystack's plaintext. Therefore, in each needle trigram, a badcharacter table was formed to determine the skip value of each character in each trigram before starting the haystack-needle matching. Table 8. B AD -C HARACTER T ABLE F OR E ACH T RIGRAM

Trigrams of Needle
Skip Values 'the first step', 'first step in', 'step in detecting', 'in detecting plagiarism', 'detecting plagiarism in', 'plagiarism in scientific', 'in scientific works', 'scientific works is', 'works is to', 'is to look', 'to look for', 'look for the', 'for the similar', 'the similar words', 'similar words in', 'words in two', 'in two or', 'two or more', 'or more documents.'

H. Matching Results by Horspool Boyer-Moore Algorithm
The string matching results in Table 9 show similarities between the haystack and the needle because as many as nine trigrams were matched. if it was only one character, the algorithm returned the value "-1", meaning that there are no three adjacent words in the haystack that exactly match the trigram.

I. Performance Results of Algorithms
To respectively; with a window size of 3 were 100%, 25%, and 40%, respectively; and with a window size of 4 were 100%, 11%, and 20%, respectively (See Appendices section for more detail performance results of the algorithms).
In terms of the performance measurement using precision and recall parameters as presented in Figure 5, the Horspool Boyer-Moore performance was equal to the Rabin-Karp algorithm. In contrast, the Winnowing algorithm performance was lower in terms of sensitivity (recall). It can be seen in Figure 5 that the wider the window of the Winnowing algorithm, the lower the sensitivity value. This was due to the effect of using windows on the determination of the fingerprints so that the wider the window width used in grouping hash values, the less the minimum hash value selected (for more detail, see Appendices section).

Figure 6. R UNNING T IMES O F A LGORITHMS
In identifying similarity using word-level trigrams, in terms of running time as presented in