Arab J Sci Eng DOI 10.1007/s13369-017-2855-x
RESEARCH ARTICLE - COMPUTER ENGINEERING AND COMPUTER SCIENCE
A Feature Selection Approach to Detect Spam in the Facebook Social Network Mohammad Karim Sohrabi1
· Firoozeh Karimi1
Received: 21 February 2017 / Accepted: 4 October 2017 © King Fahd University of Petroleum & Minerals 2017
Abstract The widespread adoption of social networks and their enormous facilities and growing opportunities has attracted many users and audience. But along with attractive and interesting messages and topics, inappropriate and sometimes criminal contents, such as spam, are also released on these networks. Malicious spammers intend to send inaccurate or irrelevant contents to distribute malformed information on online social networks. This paper is about the spam comments detection on the Facebook social network. By reviewing the posts and comments, and studying their features, an online spam filtering system has been designed in this paper. The proposed filtering system is able to exploit various exploration methods and optimization algorithms such as simulated annealing, particle swarm optimization, ant colony optimization, and differential evolution to detect and filter malicious contents and to prevent publishing spam comments to provide a secure environment for users of this popular social network. Furthermore, supervised machine learning methods, clustering techniques, and decision trees have been exploited to provide an accurate performance and appropriate speed for the proposed filtering system. Keywords Online social network · Feature selection approach · Spam detection · Machine learning
B
Mohammad Karim Sohrabi
[email protected] Firoozeh Karimi
[email protected]
1
Department of Computer Engineering, Semnan Branch, Islamic Azad University, Semnan, Iran
1 Introduction Along with the popularity of social networks, these networks have become a major tool for criminal and malicious activities in the form of sending spam. Many illegal activities such as stealing important information, selling malware, false propaganda, malware distribution, and numerous other cases of malicious operations are carried out by spammers. There are also many spammers who redirect their users to undesirable and malicious pages by posting URLs in the form of comments and publishing them on different pages of online social networks. Due to the massive amount of spam data, manual analysis and detecting of them is a very difficult and even impossible. Meanwhile, in order to secure the environment of these networks and gain the trust of users, an efficient solution to this problem is inevitable [1]. Comparing the number of spam of online social networks with the number of spam emails shows that users are more prone to trusting spam messages posted by their friends on online social networks [2]. Since the spam emails are detected by SMTP servers, the method which is widely used to detect spam emails is completely inapplicable to detect spam comments on social networks. Another considerable issue in this regard is the size of the messages. The comments on online social networks’ posts are usually short, while the size of emails is often much larger than the size of this type of comments. Another important issue in identifying the different types of spam is that the most spam emails are sent to users from fraudulent and fake accounts. By detecting and blocking the fake accounts, receiving of such spam emails can be prevented. However, this method is not appropriate for detecting spam comments of social networks. Spamming on the social networks platform is not usually done by fraudulent accounts, and there are many ordinary users who are spammers in these environments. Therefore, just identifying
123
Arab J Sci Eng
fake accounts and filtering them is not an adequate solution to prevent the release of spam comments in online social networks. This paper’s solution for spam detection in online social networks is based on machine learning methods. There are several supervised and unsupervised machine learning methods [3]which have been used for different artificial intelligence applications, such as data mining and knowledge discovery [4–9], Data warehousing [10,11], big data processing and management [12], and computational biology [13]. Machine learning techniques have been also widely used to detect software sabotages and attacks [14]. In this paper, a machine learning-based solution for spam detection in the Facebook social network is represented. By reviewing 200,000 posts from Facebook’s social network, considering important features of the posts and their corresponding comments, and finally applying the feature selection techniques, the proposed method chooses the most effective features to detect spam using machine learning techniques. The main contributions of the paper can be summarized as follows: Obtaining the highest amount of correct separation of spam and legitimate messages through three clustering methods using PSO-based feature selection, DB index and Differential Evolution (DE) algorithm, Support Vector Machines (SVM), and Decision Tree (DT). Designing a hybrid algorithm by combining SVM and clustering methods to achieve the highest amount of correct separation of spam and legitimate messages. Designing an online spam filtering system and using feature selection and clustering and decision tree techniques to improve the speed and accuracy of the system. The remaining parts of the paper have been organized as follows: In Sect. 2 the literature review is represented. Section 3 describes the scenario of the proposed system, evaluates its considered features and their impact rate, and explains its feature selection algorithm. After that evolution-based clustering algorithm is explained in this section and two supervised machine learning methods are employed to complete the detection phase. The design of the online spam detection system and the experimental results are proposed in Sect. 4, and finally the work is concluded in Sect. 5.
2 Related Works Along with several advantages and benefits of social networks, users of these networks face various threats such as identity thieves, malwares, malicious searches, and of course spam. Spammers can cause problems for users by putting spam on their profiles, and posting junk comments on the
123
networks. The widespread use and popularity of the Facebook made this social network a potential destination for most spam attacks. [15]showed an attack on the Facebook in which, through the friends list of users and choosing nodes, tens thousands of spam could be sent to other users. They have called such attacks “Friend-in-the-middle Attacks,” through which sensitive information from the users’ friends can be achieved. They explain that such attacks need cheap hardware and very little time. [16]studied on Defensio software on Facebook. This software scores the texts of comments by categorizing them using SVM, and also rates the senders of comments by the discovery among their credits. Posts are categorized using the average of these scores, and spam will be separated from the legitimate messages. The content of messages and senders’ behavior are also used by [17]to detect social media spammers. A constrained nonnegative matrix factorization-based semi-supervised approach has been exploited in this work to detect the spammers by implementing the collaboration of the message content, the user’s behavior, and social relation information matrix. Feature selection plays an important role in classification. Applying the feature selection methods is very effective in shortening the training time and improving the performance of classifiers. Since there may be a complex inter-relation between the features, it is generally difficult to choose the best subset of the features [18]. Different approaches have been proposed in the literature to solve this problem. In general, feature selection approaches can be divided into two main groups, namely wrapper approach and filter approach. Filter method primarily relies on the general features of a dataset to evaluate and select a subset of the features without taking a specific approach of learning [19]. Wrapper approach uses a classification technique to choose the optimal subset of features [20]. Each of feature selection approaches suffers from its own problems. Several meta-heuristic technique, such as genetic algorithms [21], particle swarm optimization [20], and simulated annealing [22], have been exploited to tackle the feature selection problems. Several feature selection methods have been proposed in the literature to enhance the performance of spam detection. Some of feature selection techniques have been used to detect spam emails. For example, [23]investigated 40 different features to detect unsolicited bulk emails and selected the set of best features to detect spam emails. The feature representation and feature selection techniques have also been used to enhance the SVM-based spam detector [24]. In addition to detect spam emails, feature selection techniques play an important role in spam detection in online social networks. An optimal Random Forests (RF)-based spam detection model has been proposed in [25], which optimizes two parameters of RF and determines importance of variable to select the most important features and to eliminate
Arab J Sci Eng
irrelevant ones. The optimal number of selected features is calculated in two different ways in this work. A labeled dataset of the users of Sina Webio social network has been constructed by [2]and its spammer and nonspammer users have been classified manually. The spammer detection algorithm of this work exploits a feature selection phase on the set of features of messages’ contents and users’ behavior to improve performance of its SVM-based method. A proper set of feature has been also selected from contents of message and behavior of users and has been applied to the proposed extreme machine learning-based spammer detector of [26]. A binary PSO with mutation operator is used by the wrapperbased feature selection method of [27]using a decision tree. Taxonomy of features which are used to detect malicious accounts of online social networks is represented in [28].
3 Feature Selection in the Proposed Spam Detection System A lot of spam comments are rapidly published every day on social networking environments. This paper aims to provide a system that prevents releasing these worthless and destructive messages. A simple and rational view of the proposed system is shown in Fig. 1. As shown in Fig. 1, the proposed system detects and filters the spam before they get to the recipient. The dataset of the proposed system includes 200,000 wall posts with their comments from the Facebook social network. These data have been gathered in two different ways. A part of the dataset was collected using an agent from the public Facebook pages. The second part has been collected manually from personal pages of different users which are the member of friends’ networks of the information collectors.
Users
Interface
Spam
Valid
Valid
Online Spam Filtering System
Storage
Fig. 1 The proposed scenario for spam filtering system
In this section, we first introduce the considered features and then we explain the feature selection algorithm of the proposed system. 3.1 Evaluating the Considered Features A user’s wall in the Facebook is a place for exchanging texts, videos, and other media, and sending links among this social network. The user gives their opinion about the raised issues by liking the posts or writing comments. According to the statistics of Facebook that was released in September 2011, approximately two billions wall posts are noticed by users in 1 day, and tremendous number of comments and likes are received on these wall posts [29]. Since the contents of legitimate nonspam comments are different from spam comments, different features of messages can be considered that potentially differentiate between spam and legitimate messages, which are introduced in the following [30]. Hash-Tag (Wall Posts, Comments) To facilitate searching of specific subjects or contents in the social networks and micro-blogging services, hash-tags are used. Hash-tag is created by placing a ‘#’ sign in front of a word or a phrase. Putting hash-tags in wall posts and comments are very popular. In comparison with typical users, hash-tags are more noticeable for spammers, because this feature allows wall posts or comments would be seen more by users [31]. Reply Spammers use a lot of replies on their wall posts in order to be further noticed. This can be a good feature for correct identifying of spam. Like (Wall Post, Comment) Users can use “like” button as an important feature of the social networks such as Facebook to give their positive opinions about wall posts, comments, photos or shared links. Considering the number of likes, the owners of the content pages can find out how their wall posts are noticed by their friends or other users. Since spam posts usually are not noticed by other users, the number of likes of wall posts and comments of spammers is much less than the normal users. Comments comments are similar to likes with the difference that the comments contain an explanation to the reader. Comments provide this facility for the users to have a post about the contents they have liked. Furthermore, the comments are not released privately same as likes, and are visible for other users. The more the number of comments of a wall post, the more it is attacked by spammers. Spam Words Spam accounts usually contain spam words in their wall posts. Hence, the frequency of the use of spam words in the wall posts of an account can be a good feature to detect spam accounts. URLS URLs are direct links to other pages that users are redirected to that page by clicking on them. The easiest way to bring the users of social networks to other pages is sending
123
Arab J Sci Eng Table 1 The multi-objective PSO-based proposed algorithm for feature selection [30]
data=LoadData(); nx=data.nx; BestSol=cell(nx,1); S=cell(nx,1); BestCost=zeros(nx,1); for nf=1:nx disp(['Selecting ' num2str(nf) ' feature(s) ...']); results=RunPSO(data,nf); disp(' '); BestSol{nf}=results.BestSol; S{nf}=BestSol{nf}.Out.S; BestCost(nf)=BestSol{nf}.Cost; end
3.2 Feature Selections Algorithm Particle Swarm Optimization (PSO) is the feature selection algorithm of the proposed system. PSO is an algorithm which is inspired of the behavior of a flock of birds [18]. PSO-based feature selection has been used and its better performance comparing other feature selection methods has been shown in [30]. Therefore, considering its simplicity and rapid convergence, this algorithm is used in the feature selection problem
123
21 20 19 18
E
the URLs in comments. Recent studies show that more than 80% of spam contains malicious links [31]. Share The users of social networks can share photos, videos, blog entries, and news with their friends. The number of sharing content shows how important the content is, so the high number of sharing wall posts can be an important factor for detecting spammers. The Average Time Interval (Burst Time) Spammers usually send spam messages within short intervals [32]. Therefore, by calculating these intervals during the clustering of messages, it can be expected that spam clusters have shorter time intervals than clusters of legitimate messages [33]. Friends The users who are in the list of each other considered as friend on social networks. Normal users have a greater number of friends than spammers. FollowersOn the Facebook social network the number of each user’s friends cannot exceed 5000 people. “Following” is the appropriate way to communicate with the user whose number of friends has reached this level. On the other hand, some pages, such as fan pages of famous people and politicians, are followed by the several users and there is no specific ceiling for the number of followers. Spammers do not usually have lots of followers. Message SizeSpam comments have different sizes. Some of them just include URL(s). Some of them contains advertisement and have larger sizes, due to advertising and explanation of the commodity will be in the form of comments, and the others contain both text and URL.
17 16 15 14 13
0
2
4
6
nf
8
10
12
14
Fig. 2 The error rate of the feature selection method [30]
and many other complex issues. In this paper, firstly the PSO is employed to select a number of features that are the most effective in spam identifying among 13 mentioned features for the dataset. The multi-objective PSO starts by selecting a feature and in each phase by adding a number of features calculates and displays the error rates. The multi-objective PSO-based algorithm for feature selection is depicted in Table 1. Figure 2 shows the performance of the PSO-based feature selection method of the proposed system. As shown in Fig. 2, the error rate with the selection of 4 attributes is less than the selection of 5 attributes, and in fact, the addition of the fifth feature leads to a worsening of the result. So it is clear that the point with 5 attributes must be removed from the final results. Figure 2 shows that a proper trade-off between error rate and required time and memory of the algorithm occurs by selecting 7 features. Obviously, selection of the more number of features can lead to better results in error rate, but can also cause to decrease the time and memory efficiency. After calculating the number of selected features, a permutation solution should be used to
Arab J Sci Eng 16
ure 3 shows the algorithm performance with 7 features after 100 iterations. The top seven features which have been used by the system’s feature selection method are the features with numbers 13, 6, 7, 3, 8, 4 and 9, which are “comment size,” “number of wall posts likes,” “comments likes,” “number of replies,” “link addresses,” “number of comments,” and “post sharing,” respectively. After the optimization and selection of the most effective features, the data have been prepared for clustering.
Best Cost
15.5
15
14.5
14
3.3 Clustering 13.5
0
10
20
30
40
50
60
70
80
90
100
Iteration
Fig. 3 Best cost for PSO-based feature selection [30]
determining the selected features. The proposed system uses a random key and a meta-heuristic algorithm to achieve the set of selected features using the permutation approach. Fig-
Clustering is one of the most important unsupervised learning techniques to deal with massive amounts of heterogeneous information. The goal of clustering is to classify the objects into meaningful subsets called clusters [34]. Data clustering algorithms can be hierarchical or partitioning. In each category, there are a large number of different subtypes and algorithms for finding clusters [35]. To evaluate the results of clustering algorithms, cluster validity index should be used
Table 2 The implementation of the DB index code [37]
function [DB, out] = DBIndex(m, X) k = size(m,1); % Calculate Distance Matrix d = pdist2(X, m); % Assign Clusters and Find Closest Distances [dmin, ind] = min(d, [], 2); q=2; S=zeros(k,1); for i=1:k if sum(ind==i)>0 S(i) = (mean(dmin(ind==i).^q))^(1/q); else S(i) = 10*norm(max(X)-min(X)); end end t=2; D=pdist2(m,m,'minkowski',t); r = zeros(k); for i=1:k for j=i+1:k r(i,j) = (S(i)+S(j))/D(i,j); r(j,i) = r(i,j); end end R=max(r); DB = mean(R); out.d=d; out.dmin=dmin; out.ind=ind; out.DB=DB; out.S=S; out.D=D; out.r=r; out.R=R; end
123
Arab J Sci Eng 1.4 1.2
Best Cost
1 0.8 0.6 0.4 0.2 0
0
50
100
150
200
250
300
Iteration
Fig. 4 Best cost for DE-based clustering [30] 120%
Spam Detection rate on Clustering Spam Detection rate on Clustering
100% 80% 60% 40%
[36]. DB Index is one of the well-known cluster validity indices. This index is defined as the ratio of the sum of within cluster scatter to between cluster separations [36]. The implementation of the DB index code, based on [37], is depicted in Table 2. The proposed system has employed the Differential Evolution (DE) algorithm with the DB index to divide the dataset into two clusters of spam and legitimate (nonspam) messages with very good error rate of 0.02 [30]. Differential evolution algorithm is a powerful and fast method for optimization problems in continuous spaces. One of its benefits is having a memory to keep the information of appropriate solutions of the current population. The other advantage of this algorithm is about its selection operator. In this algorithm all solutions have the equal chance to be chosen as one of the parents [37]. Using DE-based clustering method along with PSO-based features selection using DB index, the efficiency of the proposed system is improved. This method could detect 71.4% of spam and 96.3% of legitimate messages correctly. Performance of DE-based clustering and detection rate of the proposed system are shown in Fig. 4 and 5, respectively. For calculating the precision of the spam detection of the is used which proposed system, the relation DR = TPTP +FN considers true positives (TP) and false negatives (FN) [38].
20%
3.4 Supervised Machine Learning
0% spam
non-spam
1
2
Fig. 5 The percentage of legitimate messages and spam which detected using clustering method
During the clustering phase, the system must decide on each incoming message to place it in the spam cluster or legitimate messages. Supervised machine learning modules are trained classifiers that are used to make this decision. There are two candidates for this classification: support vector machines and decision trees.
MySVRFunc([x1;x2], α (S),y(S),x(:,S),Kernel)+b-1 = 0
2
Class A Class B
1.5 1
x
2
0.5 0 -0.5 -1 -1.5 -2 -2
-1.5
-1
Fig. 6 Data separation by SVM using the Gaussian kernel
123
-0.5
0
x1
0.5
1
1.5
2
Arab J Sci Eng
Spam Detection rate on Decision Tree
Spam Detecon rate on SVM
92%
93% 92% 91% 90% 89% 88% 87% 86% 85% 84%
Spam Detecon rate on SVM
91% 91% 90% 90% 89% 89% 88%
spam
non-spam
1
2
Fig. 7 The detection rate of the proposed system using SVM
In a decision tree the samples are classified in such a way that grows from the roots to the leaves. Each internal (nonleaf) node is specified with an attribute. This attribute puts a question about the input sample. In each internal node there is a branch for any possible answer that each one is determined by one of the possible answers. The leaves of this tree are determined by one class of the answers. Since this tree shows the decision-making process to determine the class of
x1 < 0.199
x7 >= 7.5e-05
x7 < 7.5e-05
x4 < 0.000115
x1 >= 0.02
x1 < 0.02
x3 >= 0.042
x3 < 0.042
x3 < 0.0385
0.66667
x3 >= 0.0385
x4 >= 0.000115 x3 < 0.0345
0
1
0
0
x3 >= 0.0345
0.75
x 5 > = 4e-05
x 6 > = 0. 00465
x1 >= 0.199
x4 >= 0.00035
x4 < 10.07
x 7 < 0.0006
x 6 < 0. 00465
2
x3 >= 0.052
x3 >= 0.0105
x 5 < 4e-05
1
3.4.2 Decision Tree
x4 < 0.00035
0
non-spam
The dataset is divided into two classes of A and B as Fig. 6. It can be seen that the support vectors are placed well and correctly to have the best results from classification. Using the SVM by the Gaussian kernel has improved the detection rate of the system for spam messages up to 89.8% and for legitimate message up to 91.2%, which are depicted in Fig. 7.
Support vector machines try to find the best classification and separation between data using support vectors. In this method only the data contained in the support vectors are used to construct the model and the algorithm is not sensitive to other data. SVM aims to find the best line to separate data in such a way to have the greatest possible distance from all categories (support vectors). When there are two categories of data that share each other, the categories cannot be separated by one line by the SVM method because of their shared data. Therefore, the data can be mapped to another space using a kernel function, in which, the data space is separable by SVM. Support vector machines have very powerful algorithms for categorizing and dividing data. Since data cannot be linearly separated in the proposed system because of their type, SVM is used in a nonlinear space using the Gaussian kernel.
x3 < 0.0105
spam
Fig. 9 The detection rate of the proposed system using decision tree
3.4.1 Support Vector Machine
x3 < 0.052
Spam Detection rate on Decision Tree
0. 25
x 5 > = 0.017
x 5 < 0. 017
x 3 < 0. 083
0.44444
x 7 > = 0.0006
x 3 > = 0. 083
0
1
0.8
x 7 < 2.4e-05
x 3 < 0.205
0
x 7 > = 2. 4e-05
x 3 > = 0. 205
1
x 4 < 0. 0095
x7 < 0.00195
0. 5
x7 < 6.5e-05
1
x4 >= 10.07
x7 >= 0.00195
x7 >= 6.5e-05
0.25
0
1
x2 >= 0.0024
x2 < 0.0024
x4 < 0.0007
x 4 > = 0. 0095
x4 >= 0.0007
1
0
1
1
0
Fig. 8 The decision tree
123
Arab J Sci Eng
Fig. 10 The design of the proposed system
an input sample, it is known as decision tree. Decision tree’s learning is a method to approximate the objective functions with discrete values. This method is resistant to noise and is able to learn disjunctive combinations of conjunctive propositions. This is one of the most famous inductive learning algorithms that have been used successfully in various applications. The process of clustering and classification of data has also been carried out by the proposed system using decision tree. After the classification the detection rate of spam and nonspam messages was 70.8 and 92.5%, respectively. The decision tree is shown in Fig. 8 and the detection rate of the proposed system for spam and nonspam message using the decision tree is shown in Fig. 9.
4 Design of the Proposed System Figure 10 shows the design of the proposed system. During the time, all comments of the dataset are put in their corresponding cluster. When the system receives new comments the cluster is updated for the first time with minimum computational overhead. Very small messages that their size is less than 20 shingles in the first section are forwarded to the nonspam cluster and are displayed, because according to the evaluations, such messages cannot be spam. For example, messages that contain only a few words or stickers are not considered as spam. The comments which contain only URL should be checked for being spam or not, and after checking out of the listed features are led to their cluster. In order to the proposed system have the best efficiency in correct diagnosis of spam and legitimate messages, the evaluation is carried out through three methods of clustering detection, support vector machine, and decision tree. The results show that in proper
123
Fig. 11 The spam detection method of the proposed system
diagnosis of legitimate messages, clustering method using DB index and in the proper diagnosis of spam messages, support vector machine (SVM) had the best performance. So, for high accuracy of the suggested system in message diagnosis, the combination of clustering using DB index with support vector machines has been used by the proposed system. Figure 11 shows the spam detection method of the proposed system. The decision tree, which has poorer detection accuracy than two other methods, has the advantage of better time complexity. When the time complexity of detection process is
Arab J Sci Eng 0.8
References
0.7
Message Pairs
0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Message Resemblance
Fig. 12 Distribution of the similarities between spam and legitimate messages for Facebook’s data
considered as the system evaluation criterion, decision tree is the best option is to use. However, if the detection accuracy is more important, combining of SVM and clustering methods will be the best option for achieving this goal. The basis of clustering in this system is recognized by the amount of threshold of cluster [33]. Hence the similarity threshold used in the clustering will be impressive. The similarity of messages is calculated using a predetermined shingle that represents the similarity between messages. To determine the amount of the threshold, we compared the similarity of legitimate comments and spam message on Facebook’s data. The result of this comparison is depicted in Fig. 12. As shown in Fig. 12, until the threshold is too large or too small, changing the amount of the threshold cannot be effective in clustering; this is why the threshold value is considered 0.6 in the proposed system.
5 Conclusions Spammers’ attacks and potential threats of the safety of online social networking environment are serious concerns of the users of this platform. This paper proposed an efficient system to filter spam message of Facebook, as one of the largest and most popular social networks, using a PSObased feature selection method and combining supervised and unsupervised classification techniques. In the proposed system, the efficient clustering technique using DB index as an unsupervised machine learning method is combined with the supervised classification techniques of SVM, to achieve more precision in detection, and decision tree, to attain better time complexity for detection process. The experimental results show that the proposed method attains very good detection rate.
1. Heydari, A.; Tavakoli, M.A.; Salim, N.; Heydari, Z.: Detection of review spam: a survey. Computer 42, 3634–3642 (2015) 2. Zheng, X.; Zeng, Z.; Chen, Z.; Yu, Y.; Rong, C.: Detecting spammers on social networks. Neurocomputing 159, 27–34 (2015) 3. Sohrabi, M.K.; Akbari, S.: A comprehensive study on the effects of using data mining techniques to predict tie strength. Comput. Hum. Behav. 60, 534–541 (2016) 4. Sohrabi, M.K.; Barforoush, A.A.: Efficient colossal pattern mining in high dimensional datasets. Knowl. Based Syst. 33, 41–52 (2012) 5. Sohrabi, M.K.; Barforoush, A.A.: Parallel frequent itemset mining using systolic arrays. Knowl. Based Syst. 37, 462–471 (2013) 6. Sohrabi, M.K.; Ghods, V.: Top-down vertical itemset mining. In: Sixth International Conference on Graphic and Image Processing (ICGIP 2014), pp. 94431V–94431V7 (2014) 7. Sohrabi, M.K.; Ghods, V.: CUSE: a novel cube-based approach for sequential pattern mining. In: 4th International Symposium on Computational and Business Intelligence (ISCBI), pp. 186–190 (2016) 8. Sohrabi, M.K.; Marzooni, H.H.: Association rule mining using new FP-linked list algorithm. J. Adv. Comput. Res. 7(1), 23–34 (2016) 9. Sohrabi, M.K.; Roshani, R.: Frequent itemset mining using cellular learning automata. Comput. Hum. Behav. 68, 244–253 (2017) 10. Sohrabi, M.K.; Ghods, V.: Materialized view selection for a data warehouse using frequent itemset mining. JCP 11(2), 140–148 (2016) 11. Sohrabi, M.K.; Azgomi, H.: TSGV: a table-like structure based greedy method for materialized view selection in data warehouse. Turk. J. Electr. Eng. Comput. Sci. 25(4), 3175–3187 (2017) 12. Sohrabi, M.K.; Azgomi, H.: Parallel set similarity join on big data based on locality-sensitive hashing. Sci. Comput. Program. 145, 1–12 (2017) 13. Sohrabi, M.K.; Tajik, A.: Multi-objective feature selection for warfarin dose prediction. Comput. Biol. Chem. 69, 126–133 (2017) 14. Arab, M.; Sohrabi, M.K.: Proposing a new clustering method to detect phishing websites. Turk. J. Electr. Eng. Comput. Sci. (2017). doi:10.3906/elk-1612-279 15. Huber, M.; Mulazzani, M.; Kitzler, G.; Goluch, S.; Weippl, E.: Friend-in-the-middle attacks. Exploiting social networking sites for spam. IEEE Internet Comput. 15(3), 28–34 (2011) 16. Abu-Nimeh, S.; Chen, T.M.; Alzubi, O.: Malicious and spam posts in online social networks. IEEE Comput. 44(9), 23–28 (2011) 17. Yu, D.; Chen, N.; Jiang, F.; Fu, B.; Qin, A.: Constrained NMF-based semi-supervised learning for social media spammer detection. Knowl. Based Syst. 125, 64–73 (2017) 18. Yong, Z.; Wei, G.; Wan-qiu, Z.: Feature selection of unreliable data using an improved multi-objective PSO algorithm. Neurocomputing 171, 1281–1290 (2016) 19. Roberto, H.W.; George, D.C.; Renato, F.C.: A global-ranking local feature selection method for text categorization. Expert Syst. Appl. 39(17), 12851–12857 (2012) 20. Esseghir, M.A.; Goncalves, G.; Slimani, Y.: Adaptive particle swarm optimizer for feature selection. In: Proceedings of the 11th International Conference on Intelligent Data Engineering and Automated Learning, LNCS 6283, pp. 226–233 (2011) 21. Oh, I.S.; Lee, J.S.; Moon, B.R.: Hybrid genetic algorithms for feature selection. IEEE Trans. Pattern Anal. Mach. Intell. 26(1), 1424–1437 (2004) 22. Lin, S.W.; Lee, Z.J.; Chen, S.C.; Tseng, T.Y.: Parameter determination of support vector machine and feature selection using simulated annealing approach. Appl. Soft Comput. 8(4), 1505– 1512 (2008) 23. Toolan, F.; Carthy, J.; Feature selection for spam and phishing detection. In: eCrime Researchers Summit (eCrime). IEEE (2010)
123
Arab J Sci Eng 24. Diale, M.; Walt, C.V.D.; Celik, T.; Modup, A.: Feature selection and support vector machine hyper-parameter optimization for spam detection. In: Pattern Recognition Association of South Africa and Robotics and Mechateronics International Conference. IEEE (2016) 25. Lee, S.M.; Kim, D.S.; Kim, J.H.; Park, J.S.: Spam detection using feature selection and parameters optimization. In: International IEEE Conference on Complex, Intelligent and Software Intensive Systems (CISIS), pp. 883–888 (2010) 26. Zheng, X.; Zeng, Z.; Yu, Y.; Kechadi, T.; Rong, C.: ELM-based spammer detection in social networks. Supercomputing 72(8), 2991–3005 (2016) 27. Zhang, Y.; Wang, S.; Phillips, P.; Ji, G.: Binary PSO with mutation operator for feature selection using decision tree applied to spam detection. Knowl. Based Syst. 64, 22–31 (2014) 28. Adewole, K.S.; Anuar, N.B.; Kamsin, A.; Varathan, K.D.; Razak, S.A.: Malicious accounts: dark of social networks. Netw. Comput. Appl. 79, 41–67 (2017) 29. Ahmad, F.; Abulaish, M.: A generic statistical approach for spam detection in online social networks. Comput. Commun. 36(10), 1120–1129 (2013) 30. Sohrabi, M.K.; Karimi, F.: A clustering based feature selection approach to detect spam in social networks. Int. J. Inf. Commun. Technol. Res. 7(4), 27–33 (2015)
123
31. Gupta, A.; Kaushal, R.: Improving Spam Detection in Online Social Networks. Indira Gandhi Delhi Technical University for Woman, Delhi (2015) 32. Yu, X.; Achan, F.; Panigrahy, K.; Hulten, R.; Andosipkov, G.: Spamming botnets: signatures and characteristics. In: Proceeding of SIGCOMM (2008) 33. Gao, H.; Chen, Y.; Lee, K.: Towards online spam filtering in social networks. In: 19th Annual Network & Distributed System Security Symposium (2012) 34. Forsati, R.; Keikha, A.; Shamsfard, M.: An improved bee colony optimization algorithm with an application to document clustering. Neurocomputing 159, 9–26 (2015) 35. Leung, Y.; Zhang, J.; Xu, Z.: Clustering by scale-space filtering. IEEE Trans. Pattern Anal. Mach. Intell. 22(12), 1396–1410 (2000) 36. Halkidi, M.; Vazirgiannis, M.: Clustering validity assessment: finding the optimal partitioning of a dataset. In: Proceedings of IEEE ICDM, San Jose, CA, pp. 187–194 (2001) 37. Das, S.; Abraham, A.; Konar, A.: Automatic clustering using an improved differential evolution algorithm. IEEE Trans. Syst. Man Cybern. Part Syst. Hum. 38(1), 218–237 (2008) 38. Liu, S.; Zhang, J.; Xiang, Y.: Statistical detection of online drifting twitter spam. In: Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security (2016)