-
Essay / Review Paper on Spam Detection URL Filtering and Image Spam Filtering Using Machine Learning
Table of ContentsIntroductionRelated WorkProposed IdeaIdea 1: Pseudo-OCR for Image Spam FilteringIdea 2: Keypoint-Based Character FeatureIdea 3: Filtering Image SpamIdea 4: Detecting Spam URLs Using the SVM AlgorithmConclusionThe growing volume of harmful content on social media requires automated methods to detect and eliminate such content. This article describes a superintendent machine learning classification model that will be built to detect the distribution of pernicious content in online social networks/media (ONS/OMS). Multi-source features were used to detect social media posts containing vitriolic Uniform Resource Locators (URLs). These URLs can direct users to websites containing malicious content, drive-by download attacks, phishing, spam, and scams. For the data collection stage, the Twitter streaming application programming interface (API) was used and VirusTotal was used to label the dataset. A random forest classification model was used with a combination of features derived from various sources. The fraudulent practice of sending emails is a criminal scheme aimed at obtaining the user's personal data and other login and confidential information. This is called phishing which acquires private information of users such as password, bank account details, credit card number, financial username and password, etc. . and can then be abused by an attacker. We aim to use the fundamental visual characteristics of a web page's appearance as the basis for page similarity detection. We propose a new solution to effectively detect phishing web pages. Note that layouts and content are fundamental characteristics of the appearance of web pages. Since the standard way of specifying layouts is using the style sheet (CSS), we develop an algorithm to detect similarities in key CSS-related elements. In this paper, we proposed a system that uses SVM technique along with image spam filtering, mapreduce spam archetype to achieve higher accuracy in detecting spam URLs and iamge spam. However, after further investigation and application of parameter tuning and feature selection methods, we were able to improve the performance of the classifier. Say no to plagiarism. Get a tailor-made essay on “Why violent video games should not be banned”?Get the original essayIntroductionThe main challenges for social media security administrators are not only protecting the management system and database of the networks social networks, but also to protect OSN users against exposure to malicious content distributed on these social networks. 60% of social media users have received or been exposed to malicious content such as spam, scams and drive-by downloads. A number of OSNs are currently developing malicious content detection systems for such attacks. For example, Facebook's immune system detects suspicious activities such as like-jacking, social bots and fake content. called phishing. Such theft takes place in order to obtain sensitive information such as passwords, bank account details or phone numbers.credit card. Phishing uses spoofed emails that look exactly like genuine emails. These emails are sent to a large number of users and appear to come from legitimate sources such as banks, e-commerce sites, payment gateways, etc. The creators of these illegitimate websites have made them look exactly like legitimate sites so that no user can identify the difference easily. Phishing attackers use different types of social engineering tactics to lure users, for example: offering attractive offers for simply visiting the site. A malicious URL is a URL created for malicious purposes, including downloading any type of malware to the affected computer, which may be contained in spam or phishing messages, or even improving its position in search engines. using Blackhat SEO techniques. The intelligent malicious URL detection system is an anti-phishing technique to protect our web experiences. Our approach uses Chinese image spam lexical features, host-based features, and site popularity features of a website to detect any suspicious or phishing websites. These features are obtained from the source code by taking the URL as input and then these features are passed to the classifier algorithm. The results obtained from our experiment show that the proposed methodology is very effective in preventing such attacks and the performance was measured using a confusion matrix for all classifiers. Related work The majority of studies in this area aim to find the most predictive features possible. acquire and the best algorithm to develop a classifier model. Researchers in this field mainly focus on finding new features with high discriminatory power, in addition to coming up with the most accurate machine learning model. Finding highly discriminating features in Internet and social media security is a challenge due to the diversity of attacks and techniques used by spammers. Due to the inventiveness of spammers, detection systems are bypassed after a while and the set of features used for spam detection must be regularly revised. Just as security researchers study attacks, spammers and hackers investigate detection systems; therefore, they may modify user properties, content or distribution mechanism to circumvent certain restriction or detection rules. For example, a study on spam detection on Twitter recommended that the number of followers is one of the elements with the highest discriminating power. The discriminatory power of this feature has been increasingly weakened by spammers who have made their accounts more popular. They do this by running spam campaigns that cause their “fake” accounts to connect to other fake accounts, thereby increasing the number of subscribers and follows. Burnap et al. used a completely different method to detect malicious URLs. They deployed a high-interaction Honey-net2 to collect system state changes, such as packet sending/receiving and CPU usage. The training dataset contained 2,000 examples with a 1:1 spam/non-spam ratio. Ten attributes were used to create a classifier reflecting system state changes after opening the tweet URL. Burnap et al. studied the shortest time needed to give apreliminary warning of the existence of malicious content in a particular URL. The best result was obtained for the Multi-Layer Perceptron (MLP) using the features acquired after 210 seconds (0.723 in the F-measure metric). The features used by Burnap et al. require complex data analysis; however, they prevent spammer sites from hiding their true nature. Although recent literature has compared several algorithms, there is a lack of information on the important steps in building a machine learning model. In particular, little information is provided on how feature selection methods are handled and how parameter tuning is performed. We address this issue in Section IV. In this article, the author also presents a method that combines fingerprinting technique and big data processing to detect spam emails. Support Vector Machine (SVM) is the machine learning technique used for spam filtering. SVM training is a very extensive process, so we used the MapReduce platform for spam filter training. In this article, the author used content-based spam filtering. The classification of emails as spam or ham is based on the data present in the email content. So, the header section is ignored in case of content-based spam filtering. This article specifically includes the comparison between the implementations of the inverse Fisher-Robinson chi-square function, the AdaBoost classifier implementation, and the KNN classifier. Proposed Idea This section describes in detail the main steps of this study, starting with data collection and labeling of the dataset, followed by a brief comparison of the most common techniques used in related studies. The main objective of the system is not only to protect the management system and database of social networks, but also to protect OSN users from exposure to malicious content distributed on these social networks, because many Social media users have received or been exposed to malicious content. Idea 1: Pseudo-OCR for filtering image spam. Image spam manufacturing technology makes image spam more similar to harmful spam, thus more difficult to identify directly from image characteristics without any content information. . Even worse, for some advanced applications, the process of filtering spam images actually requires more contextual information than just a filter result. We therefore believe that it is essential for an anti-spam system to obtain information about the content of the current image, which apparently could only be obtained by long-established OCR-based methods. However, as with the drawbacks discussed above, traditional OCR is not our best choice. Thus, the idea of pseudoOCR is proposed to avoid such defects while still being able to extract sufficient information about the content. Compared with long-established technology, our proposed pseudo-OCR has the following improvements for Chinese image spam filtering. First, pseudo-OCR has more accessible character reader requirements. It is enough to determine whether or not a given character feature belongs to a spam image rather than recognizing it. Second, pseudo-OCR can efficiently process a very wide range of images, even those with complex background and human interference that are typically difficult to handle with conventional methods.traditional OCR-based ones. Finally, for Chinese character recognition, the proposed pseudo-OCR generates model features from some training images instead of a set of standard Chinese characters. Feedback gives the system the learning ability to maintain high performance for a long time. It is well known within anti-spam communities that spammers tend to change their spam image templates over time, which would lead to inevitable performance degradation of methods based on near-duplications. Although the proposed methods are not strictly based on quasi-duplication, they adopt a similar methodology to extract template character features from some known spam images. To handle such a predictable fault, a feedback mechanism is introduced into our system. By using detected spam as an additional source of template characters, it is entirely possible to replace outdated template character features with new ones, thereby maintaining better performance. Idea 2: Point-based character functionality keysTo meet the requirements of pseudo-OCR, the extracted Chinese character must also be modified. Concerning only certain key points of a character, we designed a new character feature, which probably fails to be used for traditional character recognition, but is nevertheless sufficient to reserve enough content information for the pseudo-OCR. Most of the extraction of such a feature is a two-phase procedure. In the first phase, the key points and their connectivity information are extracted and stored as an adjacency matrix using a DFS-based algorithm, and then the actual feature is calculated from this adjacency matrix. in the second phase. To identify image spam using this feature, each character feature extracted from a given image is compared to those in the model to first determine its category information and then the distribution of category information of all these characters are used for the final judgment. Idea 3: Image Spam Filtering From the feature extraction described above, any input image will be converted into a set of character features based on 20-dimensional keypoints. Use these features for filtering image spam, whose category information should be obtained first. For a given character characteristic, the minimum distance L1 of those which separate it from all the characteristics of the model is calculated and compared to a certain threshold to determine its category. Here this threshold is named category threshold to distinguish which with the following predefined threshold. Given all the category information of an image's features, the distribution of these is used to make the final judgment. Because all model character features in our implemented system fall into two categories, spam or ham. Then, by comparing the spam feature ratio with a predefined threshold calculated during the training process to choose the minimum spam image feature ratio of all training spam images, we are able to determine whether it whether or not this is a spam image. In our system, a minimum threshold of 0.25 is selected from a total of 82 training spam images. The experiment results show that our proposed Chinese image spam filtering system using pseudo-OCR generally achieves better performance compared to the.