blog




  • Essay / Improving Service Availability in Cloud Systems

    Table of ContentsA Survey on Improving Service Availability of Cloud SystemsReviewPersonal InsightsA Survey on Drizzle: Fast and AdaptableDescriptionReviewPersonal InsightsA Survey on Soteria: Automated IoTDescriptionReviewPersonal InsightsConclusionReferencesA Survey on Improving service availability of cloud systemsCloud computing is a convenient sharing of networking that uses a network of remote servers hosted on the Internet. As a service, it has gained popularity due to its reduced complexity and cost, since there is no need to purchase the IT infrastructure, hardware or licenses required to run a physical computer network [9]. Instead, these cloud systems use many physical disk drives as the primary storage component of the cloud system. Say no to plagiarism. Get a tailor-made essay on “Why Violent Video Games Should Not Be Banned”? Get an Original Essay Many technology companies such as Amazon, Google, and Microsoft have integrated cloud computing into the foundations of their applications' online services. These apps are used by millions of different people at any given time. For this reason, service availability must be of the utmost importance. However, despite highly targeted expectations for service availability, these services are still prone to outages, leading to customer dissatisfaction and loss of revenue [9]. These failures can result from a variety of hardware issues, but the most significant is actually disk failure. Large-scale cloud systems typically use several hundred million disk drives, of which 20-57% have experienced at least one error in the last 4-6 years [9]. This percentage validates the importance of these errors and the importance of predicting disk failures to improve service availability in cloud services. To achieve this, many solutions have been proposed that use historical data from disk level sensors (SMART data) to predict disk failures and take preventative measures, such as replacing failed disks. The different proposed approaches also focus on the comprehensive prediction of disk failures [9]. . Unfortunately, before failure-prone disks are proactively replaced, a number of disk errors may occur, which can negatively affect upper layer services. These errors, called “gray faults,” are typically unnoticed errors that degrade the quality of cloud software. This paper introduces CDEF (Cloud Disk Error Forecasting), an innovative approach to proactive disk failure prediction that uses both SMART data and error reflection. level signals in order to better detect these gray failures [9]. This approach was evaluated using data from Microsoft production cloud systems and was found to be an improvement over baseline methods with a sixty-three thousand minute reduction in Microsoft Azure virtual machine downtime. per month.RevisionThe authors faced two major challenges when designing the CDEF prediction model for a large-scale cloud computing service. Many disk drives are required to run an industrial cloud system such as Microsoft Azure, leading to the first challenge: only three disks out of ten thousand can potentially become faulty on any given day [9]. With such a low percentage, it would be easy for any prediction model tosimply classify all disks as healthy, as this would reduce the lowest risk of error. Other approaches have used rebalancing techniques to address this issue and create better results, but have created false positives as a result, which ultimately reduces the accuracy of a prediction model [9]. Another challenge comes from using historical data to make predictions. Some information used (especially system-level signals) is both time- and environment-sensitive, meaning that the data on a specific drive is constantly changing over its lifetime in a cloud environment . When using test datasets, prediction models can be accurate, but in practice for predictions of future datasets it is much less so [9]. The authors achieve the goal presented in this article by overcoming these difficulties with the introduction of two new features: an error-proneness ranking system designed for disk drives and a selection tool that determines which features to set SMART data can provide the most distinction between a healthy or error-prone disk [9]. With a feature identification system, CDEF is able to filter a multitude of different SMART and system-level disk data sets and identify which ones are most optimal for determining healthy and unhealthy disks [9]. . Providing a filtered set of historical data containing relevant data for accurate disk error detection allows prediction models to focus on the important features of a hard disk drive to ensure that gray failures do not occur. not unnoticed. Rather than taking the simple approach used by existing systems and classifying a disk as defective or not, CDEF classifies disks based on their error potential [9]. The previously mentioned problem regarding unbalanced data sets is greatly alleviated because the new prediction perspective does not focus on data imbalance. Since most disks are classified as healthy, this approach more efficiently examines each disk to ensure that healthy disks are also the most optimal. The real novelty of this work lies in these solutions and their ability to build on each other. Not only can these solutions alone be an improvement over other approaches, but the combined accuracy of the feature selection method and ranking model allows for more efficient and cost-effective results compared to existing methods. Although cross-validation approaches used in other methods show better results than the CDEF approach, the results of the CDEF approach better reflect the results of actual testing of the prediction model. This is because cross-validation does not take into account the temporal sensitivity of the disk data. Moreover, the CDEF approach has already been applied to the Microsoft Azure cloud service [9] and has been shown to be effective in selecting healthy disks for the service. Since there are many issues affecting the maintainability of cloud systems, the work done by the authors is significant in highlighting the existing issues as well as implementing a solution to one of the most serious problems. Personal Insights The authors of the CDEF approach do, in fact, achieve the goal discussed at the beginning of the paper, which is to develop online prediction software capable of distinguishing between healthy and unhealthy disk drives in acloud system to improve serviceability. In order to create this software, some methods that adopt machine learning techniques had to be adopted, such as the FastTree algorithm [5] used in the CDEF ranking function. The algorithm was particularly interesting because it is available in Microsoft's python library and, given that this prediction model was tested using a dataset provided by Microsoft Azure systems [6]. This presents some problems for a cloud system such as Apple's iCloud, which is not capable of adopting Microsoft-owned libraries because iCloud runs on the Swift programming language developed by Apple for most of its services.[2] . This may be problematic for Apple's cloud services, as their service availability could be lower than that of Microsoft Azure, Google Cloud, and Amazon AWS if the CDEF approach becomes more popular. The authors mention in the conclusion that there are many ways to extend this work, perhaps something to consider in the future would be to try to implement this approach in Apple's cloud computing service .A Survey of Drizzle: Fast, Scale-Scalable Stream ProcessingAlexander Monaco[ email protected]Florida International University (FIU), Miami, FloridaDescriptionStream processing is a type of “Big Data” technology that is used to process data when their “flow” in both the producing and sending side and manifests itself on the receiving side. This type of action is used with data regarding stock transactions, traffic monitoring, smart devices or any type of information that needs to be detected and queried in a short period of time. Since data travels incredibly fast and in varying quantities, stream processing systems must be able to adapt to these changes while maintaining a high level of performance requirements. In addition to being able to adapt to these changes, stream processing systems must be able to maintain both high throughput (task performance) and low latency (time to move data between nodes) [7]. Existing approaches primarily view the above-mentioned problems as mutually exclusive solutions, resulting in highly adaptable, high-latency systems or systems with low latency during normal operations but costly adaptability. The article presents Drizzle [7], a stream processing system developed with the understanding that the two previously mentioned solutions have features that can be combined in order to improve adaptability and reduce latency in tandem.ReviewThe authors use their article not only to introduce Drizzle, but also to present two main approaches in existing solutions: continuous streaming of operators (e.g. Naiad and Apache Flink) and bulk synchronous processing (e.g. Spark Streaming and FlumeJava) [7]. Their strengths, weaknesses, and features implemented to create a new approach to stream processing that is both fast and adaptable are shared in the paper. The first approach analyzed, bulk synchronous processing, is a popular processing framework in which a barrier is used to allow parallel nodes in a system to perform local computation. In stream processing, this method is modified to some extent to create a subgroup of processes and set the processing time in seconds. Similar to the basic bulk synchronous method, these processes in the subgroup collect data, analyze it, and then terminate at a barrier that outputs the data from all processes in thesubgroup. This type of approach is beneficial because the barriers allow the streaming system to take “snapshots,” or record physical or logical information, of each process, resulting in high adaptability and fault tolerance [7 ]. However, while it is capable of being scalable and secure, the time allocated to each of these processes cannot be low enough to create low latency and this would result in processes spending more time communicating results with the pilot rather than actually processing them. This approach, continuous operator streaming, removes planning and communication with drivers and implements a barrier only when necessary. When data enters the system, its operators are stored and processed as a long-running task. Unlike bulk synchronous processing, continuous operator streaming uses checkpoints rather than barrier snapshots to recover from failures [7]. Overall, the method of this approach favors flexibility and speed over security and cost-effectiveness. If any node in this system fails, all nodes must restart at a checkpoint and be replayed. My fascination with Drizzle lies in its novelty and how the features that make both approaches effective are combined. The bulk synchronous processing method is used for task scheduling and fault tolerance, while high throughput and low latency are achieved through continuous operator methods. Personal Information Of the two combined approaches used in Drizzle, the one that required the most improvement during implementation was the bulk processing method. synchronous processing. Bulk synchronous processing uses barriers to simplify fault tolerance and increase adaptability. However, when trying to reduce latency in a system, there are many obstacles that reduce the processing time required to communicate with a centralized driver, resulting in an overload situation. Therefore, creative decisions were made against barriers in Drizzle [7]. Another work, titled “Breaking the mapreduce stage barrier” also discusses how barriers reduce performance and introduces techniques and algorithms that work barrier-free to maximize performance [8]. The authors plan to explore additional techniques to improve Drizzle's performance. Perhaps a good start would be to find a way to implement barrier-free functionality while maintaining Drizzle's level of adaptability and fault tolerance. A Survey of Soteria: Automated IoT Security and Safety AnalysisAlexander Monaco[email protected]Florida International University (FIU), Miami, FloridaDescriptionThe Internet of Things is a concept that has become more important to individuals as as the technologies classified under this concept become more advanced. IoT broadly refers to any technology connected to each other digitally, such as smartphones, computers, smart cars, smart TVs, etc. Unfortunately, the added convenience of connected devices raises many security concerns, even though many of these IoT technologies are very advanced since the inception of the IoT. Many technology companies have established guidelines describing how to regulate device security [3], but there are few tools and algorithms for assessing IoT safety and security. This article presents Soteria, a statistical analysis system for application security and.})