blog




  • Essay / Fault Tolerance

    Today, a highly secure virtual grid is demanding in which you can share any resource of any cluster even if there is a fault in the system. Grid computing is a distributed computing paradigm that differs from traditional distributed computing in that it addresses large-scale systems that span even organizational boundaries. In addition to the challenges of managing and scheduling these applications, reliability issues arise due to the unreliable nature of the network infrastructure. A failure may occur due to link failure, resource failure or any other reason and must be tolerated for the smooth and accurate operation of the system. These defects can be detected and recovered by many techniques used accordingly. A proper fault detector can avoid losses due to system failure and a reliable fault tolerance technique can avoid system failure. Fault tolerance is an important property for ensuring reliability, availability and quality of service. Say no to plagiarism. Get a tailor-made essay on “Why Violent Video Games Should Not Be Banned”? Get the original essay The fault tolerance mechanism used here sets task checkpoints based on the resource failure rate. If a resource fails, the job is restarted from its last successful state using a checkpoint file from another grid resource. Selecting optimal intervals for applying checkpoints is important to minimize application execution time in the presence of system failures. In case of resource failure, rescheduling based on failure index, the algorithm reschedules the work from the failed resource to another available resource with the lowest fault index value and executes the work from the last saved checkpoint. This ensures that the work will be completed on time with increased throughput and helps make the network environment reliable. Grid computing is a term for combining computing resources from multiple administrative domains to achieve a common goal. The Grid can be viewed as a distributed system with non-interactive workloads involving large numbers of files. Although a grid may be dedicated to a specialized application, it is more common for a single grid to be used for a variety of purposes. Grids are often built using general-purpose grid software libraries called middleware. Grid enables the sharing, selection, and aggregation of a wide variety of geographically distributed resources, including supercomputers, storage systems, data sources, and specialized devices owned by different organizations. Managing these resources is an important infrastructure in the grid computing environment. To exploit the promising potential of computing grids, fault tolerance is of fundamental importance since resources are geographically distributed. Moreover, the probability of failure is much greater than in traditional parallel computing, and resource failure fatally affects work execution. Fault tolerance is the ability of a system to perform its function correctly even in the presence of faults and makes the system more reliable. Fault tolerance services are essential for meeting QoS requirements in grid computing and address various types of resource failures, including process failures and network failures.One of the important parameters of a checkpoint system providing fault tolerance is the check printing interval or application health check period. Smaller checkpoint intervals result in increased application execution overhead due to checkpointing, while longer checkpoint intervals result in increased failure recovery time. Therefore, optimal control intervals leading to minimum application execution time in the presence of failures will need to be determined. PROBLEMS: 1. If an outage occurs on one network resource, the work gets rescheduled on another resource, which ends up not satisfying the user's QoS requirement, i.e. say the deadline. The reason is simple. As the job is rerun, it takes longer. 2. In grid computing environments, some resources meet the deadline constraint criterion, but they tend to be aimed at adults. In such a scenario, the grid scheduler selects the same resource for the simple reason that the grid resource promises to meet the requirements of the grid task users. This ultimately results in compromising the user's QoS settings in order to complete the job. 3. If an ongoing task must be completed within the stipulated time even if there is a fault in the system. The deadline in a real-time system is the major problem because such a task is meaningless if it does not finish before its deadline. 4. Real-time distributed system availability of end-to-end services and the ability to sustain outages or systematic attacks, without impacting customers or operations. 5. It is the ability to handle an increasing amount of work and the ability of a system to increase total throughput under increased load as resources are added. An adaptive checkpoint fault tolerance approach is used in this scenario to overcome the drawbacks mentioned above. In this approach, information about fault occurrences is maintained for each resource. When a failure occurs, the failure occurrence information for this resource is updated. This fault occurrence information is used when making resource allocation decisions at work. Checkpointing is one of the most popular techniques for providing fault tolerance on unreliable systems. This is a recording of a snapshot of the complete system state in order to restart the application after a crash occurs. The checkpoint can be stored on temporary or stable storage. However, the effectiveness of the mechanism strongly depends on the duration of the control interval. Frequent checkpointing can increase overhead, while lazy checkpointing can cause significant computational loss. Therefore, the decision regarding checkpoint interval size and checkpointing technique is a complicated task and should be based on the knowledge of the application as well as the system. Checkpoint recovery depends on the MTTR of the system. It periodically saves the application state to stable storage, usually a hard drive. After a crash, the application is restarted from the last checkpoint rather than from the beginning. There are three check painting strategies. These are coordinated checkpoints, uncoordinated checkpoints, and communication-induced checkpoints.1. In coordinated checkpointing, processes synchronize checkpoints to ensure that their recorded states are consistent over time.with each other, so that the overall saved and combined state is also consistent. In contrast, 2. in the case of uncoordinated chick checking, processes schedule checkpoints independently at different times and do not take messages into account.3. Communication-induced checkpoints attempt to coordinate only selected critical checkpoints. Benchmarking Existing Techniques: A grid resource is a member of a grid and offers computing services to users of the grid. Grid users register with a grid's Grid Information Server (GIS) by specifying QoS requirements such as deadline to complete execution, number of processors, operating system type, etc. . The components used in the architecture are described below: Scheduler - The scheduler is an important entity of a grid. The scheduler receives tasks from grid users. It selects feasible resources for these jobs based on information acquired from GIS. It then generates task-to-resource mappings. When the planning manager receives a grid job from a user, it obtains the details of the available grid resources from the GIS. It then transmits the list of available resources to the entities in the MTTR planning strategy. The Matchmaker entity performs the matching of resources and job requirements. The ResponseTime Estimator entity estimates the response time of the job on each matching resource based on transfer time, queue wait time waiting time and job service time. The resource selector selects the resource with minimum response time. A task dispatcher distributes tasks one by one to the checkpoint manager. GIS- GIS contains information about all available grid resources. It maintains resource details such as CPU speed, available memory, load, etc. All grid resources that join and leave the grid are monitored by GIS. Whenever a scheduler has tasks to execute, he or she consults the GIS for information about available grid resources. Checkpoint Manager − It receives scheduled work from the scheduler and sets the checkpoint based on the failure rate of the resource it is scheduled on. Then it submits the work to the resource. The checkpoint manager receives a job completion message or a job failure message from the grid resource and responds accordingly. During execution, if the task fails, the task is rescheduled from the last checkpoint instead of running from scratch. Checkpoint Manager implements a better checkpoint algorithm for setting task checkpoints. Checkpoint Server: At each checkpoint defined by the checkpoint manager, the job status is reported to the checkpoint server. The Checkpoint server records the job status and returns it on demand, i.e. when a job/resource fails. For a particular task, the checkpoint server ignores the previous checkpoint result when a new checkpoint result value is received. Fault Index Manager - The Fault Index Manager maintains the Fault Index value of each resource which indicates the failure rate of the resource. The failure index of a grid resource is incremented whenever the resource fails to complete the assigned task within the given time and also in the event of a resource failure. The failure index of a resource is.