blog




  • Essay / Case Study: Google Search Engine

    Table of ContentsIntroductionPart of Design ArchitectureScalability, Availability, and SecurityGoogle Distributed File SystemCommunication ProtocolsIntroductionGoogle is recognized as the largest search engine company in the world, with a large number of users around the world. It operates more than a million servers in data centers around the world, integrates global information, processes hundreds of millions of search queries every day, automatically “crawls” each web page and scores them one by one. Users only need to enter keywords into the search homepage, Google search engine will find the relevant pages with the highest score among the visited pages and display them in less than a second so so that everyone can access and obtain the information they want. .Say no to plagiarism. Get a tailor-made essay on “Why violent video games should not be banned”?Get the original essayGoogle has managed to become a company with a dominant share of the Internet search market, thanks to the effectiveness of search algorithms. ranking used at the bottom of its search engine. The underlying search system managed to handle over 88 billion searches per month. Meanwhile, the main search engine has never experienced an outage and users can expect query results in around 0.2 seconds. Design Architecture Part Google's search engine is implemented in C or C++, which is efficient and can run on Solaris or Linux. In this section we will give a high-level overview of how the whole system is designed. In Google, web crawling is done by severe distributed crawlers. The function of URL Server is to send the list of URLs to Crawler, then Crawler will send all the acquired web pages to the store server, then Repository will compress the web pages and store them in the database. When the system starts crawling web pages, since each web page is associated with an identification number (called a docID), the crawled URL will be assigned that number. The indexer performs many functions for reading repositories, extracting documents, and parsing them. . Each document is converted into occurrences of a set of words called hits. Hits are used to record words, their position in the text, estimate font size and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted advanced index. The indexer also has an important function, which analyzes all the links of each web page and stores important information about these links in the anchor file. File information can precisely locate the location of each link from and to as well as the link text. The URL resolver reads the anchor file and converts relative URLs to absolute URLs and then to docID. It places anchor text in the advanced index, associated with the docID the anchor points to. It also creates database links for each docID pair. The link database is used to calculate the PageRanks of all documents. The sorter takes the barrels sorted by docID and groups them by wordID to generate the inverted index. This operation requires some temporary space. The sorter also generates a list of word IDs and shifts it into the inverse index. The DuffSimulink function generates a new dictionary for the searcher as well as the LeX icon generated by the indexer. The search engine is run by a web server and responds to queries using DopCopION-created dictionaries, inverted indexes, and PageRanks.Scalability, Availability, and Security From a distributed system perspective, Google's search engine is a fascinating case study, capable of handling extremely demanding demand. , particularly in terms of scalability, reliability, availability and security. Scalability refers to the effective and efficient operation of distributed systems at different scales (from small business intranet to the Internet). If the number of resources and users increases, the system can still maintain its efficiency. There are three challenges to achieving scalability. (1) Control the cost of physical resources When the demand for resources increases, we should spend a reasonable cost to expand the system to meet the requirements. For example, if a search engine server cannot handle all access requirements, it is necessary to increase the number of servers to avoid performance bottlenecks. In this regard, Google views scalability in three dimensions: (1) being able to process more data (x) (2) being able to process more queries (y) (3) seeking better results (z). Based on the data in the introduction, Google's search engine undoubtedly performs very well in these aspects. However, other functions, including indexing, ranking, and search, require highly distributed solutions to be scalable. (2) Control performance loss When the distributed system processes a large number of users or resources, it will produce many data sets. Managing these datasets has a great demand on the performance of the distributed system. In this case, the scalability of the hierarchical algorithm is obviously better than that of the linear algorithm, but the performance loss cannot be completely avoided. Since Google's search engine requires high user interaction, it is necessary to achieve low latency as much as possible. Therefore, the better the performance, the more network search operation can be completed in 0.2 s. Only in this way can Google earn more profits from advertising sales. The annual advertising revenue is 32 billion US dollars, which shows that Google is superior to other search engines in processing search engines. performance of associated underlying resources, including network, storage. and IT resources. (3) Prevent software resource exhaustion The search engine uses 32 bits as the network address. If there are too many Internet addresses, the Internet address will be exhausted. For this, Google does not currently have a good solution, because if we use a 128-bit Internet address, there is no doubt that many software components must be changed. The availability of the distributed system mainly depends on the extent to which new resources are used. Sharing services can be added and used by multiple clients. Since Google's search engine must meet the highest demands in the shortest possible time when it comes to crawling, indexing and sorting the web, availability is also a high demand. To meet these needs, Google has developed a physical architecture. The middle layer defines a general distributed system infrastructure, which not only allows the development of new applications and services to reuse the underlying system services, but also ensures the integrity of Google's enormous code database. There are many information resources of high value to users in distributed systems. system, it is therefore very importantto protect the security of these resources. Information resource security includes three parts: confidentiality (to prevent disclosure to unauthorized persons), integrity (to prevent modification or damage), availability (to avoid interference with the means of access to resources). During the investigation into Google's search engine security, we found that Google has not had much success with security and has even publicly admitted to leaking user information to seek benefits, which which also requires users to use Google's software, information security cannot be guaranteed. Google's Distributed File SystemThe implementation of Google's file system is to meet Google's rapidly growing needs for processing and management of Big Data. In addition to this demand, GFS faces the challenge of distribution management and the increased risk of hardware failure. Ensuring data security as well as being able to scale up to thousands of computers while managing several terabytes of data can therefore be considered as the main challenges facing GFS. Google therefore made the important decision not to use any of the existing distributed file systems. Instead, he decided to develop a new file system. The biggest difference from other file systems is that it optimizes the use of large files (i.e. gigabyte to several terabytes), resulting in the majority of files being considered immutable and can be read multiple times with a single write. A GFS cluster consists of a single master and multiple chunk servers and is accessible by multiple clients. These machines are common Linux process machines that can run user-level server processes. As long as the user's resources allow the block server and client to run simultaneously on one machine. Stored files are divided into fixed-size blocks, each with a globally unique 64-bit block handle. Chunk servers are stored on local disks as Linux files. They can read and write at the same time. Block data assigned by block handle and data range. To improve GFS performance, each chunk should be replicated across at least three servers. Chunk master maintains metadata for the entire GFS. Over a certain period of time, the chunk master will request each chunk server to upload the state via HeartBeat messages. The data-carrying communication, which does not need to be tied to the Linux Vnode layer, connects directly to the chunk server. Neither the client nor the chunk server cache file data. This no-data-store approach not only avoids inability to cache due to too large a working set, but also makes the client and the entire system consistent. Linux's buffer stores all frequently accessed data in memory, so block servers do not need to cache file data, which significantly improves the performance and speed of GFS. Communication protocols The configuration and selection of communication protocols are very important for the overall design of a system. Google adopts a simple, minimal and efficient remote calling protocol. Remote call protocol communication requires a serialization component to transform procedure call data. So, Google developed a protocol buffer, which is a simplified, high-performance serialization component. Google also uses a separate protocol for publishing and subscribing. The tampons.