blog




  • Essay / Case study on Big Data ecosystem at Linkedin

    Table of contentsThe construction of a production workflow is as followsThere are three main mechanisms available depending on the needs of the application:ApplicationsKey – valuePeople you know may -beingCollaborative fillingSkills endorsementJobs that might interest youRelated searchesStreamsNews feed updateEmailRelationship strengthOLAPWho viewed my profile?Who viewed this job posting?ConclusionWorkflows are complex and built using a variety of tools available to the Hadoop ecosystem: Hive, Pig and MapReduce are the three main processing interfaces. Avro is used as the serialization format. Data stored on HDFS is processed by chained MapReduce tasks that form a workflow, a directed acyclic graph of dependencies. If today's data is incomplete, the system returns automatically. To manage these workflows, LinkedIn uses Azkaban, an open source workflow scheduler. Its characteristics are: Say no to plagiarism. Get a tailor-made essay on “Why Violent Video Games Should Not Be Banned”? Get the original essay. Azkaban supports a set of tasks that form a workflow. Each task runs as an individual process and can be chained together to create dependency graphs. Task configuration and dependencies are kept as files of simple key-value pairs. Azkaban allows you to schedule charts while retaining logs and statistics from previous executions. The system provides monitoring and restart capabilities. If a task fails, alerts are triggered and when the problem is resolved, the failed subgraph must be restarted. Azkaban also provides resource locking, although this feature is often not used because most of our data is only appended. Building a production workflow is as follows: 1. A researcher begins experimenting with data through a script, trying to convert it into the form he needs. If machine learning, data cleaning and feature engineering requires a lot of time. Then it becomes an individual Azkaban job followed by joining the results of those jobs. These features can be trained in a model and we have packaged tasks to do this. The researcher can add or modify these functionalities, he configures Azkaban to execute only the modified tasks and their dependencies; As workflows are iterated and mature, they naturally become more complex and Azkaban becomes more useful. LinkedIn manages three Azkaban instances, Azkaban on the ETL grid manages the Hadoop load and is completely hidden from users of the Hadoop ecosystem. For development and production environments, a researcher deploys their workflows to the Azkaban instance to test the output of their algorithms. Once tested, each workflow goes through a production review during which basic integration tests are performed and any necessary adjustments are made. After the review, the workflow is deployed to the Azkaban production instance. Datasets and tool suites are synchronized across environments to enable easy reproduction of results. OUTPUT The result of the workflow is transferred to other systems, either to be served online or as a derived dataset for later consumption. This job consists of a simple one-line command for data deployment. There are three main mechanisms available depending on the needs ofthe application: Key-value: the derived data is accessible in the form of an associative table or collection. Stream: Data is published as a log of changes to data tuples. OLAP: Data is transformed into multidimensional cubes for online analytical queries. Key Value: This is the most frequently used transport vehicle by Hadoop on LinkedIn and is made possible by Voldemort. Tuples are grouped into logical stores that correspond to database tables. Each key is replicated across multiple nodes based on the replication factor of its corresponding node and is then split into partitions, with each key in a store mapped using a consistent hash across multiple partitions across nodes. Voldemort is a distributed key-value store similar to Amazon's Dynamo with a simple get(key) and put(key,value) interface. It was designed for read-write models with its storage engine using MYSQL/Berkeley DB. It offers a pluggable nature that introduces a storage engine suitable for HDFS. It is made up of sets of chunks of index and data files. The index file is a compact structure containing a hash of the key followed by the offset from the corresponding value in the data file, with the entire file sorted by hashed keys. The MapReduce job's identity mappers only emit the hash of the key "replication factor" a certain number of times, followed by a custom partitioner that routes the key to the reducer responsible for the shard set. The reducers finally receive the data sorted by hash of the key and perform an addition to the corresponding chunk set. This work supports producing multiple sets of chunks per partition, allowing the number of reducers to be changed to increase parallelism. On Voldemort's side, a configurable number of directories are managed per store, with only one serving live queries while the others serve as backups. After generating new shard sets on Hadoop, Voldemort nodes extract their corresponding shard sets into new directories in parallel. By adopting a pull methodology instead of push, Voldemort can limit the data retrieved. Verification is also performed with pre-generated checksums to verify the integrity of the extracted data. After the fetch operation is successful, the files in the current active directory blockset are closed and the indexes of the new blocksets are mapped into memory, relying on the system's page cache. exploitation. This “swap” operation runs in parallel on all machines and takes less than a millisecond. The last step, after the swap, is an asynchronous cleanup of old versions in order to maintain the number of backups. Maintaining multiple backup versions of data per store facilitates rapid return to a good state in the event of data or underlying algorithm issues. This complete operation of generating, extracting and swapping chunks is summarized in a single Pig StoreFunc line that is added by the user to the last task in their workflow. Key-value access is the most common form of output from the Hadoop system on LinkedIn. We have been successfully managing these clusters for 3 years, with over 200 stores in production. We found that the consistent 99th percentile latency on most stores was less than 10ms. Stream: The second mechanism for derived data generated in Hadoop is in the form of a stream returned in Kafka. This is useful for applications that require a change log of the underlying data. This possibility of publishingData on Kafka is implemented as Hadoop OutputFormat. Here, each MapReduce location acts as a Kafka producer that emits messages, throttling as necessary to avoid overloading the Kafka brokers. The driver checks the schema to ensure backwards compatibility. OLAP: the last proposed mechanism is multidimensional data (OLAP) for online queries. Queries allow the end user to examine data by slicing it into multiple dimensions. OLAP solutions solve this problem by coupling the two required subsystems: cube generation and dynamic query service. At LinkedIn, we have developed a system called Avatara that solves this problem by moving cube generation to a high-throughput offline system and query serving to a low-latency system. The resulting data cubes should be served into a website's request/response loop. The use cases at LinkedIn have a natural fragmentation key, allowing the final large cube to be divided into several “small cubes”. To generate cubes, Avatara uses Hadoop as an offline engine to perform user-defined joins and pre-aggregation steps on the data. To facilitate development, we provide a simple API based on a configuration file that automatically generates corresponding Azkaban task pipelines. The underlying format of these small cubes is multidimensional arrays in which a tuple is a combination of pairs of dimensions and measures. The output cubes are then mass-loaded into Voldemort as a read-only store for fast online service. Due to its architecture, Avatara is not tied to Voldemort and can support other online service systems. Since Avatara provides the ability to materialize on both the offline and online engine, the developer has the option to choose faster response time or flexible query combinations. Applications depend on the data pipeline, explicitly or implicitly. This explicitly means determining where the data is the product and implicitly refers to how derived data is infused into applications. LinkedIn application can be classified into 3 types based on the application and they are Key – Value Accessing the key value uses Voldemort, which is the most commonly used output mechanism by Hadoop. Some of the techniques used are People You May Know. This is used to find other members that a user may know on the social network. It is a link prediction problem in which the node and edge features of a user in a social graph are used to predict whether an invitation can occur between two unconnected nodes via a user known to both users. Hadoop provides extraction tasks such as determining common connections, company and school overlap, geographic distance, similar ages and many more. This system contains 100 Azkaban tasks and several MapReduce tasks to build these recommendations each day. Collaborative Filtering This technique shows relationships between pairs of items like “people who did this also did that.” This type of calculation was performed only for member-to-member co-occurrence, but quickly expanded to meet the needs of other entity types, including cross-type recommendations. The front-end framework of LinkedIn as it contains the activity tracking of LinkedIn core members. Skill Endorsements Endorsements are a lightweight mechanism through which one member can connect to another.