home ¦ Archives ¦ Atom ¦ RSS

BDAS Tutorial

[embed]https://twitter.com/bigdata/status/306810999380516864[/embed]

This tutorial-the first of a two-part series-will provide an introduction to BDAS, the Berkeley Data Analytics Stack. BDAS is an open source, next-generation data analytics stack under development at the UC Berkeley AMPLab whose current components include Spark, Shark and Mesos. We will start by covering Spark, a high-speed cluster computing system compatible with Hadoop that can outperform it by up to 100x thanks to its ability to perform computations in memory. Spark provides concise, high-level APIs in both Scala and Java, and is in use at Foursquare, Conviva, Klout, Quantifind, and other companies. We will provide an overview of the Spark architecture, typical data analytics workflows (e.g., loading data from HDFS into memory and interactively querying it), and how users are applying Spark. In addition, we will also introduce Shark, a port of Apache Hive onto Spark that is compatible with existing Hive warehouses and queries. Shark can answer HiveQL queries up to 100x faster than Hive without modification to the data and queries, and is also open source as part of BDAS.

Tutorial Part 1 (with PowerPoint slides) and Part 2.

© 2008-2024 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.