Databricks is now available in both AWS and Azure so it’s getting a lot of buzz! Let’s discuss 5 things you should know about Databricks before diving in.
1. Databricks is a managed Spark-based service for working with data in a cluster
Databricks is an enhanced version of Spark and is touted by the Databricks company as being faster, sometimes significantly faster, than opensource Spark.
At a high-level, Databricks advertises the following improvements to opensource Spark:
- More optimized throughput, specifically optimized for the cloud
- Data skipping, allows users to use statistics on data files to prune files more efficiently during query processing using a methodology called Zorder
- Transparent caching accelerates reads through automatic caching of data to a node’s local storage
- More efficient decoding, boosts CPU efficiency when decoding common formats, i.e. importing data from common sources, such as database tables and flat files
- Transactional (atomic) writes to cloud-sources such as Azure Data Lake or Azure Blob Storage, for both appends and new writes
In addition, Databricks provides an interface for spinning up an Azure cluster and interacting with the cluster. The Databricks interface allows you to spin up an Azure cluster in just a few clicks, create notebooks for ETL, analytics, graph processing, and machine learning, share the notebooks with coworkers for collaboration, save the notebooks as scheduled jobs, comment on cells in notebooks for collaboration purposes, and Databricks will even shut down your cluster after a specified amount of time not-in-use.
2. Databricks supports multiple languages but you’ll always get the best performance with JVM-based languages
Databricks has a few nice features that makes it ideal for parallelizing data science, unlike leading ETL tools. The Databricks notebook interface allows you to use “magic commands” to code in multiple languages in the same notebook. Supported languages aside from Spark SQL are Java, Scala, Python, R, and standard SQL. This functionality is supported because Spark has high-level APIs for each of the supported languages.
When a command is issued in Python, for example, a few things happen. Each of the worker nodes has a JVM process and a Python process on top of it. When the worker node needs to perform a Python operation on the data because the user has issued a Python command, the worker node takes the data out of the JVM memory, puts it into Python memory, does the transformation, and then returns it to the JVM memory. This means that there is a cost in efficiency to using a non-JVM language such as Python or R. Spark-SQL and Java/Scala (built on the JVM) will consistently outperform Python and R in the Spark environment in terms of speed and performance.
However, in situations in which performance is not a key concern, the ability to code within multiple languages within a single notebook is huge because it introduces a high degree of flexibility to working within Spark – you can use the strength of each of the languages from a functional perspective, and work in your most comfortable language. For example, in a sample notebook worked through by Datalere, we were able to perform data transformation in Spark SQL, standard SQL, and Scala, predict using Spark’s MLlib, evaluate the performance of the model in Python, and visualize the results in R.
3. Databricks has 3 In-Memory Data Object APIs
Spark has three types of built-in data object APIs: RDDs, Dataframes, and Datasets.
RDD stands for Resilient Distributed Dataset and it’s the original data object of Spark. An RDD is a set of java objects representing data. RDDs are a resilient and distributed collection of records spread over one or many partitions. RDDs have three main characteristics: they are typed, they are lazy, and they are based on the Scala API. The biggest disadvantage to working with RDDs is that they are slow with non-JVM based languages such as Python or R.
Spark Dataframes are a higher-level abstraction that allows you to use a query language to transform the data stored in a Spark Dataframe. The higher-level abstraction is a logical plan that represents the schema and the data. The logical plan is converted to a physical plan for execution which means that you allow Spark to figure out the most efficient way to do what you want to do behind the scenes. Spark Dataframes are still built on top of RDDs but Spark Dataframes are faster than RDDs because they are optimized by Spark. Specifically, Spark Dataframes are using custom memory management (the Tungsten project) and optimized execution plans (Catalyst optimizer). The Tungsten project works to make sure your Spark jobs are executed faster given CPU constraints and the Catalyst optimizer optimizes the logical plan of the Spark Dataframe.
Dataframes, however, do not have compile-time type safety, meaning that they are not strongly typed. This is not a best practice within production because it can result in errors within your code. Spark introduced the Dataset API to correct for this limitation in Dataframes. Spark Datasets use the best of both worlds, the type safety of RDDs along with the optimizations of Dataframes. Since Python and R have no compile-time type safety, typed-Datasets are not available within these programming languages.
It’s easy to switch between data abstractions within Databricks but note that you want to refrain from passing your data between RDDs, Dataframes, and Datasets unnecessarily. Data is required to be serialized and deserialized when transitioning back and forth. Serialization is an expensive (or inefficient) operation.
4. A Spark Dataframe is not the same as a Pandas/R Dataframe
Spark Dataframes are specifically designed to use distributed memory to perform operations across a cluster whereas Pandas/R Dataframes can only run on one computer. This means that you need to use a Spark Dataframe to realize the benefits of the cluster when coding in Python or R within Databricks.
It is important to note, however, that you can use Spark Dataframes and Pandas or R Dataframes together by easily converting between Spark Dataframes and Pandas/R Dataframes. For example, you can code your data transformations using the Spark Dataframe and then convert to a Pandas/R Dataframe to make use of the wealth of libraries available in Python or R that specifically accept a Pandas/R Dataframe as an input, such as data visualization libraries. These libraries will not run in parallel because they are coded to require a Pandas/R Dataframe specifically as an input parameter.
Please note that converting a Spark Dataframe into a Pandas/R Dataframe is only an option if your data is small, because Databricks will attempt to load the entire data into the driver’s memory when converting from a Spark Dataframe to a Pandas/R Dataframe.
5. Spark has its own machine learning library called MLlib
Spark has a built-in data science library called “MLlib” that allows developers to run select data science algorithms in parallel, across the cluster. Spark’s MLlib documentation lists all of the supported algorithms.
Not all data science algorithms can be parallelized and some algorithms are modified slightly such that they can be optimized over a cluster. For example, the linear regression available in Spark’s MLlib uses Stochastic Gradient Descent (SGD) to calculate coefficients for the regression. Every data science or statistics method has an underlying formula that the algorithm or method is trying to optimize. Linear regression using SGD simply changes that underlying formula such that the formula can be better/more efficiently calculated across a cluster. Other methodologies, such as Naïve Bayes, depend on calculating frequencies, which naturally lends itself to being distributed over a cluster.
If you have questions about Databricks or are looking for assistance, feel free to reach out to Datalere for a free consultation. We specialize in understanding how these services work under-the-hood so that we can put together the best product, service, and architectural recommendations for our clients.