Simplifying Genomics Pipelines at Scale with Databricks Delta

Posted Leave a commentPosted in Apache Spark, data pipeline, Engineering Blog, Genomics, HLS, Machine Learning, Streaming

Try this notebook in Databricks This blog is the first blog in our “Genomics Analysis at Scale” series. In this series, we will demonstrate how the Databricks UAP4Genomics enables customers to analyze population-scale genomic data. Starting from the output of our genomics pipeline, this series will provide a tutorial on using Databricks to run sample […]

Speedy Scala Builds with Bazel at Databricks

Posted Leave a commentPosted in Apache Spark, Bazel, Company Blog, Databricks, Engineering Blog, SBT, Scala

Databricks migrated over from the standard Scala Build Tool (SBT) to using the Bazel to build, test and deploy our Scala code. Through improvements in our build infrastructure, Scala compilation workflows that previously took minutes to tens of minutes now complete in seconds. This post will walk you through the improvements we made to achieve […]

A Guide to Data Engineering Talks at Spark + AI Summit 2019

Posted Leave a commentPosted in Announcements, Apache Spark, Company Blog, Events, Spark, Spark + AI Summit

Selected highlights from the new track Big data practitioners grapple with data quality issues and data pipeline complexities—it’s the bane of their existence. Whether you are chartered with advanced analytics, developing new machine learning models, providing operational reporting or managing the data infrastructure, the concern with data quality is a common theme. Data engineers, in […]

A Guide to Developer, Deep Dive, and Continuous Streaming Applications Talks at Spark + AI Summit

Posted Leave a commentPosted in Apache Spark, Company Blog, Databricks, Databricks Delta, Education, Events, Spark + AI Summit, Spark SQL, Spark Training, Structured Streaming

In January 2013 when Stephen O’Grady, an analyst at RedMonk, published “The New Kingmakers: How Developers Conquered the World,” the book’s central argument (then and still now) universally resonated with an emerging open-source community. He convincingly charts developers’ movement “out of the shadows and into the light as new influencers on society’s [technical landscape].” Using […]

How to Work with Avro, Kafka, and Schema Registry in Databricks

Posted Leave a commentPosted in Apache Avro, Apache Kafka, Apache Spark, Company Blog, DBR 5.2, Ecosystem, Engineering Blog, Product, Streaming, Structured Streaming

In the previous blog post, we introduced the new built-in Apache Avro data source in Apache Spark and explained how you can use it to build streaming data pipelines with the from_avro and to_avro functions. Apache Kafka and Apache Avro are commonly used to build a scalable and near-real-time data pipeline. In this blog post, […]

Databricks Runtime 5.2 ML Features Multi-GPU Workflow, Pregel API, and Performant GraphFrames

Posted Leave a commentPosted in Apache Spark, Databricks Runtime 5.2 ML, Deep Learning, Engineering Blog, GraphFrames, HorovodRunner, Machine Learning, Platform, PyTorch, TensorFlow

We are excited to announce the release of Databricks Runtime 5.2 for Machine Learning. This release includes several new features and performance improvements to help developers easily use machine learning on the Databricks Unified Analytics Platform. Continuing our efforts to make developers’ lives easy to build deep learning applications, this release includes the following features […]

5 Reasons to Become an Apache Spark Expert

Posted Leave a commentPosted in Apache Spark, Company Blog, Education, Engineering Blog, Spark, Spark + AI Summit, training

Apache Spark™ has fast become the most popular unified analytics engine for big data and machine learning. It was originally developed at UC Berkeley in 2009 by the team who later founded Databricks. Since its release, Apache Spark has seen rapid adoption. Today’s most cutting-edge companies such as Apple, Netflix, Facebook, and Uber have deployed Spark […]

Apparate: Managing Libraries in Databricks with CI/CD

Posted Leave a commentPosted in Apache Spark, apparate, CI/CD, continuous delivery, continuous integration, Continuous Processing, Customers, Education, Partners, Product

This is a guest blog from Hanna Torrence, Data Scientist at ShopRunner. Introduction As leveraging data becomes a more vital component of organizations’ tech stacks, it becomes increasingly important for data teams to make use of software engineering best-practices. The Databricks platform provides excellent tools for exploratory Apache Spark workflows in notebooks as well as […]

Kicking Off 2019 with an MLflow User Survey

Posted Leave a commentPosted in Apache Spark, Ecosystem, Engineering Blog, Machine Learning, MLflow

It’s been six months since we launched MLflow, an open source platform to manage the machine learning (ML) lifecycle, and the project has been moving quickly since then. MLflow fills a role that hasn’t been served well in the open source community so far: managing the development lifecycle for ML, including tracking experiments and metrics, […]

Introducing Databricks Runtime 5.1 for Machine Learning

Posted Leave a commentPosted in Announcements, Apache Spark, Company Blog, Databricks Runtime 5.1 ML, Deep Learning, Engineering Blog, Machine Learning, PyTorch, TensorFlow

Last week, we released Databricks Runtime 5.1 Beta for Machine Learning. As part of our commitment to provide developers the latest deep learning frameworks, this release includes the best of these libraries. In particular, our PyTorch addition makes it simple for a developer to simply import the appropriate Python torch modules and start coding, without […]