Introducing Apache Spark 2.4 – The Databricks Blog

Posted Leave a commentPosted in Apache Spark, Apache Spark 2.4, Databricks Runtime 5.0, Ecosystem, Engineering Blog, Machine Learning, Pandas UDF, Platform, SparkSQL, Streaming, Structured Streaming, Unified Analytics Platform

We are excited to announce the availability of Apache Spark 2.4 on Databricks as part of the Databricks Runtime 5.0. We want to thank the Apache Spark community for all their valuable contributions to the Spark 2.4 release. Continuing with the objectives to make Spark faster, easier, and smarter, Spark 2.4 extends its scheduler to […]

SQL Pivot: Converting Rows to Columns

Posted Leave a commentPosted in Apache Spark, DataFrames, Engineering Blog, Spark SQL, sql, Unified Analytics Platform

Try this notebook in Databricks Pivot was first introduced in Apache Spark 1.6 as a new DataFrame feature that allows users to rotate a table-valued expression by turning the unique values from one column into individual columns. The upcoming Apache Spark 2.4 release extends this powerful functionality of pivoting data to our SQL users as […]

The three biggest security challenges facing AI and data initiatives

Posted Leave a commentPosted in Company Blog, Databricks Platform, Enterprise Security, platform security, Security, Unified Analytics Platform

In today’s business climate, the ability to anticipate and meet customer needs is central to success. Forward looking business leaders are looking to unleash the power of Artificial Intelligence (AI) to drive innovation, but this requires bringing together diverse teams and large volumes of data. With attackers getting more sophisticated, securing these complex data workflows […]

Democratizing Cloud Infrastructure with Terraform and Jenkins

Posted Leave a commentPosted in Ecosystem, Engineering Blog, Infrastructure, Monitoring, Platform, Provisioning, Unified Analytics Platform

This blog post is part of our series of internal engineering blogs on the Databricks platform, infrastructure management, integration, tooling, monitoring, and provisioning. This summer at Databricks I designed and implemented a service for coordinating and deploying cloud provider infrastructure resources that significantly improved the velocity of operations on our self-managed cloud platform. The service […]

How to Use MLflow To Reproduce Results and Retrain Saved Keras ML Models

Posted Leave a commentPosted in Apache Spark, Engineering Blog, Keras, Machine Learning, MLflow, Model Management, Platform, TensorFlow, Unified Analytics Platform

In part 2 of our series on MLflow blogs, we demonstrated how to use MLflow to track experiment results for a Keras network model using binary classification. We classified reviews from an IMDB dataset as positive or negative. And we created one baseline model and two experiments. For each model, we tracked its respective training […]

MLflow On-Demand Webinar and FAQ Now Available!

Posted Leave a commentPosted in Data Science, Deep Learning, Ecosystem, Engineering Blog, Machine Learning, MLflow, Model Management, Platform, Product, Unified Analytics Platform

On August 30th, our team hosted a live webinar—Introducing MLflow: Infrastructure for a complete Machine Learning lifecycle—with Matei Zaharia, Co-Founder and Chief Technologist at Databricks. In this webinar, we walked you through MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library […]

Building the Fastest DNASeq Pipeline at Scale

Posted Leave a commentPosted in Benchmark, DNA, DNASeq, Engineering Blog, GATK, Genomics, Platform, Unified Analytics Platform

In June, we announced the Unified Analytics Platform for Genomics with a simple goal: accelerate discovery with a collaborative platform for interactive genomic data processing, analytics and AI at massive scale.  In this post, we’ll go into more detail about one component of the platform: a scalable DNASeq pipeline that is concordant with GATK4 at […]

A Guide to Apache Spark Use Cases, Streaming, and Research Talks at Spark + AI Summit Europe

Posted Leave a commentPosted in Apache Spark, Company Blog, Data Science, Events, Genomics, Machine Learning, PySpark, Spark + AI Summit Europe, Structured Streaming, Unified Analytics Platform

For much of Apache Spark’s history, its capacity to process data at scale and capability to unify disparate workloads has led Spark developers to tackle new use cases. Through innovation and extension of its ecosystem, developers combine data and AI to develop new applications. So it befits developers to come to this summit not just […]

Introducing Cluster-scoped init scripts – The Databricks Blog

Posted Leave a commentPosted in Cluster Management, DBFS, Engineering Blog, Platform, Product, Summer Internship, Unified Analytics Platform

Introduction This summer, I worked at Databricks as a software engineering intern on the Clusters team. As part of my internship project, I designed and implemented Cluster-scoped init scripts, improving scalability and ease of use. In this blog, I will discuss various benefits of Cluster-scoped init scripts, followed by my internship experience at Databricks, and […]

How to Use MLflow to Experiment a Keras Network Model: Binary Classification for Movie Reviews

Posted Leave a commentPosted in Apache Spark, Data Science, Engineering Blog, Machine Learning, MLflow, Model Management, Platform, python, Unified Analytics Platform

In the last blog post, we demonstrated the ease with which you can get started with MLflow, an open-source platform to manage machine learning lifecycle. In particular, we illustrated a simple Keras/TensorFlow model using MLflow and PyCharm. This time we explore a binary classification Keras network model. Using MLflow’s Tracking APIs, we will track metrics—accuracy […]