Kicking Off 2019 with an MLflow User Survey

Posted Leave a commentPosted in Apache Spark, Ecosystem, Engineering Blog, Machine Learning, MLflow

It’s been six months since we launched MLflow, an open source platform to manage the machine learning (ML) lifecycle, and the project has been moving quickly since then. MLflow fills a role that hasn’t been served well in the open source community so far: managing the development lifecycle for ML, including tracking experiments and metrics, […]

Introducing Databricks Library Utilities for Notebooks

Posted Leave a commentPosted in Announcements, Engineering Blog

Databricks has introduced a new feature, Library Utilities for Notebooks, as part of Runtime version 5.1. It allows you to install and manage Python dependencies from within a notebook.   This provides several important benefits: Install libraries when and where they’re needed, from within a notebook.  This eliminates the need to globally install libraries on a […]

Introducing Databricks Runtime 5.1 for Machine Learning

Posted Leave a commentPosted in Announcements, Apache Spark, Company Blog, Databricks Runtime 5.1 ML, Deep Learning, Engineering Blog, Machine Learning, PyTorch, TensorFlow

Last week, we released Databricks Runtime 5.1 Beta for Machine Learning. As part of our commitment to provide developers the latest deep learning frameworks, this release includes the best of these libraries. In particular, our PyTorch addition makes it simple for a developer to simply import the appropriate Python torch modules and start coding, without […]

MLflow v0.8.1 Features Faster Experiment UI and Enhanced Python Model

Posted Leave a commentPosted in Apache Spark, Data Science, Ecosystem, Engineering Blog, Machine Learning, Machine Learning Life Cycle, MLflow, Model Management, Platform, Spark UDF

Try this notebook in Databricks MLflow v0.8.1 was released this week. It introduces several UI enhancements, including faster load times for thousands of runs and improved responsiveness when navigating runs with many metrics and parameters. Additionally, it expands support for evaluating Python models as Apache Spark UDFs and automatically captures model dependencies as Conda environments. […]

Introducing Built-in Image Data Source in Apache Spark 2.4

Posted Leave a commentPosted in Apache Spark, Data Source, Databricks Runtime, DataFrames, Deep Learning Pipelines, Ecosystem, Engineering Blog, Machine Learning

Introduction With recent advances in deep learning frameworks for image classification and object detection, the demand for standard image processing in Apache Spark has never been greater. Image handling and preprocessing have their specific challenges – for example, images come in different formats (eg., jpeg, png, etc.), sizes, and color schemes, and there is no […]

Apache Avro as a Built-in Data Source in Apache Spark 2.4

Posted Leave a commentPosted in Apache Avro, Apache Spark, Apache Spark 2.4, Data Source, Ecosystem, Engineering Blog, Spark SQL, Streaming, Structured Streaming

Try this notebook in Databricks Apache Avro is a popular data serialization format. It is widely used in the Apache Spark and Apache Hadoop ecosystem, especially for Kafka-based data pipelines. Starting from Apache Spark 2.4 release, Spark provides built-in support for reading and writing Avro data. The new built-in spark-avro module is originally from Databricks’ […]

Introducing Databricks Runtime 5.0 for Machine Learning

Posted Leave a commentPosted in Announcements, Company Blog, Databricks Runtime 5.0 ML, Deep Learning, Ecosystem, Engineering Blog, Machine Learning, Platform

Six months ago we introduced the Databricks Runtime for Machine Learning with the goal of making machine learning performant and easy on the Databricks Unified Analytics Platform. The Databricks Runtime for ML comes pre-packaged with many ML frameworks and enables distributed training and inference. Today we are excited to release the second iteration including Conda […]

MLflow v0.8.0 Features Improved Experiment UI and Deployment Tools

Posted Leave a commentPosted in Engineering Blog, Machine Learning, Machine Learning Life Cycle, MLflow, Model Management

Last week we released MLflow v0.8.0 with multiple new features, including improved UI experience and support for deploying models directly via Docker containers to theAzure Machine Learning Service Workspace. Now available on PyPI and with docs online, you can install this new release with pip install mlflow as described in the MLflow quickstart guide. In […]

Introducing HorovodRunner for Distributed Deep Learning Training

Posted Leave a commentPosted in Apache Spark, Deep Learning, Distributed Learning, Engineering Blog, Keras, Project Hydrogen, TensorFlow

Today, we are excited to introduce HorovodRunner in our Databricks Runtime 5.0 ML! HorovodRunner provides a simple way to scale up your deep learning training workloads from a single machine to large clusters, reducing overall training time. Motivated by the needs of many of our users who want to train deep learning models on datasets […]

Introducing New Built-in and Higher-Order Functions for Complex Data Types in Apache Spark 2.4

Posted Leave a commentPosted in Apache Spark, Apache Spark 2.4, Engineering Blog, SparkSQL, sql

Try this notebook in Databricks Apache Spark 2.4 introduces 29 new built-in functions for manipulating complex types (e.g., array type), including higher-order functions. Before Spark 2.4, for manipulating the complex types directly, there were two typical solutions: 1) Exploding the nested structure into individual rows, and applying some functions, and then creating the structure again […]