Scaling Genomic Workflows with Spark SQL BGEN and VCF Readers

Posted Leave a commentPosted in Announcements, Apache Spark, BGEN, Ecosystem, Engineering Blog, Genomics, HLS, Spark SQL, VCF

In the past decade, the amount of available genomic data has exploded as the price of genome sequencing has dropped. Researchers are now able to scan for associations between genetic variation and diseases across cohorts of hundreds of thousands of individuals from projects such as the UK Biobank. These analyses will lead to a deeper […]

Accurately Building Genomic Cohorts at Scale with Delta Lake and Spark SQL

Posted Leave a commentPosted in Apache Spark, Delta, Delta Lake, Ecosystem, Engineering Blog, Genomics, HLS, Joint Genotyping, SparkSQL

This is the second post in our “Genomic Analysis at Scale”  series.  In our first post, we explored a simple problem: how to provide real-time aggregates when sequencing large volumes of genomes. We solved this problem by using Delta Lake and a streaming pipeline built using Spark SQL. In this blog, we focus on the more advanced […]

Hyperparameter Tuning with MLflow, Apache Spark MLlib and Hyperopt

Posted Leave a commentPosted in Apache Spark, AutoML, Data Science, Databricks Runtime 5.4 ML, Deep Learning, Ecosystem, Engineering Blog, Hyperopt, Hyperparameter Tuning, Machine Learning, MLflow, MLlib

Hyperparameter tuning is a common technique to optimize machine learning models based on hyperparameters, or configurations that are not learned during model training.  Tuning these configurations can dramatically improve model performance. However, hyperparameter tuning can be computationally expensive, slow, and unintuitive even for experts. Databricks Runtime 5.4 and 5.4 ML (Azure | AWS) introduce new […]

Announcing the MLflow 1.0 Release

Posted Leave a commentPosted in Announcements, Company Blog, Data Science, Ecosystem, Engineering Blog, Lifecycle, Machine Learning, MLflow, Model Management, Product

MLflow is an open source platform to help manage the complete machine learning lifecycle. With MLflow, data scientists can track and share experiments locally (on a laptop) or remotely (in the cloud), package and share models across frameworks, and deploy models virtually anywhere. Today we are excited to announce the release of MLflow 1.0. Since […]

Enhanced Hyperparameter Tuning and Optimized AWS Storage with Databricks Runtime 5.4 ML

Posted Leave a commentPosted in Announcements, AutoML, Company Blog, Data Science, Databricks Runtime 5.4 ML, Deep Learning, Ecosystem, Engineering Blog, Hyperopt, Hyperparameter Tuning, Machine Learning, MLflow, MLlib, Platform, Product

We are excited to announce the release of Databricks Runtime 5.4 ML (Azure | AWS). This release includes two Public Preview features to improve data science productivity, optimized storage in AWS for developing distributed applications, and a number of Python library upgrades. To get started, you simply select the Databricks Runtime 5.4 ML from the […]

Introducing Databricks Runtime 5.4 with Conda (Beta)

Posted Leave a commentPosted in Announcements, Company Blog, Data Science, Databricks Runtime, Deep Learning, Ecosystem, Engineering Blog, Machine Learning, Product

We are excited to introduce a new runtime: Databricks Runtime 5.4 with Conda (Beta). This runtime uses Conda to manage Python libraries and environments. Many of our Python users prefer to manage their Python environments and libraries with Conda, which quickly is emerging as a standard. Conda takes a holistic approach to package management by […]

Efficient Databricks Deployment Automation with Terraform

Posted Leave a commentPosted in CI/CD, cloud automation, Company Blog, Customers, Ecosystem, Education, Engineering Blog, Platform

Managing cloud infrastructure and provisioning resources can be a headache that DevOps engineers are all too familiar with. Even the most capable cloud admins can get bogged down with managing a bewildering number of interconnected cloud resources – including data streams, storage, compute power, and analytics tools. Take, for example, the following scenario: a customer […]

Announcing General Availability of Managed MLflow on Databricks

Posted Leave a commentPosted in Announcements, Company Blog, Ecosystem, Engineering Blog, Machine Learning, Managed MLflow, MLflow, Platform, Product

Try this tutorial in Databricks MLflow is an open source platform to help manage the complete machine learning lifecycle. With MLflow, data scientists can track and share experiments locally or in the cloud, package and share models across frameworks, and deploy models virtually anywhere. Today at the Spark + AI Summit, we announced the General […]

Koalas: Easy Transition from pandas to Apache Spark

Posted Leave a commentPosted in Announcements, Apache Spark, Company Blog, Data Science, Ecosystem, Education, Engineering Blog, Machine Learning, Open Source, Pandas, python

Today at Spark + AI Summit, we announced Koalas, a new open source project that augments PySpark’s DataFrame API to make it compatible with pandas. Python data science has exploded over the past few years and pandas has emerged as the lynchpin of the ecosystem. When data scientists get their hands on a data set, […]