Accurately Building Genomic Cohorts at Scale with Delta Lake and Spark SQL

Posted Leave a commentPosted in Apache Spark, Delta, Delta Lake, Ecosystem, Engineering Blog, Genomics, HLS, Joint Genotyping, SparkSQL

This is the second post in our “Genomic Analysis at Scale”  series.  In our first post, we explored a simple problem: how to provide real-time aggregates when sequencing large volumes of genomes. We solved this problem by using Delta Lake and a streaming pipeline built using Spark SQL. In this blog, we focus on the more advanced […]

Introducing New Built-in and Higher-Order Functions for Complex Data Types in Apache Spark 2.4

Posted Leave a commentPosted in Apache Spark, Apache Spark 2.4, Engineering Blog, SparkSQL, sql

Try this notebook in Databricks Apache Spark 2.4 introduces 29 new built-in functions for manipulating complex types (e.g., array type), including higher-order functions. Before Spark 2.4, for manipulating the complex types directly, there were two typical solutions: 1) Exploding the nested structure into individual rows, and applying some functions, and then creating the structure again […]

Introducing Apache Spark 2.4 – The Databricks Blog

Posted Leave a commentPosted in Apache Spark, Apache Spark 2.4, Databricks Runtime 5.0, Ecosystem, Engineering Blog, Machine Learning, Pandas UDF, Platform, SparkSQL, Streaming, Structured Streaming, Unified Analytics Platform

We are excited to announce the availability of Apache Spark 2.4 on Databricks as part of the Databricks Runtime 5.0. We want to thank the Apache Spark community for all their valuable contributions to the Spark 2.4 release. Continuing with the objectives to make Spark faster, easier, and smarter, Spark 2.4 extends its scheduler to […]