Scaling Genomic Workflows with Spark SQL BGEN and VCF Readers

Posted Leave a commentPosted in Announcements, Apache Spark, BGEN, Ecosystem, Engineering Blog, Genomics, HLS, Spark SQL, VCF

In the past decade, the amount of available genomic data has exploded as the price of genome sequencing has dropped. Researchers are now able to scan for associations between genetic variation and diseases across cohorts of hundreds of thousands of individuals from projects such as the UK Biobank. These analyses will lead to a deeper […]

Accurately Building Genomic Cohorts at Scale with Delta Lake and Spark SQL

Posted Leave a commentPosted in Apache Spark, Delta, Delta Lake, Ecosystem, Engineering Blog, Genomics, HLS, Joint Genotyping, SparkSQL

This is the second post in our “Genomic Analysis at Scale”  series.  In our first post, we explored a simple problem: how to provide real-time aggregates when sequencing large volumes of genomes. We solved this problem by using Delta Lake and a streaming pipeline built using Spark SQL. In this blog, we focus on the more advanced […]

A Guide to Healthcare and Life Sciences Talks at Spark + AI Summit 2019

Posted Leave a commentPosted in Apache Spark, bioinformatics, Company Blog, Events, Genomics, healthcare, life sciences, Spark + AI Summit, Summit

Data and AI are ushering in a new era of precision medicine. The scale of the cloud, combined with advancements in machine learning, are enabling healthcare and life sciences organizations to use their mountains of data—such as electronic health records, genomics, real-world evidence, claims, and more—to drive innovation across the entire ecosystem, from accelerating drug […]

Simplifying Genomics Pipelines at Scale with Databricks Delta

Posted Leave a commentPosted in Apache Spark, data pipeline, Engineering Blog, Genomics, HLS, Machine Learning, Streaming

Try this notebook in Databricks This blog is the first blog in our “Genomics Analysis at Scale” series. In this series, we will demonstrate how the Databricks UAP4Genomics enables customers to analyze population-scale genomic data. Starting from the output of our genomics pipeline, this series will provide a tutorial on using Databricks to run sample […]

Building the Fastest DNASeq Pipeline at Scale

Posted Leave a commentPosted in Benchmark, DNA, DNASeq, Engineering Blog, GATK, Genomics, Platform, Unified Analytics Platform

In June, we announced the Unified Analytics Platform for Genomics with a simple goal: accelerate discovery with a collaborative platform for interactive genomic data processing, analytics and AI at massive scale.  In this post, we’ll go into more detail about one component of the platform: a scalable DNASeq pipeline that is concordant with GATK4 at […]

A Guide to Apache Spark Use Cases, Streaming, and Research Talks at Spark + AI Summit Europe

Posted Leave a commentPosted in Apache Spark, Company Blog, Data Science, Events, Genomics, Machine Learning, PySpark, Spark + AI Summit Europe, Structured Streaming, Unified Analytics Platform

For much of Apache Spark’s history, its capacity to process data at scale and capability to unify disparate workloads has led Spark developers to tackle new use cases. Through innovation and extension of its ecosystem, developers combine data and AI to develop new applications. So it befits developers to come to this summit not just […]