Scaling Genomic Workflows with Spark SQL BGEN and VCF Readers

Posted Leave a commentPosted in Announcements, Apache Spark, BGEN, Ecosystem, Engineering Blog, Genomics, HLS, Spark SQL, VCF

In the past decade, the amount of available genomic data has exploded as the price of genome sequencing has dropped. Researchers are now able to scan for associations between genetic variation and diseases across cohorts of hundreds of thousands of individuals from projects such as the UK Biobank. These analyses will lead to a deeper […]

A Guide to Developer, Deep Dive, and Continuous Streaming Applications Talks at Spark + AI Summit

Posted Leave a commentPosted in Apache Spark, Company Blog, Databricks, Databricks Delta, Education, Events, Spark + AI Summit, Spark SQL, Spark Training, Structured Streaming

In January 2013 when Stephen O’Grady, an analyst at RedMonk, published “The New Kingmakers: How Developers Conquered the World,” the book’s central argument (then and still now) universally resonated with an emerging open-source community. He convincingly charts developers’ movement “out of the shadows and into the light as new influencers on society’s [technical landscape].” Using […]

Apache Avro as a Built-in Data Source in Apache Spark 2.4

Posted Leave a commentPosted in Apache Avro, Apache Spark, Apache Spark 2.4, Data Source, Ecosystem, Engineering Blog, Spark SQL, Streaming, Structured Streaming

Try this notebook in Databricks Apache Avro is a popular data serialization format. It is widely used in the Apache Spark and Apache Hadoop ecosystem, especially for Kafka-based data pipelines. Starting from Apache Spark 2.4 release, Spark provides built-in support for reading and writing Avro data. The new built-in spark-avro module is originally from Databricks’ […]

SQL Pivot: Converting Rows to Columns

Posted Leave a commentPosted in Apache Spark, DataFrames, Engineering Blog, Spark SQL, sql, Unified Analytics Platform

Try this notebook in Databricks Pivot was first introduced in Apache Spark 1.6 as a new DataFrame feature that allows users to rotate a table-valued expression by turning the unique values from one column into individual columns. The upcoming Apache Spark 2.4 release extends this powerful functionality of pivoting data to our SQL users as […]

Introducing Flint: A time-series library for Apache Spark

Posted Leave a commentPosted in Apache Spark, Company Blog, Customers, Education, Engineering Blog, Flint, python, Spark SQL, Time Series

This is a joint guest community blog by Li Jin at Two Sigma and Kevin Rasmussen at Databricks; they share how to use Flint with Apache Spark. Introduction The volume of data that data scientists face these days increases relentlessly, and we now find that a traditional, single-machine solution is no longer adequate to the demands […]

A Guide to Data Science, Developer, and Deep Dive Talks at Spark + AI Summit Europe

Posted Leave a commentPosted in Apache Spark, Company Blog, Data Science, Events, Machine Learning, Spark SQL, Structured Streaming

In October 2012, Harvard Business Review put a spotlight on the data science career with a dedicated issue and a catchy claim: Data Scientist: The Sexiest Job of the 21st Century.Last year in October, five years on, Forbes recast an answer on Quora, Why Data Science Is Such A Hot Career Right Now? Recent technical […]

Processing Petabytes of Data in Seconds with Databricks Delta

Posted Leave a commentPosted in Apache Spark, Databricks Delta, Engineering Blog, Machine Leanring, Spark SQL, Streaming, Structured Streaming, Unified Analytics Platform

Introduction Databricks Delta is a unified data management system that brings data reliability and fast analytics to cloud data lakes. In this blog post, we take a peek under the hood to examine what makes Databricks Delta capable of sifting through petabytes of data within seconds. In particular, we discuss Data Skipping and ZORDER Clustering. […]

rquery: Practical Big Data Transforms for R-Spark Users

Posted Leave a commentPosted in Apache Spark, Big Data, Data Science, Engineering Blog, Machine Learning, R, Spark SQL, SparkR

This is a guest community blog from Nina Zumel and John Mount, data scientists and consultants at Win-Vector. They share how to use rquery with Apache Spark on Databricks Try this notebook in Databricks Introduction In this blog, we will introduce rquery, a powerful query tool that allows R users to implement powerful data transformations […]

Simplify Advertising Analytics Click Prediction with Databricks Unified Analytics Platform

Posted Leave a commentPosted in Advertising Analytics, Apache Spark, Data Visualization, Ecosystem, Education, ETL, Machine Learning, Platform, Product, Spark SQL, Streaming

Try this notebook series in Databricks Advertising teams want to analyze their immense stores and varieties of data requiring a scalable, extensible, and elastic platform.  Advanced analytics, including but not limited to classification, clustering, recognition, prediction, and recommendations allow these organizations to gain deeper insights from their data and drive business outcomes. As data of […]