A Guide to Developer, Deep Dive, and Continuous Streaming Applications Talks at Spark + AI Summit

Posted Leave a commentPosted in Apache Spark, Company Blog, Databricks, Databricks Delta, Education, Events, Spark + AI Summit, Spark SQL, Spark Training, Structured Streaming

In January 2013 when Stephen O’Grady, an analyst at RedMonk, published “The New Kingmakers: How Developers Conquered the World,” the book’s central argument (then and still now) universally resonated with an emerging open-source community. He convincingly charts developers’ movement “out of the shadows and into the light as new influencers on society’s [technical landscape].” Using […]

How to Work with Avro, Kafka, and Schema Registry in Databricks

Posted Leave a commentPosted in Apache Avro, Apache Kafka, Apache Spark, Company Blog, DBR 5.2, Ecosystem, Engineering Blog, Product, Streaming, Structured Streaming

In the previous blog post, we introduced the new built-in Apache Avro data source in Apache Spark and explained how you can use it to build streaming data pipelines with the from_avro and to_avro functions. Apache Kafka and Apache Avro are commonly used to build a scalable and near-real-time data pipeline. In this blog post, […]

Apache Avro as a Built-in Data Source in Apache Spark 2.4

Posted Leave a commentPosted in Apache Avro, Apache Spark, Apache Spark 2.4, Data Source, Ecosystem, Engineering Blog, Spark SQL, Streaming, Structured Streaming

Try this notebook in Databricks Apache Avro is a popular data serialization format. It is widely used in the Apache Spark and Apache Hadoop ecosystem, especially for Kafka-based data pipelines. Starting from Apache Spark 2.4 release, Spark provides built-in support for reading and writing Avro data. The new built-in spark-avro module is originally from Databricks’ […]

Introducing Apache Spark 2.4 – The Databricks Blog

Posted Leave a commentPosted in Apache Spark, Apache Spark 2.4, Databricks Runtime 5.0, Ecosystem, Engineering Blog, Machine Learning, Pandas UDF, Platform, SparkSQL, Streaming, Structured Streaming, Unified Analytics Platform

We are excited to announce the availability of Apache Spark 2.4 on Databricks as part of the Databricks Runtime 5.0. We want to thank the Apache Spark community for all their valuable contributions to the Spark 2.4 release. Continuing with the objectives to make Spark faster, easier, and smarter, Spark 2.4 extends its scheduler to […]

A Guide to Apache Spark Use Cases, Streaming, and Research Talks at Spark + AI Summit Europe

Posted Leave a commentPosted in Apache Spark, Company Blog, Data Science, Events, Genomics, Machine Learning, PySpark, Spark + AI Summit Europe, Structured Streaming, Unified Analytics Platform

For much of Apache Spark’s history, its capacity to process data at scale and capability to unify disparate workloads has led Spark developers to tackle new use cases. Through innovation and extension of its ecosystem, developers combine data and AI to develop new applications. So it befits developers to come to this summit not just […]

Building a Real-Time Attribution Pipeline with Databricks Delta

Posted Leave a commentPosted in Adhoc Analysis, Advertising Analytics, Apache Spark, bi, Company Blog, Databricks Delta, Ecosystem, Education, Engineering Blog, Kinesis, Machine Learning, Platform, Product, Spark Streaming, Streaming, Structured Streaming, Tableau

Try this notebook in Databricks In digital advertising, one of the most important things to be able to deliver to clients is information about how their advertising spend drove results.  The more quickly we can provide this, the better. To tie conversions or engagements to the impressions served in an advertising campaign, companies must perform […]

A Guide to Data Science, Developer, and Deep Dive Talks at Spark + AI Summit Europe

Posted Leave a commentPosted in Apache Spark, Company Blog, Data Science, Events, Machine Learning, Spark SQL, Structured Streaming

In October 2012, Harvard Business Review put a spotlight on the data science career with a dedicated issue and a catchy claim: Data Scientist: The Sexiest Job of the 21st Century.Last year in October, five years on, Forbes recast an answer on Quora, Why Data Science Is Such A Hot Career Right Now? Recent technical […]

Processing Petabytes of Data in Seconds with Databricks Delta

Posted Leave a commentPosted in Apache Spark, Databricks Delta, Engineering Blog, Machine Leanring, Spark SQL, Streaming, Structured Streaming, Unified Analytics Platform

Introduction Databricks Delta is a unified data management system that brings data reliability and fast analytics to cloud data lakes. In this blog post, we take a peek under the hood to examine what makes Databricks Delta capable of sifting through petabytes of data within seconds. In particular, we discuss Data Skipping and ZORDER Clustering. […]

A Guide to Developer, Apache Spark Use Cases, and Deep Dives Talks at Spark + AI Summit

Posted Leave a commentPosted in Apache Spark, Company Blog, Events, Kubernetes, Machine Learning, PySpark, Spark + AI Summit, Structured Streaming

Apache Spark is tackling new frontiers through innovations by unifying new workloads. This enables developers to combine data and AI to develop intelligent applications. Developers come to this summit not just to hear about innovations from contributors. They come to share their use cases, experiences, and absorb knowledge. @matei_zaharia just announced support of #kubernetes in […]

Introducing Low-latency Continuous Processing Mode in Structured Streaming in Apache Spark 2.3

Posted Leave a commentPosted in Apache Spark, Apache Spark 2.0, Continuous Processing, Databricks Runtime 4.0, Streaming, Structured Streaming

Import this notebook on Databricks Structured Streaming in Apache Spark 2.0 decoupled micro-batch processing from its high-level APIs for a couple of reasons. First, it made developer’s experience with the APIs simpler: the APIs did not have to account for micro-batches. Second, it allowed developers to treat a stream as an infinite table to which […]