Introducing Apache Spark 2.4 – The Databricks Blog

Posted Leave a commentPosted in Apache Spark, Apache Spark 2.4, Databricks Runtime 5.0, Ecosystem, Engineering Blog, Machine Learning, Pandas UDF, Platform, SparkSQL, Streaming, Structured Streaming, Unified Analytics Platform

We are excited to announce the availability of Apache Spark 2.4 on Databricks as part of the Databricks Runtime 5.0. We want to thank the Apache Spark community for all their valuable contributions to the Spark 2.4 release. Continuing with the objectives to make Spark faster, easier, and smarter, Spark 2.4 extends its scheduler to […]

A Guide to Apache Spark Use Cases, Streaming, and Research Talks at Spark + AI Summit Europe

Posted Leave a commentPosted in Apache Spark, Company Blog, Data Science, Events, Genomics, Machine Learning, PySpark, Spark + AI Summit Europe, Structured Streaming, Unified Analytics Platform

For much of Apache Spark’s history, its capacity to process data at scale and capability to unify disparate workloads has led Spark developers to tackle new use cases. Through innovation and extension of its ecosystem, developers combine data and AI to develop new applications. So it befits developers to come to this summit not just […]

Building a Real-Time Attribution Pipeline with Databricks Delta

Posted Leave a commentPosted in Adhoc Analysis, Advertising Analytics, Apache Spark, bi, Company Blog, Databricks Delta, Ecosystem, Education, Engineering Blog, Kinesis, Machine Learning, Platform, Product, Spark Streaming, Streaming, Structured Streaming, Tableau

Try this notebook in Databricks In digital advertising, one of the most important things to be able to deliver to clients is information about how their advertising spend drove results.  The more quickly we can provide this, the better. To tie conversions or engagements to the impressions served in an advertising campaign, companies must perform […]

A Guide to Data Science, Developer, and Deep Dive Talks at Spark + AI Summit Europe

Posted Leave a commentPosted in Apache Spark, Company Blog, Data Science, Events, Machine Learning, Spark SQL, Structured Streaming

In October 2012, Harvard Business Review put a spotlight on the data science career with a dedicated issue and a catchy claim: Data Scientist: The Sexiest Job of the 21st Century.Last year in October, five years on, Forbes recast an answer on Quora, Why Data Science Is Such A Hot Career Right Now? Recent technical […]

Processing Petabytes of Data in Seconds with Databricks Delta

Posted Leave a commentPosted in Apache Spark, Databricks Delta, Engineering Blog, Machine Leanring, Spark SQL, Streaming, Structured Streaming, Unified Analytics Platform

Introduction Databricks Delta is a unified data management system that brings data reliability and fast analytics to cloud data lakes. In this blog post, we take a peek under the hood to examine what makes Databricks Delta capable of sifting through petabytes of data within seconds. In particular, we discuss Data Skipping and ZORDER Clustering. […]

A Guide to Developer, Apache Spark Use Cases, and Deep Dives Talks at Spark + AI Summit

Posted Leave a commentPosted in Apache Spark, Company Blog, Events, Kubernetes, Machine Learning, PySpark, Spark + AI Summit, Structured Streaming

Apache Spark is tackling new frontiers through innovations by unifying new workloads. This enables developers to combine data and AI to develop intelligent applications. Developers come to this summit not just to hear about innovations from contributors. They come to share their use cases, experiences, and absorb knowledge. @matei_zaharia just announced support of #kubernetes in […]

Introducing Low-latency Continuous Processing Mode in Structured Streaming in Apache Spark 2.3

Posted Leave a commentPosted in Apache Spark, Apache Spark 2.0, Continuous Processing, Databricks Runtime 4.0, Streaming, Structured Streaming

Import this notebook on Databricks Structured Streaming in Apache Spark 2.0 decoupled micro-batch processing from its high-level APIs for a couple of reasons. First, it made developer’s experience with the APIs simpler: the APIs did not have to account for micro-batches. Second, it allowed developers to treat a stream as an infinite table to which […]

Introducing Stream-Stream Joins in Apache Spark 2.3

Posted Leave a commentPosted in Apache Spark, Databricks Runtime, Engineering Blog, Streaming, Structured Streaming

Since we introduced Structured Streaming in Apache Spark 2.0, it has supported joins (inner join and some type of outer joins) between a streaming and a static DataFrame/Dataset. With the release of Apache Spark 2.3.0, now available in Databricks Runtime 4.0 as part of Databricks Unified Analytics Platform, we now support stream-stream joins. In this […]

Apache Spark 2.3 with Native Kubernetes Support

Posted Leave a commentPosted in Apache Spark, Ecosystem, Engineering Blog, Kubernetes, Machine Learning, Structured Streaming

This is a community blog from Anirudh Ramanathan and Palak Bhatia, software engineer and product manager respectively at Google, working in the Kubernetes team. They are part of the group of companies that contributed to native Kubernetes support for the Apache Spark 2.3. This post is cross-posted on blog.kubernetes.io Kubernetes and Big Data The open […]