Introducing Apache Spark 2.4 – The Databricks Blog

Posted Leave a commentPosted in Apache Spark, Apache Spark 2.4, Databricks Runtime 5.0, Ecosystem, Engineering Blog, Machine Learning, Pandas UDF, Platform, SparkSQL, Streaming, Structured Streaming, Unified Analytics Platform

We are excited to announce the availability of Apache Spark 2.4 on Databricks as part of the Databricks Runtime 5.0. We want to thank the Apache Spark community for all their valuable contributions to the Spark 2.4 release. Continuing with the objectives to make Spark faster, easier, and smarter, Spark 2.4 extends its scheduler to […]

Building a Real-Time Attribution Pipeline with Databricks Delta

Posted Leave a commentPosted in Adhoc Analysis, Advertising Analytics, Apache Spark, bi, Company Blog, Databricks Delta, Ecosystem, Education, Engineering Blog, Kinesis, Machine Learning, Platform, Product, Spark Streaming, Streaming, Structured Streaming, Tableau

Try this notebook in Databricks In digital advertising, one of the most important things to be able to deliver to clients is information about how their advertising spend drove results.  The more quickly we can provide this, the better. To tie conversions or engagements to the impressions served in an advertising campaign, companies must perform […]

Processing Petabytes of Data in Seconds with Databricks Delta

Posted Leave a commentPosted in Apache Spark, Databricks Delta, Engineering Blog, Machine Leanring, Spark SQL, Streaming, Structured Streaming, Unified Analytics Platform

Introduction Databricks Delta is a unified data management system that brings data reliability and fast analytics to cloud data lakes. In this blog post, we take a peek under the hood to examine what makes Databricks Delta capable of sifting through petabytes of data within seconds. In particular, we discuss Data Skipping and ZORDER Clustering. […]

Simplify Advertising Analytics Click Prediction with Databricks Unified Analytics Platform

Posted Leave a commentPosted in Advertising Analytics, Apache Spark, Data Visualization, Ecosystem, Education, ETL, Machine Learning, Platform, Product, Spark SQL, Streaming

Try this notebook series in Databricks Advertising teams want to analyze their immense stores and varieties of data requiring a scalable, extensible, and elastic platform.  Advanced analytics, including but not limited to classification, clustering, recognition, prediction, and recommendations allow these organizations to gain deeper insights from their data and drive business outcomes. As data of […]

Simplify Streaming Stock Data Analysis Using Databricks Delta

Posted Leave a commentPosted in Apache Spark, Data Lakes, Data Warehousing, Databricks Delta, Ecosystem, Education, financial, Machine Learning, Platform, Product, Stock Prices, Streaming

Traditionally, real-time analysis of stock data was a complicated endeavor due to the complexities of maintaining a streaming system and ensuring transactional consistency of legacy and streaming data concurrently.  Databricks Delta helps solve many of the pain points of building a streaming system to analyze stock data in real-time. In the following diagram, we provide […]

Make Your Oil and Gas Assets Smarter by Implementing Predictive Maintenance with Databricks

Posted Leave a commentPosted in Apache Spark, Ecosystem, Education, Engineering Blog, Machine Learning, Platform, Product, Streaming

How to build an end-to-end predictive data pipeline with Databricks Delta and Spark StreamingTry this notebook in Databricks Maintaining assets such as compressors is an extremely complex endeavor: they are used in everything from small drilling rigs to deep-water platforms, the assets are located across the globe, and they generate terabytes of data daily.  A […]

Announcing Databricks Runtime 4.2! – The Databricks Blog

Posted Leave a commentPosted in Announcements, Apache Spark, Company Blog, Customers, Databricks, Delta, Engineering Blog, Platform, Product, Runtime, Streaming

We’re excited to announce Databricks Runtime 4.2, powered by Apache Spark™.  Version 4.2 includes updated Spark internals, new features, and major performance upgrades to Databricks Delta, as well as general quality improvements to the platform.  We are moving quickly toward the Databricks Delta general availability (GA) release and we recommend you upgrade to Databricks Runtime […]

Build a Mobile Gaming Events Data Pipeline with Databricks Delta

Posted Leave a commentPosted in Apache Spark, Education, Machine Learning, Streaming

How to build an end-to-end data pipeline with Structured StreamingTry this notebook in Databricks The world of mobile gaming is fast paced and requires the ability to scale quickly.  With millions of users around the world generating millions of events per second by means of game play, you will need to calculate key metrics (score […]

Introducing Low-latency Continuous Processing Mode in Structured Streaming in Apache Spark 2.3

Posted Leave a commentPosted in Apache Spark, Apache Spark 2.0, Continuous Processing, Databricks Runtime 4.0, Streaming, Structured Streaming

Import this notebook on Databricks Structured Streaming in Apache Spark 2.0 decoupled micro-batch processing from its high-level APIs for a couple of reasons. First, it made developer’s experience with the APIs simpler: the APIs did not have to account for micro-batches. Second, it allowed developers to treat a stream as an infinite table to which […]

Introducing Stream-Stream Joins in Apache Spark 2.3

Posted Leave a commentPosted in Apache Spark, Databricks Runtime, Engineering Blog, Streaming, Structured Streaming

Since we introduced Structured Streaming in Apache Spark 2.0, it has supported joins (inner join and some type of outer joins) between a streaming and a static DataFrame/Dataset. With the release of Apache Spark 2.3.0, now available in Databricks Runtime 4.0 as part of Databricks Unified Analytics Platform, we now support stream-stream joins. In this […]