Apache Avro as a Built-in Data Source in Apache Spark 2.4

Posted Leave a commentPosted in Apache Avro, Apache Spark, Apache Spark 2.4, Data Source, Ecosystem, Engineering Blog, Spark SQL, Streaming, Structured Streaming

Try this notebook in Databricks Apache Avro is a popular data serialization format. It is widely used in the Apache Spark and Apache Hadoop ecosystem, especially for Kafka-based data pipelines. Starting from Apache Spark 2.4 release, Spark provides built-in support for reading and writing Avro data. The new built-in spark-avro module is originally from Databricks’ […]

Introducing New Built-in and Higher-Order Functions for Complex Data Types in Apache Spark 2.4

Posted Leave a commentPosted in Apache Spark, Apache Spark 2.4, Engineering Blog, SparkSQL, sql

Try this notebook in Databricks Apache Spark 2.4 introduces 29 new built-in functions for manipulating complex types (e.g., array type), including higher-order functions. Before Spark 2.4, for manipulating the complex types directly, there were two typical solutions: 1) Exploding the nested structure into individual rows, and applying some functions, and then creating the structure again […]

Introducing Apache Spark 2.4 – The Databricks Blog

Posted Leave a commentPosted in Apache Spark, Apache Spark 2.4, Databricks Runtime 5.0, Ecosystem, Engineering Blog, Machine Learning, Pandas UDF, Platform, SparkSQL, Streaming, Structured Streaming, Unified Analytics Platform

We are excited to announce the availability of Apache Spark 2.4 on Databricks as part of the Databricks Runtime 5.0. We want to thank the Apache Spark community for all their valuable contributions to the Spark 2.4 release. Continuing with the objectives to make Spark faster, easier, and smarter, Spark 2.4 extends its scheduler to […]