Introducing Built-in Image Data Source in Apache Spark 2.4

Posted Leave a commentPosted in Apache Spark, Data Source, Databricks Runtime, DataFrames, Deep Learning Pipelines, Ecosystem, Engineering Blog, Machine Learning

Introduction With recent advances in deep learning frameworks for image classification and object detection, the demand for standard image processing in Apache Spark has never been greater. Image handling and preprocessing have their specific challenges – for example, images come in different formats (eg., jpeg, png, etc.), sizes, and color schemes, and there is no […]

Apache Avro as a Built-in Data Source in Apache Spark 2.4

Posted Leave a commentPosted in Apache Avro, Apache Spark, Apache Spark 2.4, Data Source, Ecosystem, Engineering Blog, Spark SQL, Streaming, Structured Streaming

Try this notebook in Databricks Apache Avro is a popular data serialization format. It is widely used in the Apache Spark and Apache Hadoop ecosystem, especially for Kafka-based data pipelines. Starting from Apache Spark 2.4 release, Spark provides built-in support for reading and writing Avro data. The new built-in spark-avro module is originally from Databricks’ […]