Introducing Apache Spark 2.4 – The Databricks Blog

Posted Leave a commentPosted in Apache Spark, Apache Spark 2.4, Databricks Runtime 5.0, Ecosystem, Engineering Blog, Machine Learning, Pandas UDF, Platform, SparkSQL, Streaming, Structured Streaming, Unified Analytics Platform

We are excited to announce the availability of Apache Spark 2.4 on Databricks as part of the Databricks Runtime 5.0. We want to thank the Apache Spark community for all their valuable contributions to the Spark 2.4 release. Continuing with the objectives to make Spark faster, easier, and smarter, Spark 2.4 extends its scheduler to […]

SQL Pivot: Converting Rows to Columns

Posted Leave a commentPosted in Apache Spark, DataFrames, Engineering Blog, Spark SQL, sql, Unified Analytics Platform

Try this notebook in Databricks Pivot was first introduced in Apache Spark 1.6 as a new DataFrame feature that allows users to rotate a table-valued expression by turning the unique values from one column into individual columns. The upcoming Apache Spark 2.4 release extends this powerful functionality of pivoting data to our SQL users as […]

Simplifying Change Data Capture with Databricks Delta

Posted Leave a commentPosted in Apache Spark, CDC, Change Data Capture, Company Blog, Databricks Delta, Education, Engineering Blog, Product

A common use case that we run into at Databricks is that customers looking to perform change data capture (CDC) from one or many sources into a set of Databricks Delta tables. These sources may be on-premises or in the cloud, operational transactional stores, or data warehouses. The common glue that binds them all is […]

MLflow v0.7.0 Features New R API by RStudio

Posted Leave a commentPosted in Announcements, Apache Spark, Company Blog, Deep Learning, Ecosystem, Education, Engineering Blog, GPyOpt, Hyperopt, Java, Keras, Machine Learning, MLflow, multistep workflow, Partners, python, R, RStudio

Today, we’re excited to announce MLflow v0.7.0, released with new features, including a new MLflow R client API contributed by RStudio. A testament to MLflow’s design goal of an open platform with adoption in the community, RStudio’s contribution extends the MLflow platform to a larger R community of data scientists who use RStudio and R […]

What’s New for Apache Spark on Kubernetes in the Upcoming Apache Spark 2.4 Release

Posted Leave a commentPosted in Apache Spark, Ecosystem, Engineering Blog, Kubernetes

This is a community blog from Yinan Li, a software engineer at Google, working in the Kubernetes Engine team. He is part of the group of companies that have contributed to Kubernetes support in the upcoming Apache Spark 2.4. Since the Kubernetes cluster scheduler backend was initially introduced in Apache Spark 2.3, the community has […]

How to Use MLflow To Reproduce Results and Retrain Saved Keras ML Models

Posted Leave a commentPosted in Apache Spark, Engineering Blog, Keras, Machine Learning, MLflow, Model Management, Platform, TensorFlow, Unified Analytics Platform

In part 2 of our series on MLflow blogs, we demonstrated how to use MLflow to track experiment results for a Keras network model using binary classification. We classified reviews from an IMDB dataset as positive or negative. And we created one baseline model and two experiments. For each model, we tracked its respective training […]

Simplify Market Basket Analysis using FP-growth on Databricks

Posted Leave a commentPosted in Apache Spark, Company Blog, Education, Engineering Blog, FP-Growth, Machine Learning, Market Basket Analysis, Product

Try this notebook in Databricks When providing recommendations to shoppers on what to purchase, you are often looking for items that are frequently purchased together (e.g. peanut butter and jelly). A key technique to uncover associations between different items is known as market basket analysis. In your recommendation engine toolbox, the association rules generated by […]

Identify Suspicious Behavior in Video with Databricks Runtime for Machine Learning

Posted Leave a commentPosted in Apache Spark, Company Blog, Deep Learning, Deep Learning Pipelines, Education, Engineering Blog, Machine Learning, OpenCV, Platform, Product, TensorFlow, Video Analytics

Try this notebook series in Databricks With the exponential growth of cameras and visual recordings, it is becoming increasingly important to operationalize and automate the process of video identification and categorization. Applications ranging from identifying the correct cat video to visually categorizing objects are becoming more prevalent.  With millions of users around the world generating […]

Introducing Flint: A time-series library for Apache Spark

Posted Leave a commentPosted in Apache Spark, Company Blog, Customers, Education, Engineering Blog, Flint, python, Spark SQL, Time Series

This is a joint guest community blog by Li Jin at Two Sigma and Kevin Rasmussen at Databricks; they share how to use Flint with Apache Spark. Introduction The volume of data that data scientists face these days increases relentlessly, and we now find that a traditional, single-machine solution is no longer adequate to the demands […]

An Introduction To Machine Learning Using Spark Language

Posted Leave a commentPosted in Apache Spark, Machine Learning, News, SmartData Collective Exclusive

Machine learning is an upcoming field in the world of digital science, which allows you to create algorithms to make your device learn to operate on data and also to make predictions based on collected data. Machine learning course is possible through various languages like Python, Java, C++, R, etc. Apache Spark is considered to […]