Databricks Engineering Interns & Impact in Summer 2018

Posted Leave a commentPosted in Announcements, Company Blog, Education, Intern

Thanks to our awesome interns! This summer, our Engineering interns at Databricks did amazing work.  Our interns, working on teams from Developer Tools to Machine Learning, built features and improvements which are already impacting our customers and the Apache Spark and AI communities. Spending a summer at Databricks Databricks Engineering internships are a mix of […]

MLflow v0.7.0 Features New R API by RStudio

Posted Leave a commentPosted in Announcements, Apache Spark, Company Blog, Deep Learning, Ecosystem, Education, Engineering Blog, GPyOpt, Hyperopt, Java, Keras, Machine Learning, MLflow, multistep workflow, Partners, python, R, RStudio

Today, we’re excited to announce MLflow v0.7.0, released with new features, including a new MLflow R client API contributed by RStudio. A testament to MLflow’s design goal of an open platform with adoption in the community, RStudio’s contribution extends the MLflow platform to a larger R community of data scientists who use RStudio and R […]

Simplify Market Basket Analysis using FP-growth on Databricks

Posted Leave a commentPosted in Apache Spark, Company Blog, Education, Engineering Blog, FP-Growth, Machine Learning, Market Basket Analysis, Product

Try this notebook in Databricks When providing recommendations to shoppers on what to purchase, you are often looking for items that are frequently purchased together (e.g. peanut butter and jelly). A key technique to uncover associations between different items is known as market basket analysis. In your recommendation engine toolbox, the association rules generated by […]

Identify Suspicious Behavior in Video with Databricks Runtime for Machine Learning

Posted Leave a commentPosted in Apache Spark, Company Blog, Deep Learning, Deep Learning Pipelines, Education, Engineering Blog, Machine Learning, OpenCV, Platform, Product, TensorFlow, Video Analytics

Try this notebook series in Databricks With the exponential growth of cameras and visual recordings, it is becoming increasingly important to operationalize and automate the process of video identification and categorization. Applications ranging from identifying the correct cat video to visually categorizing objects are becoming more prevalent.  With millions of users around the world generating […]

Introducing Flint: A time-series library for Apache Spark

Posted Leave a commentPosted in Apache Spark, Company Blog, Customers, Education, Engineering Blog, Flint, python, Spark SQL, Time Series

This is a joint guest community blog by Li Jin at Two Sigma and Kevin Rasmussen at Databricks; they share how to use Flint with Apache Spark. Introduction The volume of data that data scientists face these days increases relentlessly, and we now find that a traditional, single-machine solution is no longer adequate to the demands […]

A Guide to Apache Spark Use Cases, Streaming, and Research Talks at Spark + AI Summit Europe

Posted Leave a commentPosted in Apache Spark, Company Blog, Data Science, Events, Genomics, Machine Learning, PySpark, Spark + AI Summit Europe, Structured Streaming, Unified Analytics Platform

For much of Apache Spark’s history, its capacity to process data at scale and capability to unify disparate workloads has led Spark developers to tackle new use cases. Through innovation and extension of its ecosystem, developers combine data and AI to develop new applications. So it befits developers to come to this summit not just […]

Announcing Databricks Runtime 4.3 – The Databricks Blog

Posted Leave a commentPosted in Announcements, Company Blog, Engineering Blog, Platform, Product, Runtime

I’m pleased to announce the release of Databricks Runtime 4.3, powered by Apache Spark.  We’ve packed this release with an assortment of new features, performance improvements, and quality improvements to the platform.   We recommend moving to Databricks Runtime 4.3 in order to take advantage of these improvements. In our obsession to continually improve our platform’s […]

Building a Real-Time Attribution Pipeline with Databricks Delta

Posted Leave a commentPosted in Adhoc Analysis, Advertising Analytics, Apache Spark, bi, Company Blog, Databricks Delta, Ecosystem, Education, Engineering Blog, Kinesis, Machine Learning, Platform, Product, Spark Streaming, Streaming, Structured Streaming, Tableau

Try this notebook in Databricks In digital advertising, one of the most important things to be able to deliver to clients is information about how their advertising spend drove results.  The more quickly we can provide this, the better. To tie conversions or engagements to the impressions served in an advertising campaign, companies must perform […]

Loan Risk Analysis with XGBoost and Databricks Runtime for Machine Learning

Posted Leave a commentPosted in Apache Spark, Company Blog, data pipeline, Data Visualization, Ecosystem, Education, Engineering Blog, financial, Machine Learning, MLlib, Platform, Product, XGBoost

Try this notebook series in Databricks For companies that make money off of interest on loans held by their customer, it’s always about increasing the bottom line. Being able to assess the risk of loan applications can save a lender the cost of holding too many risky assets. It is the data scientist’s job to […]

MLflow 0.4.2 Released – The Databricks Blog

Posted Leave a commentPosted in Announcements, Apache Spark, Company Blog, Engineering Blog, Machine Learning, MLflow, Model Management

Today, we’re excited to announce MLflow v0.4.0, MLflow v0.4.1, and v0.4.2 which we released within the last week with some of the recently requested features. MLflow 0.4.2 is already available on PyPI and docs are updated. If you do pip install mlflow as described in the MLflow quickstart guide, you will get the recent release. In this […]