The Future of Cloud Data: Is a Modern Cloud Data Platform Right for You?

May 1, 2022

Table of Contents:

Why do you should consider a Modern Cloud Data Platform?

For decades, we have relied on popular on-premise based Database Management Systems (DBMS) to meet our business needs. Some popular examples:

  • SQL Server
  • Oracle
  • MySQL
  • PostgreSQL
  • IBM DB2
Since most companies are heavily invested in their current data systems, why do we recommend you consider migrating to a Modern Data Cloud Platform? To answer this question, we will need to go through a brief history of data platform technologies.

Traditional DBMS Solutions

First, we will look at traditional Database Management Systems. In the 1980s, conventional data systems were business operations' mainstay. And they are still being deployed today. Traditional databases store structured data in a way optimized for quick retrieval and manipulation. This is accomplished by extracting data from one or more upstream data stores, transforming that data, and then loading it into several downstream data stores that the business can consume through several pre-created reports and query tools.

Traditional Architecture

Traditional DBMS Challenges

Unfortunately, there are many difficulties with traditional DBMS solutions. Specifically, they:

  • Require extensive upfront setup and investment (hardware and software licensing).
  • Require extensive ongoing maintenance and many diverse skill sets (infrastructure, backup and disaster recovery, database maintenance, security & compliance, etc.).
  • Are difficult, costly, and slow to scale as your business grows. And require manual scaling (i.e., they lack auto-scaling features). Scaling also requires knowledge of setting up and scaling a data center.
  • Suffer from slow query performance. Most traditional DBMS systems do not support Massive Parallel Processing (MPP) and the record-breaking query performance this technology supports.
  • Are located in "your cloud" which makes it difficult to integrate and connect with the latest cloud features - compute, storage, ChatBots/ChatGPT, Inernet of Things (IOT), DevOps, media services, block chain, mobile, and low-code apps, analytics and machine learning.
  • Don't support large amounts of data (i.e., Big Data) scenarios.
  • Are not agile. Adding new data and reports is very slow and requires extensive support from a very scarce resource - technology developers.
  • Are often a security and compliance risk because they require specialized skill sets and constant research of compliance requirements.
  • Don't allow the platform and data to be accessed from any location and at any time.
  • Don't optimize for costs. Often it requires more capacity than necessary to support bursting scenarios. You don't pay for the processing power that you are using, but rather you pay for the processing power needed at peak times in your business. The rest of the time that unused processing power remains unused, as the server's useful lifetime diminishes.
  • Make it difficult to share data.
  • Don't support Artificial Intelligence and Machine Learning models, nor the semi-structured and unstructured data that feeds many of those models.
  • Don't work well with the variety of data created in today's business scenarios – for example, video, images, sensor data, sound, etc.

Document Databases

Document Databases, for example, MongoDB, Cassandra, and CouchDB, were created to overcome some of the challenges that are inherent in a traditional DBMS for some specific niche scenarios. To do so, the designers abandoned many tenants of relational databases and some of their inherent strengths and weaknesses. As NOSQL platforms, instead of table-oriented, they are document-oriented databases that store data in collections of documents with many different data types. They are also geared towards developers and remove the need for knowledge of the underlying structure of the database (e.g., relational design tenants, and SQL language). These attributes give Document Databases a high degree of agility, scalability and flexibility. Document databases are best suited for applications that require fast access to pre-processed or static data that doesn't need to be joined (i.e., cross referenced). For example, this blog, or pricing information, product catalogs, media, content management, social sites, customer profiles, maps, etc.

Document Database Challenges

While they certainly have their sweet spot, there are some drawbacks to Document Databases. Specifically, they:

  • Can be difficult to query. Because of the diversity of data types and the loss of relational structure (e.g., joining data), it can be difficult and slow to retrieve information.
  • Don't handle transactional data well. They only support ACID transactions at the document level, although these have limited ACID guarantees. This makes it difficult to build, for example, a banking ledger application or an e-commerce platform.
  • Don't support traditional SQL skill sets. They are NOSQL platforms, as SQL doesn't make sense in a document-first world.
  • Often don't have a built-in query language, which means you must use 3rd party tools.
  • Lack the traditional query optimizations found in relational systems (e.g., advanced indexing) leading to slower performance.

Data Lakes

Data Lakes were also created to overcome some of the traditional DBMS challenges, but they solve the challenge with a different approach. Data Lakes collect large volumes of data from multiple sources and allow it to be stored in its native format, making it easily accessible and customizable to accommodate changing requirements. Some of the benefits of a Data Lake are:

  • Great flexibility – with a well-structured Data Lake, the data can be accessed and manipulated in whatever manner the user desires without requiring any pre-existing schemas.
  • Security - Data Lakes also enable security measures to be taken on an individual or group basis to protect sensitive data.
  • Cost control - using the cloud, Data Lakes are also highly cost-effective compared to traditional database solutions.

 

Data Lake + Data Warehouse Architecture

Data Lake Challenges

However, Data Lakes suffer from several limitations:

  • Data lakes don't have schemas, so your data quality reduces quickly over time.
  • The data lakes themselves do not have transactional support. They require ETL processes to load the data into databases/DW before they are helpful in BI reporting or application scenarios. This adds another layer of complexity, effort, and cost.
  • Since you are still stuck building traditional database model from the data lake, it also reduces agility. The DW needs multiple processes, steps, and tools that must be updated as business requirements change (similar to the traditional model).
  • Depending on the database used, it also may suffer from the same scale issues as the traditional model.
  • They lack the functionality to curate and govern the data in the data lake. So, data quality suffers. For this reason, some have even re-titled the data lake as a data swamp. And the quality of the data is often poor.
  • As they grow, performance rapidly decreases due to an inability to handle metadata in a scalable fashion.

Modern Cloud Data Platform

This brings us to Modern Cloud Data Platforms. The Modern Cloud Data Platform effectively solves traditional database and data lakes' drawbacks. They can also function as a unstructured, document or relational database. The modern cloud data architecture offers numerous advantages over conventional models. Some of the key benefits include:

  • Simplicity - modern platforms provide a simple user interface that allows users to easily access and manage their data, with reduced technical proficiency. This can be especially beneficial for companies whose employees may only have traditional database skills, such as SQL.
  • Scalability - by leveraging the power of the cloud, businesses can easily create a scalable and highly configurable environment. Modern platforms can easily handle processing scenarios that range from small reporting services to Big Data volumes on the same platform.
  • Speed - modern platforms are built on massive-parallel processing architectures (MPP), which distribute compute across unlimited compute servers. This provides elastic compute scale, and for orders of magnitude faster performance than traditional technologies. MPPs enable companies to quickly access and analyze their data and reach conclusions in minutes.
  • Cost savings - modern cloud platforms require no upfront infrastructure investments and scale on demand while still providing powerful analytics capabilities.
  • Real-time insights - cloud computing provides real-time insights, allowing for more informed decision-making.
  • Analytics capabilities - cloud-based systems provide powerful self-service analytics capabilities, driving better decision-making.
  • Flexibility - can act as a document-database (along with all the benefits mentioned above) or as a relational database.
Ultimately, these advantages allow businesses to remain competitive by using their data assets more efficiently and driving innovation.

While there are several Modern Cloud Data Platforms to choose from, there are two market leaders we will discuss - Databricks & Snowflake. Each has taken a different path with its platform, and we will briefly take a look at the strengths of each.

Databricks

Databricks is a modern cloud-based data platform that makes data management, security, and governance efficient and easy. It combines the most desirable aspects of data lakes and data warehouses to give you reliability, governance, and performance while maintaining data lakes' openness and machine learning support. This unified approach eliminates silos between data engineering, analytics, BI, data science, and machine learning, ultimately speeding innovation and streamlining operations. The platform is based on opensource and open standards for maximum flexibility. Some of the highlights of Databricks:

  • Platform as a Service (PaaS) data platform with some administration required. Recently released serverless Software as a Service (SaaS) option.
  • Independant scaling of compute and storage. This is important as storage is very cheap, and traditionally you had to install compute on the storage server. This got expensive fast.
  • Delta Lake, an open-source elastic storage framework meant to sit on top of all the major cloud data lakes, brings several data capabilities to data lakes. This includes data warehouses, ACID transactions, audit history, data versioning, and schema enforcement.
  • Databricks Massive-Parallel Processing Photon query engine that supports SQL (and other languages) with industry-leading performance.
  • Pay for what you consume. Traditionally you were required to purchase capacity for your peaks to ensure you had it on hand when needed. During non-peak time the capacity was unused. Databricks only charges you for what you use.
  • Massive-Parallel Processing Spark query engine that supports SQL and unstructured data queries.
  • Data pipelines that support batch and streaming scenarios at scale.    
  • World-class performance and efficiency reduce the Total Cost of Ownership (TCO) over all other major platforms.
  • Available on all major clouds.
  • Native support for structured, semi-structured, AND unstructured data.
  • Unified ML platform with built-in support for most important data science/machine learning tools – Hyperopt, Horvod, AutoML, Scikit-learn, Tensorflow and the end-to-end ML Lifecycle management tool MLFlow.

 

Snowflake

Snowflake is also a modern cloud-based data platform that offers users an affordable and flexible solution for managing and analyzing their data. Rather than paying high up-front costs, Snowflake allows customers to pay for only what they need and scales elastically with use. Additionally, Snowflake takes care of the complex infrastructure so users can focus on generating business value. This makes it ideal for companies who want to quickly get started using the cloud for data storage and analysis. Some of the highlights of Snowflake:

  • Software as a Services (SAAS) solution with near-zero administration.
  • Independant scaling of compute and storage. This is important as storage is very cheap, and traditionally you had to install compute on the storage server. This got expensive fast.
  • Pay for what you consume. Traditionally you were required to purchase capacity for your peaks to ensure you had it on hand when needed. During non-peak time the capacity was unused. Snowflake only charges you for what you use.
  • Proprietary elastic storage platform that enables data warehouses, ACID transactions, audit history, data versioning, and schema enforcement.
  • Massive-Parallel Processing virtual warehouse query engine that supports SQL queries on large volumes of data.
  • Best-in-class support for simple SQL-based warehousing and reporting. Effortless to set up and get building solutions for the business.
  • Available on all major clouds.
  • Native support for structured and semi-structured data.
  • Robust 3rd party marketplace with support for - data science, business intelligence, streaming analytics, data acquisition, data sharing, etc.

 

Snowflake Architecture

Which is right for you?

You won't make a wrong decision by building on any of these market-leading platforms! But which should you choose?

 

Databricks

Choose Databricks if:

  1. Advanced Analytics and Machine Learning are critical to your business strategy now or in the future. Databricks offers powerful technology and superior support for advanced analytics workloads. It is the only platform leader in the Gartner Magic Quadrant for BOTH Cloud DBMS AND Data Science and Machine Learning categories.
  2. You need native support for structured, semi-structured, AND unstructured data.
  3. You want the fastest and most cost-effective data platform for big data scenarios.
  4. You want to retain ownership of your data. Databricks facilitates de-coupling your data processing from your data stores using open-source data formats. Snowflake requires you to convert your data into its proprietary format and pay to access your data in the system.
  5. You have complex data with robust ETL (Extract Transform Load) needs.

Snowflake

Choose Snowflake if:

  1. You want a straightforward SQL-based business intelligence platform.
  2. You want to optimize performance for data warehouse/database query performance.
  3. You don't want the hassle of administration or the cost of setup.
  4. You are just beginning your data journey or need a straightforward solution.
  5. You want to leverage the power of the market-leading Snowflake Marketplace. The Snowflake Marketplace offers a wide range of data, services, and applications from some of the world's top companies.

Other Considerations

This post is already too long, but we need to mention some additional considerations.

Data Streaming

The world is moving towards processing large amounts of data in real-time. We have written an overview of the importance of real-time streaming, and how adding Confluent to your architecture enables this capability.

Scalable Real-Time Integration

Analysis & Visualizations

While automated decision making is the vision, humans are very much intertwined in decisions at most companies today. So, without a good data discovery and visualization strategy, you are missing the last mile of your data journey. Learn why we believe ThoughtSpot is a powerful platform for Analyzing and visualizing your data.

Making Decisions Fast with ThoughtSpot

Artificial Intelligence & Machine Learning

Arguably, AI/ML is the next great advancement in data. Organizations that get it right have a considerable advantage over their competitors. If you are new to AI/ML, or looking for examples, we have a few blog posts to help you along in your journey.

Artificial Intelligence (AI) & Machine Learning (ML) Introduction

Top Artificial Intelligence (AI) & Machine Learning (ML) Examples

5 Advantages of Databricks Over Snowflake as an Advanced Analytics Platform

We would like to help

Do you need help choosing and establishing your Modern Cloud Data Platform? Contact us below to learn more about how we can help.
Our Blaze IP allows us to get your data platform up and operating in weeks.

To explore how our solutions can be tailored to meet your unique requirements, please click the 'Connect' button at the top of this page to schedule a meeting with our team of experts.

See how others are winning...
Learn why organizations are getting better outcomes using data and how Macula can help.
Start Now