Building A Modern Cloud Analytics Stack: A Guide for Data-driven Companies
Table of Contents
The saying goes, “if you can’t measure it, you can’t improve it,” and to that end, most companies believe they can measure their way to success. But while many businesses understand the intrinsic value of the information contained within their data, generating holistic insights from it to drive the business is often easier said than done.
The average startup uses 20-50 paid SaaS applications to run their business and connecting the dots between them is difficult because of the time, effort, and technical expertise it requires. As a result, many startups can’t leverage the data from their SaaS applications easily — or don’t use it at all. According to one report, 56% of startups “rarely or infrequently” check their data, and 33% of those say the reason is that they have too many other responsibilities.
Fortunately, there are now technologies that provide the robust infrastructure required to make data available easily and quickly for organizations of any size. But with many options available to choose from, it can be challenging to determine the best solution for your needs.
Here, we discuss how to build a modern cloud analytics stack: the integrated set of tools and services that allows companies to unlock the full value of their data and enable the faster, smarter decisions necessary to compete and grow their business.
The Problem With Traditional Analytics Stacks
For years, business intelligence (BI) was a field that only enterprise companies could access due to the resource-intensive and expensive costs of building and maintaining the infrastructure required to power BI and analytics. But even large, analytically mature organizations still encounter barriers to organization-wide data-driven decision-making due to:
- INFRASTRUCTURE. Most of today’s analytic systems and tools were designed for on-premise warehouses and retrofitted for the cloud. They often require data to be extracted for preparation and heavily modeled by the BI team before it can be used by the business and may still require some components to be run on-premises or to manually set up and managed.
- ACCESS. Most analytic solutions are focused on reporting dashboards and require SQL or proprietary code to drill into data, which prevents non-technical business users from getting their hands on the data they need to make timely decisions. As a result, many organizations have to employ valuable engineering sources to help pull and combine datasets for analysis — time better spent developing products. And when engineers can’t get to the request in time, business users are forced to access the data the only way they know how: by extracting it to spreadsheets. This creates its own set of issues including stale data, data silos, scale limitations, and worst of all, governance and security risks.
- DASHBOARDS. Domain experts are often limited to view-only metrics in surface-level, static dashboards, which prevent them from performing more in-depth analyses. If they have follow-up questions about the data, they must go back to their data or BI team — a cycle that can take days, if not weeks, to finally obtain useful insights. Growing businesses must remain agile to compete and cannot afford to wait that long.
But the rise of cloud data exploration — and the underlying tools and technology that support it — enables organizations of any size to take advantage of the cloud’s speed, accessibility, and near-infinite scale. All in an easy-to-implement, low-cost, hosting-free environment that saves time and resources, allowing growing businesses to more readily compete with the large enterprise corporations they are trying to disrupt.
The cloud analytics stack: A SaaS solution for fast-moving organizations of any size
The cloud data analytics stack refers to the layered ‘stack’ of technologies, cloud-based services, and data management systems that collect, store, and analyze data. It provides organizations with a steady stream of real-time data to generate value out of their data to power decision-making throughout the business.
The analytics stack is an effective end-to-end solution that manages where data comes from, how it moves around, and how it is prepared for analysis and consumption by end users. And when properly implemented, the modern cloud analytics stack delivers continuous data integration and organization-wide accessibility, with minimal manual intervention and bespoke code.
LAYER 1
The Cloud Data Pipeline
Part of what makes data so valuable is that it can provide a glimpse into business operations in real-time. But to benefit from that view, you need a pipeline that ingests and transforms data across applications, databases, files, and more into a centralized repository called the cloud data warehouse (also referred to as a cloud data platform), where it can then be housed modeled, and holistically analyzed.
How the cloud data pipeline works
Manually extracting and integrating data from your systems and applications to the warehouse is a headache and something no fast-growing company has time for. Large enterprise organizations have teams of dedicated data engineers that typically spend hours a week building and maintaining such pipelines within a business. They are tasked with handling data normalization, source changes, schema updates, and more.
Fortunately, there are now SaaS solutions that offer out-of-the-box connectivity to popular data sources, SaaS applications and more, as well as normalize and transform disparate sources of data and move it around without having to write code. When selecting a cloud data pipeline for your business, ensure that the solution, at a minimum, does these three things:
- DOES THE HEAVY-LIFTING WITH INTEGRATIONS AND PRE-BUILT MODELING TOOLS
Make sure your solution builds and continuously maintains its integrations to a vast array of sources, so you don’t have to. You should be able to connect all your data sources — including structured and unstructured data — to your pipeline in a few clicks. A good data pipeline should also deliver near zero-maintenance, ready-to-query schemas, flexible transformation, and basic modeling capabilities. - KEEPS DATA FRESH IN REAL-TIME
Many pipeline tools copy records from the database, which causes version control issues when records get deleted. A good data pipeline should automatically check for and update your data sources to ensure that any changes made within a platform or to your data are available in real-time. This helps ensure that all necessary data is brought into the cloud data warehouse, insights are always fresh, and engineers can focus on more meaningful work instead of managing data. - OFFERS A FULLY MANAGED SOLUTION
Fully managed integration solutions allow organizations to outsource and automate the entire process of building and maintaining a data pipeline. This helps ensure that your data is pumped reliably and cleanly into your cloud data warehouse while freeing up valuable engineering resources.
Here are some popular data pipelines
FIVETRAN
Fivetran fully automated connectors sync data from cloud applications, databases, event logs and more into your data warehouse. Their integrations are built for analysts who need data centralized but don’t want to spend time maintaining their own pipelines or ETL systems, allowing data teams to focus on what really matters: driving analytics for their business.
VISIT WWW.FIVETRAN.COM TO LEARN MORE
MATILLION
Matillion is data transformation purpose-built for the Snowflake Data Cloud and Amazon Redshift, and Google BigQuery cloud data warehouses, enabling businesses to achieve new levels of simplicity, speed, scale, and savings. Trusted by companies of all sizes to meet their data integration and transformation needs, Matillion products are highly rated across the AWS, GCP and Azure Marketplaces.
VISIT WWW.MATILLION.COM TO LEARN MORE
LAYER 2
The Cloud Data Platform
Data may be the world’s most valuable asset, but siloed, stale, and constrained data will never provide the business value startups need to compete in today’s highly fragmented market.
Traditional enterprise data warehouses (on-premise or hybrid) were not built for the needs of modern business and are associated with high costs of storing, accessing, and analyzing data, as well as scalability issues, manual upkeep, and compliance risks. If you recall, startups use 20-50 paid SaaS applications to run their business on average. The costs and maintenance of using traditional warehouses to collect and store data across them all would be unsustainable. They need a cost-effective solution that allows them to readily mobilize data with near-unlimited scale and performance.
The second layer of the cloud analytics stack, the cloud data warehouse — also referred to as cloud data platforms — provides elastic infrastructure, unlimited scale, risk mitigation, and security management. Cloud data warehouses eliminate many of traditional on-premises warehouses’ upfront costs and support many analytic workloads for faster insights without sacrificing security, governance, or data compliance.
How the cloud data warehouse works
The cloud data warehouse is the hub at the center of your analytics stack, acting as the centralized, fully governed repository for all the data in a company. Once data is transferred to the CDW by data pipelines, data teams can transform, model, and combine it for various use cases including business intelligence.
As you explore different vendors, here are the key things to keep in mind when comparing cloud data warehouses.
- WHAT TYPE OF DATA YOU WANT TO STORE
The proliferation of new data sources, velocity of business, decrease in storage costs, and the rise of cloud computing has changed the game. While storing data is easier and less expensive than ever before, the explosion of new semi-structured and unstructured data types (e.g., JSON and logs) pouring in from applications, APIs, websites, smart devices, etc., can be overwhelming.
When selecting a cloud data platform, ensure that it is one capable of storing all types of data including structured, semi-structured, and unstructured data. - THE AMOUNT OF DATA YOU WANT TO STORE AND THE COST TO STORE IT
A good data warehouse leverages the cloud to seamlessly scale to support thousands of users and hundreds of billions of rows of data, facilitating collaboration and expansion across the enterprise. Together with the right user tool, it’s an ideal solution for powering self-service analytics, including operational reporting, ad hoc querying, and real-time decision-making at any level or volume. - HOW MANY ENGINEERING RESOURCES YOU WANT TO DEDICATE TO MAINTENANCE
With traditional enterprise data warehouses, data is created and sent to the EDW, captured, and stored. The raw data is then cleaned, prepared, and modeled by data and BI teams before it’s handed off to business domain experts as a report or dashboard. At this point, domain experts can finally use the data to make clear and informed decisions to drive the business forward.
This process was repeated for every new data source and took a considerable amount of time, effort, and money to maintain. But with the rise of cloud-based tools, much of the maintenance costs get shifted to data pipeline, storage, and BI vendors. This new system takes away the burdens and costs associated with data warehousing and integration — and when partnered with a cloud-native analytics tool — helps reduce the time to value for analyses.
Here are some popular cloud data platform options
SNOWFLAKE
Snowflake delivers the Data Cloud — a global network where thousands of organizations mobilize data with near-unlimited scale, concurrency, and performance. Inside the Data Cloud, organizations unite their siloed data, easily discover and securely share governed data, and execute diverse analytic workloads. Snowflake’s platform is the engine that powers and provides access to the Data Cloud, creating a solution for data warehousing, data lakes, data engineering, data science, data application development, and data sharing.
VISIT WWW.SNOWFLAKE.COM TO LEARN MORE.
BIGQUERY
When developing a visualization, consider your customer personas, then pick one that is the highest priority. Determine who will see the data, what key challenges they face, and what hurdles they must overcome to achieve a specific goal.
VISIT WWW.CLOUD.GOOGLE.COM/BIGQUERY TO LEARN MORE.
REDSHIFT
Amazon Redshift is a data warehouse product which forms part of the larger cloud-computing platform Amazon Web Services. No other data warehouse makes it as easy to gain new insights from all your data. With Redshift, you can query and combine exabytes of structured and semi-structured data across your data warehouse, operational database, and data lake using standard SQL. Redshift lets you easily save the results of your queries back to your S3 data lake using open formats, like Apache Parquet, so that you can do additional analytics from other analytics services like Amazon EMR, Amazon Athena, and Amazon SageMaker.
VISIT WWW.AWS.AMAZON.COM/REDSHIFT/ TO LEARN MORE.
LAYER 3
The Cloud-native Analytics Solution
To maximize the value of the cloud data warehouse and fuel faster data-driven decision-making, startups need a solution that empowers their employees to freely and securely interact with data in real-time. But most analytics tools fail to extend the ability to access, explore, and analyze data to the full organization for three reasons:
- BI analysis tools offer ‘self-service’ interfaces for business users, but these interfaces are minimal. They do not allow business experts to ask “What if?” questions of the data or to follow their curiosity and explore it in creative ways. And worse, the dashboards created from these tools are limited to surface-level reporting based on a limited, predefined set of business metrics, preventing the ad hoc analyses that lead to novel insights. This leaves business users without the ability to ask follow-up questions or dig deeper into the data without going back to the BI team for help.
- These tools require the use of proprietary coding languages or SQL to extract, parse, and combine data, as well as heavy modeling by the BI team before it can be consumed by domain experts. As a result, non-coding business users have to wait on busy BI and data teams for help or to get their questions answered. The data team, on the other hand, must anticipate the questions business users will ask, defeating the purpose of “self-service” analytics as a whole.
- Because BI teams don’t have business-level domain expertise — and because business users don’t always know the right questions to ask when they make the initial request — the reports don’t contain the novel insights needed to drive business decisions. This results in a cycle of back and forths between BI and business teams, which ultimately leads to business users turning to an extract to get the answers they need independently. Data extracts open businesses up to regulatory compliance, security, and governance risks that could cost organizations millions.
In today’s fast-paced competitive marketplace, companies can’t afford to wait on the data needed to drive critical business decisions. They need a solution that allows everyone to explore data on-demand and use it to adjust their strategies and quickly pivot to take advantage of new opportunities as they present themselves.
Sigma: A plug-and-play cloud analytics and data exploration platform
Sigma was designed for detailed data exploration and analysis, empowering teams to securely inspect data at scale in a UI they know and love: the spreadsheet. As the final piece of the cloud data stack, Sigma enables startups to fully harness the power of their data and maximize the value of their investment through limitless cloud data exploration.
True, self-service analytics delivered.
Sigma minimizes ad hoc reporting requests by empowering non-technical users teams to explore data and find answers to data complex data questions independently. Users can join, slice, dice, filter, and calculate data using the same format, functions, and formulas as traditional spreadsheets because of Sigma’s spreadsheet-like interface.
This means that every user, regardless of technical ability, can use Sigma to quickly answer new, unanticipated questions by querying live data directly from the cloud data warehouse in real-time down to row-level detail — without writing SQL or extracting data to a traditional spreadsheet.
Live, accurate insights in seconds at cloud scale.
Today’s startups have more data sources and volumes than ever before. But traditional analytics solutions weren’t designed to leverage the sheer scale and processing power of the cloud. Many require data aggregations or subsets of data for analysis, rather than working directly on millions or billions of data rows in your cloud data platform. Therefore loading, joining, modeling, and analyzing raw data at cloud scale is often easier said than done, often resulting in the software slowing down or crashing completely.
As a cloud-native BI solution, Sigma operates on top of the CDW, allowing anyone to explore and query live data directly from the CDW in real-time down to row-level detail — no copies or extracts required. Sigma can even automatically parse semi-structured data (like JSON) on the fly, making it easier to join additional data sources like SaaS applications, websites, IoT, and more.
Sigma was built to harness the scale, speed, and sheer power of the cloud and can process data from dozens of sources in seconds, allowing users to easily crunch through billions of rows of data, and quickly drill into as much detail as needed without the risk of accidentally corrupting or deleting data in the warehouse. Because the warehouse is automatically updated and Sigma takes this into account, you’re always working with accurate, up-to-date data to make timely decisions.
A modern, more inclusive approach to data governance
Rock-solid governance is a non-negotiable piece of every good data strategy. But it shouldn’t be a stumbling block for a startup. The CDW makes it easy to aggregate data across hundreds of sources and house it all in a centralized, secure, and fully-governed repository. It’s a scalable, single source of truth that supports concurrent workloads at speed — so why remove data for analysis?
With Sigma, teams can take full advantage of the cloud’s speed, scale, and compute power while ensuring that data is safe, current, and complete. Because Sigma operates on top of the (CDW) as a cloud-native analytics solution, it ensures that data is safe, secure, and always kept in one location. This helps startups avoid the risk of a security breach or compliance violations.
Sigma also follows modern security frameworks for authentication and access controls which allow for a complete and centralized way to govern users and ensure proper access to data.
Reuse and repurpose analyses for easier collaboration
Extracting data to spreadsheets causes a significant problem: Data is outdated the moment it leaves the BI team’s data warehouse. This means that line of business teams across marketing, finance, and sales, for example, must manually recreate or update any recurring or frequently requested reports. To make matters worse, sharing these data extracts creates not only version control issues but also causes security and governance risks (more on that in a second!).
But because Sigma pulls live data directly from a company’s cloud data warehouse, reports can be built just once and automatically stay updated, saving teams time and unnecessary headaches. Dashboards and reports can also be shared with colleagues who can duplicate and repurpose analyses without fear of causing version control confusion or disturbing others’ work. Collaborating across teams and departments is a breeze when everyone works off the same live, accurate datasets!
SUCCESS STORY
Payload Accelerates Business Velocity After Implementing a Modern Cloud Data Analytics Stack
PAYLOAD is an easy-to-use cloud application for logistics and supply chain management, delivering simple, accountable logistics tracking and reporting. Their logistics and supply chain management application collects vast amounts of a variety of data, including real-time coordinate capture, field ticket data for pickup and delivery, events that occur along a route, and much more.
Customers and internal decision-makers alike were hungry to get their hands on this data to integrate it with other systems and generate key business insights. However, due to the limitations of their cloud infrastructure, PAYLOAD had difficulties performing the most basic data imports, exports, integrations, and reporting.
The infrastructure costs and time-spent performing these tasks began to add up. Development was spending hours each week updating database schemas for new reports, indexing the database to improve performance, creating fresh views for customers to export, and fixing any application errors these activities caused.
At the same time customers were looking for direct access to their data while maintaining PAYLOAD’s high security standards. With no way to integrate PAYLOAD application data directly into their systems, customers were left with extracts of their data that had to be completely restructured before they could even use them. PAYLOAD needed to modernize their data stack to a cloud-based solution that had the scale and flexibility to handle their analytics workloads.
The first step was separating development and BI workloads, enabling the BI team to structure and integrate application data without negatively impacting development infrastructure. After exploring traditional enterprise data warehouses (EDW), several cloud data warehouse providers, and even considered building their solution, they decided that Snowflake’s Data Cloud was the best choice.
Once Snowflake was up and running — a process that only took two weeks — PAYLOAD needed a way to stitch together data from across sources and port it into Snowflake. After evaluating several data integration tools, the team chose Fivetran because they use database logs to synchronize data, so it always accurately reflects what’s in the database.
The last piece of the puzzle was finding an analytics tool that would allow employees to self-serve and get answers to ad hoc questions without waiting on the BI team to model data and build reports in an overly-complex, proprietary language. They chose Sigma because its spreadsheet-like interface gives everyone the power of SQL without having to know code.
Adopting a modern cloud data analytics stack was a game-changer for PAYLOAD, allowing them to harness the full, unbridled power of their data. After implementing their modern analytics stack, PAYLOAD recorded:
- 7x faster report delivery times
- $8,000+ cost savings per report
- 50% BI resource savings
- 10 seconds report load times down from 1 hour to 10 seconds
Conclusion
With data-driven goals paramount, BI and analytics tools, applications, and practices—as well as the data management and integration systems that support them—are in the spotlight. This “stack” of layered technologies, cloud-based services, and practices is critical to providing an increasingly varied spectrum of decision-makers with a steady stream of trusted, quality data and analytics insights.
The robust new ecosystem of cloud-based solutions is easy to set up and get going and integrate seamlessly with one another, allowing organizations of any size to quickly maximize the value of their data with its cost-effective, elastic architecture that scales with an organization.