This is the second in a series, by Snowflake, examining the concept of net zero data and how advances in technology can help the world’s largest organisations—especially those which are particularly emissions-intensive like oil and gas—reduce the carbon emissions footprint of their data. For part one, see here.
Fully exploiting the flexibility of cloud computing enables organisations to gain meaningful energy and emissions efficiencies. Unfortunately, the realisation of these benefits is often constrained by data platform architectures designed for fixed-capacity environments. We’ve generally observed this at nearly all of the analytical database services available in the market today.
These design choices were revolutionary at the time they were made, enabling the use of massive parallel processing (MPP) computing techniques required to handle the proliferation of ‘Big Data’ data sets. In today’s context, however, those design choices prevent analytics services from using computing resources efficiently. And a less efficient use of CPU additionally affects the energy impact, emissions footprint, and cost of data operations.
How Net Zero Data Works
The increased server utilisation that comes from running on co-located resources in the public cloud has a positive impact on emissions, but employing a modern, multi-cluster, shared-data architecture built for the cloud provides additional benefits, specifically:
- Eliminating the need to transform and process large data sets. Some analytics databases native to the large Cloud Service Provider (CSP) suites require transforming raw semi-structured data files into a traditional columnar structure to be ready for analytical workloads. This transformation requires a great deal of compute power. But new methods for partitioning and indexing highly compressed, cost-efficient data formats—like Parquet, JSON, or XML—now allow organisations to use the full range of traditional enterprise-grade SQL to query that data such that the data structure, or schema, is determined on-read. This eliminates entirely the energy, emission, and cost associated with transforming semi-structured, machine-generated data sets which are so common in the energy industry. These data sets include seismic data in subsurface exploration; IoT data in upstream fossil fuel production, refinery processing, and renewable electricity generation; time-series or other market data in energy trading; and transaction data in forecourt and convenience retail.
- Eliminating the need to store multiple forms and copies of the same data. By eliminating data transformations, we simultaneously eliminate the need to store pre- and post-processed forms of those data sets. Indeed, the concept of pre- and post-processing is replaced entirely by a single, unified data environment across semi-structured and structured data. By creating software-defined, functional views on top of that single data environment, we can additionally eliminate the need to make copies of data for different use cases, for example, data engineering, data science, business intelligence reporting, financial and regulatory compliance reporting, or ad hoc analysis. And fewer copies means less disk space is required to store that data, thus reducing the energy, emissions, and cost requirements of data operations.
- Reducing the CPU capacity required to run a global, enterprise-grade analytics platform. The shared-nothing architecture underpinning some of the CSP’s native analytics database offerings requires a one-to-one scaling of compute clusters and database instances, where the number of nodes in any single cluster has a fixed upper bound. Delivering the 24×7 speed and performance required by a large, complex global organisation thus requires full-time availability of peak capacity CPU resources to avoid, for example, latency associated with boot-up or concurrent users. Through a shared-data architecture, however, we can achieve a complete logical decoupling of storage and compute, provisioning resources per second of actual CPU usage and eliminating machines idling on stand-by. No machines on stand-by means radically less energy usage, emissions, and cost attributed to data.
- Compounding the effects of more efficient data centre design and management. Other non-software-related improvements in the design and management of data centres lead to additional large energy and emissions gains. AWS is contracting more and more power from renewable energy producers, including from BP’s solar energy plants1. Microsoft is experimenting with sinking data centres into the ocean2 to reduce the energy associated with cooling servers. In contrast to many on-premises corporate data centres, CSPs run their cloud operations as profit centres, and the profit motive has naturally accelerated the development of structural data centre designs that yield a significant improvement in energy and cost efficiency. Optimisations made at the database architecture level are likely to compound the effects of more efficient data centre designs, although further research in this area is required to understand to what extent this holds true.
In our third and final post covering net zero data, we’ll explore how one of the largest energy companies in the world can leverage better, faster data to decarbonise their operations.