Introduction to Data Lakes
Let’s start with a discussion about what data lakes are, and then where they fit in as a critical component to your overall data engineering ecosystem. So what is a data lake? Well, it’s a fairly broad term, but it generally describes a place where you can securely store various types of data of all scales for processing and analytics.
Data lakes are typically used to drive data analytics, data science and ML workloads, or batch and streaming pipelines. Data lakes will accept all types of data. Finally, data lakes are portable on-premise or in the cloud.
Components of Data Engineering Ecosystem
- Data Sources and Data Sinks
Now here is where data lakes fit into the overall data engineering ecosystem for your team. You have to start with some originating systems that are the source of all of your data. Those are your data sources, then as a data engineer, you need to build those reliable ways of retrieving and storing that data. Those are your data sinks, the first line of defense in an enterprise data environment is your data lake.
Again, it’s the central give me whatever data you have in a variety of formats, volume, and velocity and I got it, and I’ll take care of it. - Data Transformations and Data Pipeline
Once your data is of the source systems and inside of your environment, generally a ton of cleanup and processing is required to massage that data into a useful format for your business. What actually performs the cleanup and processing of data? Those are your data pipelines, they’re responsible for doing the transformations and processing on your data at scale, and bring your entire system to life with fresh newly processed data available for analysis.
Let’s move into what cloud products fit into one of these roles.
Solution Architecture Diagram on GCP Cloud
In the center of that diagram, the data lake here is Google Cloud Storage buckets. It’s your consolidated location for raw data, and it’s durable and highly available. Know that Google Cloud Storage is your only option for data legs on GCP, it is one of a few good options to serve as a data like but it’s not the only one.
This is why it’s so important to first understand what you want to do first, and then finding which of the solutions best meets your needs.
Your data generally serves as that single consolidated place for all of your raw data. I like to think of it as a durable staging area, everything gets collected here and then sent out elsewhere.
Now, this data may end up in many other different places like a transformation pipeline that cleans it up and moves it to the data warehouse. And then it’s read by a machine learning model, but it all starts with getting that data into your data lake first.
Data Lakes with GCP Storage Solutions
Here is a list of many GCP Storage Services suitable for Data Lake:
- Cloud Storage
- Cloud SQL
- Cloud Snapper
- Cloud Data Store
- Cloud Bigtable
In this data lake article, we’ll focus on the Cloud Storage product which makes up your data lake.
Building a Data Lake using Cloud Storage
Google Cloud Storage is the essential storage service for working with data, especially unstructured data. Let’s do a deep dive into why Google Cloud Storage is a popular choice to serve as a Data Lake.
- Data persistent: Data in Cloud Storage persists beyond the lifetime of virtual machines or clusters. It’s persistent and it’s also relatively inexpensive compared to the cost of computing. So for example, you might find it more advantageous to cache the results of previous computations inside of cloud storage for archiving. Or, if you don’t need an application running all the time, you might find it helpful to save this state of your application in the cloud storage, and then shut down the machine that’s running or when you don’t need it.
- Object store: Google Cloud Storage is an object store, so it just stores and retrieves binary objects without regard to what data is contained in the objects. However, to some extent, it also provides file system compatibility and can make objects look like and work like as if they were files, so you can copy files in and out of it.
- Durability: Data stored in cloud storage will basically stay there forever meaning that it’s durable, but it’s available instantly or it’s strongly consistent.
- Availability: You can share data globally, but it’s encrypted and completely controlled and private if you want it to be. It’s a global service, you can reach the data from anywhere, which means it offers global availability. But the data can also be kept in a single geographic location if you need that too.
How Does Cloud Storage Work
As a data engineer, you need to understand how cloud storage accomplishes these apparently contradictory qualities, and when to employ them in your solutions.
Buckets and Objects:
A lot of cloud storage amazing properties have to do with the fact that ultimately it’s an object store, and that all the other features are built on top of that base. The two main entities in cloud storage our buckets and objects.
-
- Buckets are containers which hold objects
- Objects exist inside of those buckets and not apart from them.
So, buckets are containers for our purposes for data, buckets are identified in a single globally unique namespace. So that means, once a name is given to a bucket, it can’t be used by anyone else until that bucket’s deleted and that name is released. Having a global namespace for buckets greatly simplifies locating any particular bucket. When the bucket is created it’s associated with a particular region or multiple regions, choosing a region close to where the data will be processed will reduce latency.
Data Replication:
When an object is stored, cloud storage replicates that object, it’ll then monitor the replicas, and if one of them is lost or corrupted it’ll replace it automatically with a fresh copy.
For a single region bucket, as you might expect, the objects are replicated across zones within that one region.
Metadata:
The objects are stored with metadata, metadata is information about that object. Additional cloud storage features use the metadata for purposes such as access control, compression, encryption, and lifecycle management of those objects and buckets. This feature uses the object metadata to determine when to delete that object. When you create a bucket, you need to make several decisions. The first is the location of that bucket, location is set when a bucket is created and it can never be changed.
Cloud Storage Simulates a File System
Cloud storage uses the bucket name and the object name to simulate a file system. This is how it works, the bucket name is the first term in the URI, a forward slash is appended to it, and then it’s concatenated with the object name. The object name allows the forward-slash character as a valid character in the name, the very long object name with forward-slash characters in it. Looks like a file path system even though it’s just a single name, in the example shown the bucket name is “declass”. The object name is “de/modules/O2/script.sh”, the forward slashes are just characters in the name.
Google Cloud Storage can be accessed using the file access method, which allows you for example to use a copy command from the local file directory to Google Cloud Storage. You can use the tool gsutil or Google storage utility to do this, cloud storage can also be accessed over the web. The site for it is storage.cloud.google.com, and it uses TLS or HTTPS to transport your data, which protects the credentials as well as the data in transit.
Summary
So as we can see Cloud Storage is object storage for companies of all sizes. Store any amount of data. Retrieve it as often as you’d like. Cloud Storage has an ever-growing list of storage bucket locations where you can store your data with multiple automatic redundancy options. In addition, Cloud Storage has many other object management features.