15 min readComplete Guide and Hands-on of AWS Data Lake

sail, sailing boats, wind

This complete guide and hands-on will definitely work for you if you are going to set up an AWS Data lake in your organisation. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

Data in this universe is increasing drastically and it’s estimated that the global datasphere will grow to 175 zettabytes by 2025. Around 90% of the data is mostly unstructured or semi-structured data. There are many solutions to store and process structured data but when it comes to any form of the data be it structured/semi-structured or unstructured, then the data lake comes into the picture.

Data lake Architecture from AWS
https://www.i-scoop.eu/

A data lake maintains data in its native formats and handles the three Vs of big data — volume, velocity, and variety — while providing tools for analyzing, querying, and processing. Data lakes eliminate all the restrictions of a typical data warehouse system by providing unlimited space, unrestricted file size, schema on reading, and various ways to access data (including programming, SQL-like queries, and REST calls).

Need of Data lake

If you are facing the below issues, then definitely you need to create an AWS Data lake.

  • Businesses having too many data stores and not having any single source of data. Facing difficulties in fetching data from multiple sources
  • Data is increasing day by day, too much spending on data storage.
  • The structure of data is varying a lot. For example business having user audit data, IoT devices data, logs data, image gallery.
  • Slow data analytics on big data.

By this, it would be clear to you that Data lake is a need for your organisation or not!

Now we have to choose the right Data lake Architecture!

Data Lake Architecture

Many companies use cloud storage services such as Google Cloud Storage and Amazon S3 or a distributed file system such as Apache Hadoop. For this blog, I’m going to use AWS Data lake as an example as it comes under the Free tier plan.

Data Lake Layers

https://www.virtasant.com/blog/data-lake-architecture

A data lake can have any numbers of layers as it’s not a product or a tool, it’s a process. For our use case, we would be discussing the following layers.

  1. Ingestion
  2. Distillation
  3. Processing
  4. Insights

Let’s go through each of them one by one. I know this would be an exciting part as no one really want’s to get the theory

AWS Data lake Hands-on

AWS Data Lake Architecture with every services

Ingestion

First, we will ingest our raw data as it is in the data lake, so for this use case, we would be using PostgresSQL table records. We will be ingesting all the records to the Data lake. Let’s create some big data in the Postgres table.

We would be creating a town table with random data like id, code, article and name and with the use of the generate_series function, we would be creating 100k records.

The data is looking something like below

Bulk data creation using PostgresSQL

Now our data is ready and all we need to do is to ingest this data in the data lake. So first we need to create a database in AWS Data lake to store all the unstructured and structured data.

Data Source Creation

Database

To create new database follow the below steps:

  • Open AWS Glue in AWS by typing AWS Glue in the search bar of AWS Console
AWS Glue
  • Click on databases on the left side then click on Add Database
Add database in AWS Glue
  • Enter the name of the database, for this use case we will use city as database
Create database using AWS Glue

AWS Glue Connection

To fetch the data from AWS RDS or any RDBMS we have to create a connection for that. To create a new connection follow the below steps

  • Click on connections on the left side then click on Add Connections
Add connection using AWS Glue
  • Type the name of the connection and choose the connection type, for this example we are choosing JDBC but you can also choose Amazon RDS
Use JDBC or Amazon RDS connection type in AWS Glue
  • Enter connection details like JDBC string, username, database_name, password. Amazon VPC, subnet of that database, security group for that Database
AWS Glue Connection

AWS Glue Crawler

Create a crawler that will create a data source table schema that will use for fetching the data from the actual DB in regular intervals.

To create a crawler follow the below steps.

  • Click on crawler and click Add Crawler
Add crawler in AWS Glue
  • Enter the name of the crawler and click next
AWS Glue crawler
  • Choose the default settings only
AWS Glue crawler for AWS data lake
  • Choose JDBC as connection type, choose the JDBC connection previously created and specify the path of the table from where we have to fetch the table schema
AWS Data lake crawler
  • Choose IAM role for Glue(I have used full access of Glue). Set the crawler frequency to Run on Demand as we are going to run whenever we want to create or update the table schema.
AWS data lake crawler scheduler
  • Choose the already created database
AWS Crawler data output
  • Now just run the crawler that we have just created. It will create a table schema that you can view in the tables section.
AWS Glue dashboard

AWS Glue Job

Then create a job to fetch data from the JDBC connection at a particular interval and store them in Amazon S3 as the parquet file format.

To do so follow the below steps:

  • Click on jobs on the left side of the AWS Glue console then click Add Job
Add ETL Job in AWS Glue
  • Enter the name of the job, assign role and just change the bookmark setting from disable to enable so that job will remember the last state of the data fetched so the data duplication will be avoided
AWS Glue job
  • Choose a data source from where we have to fetch the data, we will choose the table schema created in the above steps
Choose data source for ETL Job in AWS Glue
  • For data target, click on create a table in data target. Then choose Amazon S3 as it’s the cheapest data source available. For format choose parquet format and specify the data target path.
Amazon S3 as a data target in AWS Glue for Data lake
  • We can also change the target schema and save the job when finished.
  • Now click on the save script dropdown and add a trigger for it.
  • For this use case, we will daily fetch the data from the data source and put the data in data target i.e. S3
  • Don’t forget to enable the bookmark
  • As the trigger is created, now just manually run the job

Now the data is available in Amazon S3 in parquet format and the ingestion process is over. Next, we will start the distillation process

Distillation

Fetch Data from AWS S3

  1. Create another crawler to fetch the data from S3 in parquet file format and will create a table out of it in AWS Glue
  • Choose the data source as S3 and specify the exact path of the folder that is having all the data in parquet format.
  • Once the crawler is created, just run the crawler. It will create a table that is having all the data that is in the parquet files in Amazon S3
  • You can also check out the created table from the tables tab

Insights by AWS Athena

Now comes the most exciting part of the whole data lake. Insights are the stage where you have to query the data from the store using some query engine and create some beautiful dashboards on it.

  • Open AWS Athena by typing on the search bar on the top
  • Provide an output location for the result of the queries that we will be firing. Click on the settings that are on the top right side
  • Specify a output S3 bucket path for the output data

Now just fire the query, it will show you the whole result in the Athena query engine. You can also export the result or connect Athena to any other analytics tools like Holistics or Tableau or Domo.

Conclusion

You can choose any other data lake to provide like Azure or GCP that will not make any difference. But the only thing matter is what storage engine you are using. In my use case, I used Amazon s3 cause that is a hell of a lot cheaper than other options out there.

I hope, it has helped you or it will help you. If you want to discuss anything or anything related to tech, you can contact me here or on the Contact Page. If you are interested in becoming a part of the Progress story please reach out to me or check out the Create Blog page.

See you next time! Peace Out ✌️

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.