This complete guide and hands-on will definitely work for you if you are going to set up an AWS Data lake in your organisation. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.
Data in this universe is increasing drastically and it’s estimated that the global datasphere will grow to 175 zettabytes by 2025. Around 90% of the data is mostly unstructured or semi-structured data. There are many solutions to store and process structured data but when it comes to any form of the data be it structured/semi-structured or unstructured, then the data lake comes into the picture.
A data lake maintains data in its native formats and handles the three Vs of big data — volume, velocity, and variety — while providing tools for analyzing, querying, and processing. Data lakes eliminate all the restrictions of a typical data warehouse system by providing unlimited space, unrestricted file size, schema on reading, and various ways to access data (including programming, SQL-like queries, and REST calls).
Need of Data lake
If you are facing the below issues, then definitely you need to create an AWS Data lake.
- Businesses having too many data stores and not having any single source of data. Facing difficulties in fetching data from multiple sources
- Data is increasing day by day, too much spending on data storage.
- The structure of data is varying a lot. For example business having user audit data, IoT devices data, logs data, image gallery.
- Slow data analytics on big data.
By this, it would be clear to you that Data lake is a need for your organisation or not!
Now we have to choose the right Data lake Architecture!
Data Lake Architecture
Many companies use cloud storage services such as Google Cloud Storage and Amazon S3 or a distributed file system such as Apache Hadoop. For this blog, I’m going to use AWS Data lake as an example as it comes under the Free tier plan.
Data Lake Layers
A data lake can have any numbers of layers as it’s not a product or a tool, it’s a process. For our use case, we would be discussing the following layers.
- Ingestion
- Distillation
- Processing
- Insights
Let’s go through each of them one by one. I know this would be an exciting part as no one really want’s to get the theory
AWS Data lake Hands-on
Ingestion
First, we will ingest our raw data as it is in the data lake, so for this use case, we would be using PostgresSQL table records. We will be ingesting all the records to the Data lake. Let’s create some big data in the Postgres table.
We would be creating a town table with random data like id, code, article and name and with the use of the generate_series function, we would be creating 100k records.
The data is looking something like below
Now our data is ready and all we need to do is to ingest this data in the data lake. So first we need to create a database in AWS Data lake to store all the unstructured and structured data.
Data Source Creation
Database
To create new database follow the below steps:
- Open AWS Glue in AWS by typing AWS Glue in the search bar of AWS Console
- Click on databases on the left side then click on Add Database
- Enter the name of the database, for this use case we will use
city
as database
AWS Glue Connection
To fetch the data from AWS RDS or any RDBMS we have to create a connection for that. To create a new connection follow the below steps
- Click on connections on the left side then click on Add Connections
- Type the name of the connection and choose the connection type, for this example we are choosing JDBC but you can also choose Amazon RDS
- Enter connection details like JDBC string, username, database_name, password. Amazon VPC, subnet of that database, security group for that Database
AWS Glue Crawler
Create a crawler that will create a data source table schema that will use for fetching the data from the actual DB in regular intervals.
To create a crawler follow the below steps.
- Click on crawler and click Add Crawler
- Enter the name of the crawler and click next
- Choose the default settings only
- Choose JDBC as connection type, choose the JDBC connection previously created and specify the path of the table from where we have to fetch the table schema
- Choose IAM role for Glue(I have used full access of Glue). Set the crawler frequency to Run on Demand as we are going to run whenever we want to create or update the table schema.
- Choose the already created database
- Now just run the crawler that we have just created. It will create a table schema that you can view in the tables section.
AWS Glue Job
Then create a job to fetch data from the JDBC connection at a particular interval and store them in Amazon S3 as the parquet file format.
To do so follow the below steps:
- Click on jobs on the left side of the AWS Glue console then click Add Job
- Enter the name of the job, assign role and just change the bookmark setting from disable to enable so that job will remember the last state of the data fetched so the data duplication will be avoided
- Choose a data source from where we have to fetch the data, we will choose the table schema created in the above steps
- For data target, click on create a table in data target. Then choose Amazon S3 as it’s the cheapest data source available. For format choose parquet format and specify the data target path.
- We can also change the target schema and save the job when finished.
- Now click on the save script dropdown and add a trigger for it.
- For this use case, we will daily fetch the data from the data source and put the data in data target i.e. S3
- Don’t forget to enable the bookmark
- As the trigger is created, now just manually run the job
Now the data is available in Amazon S3 in parquet format and the ingestion process is over. Next, we will start the distillation process
Distillation
Fetch Data from AWS S3
- Create another crawler to fetch the data from S3 in parquet file format and will create a table out of it in AWS Glue
- Choose the data source as S3 and specify the exact path of the folder that is having all the data in parquet format.
- Once the crawler is created, just run the crawler. It will create a table that is having all the data that is in the parquet files in Amazon S3
- You can also check out the created table from the tables tab
Insights by AWS Athena
Now comes the most exciting part of the whole data lake. Insights are the stage where you have to query the data from the store using some query engine and create some beautiful dashboards on it.
- Open AWS Athena by typing on the search bar on the top
- Provide an output location for the result of the queries that we will be firing. Click on the settings that are on the top right side
- Specify a output S3 bucket path for the output data
Now just fire the query, it will show you the whole result in the Athena query engine. You can also export the result or connect Athena to any other analytics tools like Holistics or Tableau or Domo.
Conclusion
You can choose any other data lake to provide like Azure or GCP that will not make any difference. But the only thing matter is what storage engine you are using. In my use case, I used Amazon s3 cause that is a hell of a lot cheaper than other options out there.
I hope, it has helped you or it will help you. If you want to discuss anything or anything related to tech, you can contact me here or on the Contact Page. If you are interested in becoming a part of the Progress story please reach out to me or check out the Create Blog page.
See you next time! Peace Out ✌️