The differences between Data Engineers, Data Scientists and Data Analysts are not very clear. The requirement and job description for these three catagories are overlapping too. So is the online training resources. I once saw an Udemy Course with title like “Python for Data Engineer”. But the content of that course is mostly for entry level data scientist, rather than data engineer. Because of these existing confusions, when I read this post (https://www.dataengineering.academy/pipeline-data-engineering-academy-blog/learn-data-engineering-on-a-shoestring-free-courses) , I was very happy and agrees most of its opinion about data engineer skillsets–“using Python, SQL and the command line are essential for data engineers”.

For this purpose, this is the best data engineering training course I found on line: https://github.com/DataTalksClub/data-engineering-zoomcamp

While other course are only scratching the surface, this course is truely about the the core of data engineering. I am planning to translate part of the course of GCP to AWS. The motinvation for my translation is simple: the company that I am working decide to go AWS (vs Azure). The more I know AWS, the better. Self teaching/learning need a good scaffold and translation is easy to start and easy to stick to.

I probably will writing all the details here. I also hope that ChatGPT can help me somewhere somehow along the way.

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession

<
Previous Post
How to Distribute Effectively on a Hadoop Cluster
>
Next Post
How to turn Data Scientists’ python code to a Data Engineer PySpark code