What does a Data Engineer do?

October 2, 2024
10:21 pm

Introduction

Data engineering is the practice of designing, building and maintaining systems that collect, transform and store large volumes of data at scale which makes it usable for data science and analytics. As organisations develop the ability to collect massive amounts of raw data, they need skilled professionals to make this data usable and valuable.

In this blog, Mercia Malan takes us through the foundations of data engineering and what the role entails. With over 12 years of experience in software and data engineering, she brings a wealth of knowledge to the field. Having worked on the data platform team for Thames Water before stepping up as the Head of Data Engineering, she is also the driving force behind DataConf, South Africa’s community-driven data conference. Mercia includes a lot from her recent webinar, “The Fundamentals of Data Engineering“, to help you understand this exciting and rapidly evolving field. Let’s get into it!

What is data engineering?

Data engineering is the foundation that enables data scientists and analysts to derive valuable insights from data. Data engineers build and maintain the infrastructure necessary for high-performance data generation and processing.

The field of data engineering has evolved significantly over the past few decades:

1. Data Warehousing (1980s-1990s): Focus on relational databases and structured data, predominantly using SQL. Storing data centrally for analytics.

2. Big Data and Hadoop (2000s): Introduction of distributed storage and processing for handling large volumes of data (rise of Big Data).

3. Cloud-based Solutions and data lakes (2010s): Emergence of data-centric applications and cloud platforms like like Amazon Redshift and also the move away from data warehouses to more flexible data lakes.

4. Modern Data Stack and focus on data science and AI (Present): Cloud-native solutions, lakehouses, decentralised data platforms, handling both structured and unstructured data at scale. Preparing data for data science modelling rather than building data warehouses for analytics.

This evolution has led to data engineering becoming a distinct and vital role in the data ecosystem, separate from traditional software engineering or data science roles, however, working closely together with these roles to create end-to-end data solutions.

So what does a data engineer actually do?

Data engineers work in a variety of settings to build systems that collect, manage, and transform raw data into usable information for data scientists and data analysts to interpret. Their ultimate goal is to make data accessible so that organisations can use it to evaluate and optimise their performance.

Andries van der Walt, another Lead Data Engineer at Sand, puts it in simpler terms: “Much like plumbing with water flow, data engineers ensure data zips efficiently through an organisation’s systems. They’re the ones building and maintaining the digital pipelines that let data be collected, stored, and analysed without a hitch.”

At Sand Technologies, we use the SIESE framework to capture the value stream when enriching data for clients in a technology and industry agnostic way:

Source: Discover and catalogue data sources, understand source systems from a data engineering perspective, and perform exploratory data analysis from a data science perspective.
Ingest: Integrate data from various sources as identified in the Source phase, implement extract, transform, and load (ETL) systems and persist raw data (typically into a data lake).
Enrich: Transform and model ingested data into a data twin or data warehouse and enrich data through data science and machine learning.
Serve: Serve business insights derived from the enriched stage through reports and dashboards, self-serve platforms or custom software applications.
Engage: Gather use cases and collaborate with stakeholders and users throughout the whole development lifecycle.

In order to SIES rapidly and with high quality, we need to have a 3 supporting planes in place:

Observation and Control: Ensuring compliance with data governance and security policies.
Governance and Risk: Implementing safeguards for data integrity and ethical use.
Provisioning and Environment: Automating infrastructure setup for efficient data handling.

Data engineers are primarily involved in the Source and Ingest phases, and partly in the data modelling (from a database schema modelling perspective) portion of the Enrich phase. They are also responsible for certain aspects within the supporting context such as ensuring data quality and handling of sensitive data.

Data engineering use cases

Data engineering has a wide range of applications in today’s data-driven world:

Data collection, storage, and management: Streamlining data intake and storage across an organization for convenient access and analysis.
Real-time data analysis: Automating processes of collecting, cleaning, and formatting data for use in data analytics, enabling real-time learning and decision-making.
Machine learning: Supporting machine learning engineers by creating data pipelines that transport data from collection points to models for training.
Business Intelligence: Providing a robust infrastructure for BI analysts to generate reports and dashboards.
Data-driven decision making: Empowering executives and managers with access to accurate, up-to-date data for strategic decisions.
Smart infrastructure management: Enabling the creation and operation of digital twins for complex systems.

To illustrate the real-world impact of data engineering, let’s look at a case study from our work at Sand Technologies:

Case study: Smart Solutions for Wastewater Management

The wastewater industry has long faced challenges in effectively managing operations and combating pollution. Traditional methods lacked the precision and adaptability needed for modern wastewater treatment. To address this, we introduced cloud-based digital twins for sewage treatment plants.

These digital replicas, underpinned by pneumatic, hydraulic, and process engineering models, allow operators to digitally mirror their day-to-day operations, adjusting various plant factors in real-time. The solutions include comprehensive data models driven by live plant data.

As data engineers, our role in this project involved:

Creating bespoke data models to accurately represent complex wastewater systems
Designing digital replicas that could emulate real-world processes
Deploying end-to-end cloud infrastructure to support the digital twin system
Developing user-friendly applications for real-time parameter adjustments

By integrating AI-driven analytics, these models provide actionable insights, guiding users on optimizing their sites for both operational efficiency and strategic capital expenditure.

This case study demonstrates how data engineering can transform an entire industry, moving from traditional, reactive approaches to proactive, data-driven solutions. It showcases the power of combining data engineering with domain expertise to solve real-world problems.

Data engineer roles and responsibilities

Whether you are just stepping into a career in data engineering, thinking of a career change (the line between data engineers, data scientists and analytics engineers has never been thinner) or you’re a seasoned professional, there are various job roles and paths to consider that may help you advance your engineering career.

MJ, a Senior Data Engineer working as a tech lead in one of the data platform teams of Sand’s wastewater treatment clients, offers insight into his life as a data engineer at Sand: “Data Engineers are rare; only 18% of companies have a data team, and less than 6% have a dedicated data engineer.” At Sand, we structure our data engineering roles differently from some other companies in the industry. Our career progression is based on three main levels:

Junior Data Engineers:

Ingest data into a “landing zone”
Write code to process and clean data
Develop and maintain ETL processes
Write tests for data pipelines
Transform data according to designed data models
Observe data pipelines in production to identify issues

Mid-level Data Engineers (in addition to Junior tasks):

Understand how to connect to various data sources
Apply optimisations to data pipelines and storage
Promote code and pipelines to testing and production environments
Iterate and improve pipeline performance and data models

Senior Data Engineers (in addition to Mid-level tasks):

Design data solutions and architectures
Find and evaluate new data sources
Model data for different use cases (Analytics, Transactional, or Data Science)
Work with data scientists to iterate on their required data models
Ensure scalability and efficiency of data systems

We also have a unique role called Keystone Data Engineer, which is similar to but distinct from the Lead Data Engineer role you might find at other companies.

At Sand, these roles could have you working in various industries across the globe, including:

This diverse exposure allows our data engineers to gain a wide range of experience and apply their skills to different domains.

What skills are needed to be a Data Engineer?

Becoming a successful data engineer requires a blend of technical prowess and interpersonal abilities. At the core, data engineers need a strong foundation in programming languages, database systems, and cloud computing platforms. These technical skills allow them to build robust data pipelines, design efficient storage solutions, and create scalable architectures that can handle the volume and variety of modern data.

However, technical skills alone are not sufficient. Data engineers must also possess analytical thinking, problem-solving abilities, and excellent communication skills. They need to translate complex business requirements into effective data solutions, collaborate with various stakeholders, and explain technical concepts to non-technical team members. This combination of hard and soft skills enables data engineers to not only build powerful data systems but also to ensure these systems truly serve the needs of the organisation.

Programming: Proficiency in Python, SQL, and Java
Database Systems: Understanding of Relational (MySQL, PostgreSQL) and NoSQL (MongoDB, Cassandra) databases
Big Data Technologies: Familiarity with Hadoop, Spark, and Kafka
Cloud Platforms: Knowledge of AWS, Google Cloud Platform, or Azure
Data Modeling: Ability to design efficient and scalable data models
ETL Processes: Experience with extract, transform, and load systems
Problem-solving: Ability to tackle complex data challenges
Communication: Skill in explaining technical concepts to non-technical stakeholders

To validate your skills and boost your career prospects, consider pursuing certifications such as:

These certifications demonstrate your expertise in building and maintaining data solutions on specific cloud platforms.

Grow your career in Data Engineering at Sand

If you’re passionate about working with data, enjoy solving complex problems, and want to be part of shaping the future of how businesses use information, then a career in data engineering at Sand might be perfect for you. The journey to becoming a proficient data engineer is ongoing – embrace continuous learning, stay curious, and never stop exploring new technologies and techniques. It’s what working at Sand is all about!

If you’re interested in pursuing a career in data engineering, a great place to start is by following a structured learning path. Check out the Data Engineer Roadmap on GitHub for a comprehensive overview of the skills and technologies you’ll need to master.

If you already have experience, showcase your skills to potential employers, consider creating a portfolio of data engineering projects. Some ideas include:

Building an end-to-end data pipeline using open-source tools
Creating a data lake solution for analyzing large datasets
Developing a real-time data streaming application
Implementing a data warehouse for business intelligence reporting

These projects will help you apply your skills in real-world scenarios and demonstrate your ability to solve complex data challenges. If you found this article helpful and want to dive deeper into the world of data engineering, watch the full webinar on “The Fundamentals of Data Engineering.” It provides more detailed explanations, real-world examples, and Q&As with the audience.

Ready to join?

Considering a career at Sand for your Data Engineering ambitions? Visit our Careers Website or join our Talent Community. Signing up will keep you informed with available jobs, the latest upcoming events and learning resources.