Data Engineering Internship KBC Bank & Insurance - LeCa Team

Improving Traceability in Marketing Data Pipelines

Marketing campaigns need a reliable list of clients to contact. I investigated a pipeline where every client should either end up in the final campaign target list or be logged with a clear reason for being filtered out.

Context

Enterprise data quality, with real business impact.

Marketing campaigns start with a large list of potential clients. Before a campaign can be sent, this list passes through a data pipeline that combines information from multiple sources and applies a series of business rules. At each stage, some clients continue toward the final campaign while others are filtered out for specific reasons, such as eligibility rules, exclusions, or communication constraints.

To keep the process explainable, every client should have a traceable outcome: either they appear in the final campaign target list or they are recorded in a traceability dataset explaining why they were removed. The problem I investigated was that the total number of final targets plus logged dropouts did not always match the number of clients that entered the pipeline. My work focused on finding where these discrepancies originated and making the investigation process repeatable.

Problem

The number of clients entering the campaign pipeline did not always match the combined number of final campaign targets and traceability records. This meant some missing, duplicated, or unexplained records had to be investigated.

My Role

I traced records across stages, compared counts, analyzed joins and filters, validated findings with dashboards, and documented a repeatable investigation workflow.

Confidentiality

Internal datasets, repositories, and implementation details are generalized here while preserving the project logic and learning outcomes.

Objective

Every client entering the pipeline should either reach the final campaign output or have a documented reason explaining why they were filtered out.

Pipeline Flow

How campaign targets are created.

Each campaign begins with a population of potential clients. The pipeline progressively filters, enriches, and validates these records until the final target list is produced. Clients that do not continue should be captured in traceability reporting with an explanation.

01

FSR

Initial client selection

02

CE

Configurable exclusions

03

OS

Offer selection baseline

04

OE

Offer enrichment logic

05

BU

Buffer update

06

BS

Buffer selection

07

LV + DT

Output or traceability

Validation equation

The investigation checked whether the clients that entered the later pipeline stage could be fully accounted for as either final output or traceability records.

Input = Final targets + Logged dropouts
Methodology

A reusable debugging workflow.

Instead of chasing isolated symptoms, I used a five-step reconciliation process that could be repeated across marketing solutions and pipeline runs.

Step 1

Count Comparison

Measured baseline row counts at each layer to locate where discrepancies started.

Step 2

Data Splitting

Separated continuing leads from filtered leads using indicators, reasons, and business keys.

Step 3

Difference Analysis

Joined stage outputs to identify missing records, duplicate records, and unexpected increases.

Step 4

Root Cause Review

Inspected source logic, joins, selected columns, Airflow logs, and local pipeline runs.

Step 5

Validation

Re-ran affected stages and compared results across multiple marketing solutions.

Findings

Four causes behind the inconsistencies.

The discrepancies were not caused by one bug. They came from several subtle issues that affected different parts of the pipeline.

Offer Enrichment Duplication

A one-to-many join in enrichment could multiply lead records when reference data contained parent-child relationships.

Response: corrected the join/deduplication logic so each input lead produced at most one enriched output.

Missing Buffer Traceability

Some buffer-stage filter-outs existed operationally, but were not visible in the main traceability reporting dataset.

Response: identified the reporting gap and proposed including the buffer traceability source in Qlik reporting.

Blocking Table Windowing

Windowing logic could select an outdated blocking record instead of the most current eligibility record, causing small but consistent count differences.

Response: changed the window selection to use the latest relevant blocking end date.

Buffer Update Uniqueness Loss

A transformation dropped identifier columns, making distinct rows appear identical later in the pipeline.

Response: proposed preserving distinguishing columns as the safer fix, with dropDuplicates only if business rules confirm one record should remain.

Dashboard Analysis

Making traceability easier to investigate.

The Qlik dashboard was useful for identifying problematic marketing solutions and validating corrected counts, but the new report highlights that it did not yet expose the complete pipeline flow or buffer traceability data.

Current Limitations

  • Final counts were visible, but stage-by-stage losses were difficult to locate.
  • Buffer traceability datasets were not directly integrated into reporting.
  • Offer Selection was used as the expected-lead baseline even though Offer Enrichment applies additional logic.

Proposed Enhancements

  • Pipeline-layer reconciliation views
  • Buffer traceability visibility
  • Standardized traceability reason grouping
  • Offer Enrichment based expected lead calculations
  • Lead-flow waterfall visualizations
  • Root-cause analysis dashboards
Tools

Stack and practices.

The work combined notebook-based data analysis, production pipeline reading, dashboard validation, AWS Glue pipeline context, and team review.

PySpark Apache Airflow Qlik Bitbucket AWS Glue Adobe Campaign Parquet Data Reconciliation

Technical Skills

  • PySpark data analysis
  • Distributed pipeline debugging
  • Join and window logic review
  • Dashboard validation and improvement proposals

Professional Skills

  • Hypothesis-driven investigation
  • Stakeholder communication
  • Documentation discipline
  • Peer validation
Outcome

What changed.

The internship produced both immediate technical improvements and a more durable debugging method that the team could reuse for future traceability investigations.

Documented a repeatable five-step method for pipeline reconciliation.

Resolved major duplication issues in offer enrichment and buffer update logic.

Fixed a blocking table selection issue that caused smaller but consistent inconsistencies.

Identified a reporting-layer traceability gap and documented Qlik dashboard enhancements for future development.

Reflection

What I learned.

Data loss vs reporting loss

A key lesson was separating records that truly disappear from records that exist but are not visible in the reporting layer.

Keys matter

Counts only make sense when the business key is understood. Dropping one identifier can change the whole interpretation.

Debugging is communication

Finding a root cause was only useful when I could explain it clearly to technical peers and connect it to business reporting impact.

A bridge between academic learning and production data engineering.

This internship strengthened my understanding of enterprise-scale Spark pipelines, traceability design, dashboard-driven validation, and the discipline needed to make data quality problems explainable.