Problem
The number of clients entering the campaign pipeline did not always match the combined number of final campaign targets and traceability records. This meant some missing, duplicated, or unexplained records had to be investigated.
Marketing campaigns need a reliable list of clients to contact. I investigated a pipeline where every client should either end up in the final campaign target list or be logged with a clear reason for being filtered out.
Marketing campaigns start with a large list of potential clients. Before a campaign can be sent, this list passes through a data pipeline that combines information from multiple sources and applies a series of business rules. At each stage, some clients continue toward the final campaign while others are filtered out for specific reasons, such as eligibility rules, exclusions, or communication constraints.
To keep the process explainable, every client should have a traceable outcome: either they appear in the final campaign target list or they are recorded in a traceability dataset explaining why they were removed. The problem I investigated was that the total number of final targets plus logged dropouts did not always match the number of clients that entered the pipeline. My work focused on finding where these discrepancies originated and making the investigation process repeatable.
The number of clients entering the campaign pipeline did not always match the combined number of final campaign targets and traceability records. This meant some missing, duplicated, or unexplained records had to be investigated.
I traced records across stages, compared counts, analyzed joins and filters, validated findings with dashboards, and documented a repeatable investigation workflow.
Internal datasets, repositories, and implementation details are generalized here while preserving the project logic and learning outcomes.
Every client entering the pipeline should either reach the final campaign output or have a documented reason explaining why they were filtered out.
Each campaign begins with a population of potential clients. The pipeline progressively filters, enriches, and validates these records until the final target list is produced. Clients that do not continue should be captured in traceability reporting with an explanation.
01
Initial client selection
02
Configurable exclusions
03
Offer selection baseline
04
Offer enrichment logic
05
Buffer update
06
Buffer selection
07
Output or traceability
The investigation checked whether the clients that entered the later pipeline stage could be fully accounted for as either final output or traceability records.
Instead of chasing isolated symptoms, I used a five-step reconciliation process that could be repeated across marketing solutions and pipeline runs.
Step 1
Measured baseline row counts at each layer to locate where discrepancies started.
Step 2
Separated continuing leads from filtered leads using indicators, reasons, and business keys.
Step 3
Joined stage outputs to identify missing records, duplicate records, and unexpected increases.
Step 4
Inspected source logic, joins, selected columns, Airflow logs, and local pipeline runs.
Step 5
Re-ran affected stages and compared results across multiple marketing solutions.
The discrepancies were not caused by one bug. They came from several subtle issues that affected different parts of the pipeline.
A one-to-many join in enrichment could multiply lead records when reference data contained parent-child relationships.
Response: corrected the join/deduplication logic so each input lead produced at most one enriched output.
Some buffer-stage filter-outs existed operationally, but were not visible in the main traceability reporting dataset.
Response: identified the reporting gap and proposed including the buffer traceability source in Qlik reporting.
Windowing logic could select an outdated blocking record instead of the most current eligibility record, causing small but consistent count differences.
Response: changed the window selection to use the latest relevant blocking end date.
A transformation dropped identifier columns, making distinct rows appear identical later in the pipeline.
Response: proposed preserving distinguishing columns as the safer fix, with dropDuplicates only if business rules confirm one record should remain.
The Qlik dashboard was useful for identifying problematic marketing solutions and validating corrected counts, but the new report highlights that it did not yet expose the complete pipeline flow or buffer traceability data.
The work combined notebook-based data analysis, production pipeline reading, dashboard validation, AWS Glue pipeline context, and team review.
The internship produced both immediate technical improvements and a more durable debugging method that the team could reuse for future traceability investigations.
Documented a repeatable five-step method for pipeline reconciliation.
Resolved major duplication issues in offer enrichment and buffer update logic.
Fixed a blocking table selection issue that caused smaller but consistent inconsistencies.
Identified a reporting-layer traceability gap and documented Qlik dashboard enhancements for future development.
A key lesson was separating records that truly disappear from records that exist but are not visible in the reporting layer.
Counts only make sense when the business key is understood. Dropping one identifier can change the whole interpretation.
Finding a root cause was only useful when I could explain it clearly to technical peers and connect it to business reporting impact.
This internship strengthened my understanding of enterprise-scale Spark pipelines, traceability design, dashboard-driven validation, and the discipline needed to make data quality problems explainable.