An-Najah Staff

Publication Type

Conference Paper

Authors

MOAMIN BURHAM JAMIL ABUGHAZALA

Data-intensive applications (DIAs) increasingly rely on trustworthy, high-quality data to support reliable analytics, machine learning, and automated decision-making. However, ensuring data quality at scale remains a major challenge due to schema evolution, ingestion inconsistencies, and the manual effort required to define validation rules. This paper presents DQGen, a framework for automating the generation of data quality validation scripts tailored to complex, evolving datasets. By leveraging dataset metadata and systematically mapping data quality dimensions (e.g., completeness, uniqueness, validity) to Great Expectations (GE) rules, DQGen produces executable validation code adaptable to any schema.

We evaluate DQGen on a real-world dataset from a large-scale Internet Service Provider (ISP), comprising over 26 million records across multiple relational tables. Results show that DQGen reduces validation setup time by over 90\%, improves rule coverage and consistency, and enables continuous integration of data quality checks in batch or CI/CD workflows. The proposed framework contributes to the reliability and governance of modern DIAs by ensuring scalable, transparent, and automated validation.

Conference

Conference Title: European Conference on Software Architecture (ECSA) 2025
Conference Country: Palestine
Conference Date: Sept. 15, 2025 - Sept. 19, 2025
Conference Sponsor: Springer

DQGen: Scalable Metadata-Driven Automation for Data Quality Validation in Data-Intensive Applications