Why data wrangling still matters in the age of AI
Julian Thomas, Principal Consultant at PBT Group
Artificial intelligence (AI) has captured the imagination of many businesses. But while the promise of advanced models and intelligent automation is exciting, we cannot forget that any AI systems can only perform as well as the data they are built on. If you are serious about building AI solutions that deliver meaningful, trustworthy outcomes, then foundational data practices like wrangling and lake architecture are the bedrock.
Turning raw data into AI-ready assets
In most organisations, data is generated constantly across multiple systems. That does not mean it is immediately usable. You might have terabytes of information sitting in your ecosystem, but much of it is noisy, duplicated, inconsistent, incomplete, insufficient, or unlabelled. This is where data wrangling comes in.
Wrangling is the hands-on process of getting raw, messy data into a structured, usable format. It includes everything from deduplication and transformation to labelling, enrichment, and quality assurance. While it might not be glamorous work, it is essential if you want your models to perform reliably.
The Medallion architecture explained
At PBT Group, one of the frameworks we use to guide this process is the Medallion architecture. It breaks data into three layers:
- Bronze: raw, unfiltered data collected from various sources in its original format. It represents the single source of truth.
- Silver: cleaned and structured data that is more analytics friendly.
- Gold: curated datasets that are fully governed, enriched, and ready for use in Business Intelligence (BI) or machine learning.
This layered approach lets us progressively improve data quality and accessibility, ensuring that the right teams (from engineers to analysts to data scientists) can work confidently from the layer that provides them with data at the level of quality and governance that best suits their requirements.
AI needs structure
It is a common misconception that AI just needs ‘lots’ of data. What it really needs is the right data. For example, data that is well-prepared, relevant, diverse, and unbiased. If you skip the wrangling phase or treat your data lake as a dumping ground, you are setting your models up for failure. Poor quality inputs lead to poor quality outputs, no matter how sophisticated your algorithm might be.
A well-managed data lake, paired with extensive wrangling practices, gives organisations the ability to:
- Consolidate fragmented data sources.
- Maintain historical records for model training.
- Ensure traceability and governance.
- Create reusable, trusted datasets for different AI use cases.
Know when to transition from Data Wrangling to Extract, Transform, and Load (ETL)
It is important to note that at this point, the main difference between data wrangling and ETL is that the former is simply “informal ETL”, done in the context of machine learning for a given initiative. Often, it is once-off, limited in duration to a particular project or initiative. ETL is effectively the same activity, but automated for long term use, and done according to a common, central standard, with additional requirements in terms of auditability, restart-ability and re-usability.
Data scientists should try to avoid permanent data wrangling solutions. These tend to be manual, human intensive, and not optimised for production stability. Once the data scientist has completed the data wrangling, and once the resulting training model is approved for production implementation, the data wrangling solution should be handed over to a formal engineering team where it can be converted into formal ETL, according to the organisation’s standards.
Make data quality a daily discipline
Another point I always highlight is that data quality is not a once-off task. It needs to be baked into the way your teams work. Governance, standardisation, and validation processes must be part of the day-to-day lifecycle of your data assets. And the earlier in the pipeline this is applied, the more value it delivers downstream.
Rethinking the process
If your data lake is just a landing zone, and if wrangling is viewed as a side task, your AI projects will suffer. But if you start treating data as a product, and structure it with the right frameworks, you will quickly see the shift. The data becomes more accessible, more accurate, and far more valuable.