Machine learning is often known for building prediction models using training data, i.e., identifying your susceptibility to an internal corrosion failure, for example. However, a more practical application of machine learning is data engineering, i.e., data wrangling. One area where it delivers clear value is helping pipeline operators better manage poor or missing data.
From experience, 90% of the effort on every machine learning project is data wrangling. - AI, Machine Learning, and Data Science in Pipeline Integrity
Decades of inconsistent record-keeping, acquisitions, and fragmented data systems have left most operators with significant gaps in their pipeline records. Those gaps directly impact how risk is quantified and how integrity decisions are made.
Today, operators are expected to calculate risk with incomplete inputs or poorly documented (and potentially incorrect) data, often without clear guidance on which records can be trusted.
At the same time, the consequence of failure, or even a near miss, continues to increase. Regulatory requirements for gas pipeline operators now mandate the validation of records that are not traceable, verifiable, and complete within defined timelines.
The challenge is no longer identifying missing data. It's deciding where to focus validation efforts.
When records are missing, the default is to assume the worst-case or use the most conservative value possible. This approach is simple and defensible, but it introduces added complexity. When overly conservative assumptions are applied broadly, large portions of the pipeline system begin to look the same from a risk perspective. It becomes difficult to distinguish between segments that are likely to represent real risk and those that are simply missing documentation.
Over time, the scope of validation programs expands without clear prioritization. Uniform assumptions make it difficult to distinguish between higher-risk segments and those that are simply missing data.
The impact is not just analytical, i.e., risk scores are rendered obsolete; it’s operational and financial. Excavation and validation activities are resource-intensive, often ranging from $20,000 to over $100,000 per dig depending on conditions. Applying conservative assumptions at scale can significantly increase the number of digs. At scale, that can translate into millions of dollars in additional work.
At the same time, physically verifying every pipeline segment where there's a data gap isn't feasible for large systems. This creates ongoing tension between regulatory requirements and operational constraints.
Machine learning provides a more structured way to work with incomplete data or data of low trust. Instead of treating all unknowns the same, machine learning (ML) models can use existing records to identify patterns and impute the missing data point. In fact, pattern recognition is a skill that ML models excel at. This allows operators to move from uniform assumptions to differentiated risk and prioritization of resources. This approach doesn’t replace records or engineering judgment. It simply provides a way to prioritize validation efforts more effectively while data is still being remediated.
Does your data look like this?
Not only can machine learning models identify data points that don't align with pre-determined rules, but they can also capture relationships that are not always explicitly defined. For instance, over time, the model can learn that above-ground pipe is typically associated with above-ground coating types like paint, while below-ground pipe with a coating designation of “paint” may indicate that the documented coating type is wrong. These types of learned relationships can help the models identify records that are more likely to be incorrect, not just incomplete.
Or this? Repairs with the original install year > probably incorrect data.
In practice, this approach starts with the data that is already available. Most pipeline systems contain enough complete records to establish relationships between key attributes such as install year, diameter, wall thickness, seam type, and operating conditions. These relationships can be learned using boosted decision tree models (such as XGBoost) and then applied to segments with missing data.
For a recent project for a gas pipeline operator, the pipeline properties that were analyzed for imputation include the following:
The output is not a definitive answer for missing records. Instead, the model identifies inconsistencies in the data (that may be indicative of incorrect data) and imputes plausible answers for the missing data with an assigned probability value. The goal is not to eliminate uncertainty. It’s to make the uncertainty manageable and actionable while also highlighting data points that don't "follow the rules."
Understanding that not all errors carry the same weight was an important nuance. To that end, the model is trained to prioritize avoiding high-risk mistakes over maximizing accuracy. By using risk models to inform misclassification costs, it aligns predictions with real-world consequences and supports more defensible, risk-based decisions.
Results from one ML Model that was utilized to predict install year.
Regulatory expectations around traceable, verifiable, and complete records are not new. What's changing is how those expectations are enforced.
Operators are no longer being asked to produce records. They're expected to show how conclusions were reached, what data was used, and why those inputs are considered reliable. Internal assumptions aren't enough.
Machine learning supports this shift by making the prioritization process explicit. Instead of treating validation as a broad requirement, it provides a structured and repeatable way to identify where effort should be focused. More importantly, it creates a defensible record of why certain segments are validated first, and others are not.
At its core, this approach shows that machine learning can be a practical tool for navigating uncertainty, not by replacing missing data, but by helping operators make better decisions in its absence. By learning patterns from complete records and grounding predictions in explainable, risk-informed logic, the ML data imputation model produces outputs that can be validated, audited, and trusted. It addresses key challenges like data gaps, overfitting, and explainability while aligning with the realities of a regulated industry. Most importantly, it provides defensible, risk-prioritized guidance for dig programs, flagging potential issues and supporting action until records can be fully remediated.
Once the foundation is in place, the conversation shifts from cleaning data to confidently extracting insights, unlocking more advanced analytics, and truly predictive integrity strategies.
See this work presented at the upcoming API conference.