Even the best and most expensive technology can’t produce good results from bad data. This is true for AI models, too, say Liam Cotter and Niall Duggan
Nothing has changed, yet everything has changed because of the advent of generative artificial intelligence (AI) and large language models. Their ability to surface unstructured data and use it to generate insights is enormously powerful and dangerous at the same time.
The problems lie not only in the quality of the unstructured data but much of the structured data to which it may have access.
This situation has arisen mainly because organisations have not been using this data to any great extent up until now.
Refining data for best results
If organisations want to take advantage of the potential benefits of generative artificial intelligence (GenAI), they must give GenAI access to their full treasure trove of data or as much of it as is legally permitted. If data is the new oil, however, much of it needs refining to unlock its value.
Unfortunately, too many organisations are turning AI loose on the data they have now without first addressing the quality and governance issues associated with it. AI and data analytics need good, trusted, consistent and well-curated data to work correctly and deliver value, which can be rare.
Organisations tend to have very fragmented enterprise data environments. Data can be stored on-premises, in the cloud or externally with third parties – and it can be both structured and unstructured.
Typically, there are lots of siloes and duplication. This results in separate parts of the same organisation interpreting the same data differently.
Finding the best storage solution
Data storage is a complex problem to solve. First, there is the sheer volume of data, much of it historical, held by organisations.
Then, there is the way the data is passed around. It is often stored in multiple locations, amended and altered in different ways in different places, and subject to misinterpretation, which results in the same data having more than one existence.
This is not a new problem; it has already been addressed for business intelligence systems. The standard solution has been the creation of data warehouses or farms which attempt to offer a single source of data truth for the entire organisation.
With the enormous volume of data required for GenAI to deliver on its promise, however, the cost of maintaining and resourcing a single data source would quickly become prohibitive.
Furthermore, the effectiveness of having data stored in single locations is now questionable.
As a result, we are now seeing a move towards data mesh infrastructure. This sees data stored in multiple interconnected, decentralised domains that are all equally accessible.
They are organised by business function, so the people most familiar with the data – those best qualified to assess and assure its quality – are in control of it.
This helps to ensure the consistency and good governance of the data – the foundation required for adopting AI and GenAI in organisations.
The need for team collaboration
The data mesh infrastructure has other advantages. It allows different collaborators to work together, for example.
In the warehouse model, all data was in the hands of the IT department, and that had severe limitations.
IT professionals may be experts on secure data storage, but they are typically not familiar with the nature of the data itself and can’t be expected to vouch either for its quality or the accuracy of an interpretation.
On the other hand, when different parts of the business are responsible for managing and curating their own data, they can use it more and work together to create new uses.
AI can be deployed while the mesh is under development and can be given access to the data in each domain as it becomes available.
However, the development of a data mesh is not simply a technology exercise. It is also a data cleansing and quality assurance process. All data in the mesh should be verified for quality and consistency.
This is vitally important for organisations in which the lineage of data can be doubtful. An energy utility’s meter data sits in multiple areas of the business, for example, including the billing and asset functions.
This data needs to be brought together into one coherent object, and disparate systems must be joined up and a common taxonomy shared when describing the data. This will enable AI systems to learn from the data in a consistent and more reliable way.
This cleansing and verification exercise offers significant benefits in relation to compliance with new reporting standards such as the Corporate Sustainability Reporting Directive (CSRD). Further, readily accessible quality-assured data will make the reporting process much less onerous.
Governance and control
Having addressed the quality and accuracy issue, the correct governance and controls must be put in place regarding privacy, data protection and security.
Organisations must ensure that AI systems do not inappropriately use their data. This requires constant monitoring of data management and governance.
Other key aspects to be addressed are the organisation's culture and the skills of its workforce. Organisations need to become data-centric, and their people must adopt a data mindset if they are to fully take advantage of the value of their data.
They must also look at the skills within the workforce and ensure that everyone has basic data skills and that the organisation is not dependent on the IT function to get business insights from its data.
Liam Cotter is Partner at KPMG and Niall Duggan is a Director at KPMG