Cloud Citizen

Why am I reading this?

This post aims to alert executives and decision makers – regardless of their technical capabilities – who may be entering their first AI-transition project to the almost certain horrors ahead of using data that has not been cleansed for the purpose.

If relied upon, poorly prepared data threatens to force both their business’s competitiveness and their own livelihoods into a reputational nosedive from which neither is likely to recover quickly.

Such warnings aside, what follows outlines a path to building into the management of data quality the processes more certain to deliver the peace of mind to be earned from crafting a clean and smoothly functioning AI engine that just works, and learns from day one.

This emerging new form of best practice embraces instead the already common, expert human quality control checking processes to be found in everyday journalistic and editorial production.

AI’s greatest challenge lies in its reliance on clean data

The quality of the data used to train an AI engine plays a crucial role in determining its performance and capabilities, and both ChatGPT and Google’s Bard affirm that an AI engine is only as good as the data it is trained on.

Overall, while the architecture and algorithms of an AI engine are essential components, the data used for its training remains key to determining its performance and effectiveness in real-world applications.

In the context of AI, “hallucination” refers to the phenomenon in which a model generates content that is not based on actual data or is significantly different from reality.

A well-trained AI model relies on high-quality, diverse, and relevant data to achieve accurate and reliable results.

In avoiding the risk of hallucination, it is important to select and prepare the data used to train an AI engine carefully in order to ensure it is as accurate and representative as possible.

And no matter what data an AI engine might subsequently learn from – and ChatGPT itself says “garbage in, garbage out” aptly applies to AI models – getting its initial data platform clean enough to be reliable is something to which the skills of expert human editors are extremely well suited.

The clean-up

The processes required to manage data quality for effective AI system design correspond in many ways with those used in professional publishing.

Both the cultivation of high-quality AI data and the production of a professional magazine or newspaper involve managing and making sense of large volumes of information that may never previously have been brought together to ensure its legibility, accuracy, completeness and reliability.

And both processes involve identifying the target audience to which it will be presented, to build a clear understanding of its needs, and to standardise the output to be delivered to it, in a planned format against a known schedule.

Remedies to prevent or correct the generation of bad AI data may lie at many levels of thinking and planning across a project.

Against this expectation, a professional editor’s first job is to ensure that wherever information is presented, its content is suited to its audience’s needs, it makes sense, is factually accurate and is structured logically to aid comprehension.

They ensure consistency of language and correct errors, such as poor grammar or incorrect spelling and punctuation.

They remove ambiguity to clarify the author’s meaning, and they simplify obscure language and bureaucratic, technical or specialist jargon.

They check for any potential legal problems, such as plagiarism, ethical or moral problems, copyright infringements and defamation risks.

And in conducting this work, they must become the very pickiest of proofreaders.

In either AI system design or publishing, there must be detailed communication between stakeholders at every step across its schedule to plan and synchronise the human effort needed to make this happen reliably.

Contributors and data sources must be identified, processes put in place to clean and validate data, and rigorous quality control measures implemented.

Beyond understanding the audience and its expectations, the metadata used to organise and classify information plays a vital role in both domains.

As in database design, publishing professionals use metadata to select, categorise and describe content within appropriate topics or genres, and to enable discovery and search.

Against this, content must be commissioned and selected, layouts designed, and text edited, fitted and proofread, following the strict rules of editorial style it must obey.

And across its progress, the process must be tracked and documented across its delivery, areas for improvement identified, and new team members found and trained.

The delivery

Because they have been so extensively tested and proven, applying the operating practices refined over centuries in professional publishing can be beneficial in ensuring quality, accuracy, and ethical considerations to manage risks involved in developing AI systems and models.

Its discipline typically involves a thorough end-to-end review process involving a chain of human editors and proofreaders, each checking the inputs of those upstream before them to ensure the timely delivery of the published product to the expected quality.

By the time a final editor has pressed the “publish” button, a piece of work will have passed before, and signed off by, several sets of eyes.

Just as publishers are transparent about their publishing practices, this same process of peer review, encouraging experts from various domains to review the AI models and documenting the entire pipeline – including data collection, preprocessing, model architecture, hyperparameters, and training procedures – can ensure AI systems are developed and deployed in a safe and responsible manner to mitigate potential risks.

And these same working practices of professional publishing can likewise be applied beyond implementation to monitor the performance and test the output of AI models on a systematic, continuing basis.

There may be no single silver bullet for mitigating the risks associated with AI, but constant vigilance means such strategies must repeatedly be re-examined, documented and updated as needed.

Iterate to protect the future

In both publishing and AI system design, user feedback is invaluable for refining and bringing up to date its end product.

And AI designers must seek sufficient user sentiment to identify areas of improvement, address biases, and continually to find opportunities to enhance the performance of their systems through iterative cycles of training and upgrade.

There exist a number of parallels between the planning processes required to manage data quality for effective AI system design and those used in professional publishing.

But the practices of the latter can be readily overlaid on the former to meet the overall goal of capturing and cleaning accurate, reliable, and high-quality data in pursuit of effective AI system design.