Planning Your Content Audit: A Technical Quick Guide

Jan 18

An image of blank white papers falling from overhead into an empty room. — *A Sudden Gust of Wind* by Serkan Ozkaya

Note

Before performing a content audit, you will need to undertake a content inventory. The inventory should supply an exhaustive, comprehensive catalog of your content, enriched with whatever metadata can be identified explicitly or inferred programmatically. The next step is the content audit itself, based on the results of the content inventory.

Overview

A content audit is an archival, editorial, taxonomical or legal examination to make key identifications and determinations about the meaning and fate of an organization’s content, or a subset thereof; and in doing so to apply some level of metadata and sense of arrangement to the content in service of business goals. Often, the shorthand for a content audit is “tagging”.

While valuable, simple tagging alone can provide only a flattened picture of content. There are many more dimensions to a good content audit. Its scope should be broad and deep – addressing semantic representations, intricate relationships and shifting states within content ecosystems, from an individual piece of content up to the larger set in which content exists.

In its essence, a sound content audit is a holistic exercise that should reflect and yield existing and new realizations about individual content items, the overall content set and the organization itself.

Audit Purpose

There are several reasons why an organization would undertake the time and expense of a content audit:

loading content into a CMS or DMS for the first time
migrating content among legacy systems using ETL operations
preparing an ontology, taxonomy, content model or knowledgebase
redesigning and relaunching a website
Search Engine Optimization
transitioning to a new brand, new ownership, new product line, etc.
fulfilling obligations for legal discovery
meeting GDPR or other compliance requirements
preserving archives and maintaining historical records
supervised machine learning

Audit Outcomes

These are several outcomes that a successful content audit could determine, many of which are non-exclusive:

characterize or recharacterize content through metadata and taxonomy
relate existing content through structural information
retain existing content intact
restore previously archived content
transform content – rewrite it, recreate it, combine it, separate it
translate content
relocate content through migration
retire existing content to an archive or otherwise lock or protect it
delete content
identify areas that require creation of new content

Audit Context

While an audit naturally focuses on content, it does not do so in isolation. There are organizational, procedural and technical contexts in which a content audit takes place.

Content auditors are typically subject matter experts on loan from internal departments to a project for a limited time; or they are brought in from an external vendor for a set duration. Auditors inevitably have varying degrees of familiarity with an organization’s existing content, along with the business processes and supporting systems that handle that content.

Organizationally, content often exists in an ecosystem where different teams author and maintain that content across separate business units and distributed systems. This poses challenges of aligning ownership, subject matter expertise and appropriate permissions.

Procedurally, audits may be performed entirely by humans or might involve machine assistance – artificial intelligence and predictive analysis – to lend speed and thoroughness. Audits may be performed in a condensed timebox or staggered across lengthy project phases.

Technically, there are key systemic elements to a sound content audit. By preparing the right technical foundation for the content audit, you can ensure both that auditors use their time effectively and that the audit produces quality information.

The systems underlying the content audit must reflect and support the many formats in which inventoried content is maintained and is surfaced by the organization, the dynamic means by which audit identifications and determinations need to be applied and recorded, and the different ways audit data might be leveraged and repurposed in the future.

Before beginning a content audit, careful discovery and planning are necessary to design and build the audit model itself, the accompanying audit schema, a usable and effective audit interface, a process for making updates to the audit model and the audit interface, a storage mechanism for audit identifications and determinations, and a strategy for making audit data available to other processes and systems.

Audit Model

As a starting point for the content audit, establish whether any organizational ontologies, metadata schemas, taxonomies or data structures exist that can be used as an initial guide for the audit, to give it general points of reference.

Likewise, confirm whether within these mappings there are properties already in use across existing systems. Commonly these include date, title, author, department, topic, subtopic etc.

These mappings serve to orient auditors before they begin and to build and reshape the audit model that will inform identifications and determinations about content. The audit model should be expected to transform and grow as the audit itself proceeds. This is natural as auditors discover previously undiscerned categorizations and incorporate new classifications into the larger audit model.

Auditors will need to identify and record three principal types of information that apply to any piece of content: its semantic representations, its state properties and its structural relationships. Effectively, this is what the content means and represents, its temporal evolution and condition, and its place vis-à-vis other items in the larger content set.

Semantic Representations
- Metadata values
- Taxonomy values
State Properties
- Origin, for example: migrated, authored, dynamically generated
- System of Record, for example: native system, originating system, target system
- Workflow Status, for example: draft, submitted, in review, rejected, accepted, withdrawn, final, superseded
- Publication Disposition, for example: unpublished, draft, published, archived, restored, deleted
- Audit Determination: relate, retain, restore, transform, translate, relocate, archive, delete, etc.
Structural Relationships
- Parents and Children, for example: hierarchical content items that form compound objects
- Siblings, for example: items assembled in a node queue or other discretionary ranking
- Associations, for example: content resources related through Web hyperlinks
- Correlations, for example: translations assembled in a multilingual family

Semantic Representations

Semantic representations of content are captured through auditors’ application of values in a formal, bounded taxonomy, usually comprised of: controlled vocabularies; free tagging by auditors; tagging assisted by artificial intelligence or predictive analysis; and programmatic inference of values based on URL patterns, content types, formats or other identifiers.

State Properties

State properties indicate where a piece of content is in its lifecycle and can help inform its fate. For example: Did a given page on a legacy website originate as a dynamically-generated aggregation? This suggests that a story is required for the development team to recreate the page through a dynamic view on the new website; or for an editorial team to manually author the same content more simply as a new, static page.

A particular challenge of identifying the current state properties of a given content item is that a content inventory typically supplies only a snapshot of the content set.

In large organizations, the systems that support enterprise content are themselves in constant states of upgrade, transition, replacement and retirement, which in turn affects the state of content. Apart from underlying systems, content itself can undergo frequent editorial and manual changes and eventually deviate from what was captured in the original inventory.

The fickle nature of content systems and content itself suggests both that once delivered, a content inventory needs to be updated and refreshed regularly; and that a content audit should be completed as quickly as possible in alignment with project phases and delivery timelines.

Structural Relationships

The identification of structural relationships within content inventories is one of the most complex and challenging yet enriching aspects of an audit. For example, this could mean weaving together years’ worth of legal cases, parties, filings and documents into a larger, multidimensional mosaic, yielding fresh insights into and fuller understanding of the content.

Uncovering structural relationships by visual inspection and connecting them manually through an interface can prove challenging for an auditor to perform accurately and consistently. Likewise, capturing rich, structural relationships – such as parent containers and their children, or referring links – in an otherwise flat taxonomy can present UX challenges for the actual interface used in the audit.

Documenting structural relationships could meaningfully involve machine-assisted detection and capture of such relationships, as will be discussed toward the end of this post.

A Basic Audit Schema

Source URL:

Source URL Alias:

Migration URL:

Migration URL Alias:

Current Supporting System of Record:

Medium:

Format:

MIME Type:

Migration Format:

DMS Title:

Display Title (contextual):

Display SubTitle:

Displayed Date:

Official / Legal Date:

Published Date:

Last Updated:

Authored versus Dynamic:

Aggregation:

Author:

Content Type:

Content SubType:

Display Type:

Status:

Translated:

Language Version:

Technical / Operational Owner:

Business Department / SME Owner:

Business Team Owner:

Topic:

SubTopic:

Proposed Topic:

Tags:

Metadata Description:

Audience:

Business Location:

Geographic Region:

Access Control Level:

Contains Possible Personal Data:

Evergreen:

Requires Periodic Review:

Content Quality Check:

Content Quality Issues:

Disposition: unpublished, draft, published, archived etc.

Audit Escalation Status:

Audit Determination:

Audit Comment:

Audit Interface

Auditors need a friendly interface within which they can review inventoried content and access the model they will use to apply values to that content. The audit interface is the window through which auditors will identify and record semantic representations, state properties and structural relationships in the content set. It is worthy of significant thinking, UX design and development before the audit begins.

Typically the audit interface reflects mainly controlled vocabularies, perhaps tied to an external, dynamic data source or populated in the audit application itself. However, the interface should also enable free tagging of content items and the tracking of structural relationships among them. Ideally, this interface could be updated in real-time as the audit model evolves and auditors request changes and additions to the model.

At a component level, the audit interface will likely display controls for key value pairs, tree dependencies and cascades, free-form tagging and object-to-object relationships. Note that if you intend to use a spreadsheet to record audit identifications and determinations, the tabular layout of a spreadsheet limits the ease with which such controls can be displayed and with which an auditor can apply them.

Evolving the Audit Model

As the audit proceeds and auditors learn more about the content inventory, they will inevitably request additions, removals and changes to the audit model. It is important to have the ability to make these changes quickly and retroactively to avoid creating a bottleneck. Changes made to the audit model by one auditor need to be visible, accessible and assignable through the audit interface immediately by any other auditors on the team.

Audit Storage

It is vital to determine where and how content inventory data will be stored and where content audit identifications and determinations will be recorded. These are typically the same location; however they need not be. The default, quick solution for audit teams is often a shared Excel spreadsheet on a shared drive, or a Google sheet with mutual access.

Cloud relational databases such as Airtable are also available. Be mindful that you will need to use the means of recording the audit identifications and determinations as a data source at some point. A storage solution that enables rapid export as CSV or conversion to XML is strongly preferred. Rapid access to subsets of audit data will also be necessary for content migrations and ETL operations.

Refreshing the Inventory

While the content audit is under way, life goes on in the rest of the website as content is added and changed. Auditors will need a way to keep the content inventory already under audit updated to reflect both: newly-created content been added to a site since the original inventory was delivered; and changes to content that was included in the original inventory.

Manually detecting, tracking and ingesting new content and changes to previously known content presents a coordination and documentation headache for most teams auditing anything but the simplest content inventories. Consider building a dedicated service to syndicate new content items and sync changes to known content items directly into the inventory on a pre-agreed basis: real-time, daily, weekly etc.

You certainly can opt to use manual means to keep the content inventory fresh. However, this means tedious and laborious Web scraping and database extraction procedures, metadata deduplication, and careful packaging of results into the target environment. It likely will entail maintenance of offline spreadsheets that are delicate to maintain and prone to error.

Machine Assistance

Auditors are human. They sometimes lose their place, have no idea what they are looking at, make well-intentioned guesses that miss the mark, or just get burned out. Varying levels of machine assistance can enhance the work of auditors by automatically cross-checking work for variations and discrepancies in the interest of consistency and quality.

In two particular content audit use cases, however, the situation may be reversed. These are scenarios involving optical character recognition and artificial intelligence to process information locked in specific file formats; and electronic discovery to process content sets to reconstruct opaque or obscured structural relationships not immediately apparent to the human eye.

In these cases it is the machine that may be relied upon principally or exclusively to perform the audit, with the human auditor intervening to check for quality and reliability.

OCR and AI for Specific Formats

When confronted with the need to audit large collections of images and graphics, image-based PDFs, and lengthy audio and video files, the semantic nature of these content items can often be more rapidly discerned by machine means than by painstaking human inspection.

Optical Character Recognition (OCR) and Artificial Intelligence (AI) technologies can facilitate and enrich an audit of these files that otherwise involve a high level of human effort. Auditors can still supervise the machine and review its findings across repeated learning cycles.

Optical Character Recognition (OCR) – Capture the full text of image-based PDFs and images. Example solutions include Amazon Textract and Google Cloud Vision.
Artificial Intelligence (AI) – Example solutions include Amazon Rekognition for images and videos and Amazon Transcribe and Google Speech-to-Text for audio.

eDiscovery of Structured Relationships

Likewise, when it is necessary to reveal and record intricate, structured relationships within a large content inventory, machines can more quickly identify and connect related content items and characterize the relationships among them than hours of human inspection.

Electronic Discovery (eDiscovery) technologies used in the Legal profession offer possibilities for such scenarios. For example, Relativity can ingest large-scale, batched document sets then report on and support queries into relationships within the documents, such as keyword proximity and density.

As impressive as their capabilities are, eDiscovery technologies have yet to find widespread application in content audits outside the legal profession. This is largely due to the high cost-of-entry, narrow audience and specialized production output associated with eDiscovery technologies.

Unless you are a large legal firm, it could prove challenging to gain access to expensive eDiscovery tools for your audit. As an alternative, eDiscovery technologies can be approximated by developing Python and Natural Language Toolkit applications to serve the same purpose.

Cost-Benefit of Incorporating Machine Assistance

Some software procurement, licensing, development or integration would be required to enable machines to assist with peering inside problematic file formats and to untangling complex structured relationships. The audit team must weigh the cost of developing machine-assistance capabilities against the human time involved to achieve the same audit goals.

Backups

Content inventory data and content audit identifications and determinations would require significant time and expense to recreate if lost or corrupted. Such destruction would minimally set back any project by weeks or months. Accordingly, ensure that the both the content inventory and the audit storage mechanism(s) are backed up regularly – ideally in real-time. Full nightly backups are a common configuration and would ensure that a day’s work at most would be lost in case of sudden data loss.

audit

Marc Salvatierra