Planning Your Content Inventory: A Technical Quick Guide
Your content inventory is typically a precursor to a content audit, migration or site redesign. A well-scoped and finely-tuned content inventory lays the foundation for a more meaningful audit, a smoother migration and a successful site relaunch.
I take a consistent, exhaustive and forensic approach to any content inventory – the result is to make it efficient, comprehensive and repeatable. While delivering the catalog of known content, the inventory can also reveal previously unknown and dissociated content.
This technical quick guide is offered for use in planning your own content inventory. It consists of five phases: Familiarization, Preparation, Determination, Execution and Delivery. Where relevant, additional explanation is provided for given steps.
Familiarization
You want to establish the context for the content inventory in terms of the organization, the project and its business goals. A baseline level of domain knowledge is necessary before cataloguing the content holdings of a website or other web property.
For example: What problems will the inventory solve? Is this inventory for specific site sections, an entire site, or a collection of sites? Is this a one-off content inventory or a recurring inventory? Will the results be used for a wholesale redesign, a lift-and-shift, or migration into another system?
With the proper context established, you can align the remaining phases accordingly and perform an inventory that supports the project’s goals.
The Steps
Assemble an overview of:
the organization
its mission
existing taxonomies
existing ontologies
the project
its scope
charter
roadmap
epic and story backlog if available
inventory deliverables expected
intended use for the results
timeframe for delivery
past work, if applicable
Preparation
This is your technical discovery phase. Its purpose and benefit are to identify all pertinent details of an organization’s technical ecosystem that will yield value in the content inventory and that will determine how you conduct the inventory itself.
By proceeding methodically, you can reduce the need for rerunning the inventory and the risk of missing important details.
The mechanism here is a checklist to interrogate every aspect of the technical ecosystem for its relevance to and influence upon the content inventory. These questions must establish what content exists, in which content types or other forms it is constituted, how it is supported and maintained, what specific content among it is necessary to inventory, what systemic factors might affect or detract from the inventory, and the best technical approach to achieve a unified inventory.
This phase will simultaneously expose and exclude requirements for the inventory. As important as learning what need to do, you will emerge knowing what work you will not need to do – or at least what elements you need to steer around.
Importantly, you will also arrive at a sense of who you ultimately need to involve in the inventory from across various teams, i.e. database admins, taxonomists, subject matter experts, project leads and system engineers.
The Steps
Systematically define the scope of the inventory and incorporate any technical considerations, including:
sites
domains
subdomains
external resources
SlideShare presentations
SoundCloud audio
YouTube video
Content Delivery Network (CDN) resources
behind a controlled Canonical Name (CNAME)
on an external domain such as Amazon CloudFront, Box.com, Google Cloud
underlying components – a modern site can have one front-end and several back-ends
applications and platforms, such as Drupal, Alfresco, dotCMS or other legacy CMS or DMS
services, such as Varnish caching, Okta authentication, or database replication
systems, such as the Kafka messaging queue or an F5 load balancer
databases
relational
NoSQL
Universal Unique Identifiers (UUIDs) in use – vital for tracing content back to system in which it resides
internal, such as Drupal node IDs
external, typically exposed in page markup or URLs
managed resources – administered through a database-driven system
published and linked, i.e. Drupal nodes
published but unlinked
draft, unpublished or never published
unmanaged resources – direct interface access or via Command Line Interface (CLI) or secure File Transfer Protocol (sFTP) client
published and linked, i.e. bare files on server
published but unlinked
draft, unpublished, or never published
aggregations
item listings, i.e. Announcements
category groupings
paginated displays
dark content – mystery resources that were forgotten or remained unexposed in
an administration environment
file structures
databases
structured, unstructured and loosely structured content
existing content types and templates
standalone files and pages
flat file stores
wikis
known or inferred content types based on
documented definitions
menu placement
titles
URL patterns
associations
recurring indicators such as markup or keywords
dynamic content that has a changing or mixed state depending on context
personalization
location
language
access level
access-protection rules
profile pages
secure or private content
subscription-based content
workflows – states through which the content you are inventorying may concurrently be transitioning
draft
awaiting approval
published
unpublished
archived
deleted
retention policies
previous versions
unpublished items
data in content
tabular data
data attributes
sensitive information within content
Personally Identifiable Information (PII)
information covered under the EU’s General Data Protection Regulation (GDPR)
mutability
what content is considered locked, archival or evergreen
what content remains subject to edits and updates
HTML rendering
old-school
Single Page Application
file serving
inline, displayed in browser, if applicable
always download
file system case sensitivity
case-sensitive
case-insensitive
URL and naming conventions
directories and subdirectories
page slugs
file names
redirection patterns
pages
files
date conventions
single or multiple
ISO conventions in use
use of official or legal dates
use of publication dates
use of last modified dates
multilingual conventions
is there a primary, official, or authoritative language
what other languages are supported
how comprehensive is language support
what prefixes, suffixes, or language-specific domains are used to denote language
are translations correlated via a menu scheme, URL naming convention, etc.
or are translations standalone and disparate, unlinked among each other
exclusions
JavaScript files
CSS files
extensions or MIME types
paths
protocols
Determination
Before executing the actual inventory, a few key determinations remain. These judgments drive the implementation details of the inventory and its output.
Upon completion of this phase, you will be better able to gauge a final level of effort and to get a general idea how much time and money the inventory will require.
The Steps
Make the final necessary determinations:
target environment(s)
use the production site
or a cloned instance
where an audit will take place
within current or legacy system
in new different system
format
what is required for an Extract Transform Load (ETL) operation into an audit environment
what is required for a system-to-system integration, i.e. endpoints
content capture
full content capture
via a referring URL
metadata schema – all columns which will contain information delivered in the inventory
Title
Subtitle
Official Date
Published Date
Last Updated
Language
Source URL
Migration URL
URL Alias(es)
Format
Content Type – used in back-end management
Presentation Type – used in front-end display
Migration Type – used in extraction and migration
exclusions
gather excluded resources but mark them as excluded
do not gather excluded resources
redirections
don’t capture redirections and use only the final destination
capture redirection chains
automation
run the inventory using manual steps
script the inventory and use an orchestrated process
syncing
institute a service to reflect new content and updates to content already captured by the inventory
enforce a content freeze and perform subsequent intermittent catchup re-inventories
titling for binaries
extract from a wrapper page
extract from internal metadata
extract from file content
extract from referring link titles or link text
enhancements – are any enhancements desired to metadata as content is inventoried
clean up titles by removing delimited portions, i.e. before or after “-“ or “|”
normalize dates
add metadata that can be inferred based upon patterns in titles, URLs, associations, other recurring indicators such as markup structure or SEO keywords
timestamps
do individual inventoried items need to reflect the specific date and time captured
or can the overall inventory be given a timestamp
or is content considered evergreen
Execution
Having done your research, analysis and consideration in the preceding phases, you are now ready to push the button to execute your inventory. Though it’s likely you will be pushing several buttons at least a few times, depending on your requirements.
You may be performing a one-time set of manual steps with basic tools to carry out a simpler inventory. Alternately, if you have a complex content set and sufficient software engineering talent available, you may be orchestrating a recurring, automated inventory process.
The Steps
Execute the inventory:
front-end crawl with Screaming Frog, Wget or another crawler
note that authenticating into a production system is inadvisable so consider guardrails
query admin interface content listings – managed objects like Drupal nodes
UNIX “find” through file system directory structures
merge results
ensure proper encoding in recorded values
multilingual
special characters
remove trailing slashes
check for 404s
resolve case-sensitivity issues
de-duplicate
force redirections
repeat
consolidate results into CSV or other interim format
Delivery
Packaging and delivering the inventory offers a final opportunity to polish its results. These are largely presentational steps that make for easier interpretation of the results and loading into an audit system.
If you are running recurring inventories, you should establish a naming convention for the output files that will give you a clear picture when you need to differentiate among accumulating results sets.
Here is a sample format:
[name of the content set]-[captured through date YYYY-MM-DD]-[as of date delivered YYYY-MM-DD-HHMM-timezone]-[file extension]
For example:
minutes-of-proceedings-through-2021-05-14-as-of-2021-05-31-1848-pdt.xlsx
The Steps
Package and deliver the inventoried content:
normalize data
reconcile competing date formats
enforce title conventions, i.e. remove suffixed “ – Organization Name” etc.
remove initial and trailing whitespace in values for title, subtitle and other text fields
tokenize special characters, depending on how the inventory results will be processed
commas
quotation marks
indicate aliases where multiple URLs refer to the same content
set default sort order and subsort orders
package results into XLSX, XML, JSON, or other final format
store final inventory package in a redundant location
load inventory results into audit environment