Planning Your Content Inventory: A Technical Quick Guide

June 1, 2021 Marc Salvatierra

Your content inventory is typically a precursor to a content audit, migration or site redesign. A well-scoped and finely-tuned content inventory lays the foundation for a more meaningful audit, a smoother migration and a successful site relaunch.

Ethan Hunt goes Into the Vault — Mission Impossible

I take a consistent, exhaustive and forensic approach to any content inventory – the result is to make it efficient, comprehensive and repeatable. While delivering the catalog of known content, the inventory can also reveal previously unknown and dissociated content.

This technical quick guide is offered for use in planning your own content inventory. It consists of five phases: Familiarization, Preparation, Determination, Execution and Delivery. Where relevant, additional explanation is provided for given steps.

Familiarization

You want to establish the context for the content inventory in terms of the organization, the project and its business goals. A baseline level of domain knowledge is necessary before cataloguing the content holdings of a website or other web property.

For example: What problems will the inventory solve? Is this inventory for specific site sections, an entire site, or a collection of sites? Is this a one-off content inventory or a recurring inventory? Will the results be used for a wholesale redesign, a lift-and-shift, or migration into another system?

With the proper context established, you can align the remaining phases accordingly and perform an inventory that supports the project’s goals.

The Steps

Assemble an overview of:

the organization
its mission
existing taxonomies
existing ontologies
the project
- its scope
- charter
- roadmap
- epic and story backlog if available
inventory deliverables expected
- intended use for the results
- timeframe for delivery
- past work, if applicable

Preparation

This is your technical discovery phase. Its purpose and benefit are to identify all pertinent details of an organization’s technical ecosystem that will yield value in the content inventory and that will determine how you conduct the inventory itself.

By proceeding methodically, you can reduce the need for rerunning the inventory and the risk of missing important details.

The mechanism here is a checklist to interrogate every aspect of the technical ecosystem for its relevance to and influence upon the content inventory. These questions must establish what content exists, in which content types or other forms it is constituted, how it is supported and maintained, what specific content among it is necessary to inventory, what systemic factors might affect or detract from the inventory, and the best technical approach to achieve a unified inventory.

This phase will simultaneously expose and exclude requirements for the inventory. As important as learning what need to do, you will emerge knowing what work you will not need to do – or at least what elements you need to steer around.

Importantly, you will also arrive at a sense of who you ultimately need to involve in the inventory from across various teams, i.e. database admins, taxonomists, subject matter experts, project leads and system engineers.

The Steps

Systematically define the scope of the inventory and incorporate any technical considerations, including:

sites
- domains
- subdomains
external resources
- SlideShare presentations
- SoundCloud audio
- YouTube video
Content Delivery Network (CDN) resources
- behind a controlled Canonical Name (CNAME)
- on an external domain such as Amazon CloudFront, Box.com, Google Cloud
underlying components – a modern site can have one front-end and several back-ends
- applications and platforms, such as Drupal, Alfresco, dotCMS or other legacy CMS or DMS
- services, such as Varnish caching, Okta authentication, or database replication
- systems, such as the Kafka messaging queue or an F5 load balancer
databases
- relational
- NoSQL
Universal Unique Identifiers (UUIDs) in use – vital for tracing content back to system in which it resides
- internal, such as Drupal node IDs
- external, typically exposed in page markup or URLs
managed resources – administered through a database-driven system
- published and linked, i.e. Drupal nodes
- published but unlinked
- draft, unpublished or never published
unmanaged resources – direct interface access or via Command Line Interface (CLI) or secure File Transfer Protocol (sFTP) client
- published and linked, i.e. bare files on server
- published but unlinked
- draft, unpublished, or never published
aggregations
- item listings, i.e. Announcements
- category groupings
- paginated displays
dark content – mystery resources that were forgotten or remained unexposed in
- an administration environment
- file structures
- databases
structured, unstructured and loosely structured content
- existing content types and templates
- standalone files and pages
- flat file stores
- wikis
known or inferred content types based on
- documented definitions
- menu placement
- titles
- URL patterns
- associations
- recurring indicators such as markup or keywords
dynamic content that has a changing or mixed state depending on context
- personalization
- location
- language
- access level
access-protection rules
- profile pages
- secure or private content
- subscription-based content
workflows – states through which the content you are inventorying may concurrently be transitioning
- draft
- awaiting approval
- published
- unpublished
- archived
- deleted
retention policies
- previous versions
- unpublished items
data in content
- tabular data
- data attributes
sensitive information within content
- Personally Identifiable Information (PII)
- information covered under the EU’s General Data Protection Regulation (GDPR)
mutability
- what content is considered locked, archival or evergreen
- what content remains subject to edits and updates
HTML rendering
- old-school
- Single Page Application
file serving
- inline, displayed in browser, if applicable
- always download
file system case sensitivity
- case-sensitive
- case-insensitive
URL and naming conventions
- directories and subdirectories
- page slugs
- file names
redirection patterns
- pages
- files
date conventions
- single or multiple
- ISO conventions in use
- use of official or legal dates
- use of publication dates
- use of last modified dates
multilingual conventions
- is there a primary, official, or authoritative language
- what other languages are supported
- how comprehensive is language support
- what prefixes, suffixes, or language-specific domains are used to denote language
- are translations correlated via a menu scheme, URL naming convention, etc.
- or are translations standalone and disparate, unlinked among each other
exclusions
- JavaScript files
- CSS files
- extensions or MIME types
- paths
- protocols

Determination

Before executing the actual inventory, a few key determinations remain. These judgments drive the implementation details of the inventory and its output.

Upon completion of this phase, you will be better able to gauge a final level of effort and to get a general idea how much time and money the inventory will require.

The Steps

Make the final necessary determinations:

target environment(s)
- use the production site
- or a cloned instance
where an audit will take place
- within current or legacy system
- in new different system
format
- what is required for an Extract Transform Load (ETL) operation into an audit environment
- what is required for a system-to-system integration, i.e. endpoints
content capture
- full content capture
- via a referring URL
metadata schema – all columns which will contain information delivered in the inventory
- Title
- Subtitle
- Official Date
- Published Date
- Last Updated
- Language
- Source URL
- Migration URL
- URL Alias(es)
- Format
- Content Type – used in back-end management
- Presentation Type – used in front-end display
- Migration Type – used in extraction and migration
exclusions
- gather excluded resources but mark them as excluded
- do not gather excluded resources
redirections
- don’t capture redirections and use only the final destination
- capture redirection chains
automation
- run the inventory using manual steps
- script the inventory and use an orchestrated process
syncing
- institute a service to reflect new content and updates to content already captured by the inventory
- enforce a content freeze and perform subsequent intermittent catchup re-inventories
titling for binaries
- extract from a wrapper page
- extract from internal metadata
- extract from file content
- extract from referring link titles or link text
enhancements – are any enhancements desired to metadata as content is inventoried
- clean up titles by removing delimited portions, i.e. before or after “-“ or “|”
- normalize dates
- add metadata that can be inferred based upon patterns in titles, URLs, associations, other recurring indicators such as markup structure or SEO keywords
timestamps
- do individual inventoried items need to reflect the specific date and time captured
- or can the overall inventory be given a timestamp
- or is content considered evergreen

Execution

Having done your research, analysis and consideration in the preceding phases, you are now ready to push the button to execute your inventory. Though it’s likely you will be pushing several buttons at least a few times, depending on your requirements.

You may be performing a one-time set of manual steps with basic tools to carry out a simpler inventory. Alternately, if you have a complex content set and sufficient software engineering talent available, you may be orchestrating a recurring, automated inventory process.

The Steps

Execute the inventory:

front-end crawl with Screaming Frog, Wget or another crawler
- note that authenticating into a production system is inadvisable so consider guardrails
query admin interface content listings – managed objects like Drupal nodes
UNIX “find” through file system directory structures
merge results
ensure proper encoding in recorded values
- multilingual
- special characters
remove trailing slashes
check for 404s
- resolve case-sensitivity issues
de-duplicate
force redirections
- repeat
consolidate results into CSV or other interim format

Delivery

Packaging and delivering the inventory offers a final opportunity to polish its results. These are largely presentational steps that make for easier interpretation of the results and loading into an audit system.

If you are running recurring inventories, you should establish a naming convention for the output files that will give you a clear picture when you need to differentiate among accumulating results sets.

Here is a sample format:

[name of the content set]-[captured through date YYYY-MM-DD]-[as of date delivered YYYY-MM-DD-HHMM-timezone]-[file extension]

For example:

minutes-of-proceedings-through-2021-05-14-as-of-2021-05-31-1848-pdt.xlsx

The Steps

Package and deliver the inventoried content:

normalize data
- reconcile competing date formats
- enforce title conventions, i.e. remove suffixed “ – Organization Name” etc.
- remove initial and trailing whitespace in values for title, subtitle and other text fields
tokenize special characters, depending on how the inventory results will be processed
- commas
- quotation marks
indicate aliases where multiple URLs refer to the same content
set default sort order and subsort orders
package results into XLSX, XML, JSON, or other final format
store final inventory package in a redundant location
load inventory results into audit environment