Content Aspects

View Original

Planning Your Content Inventory: A Technical Quick Guide

Your content inventory is typically a precursor to a content audit, migration or site redesign. A well-scoped and finely-tuned content inventory lays the foundation for a more meaningful audit, a smoother migration and a successful site relaunch.

Ethan Hunt goes Into the VaultMission Impossible

I take a consistent, exhaustive and forensic approach to any content inventory – the result is to make it efficient, comprehensive and repeatable. While delivering the catalog of known content, the inventory can also reveal previously unknown and dissociated content.

This technical quick guide is offered for use in planning your own content inventory. It consists of five phases: Familiarization, Preparation, Determination, Execution and Delivery. Where relevant, additional explanation is provided for given steps.

Familiarization

You want to establish the context for the content inventory in terms of the organization, the project and its business goals. A baseline level of domain knowledge is necessary before cataloguing the content holdings of a website or other web property.

For example: What problems will the inventory solve? Is this inventory for specific site sections, an entire site, or a collection of sites? Is this a one-off content inventory or a recurring inventory? Will the results be used for a wholesale redesign, a lift-and-shift, or migration into another system?

With the proper context established, you can align the remaining phases accordingly and perform an inventory that supports the project’s goals.

The Steps

Assemble an overview of:

  • the organization

  • its mission

  • existing taxonomies

  • existing ontologies

  • the project

    • its scope

    • charter

    • roadmap

    • epic and story backlog if available

  • inventory deliverables expected

    • intended use for the results

    • timeframe for delivery

    • past work, if applicable 

Preparation 

This is your technical discovery phase. Its purpose and benefit are to identify all pertinent details of an organization’s technical ecosystem that will yield value in the content inventory and that will determine how you conduct the inventory itself.

By proceeding methodically, you can reduce the need for rerunning the inventory and the risk of missing important details.

The mechanism here is a checklist to interrogate every aspect of the technical ecosystem for its relevance to and influence upon the content inventory. These questions must establish what content exists, in which content types or other forms it is constituted, how it is supported and maintained, what specific content among it is necessary to inventory, what systemic factors might affect or detract from the inventory, and the best technical approach to achieve a unified inventory.

This phase will simultaneously expose and exclude requirements for the inventory. As important as learning what need to do, you will emerge knowing what work you will not need to do – or at least what elements you need to steer around.

Importantly, you will also arrive at a sense of who you ultimately need to involve in the inventory from across various teams, i.e. database admins, taxonomists, subject matter experts, project leads and system engineers.

The Steps

Systematically define the scope of the inventory and incorporate any technical considerations, including:

  • sites

    • domains

    • subdomains

  • external resources

    • SlideShare presentations

    • SoundCloud audio

    • YouTube video

  • Content Delivery Network (CDN) resources

    • behind a controlled Canonical Name (CNAME)

    • on an external domain such as Amazon CloudFront, Box.com, Google Cloud

  • underlying components – a modern site can have one front-end and several back-ends

    • applications and platforms, such as Drupal, Alfresco, dotCMS or other legacy CMS or DMS

    • services, such as Varnish caching, Okta authentication, or database replication

    • systems, such as the Kafka messaging queue or an F5 load balancer

  • databases

    • relational

    • NoSQL

  • Universal Unique Identifiers (UUIDs) in use – vital for tracing content back to system in which it resides

    • internal, such as Drupal node IDs

    • external, typically exposed in page markup or URLs

  • managed resources – administered through a database-driven system

    • published and linked, i.e. Drupal nodes

    • published but unlinked

    • draft, unpublished or never published

  • unmanaged resources – direct interface access or via Command Line Interface (CLI) or secure File Transfer Protocol (sFTP) client

    • published and linked, i.e. bare files on server

    • published but unlinked

    • draft, unpublished, or never published

  • aggregations

    • item listings, i.e. Announcements

    • category groupings

    • paginated displays

  • dark content – mystery resources that were forgotten or remained unexposed in

    • an administration environment

    • file structures

    • databases

  • structured, unstructured and loosely structured content

    • existing content types and templates

    • standalone files and pages

    • flat file stores

    • wikis

  • known or inferred content types based on

    • documented definitions

    • menu placement

    • titles

    • URL patterns

    • associations

    • recurring indicators such as markup or keywords

  • dynamic content that has a changing or mixed state depending on context

    • personalization

    • location

    • language

    • access level

  • access-protection rules

    • profile pages

    • secure or private content

    • subscription-based content

  • workflows – states through which the content you are inventorying may concurrently be transitioning

    • draft

    • awaiting approval

    • published

    • unpublished

    • archived

    • deleted

  • retention policies

    • previous versions

    • unpublished items

  • data in content

    • tabular data

    • data attributes

  • sensitive information within content

    • Personally Identifiable Information (PII)

    • information covered under the EU’s General Data Protection Regulation (GDPR)

  • mutability

    • what content is considered locked, archival or evergreen

    • what content remains subject to edits and updates

  • HTML rendering

    • old-school

    • Single Page Application

  • file serving

    • inline, displayed in browser, if applicable

    • always download

  • file system case sensitivity

    • case-sensitive

    • case-insensitive

  • URL and naming conventions

    • directories and subdirectories

    • page slugs

    • file names

  • redirection patterns

    • pages

    • files

  • date conventions

    • single or multiple

    • ISO conventions in use

    • use of official or legal dates

    • use of publication dates

    • use of last modified dates

  • multilingual conventions

    • is there a primary, official, or authoritative language

    • what other languages are supported

    • how comprehensive is language support

    • what prefixes, suffixes, or language-specific domains are used to denote language

    • are translations correlated via a menu scheme, URL naming convention, etc.

    • or are translations standalone and disparate, unlinked among each other

  • exclusions

    • JavaScript files

    • CSS files

    • extensions or MIME types

    • paths

    • protocols

Determination

Before executing the actual inventory, a few key determinations remain. These judgments drive the implementation details of the inventory and its output.

Upon completion of this phase, you will be better able to gauge a final level of effort and to get a general idea how much time and money the inventory will require.

The Steps

Make the final necessary determinations:

  • target environment(s)

    • use the production site

    • or a cloned instance

  • where an audit will take place

    • within current or legacy system

    • in new different system

  • format

    • what is required for an Extract Transform Load (ETL) operation into an audit environment

    • what is required for a system-to-system integration, i.e. endpoints

  • content capture

    • full content capture

    • via a referring URL

  • metadata schema – all columns which will contain information delivered in the inventory

    • Title

    • Subtitle

    • Official Date

    • Published Date

    • Last Updated

    • Language

    • Source URL

    • Migration URL

    • URL Alias(es)

    • Format

    • Content Type – used in back-end management

    • Presentation Type – used in front-end display

    • Migration Type – used in extraction and migration

  • exclusions

    • gather excluded resources but mark them as excluded

    • do not gather excluded resources

  • redirections

    • don’t capture redirections and use only the final destination

    • capture redirection chains

  • automation

    • run the inventory using manual steps

    • script the inventory and use an orchestrated process

  • syncing

    • institute a service to reflect new content and updates to content already captured by the inventory

    • enforce a content freeze and perform subsequent intermittent catchup re-inventories

  • titling for binaries

    • extract from a wrapper page

    • extract from internal metadata

    • extract from file content

    • extract from referring link titles or link text

  • enhancements – are any enhancements desired to metadata as content is inventoried

    • clean up titles by removing delimited portions, i.e. before or after “-“ or “|”

    • normalize dates

    • add metadata that can be inferred based upon patterns in titles, URLs, associations, other recurring indicators such as markup structure or SEO keywords

  • timestamps

    • do individual inventoried items need to reflect the specific date and time captured

    • or can the overall inventory be given a timestamp

    • or is content considered evergreen 

Execution

Having done your research, analysis and consideration in the preceding phases, you are now ready to push the button to execute your inventory. Though it’s likely you will be pushing several buttons at least a few times, depending on your requirements.

You may be performing a one-time set of manual steps with basic tools to carry out a simpler inventory. Alternately, if you have a complex content set and sufficient software engineering talent available, you may be orchestrating a recurring, automated inventory process.

The Steps

Execute the inventory:

  • front-end crawl with Screaming Frog, Wget or another crawler

    • note that authenticating into a production system is inadvisable so consider guardrails

  • query admin interface content listings – managed objects like Drupal nodes

  • UNIX “find” through file system directory structures

  • merge results

  • ensure proper encoding in recorded values

    • multilingual

    • special characters

  • remove trailing slashes

  • check for 404s

    • resolve case-sensitivity issues

  • de-duplicate

  • force redirections

    • repeat

  • consolidate results into CSV or other interim format

Delivery

Packaging and delivering the inventory offers a final opportunity to polish its results. These are largely presentational steps that make for easier interpretation of the results and loading into an audit system.

If you are running recurring inventories, you should establish a naming convention for the output files that will give you a clear picture when you need to differentiate among accumulating results sets.

Here is a sample format:

[name of the content set]-[captured through date YYYY-MM-DD]-[as of date delivered YYYY-MM-DD-HHMM-timezone]-[file extension]

For example:

 minutes-of-proceedings-through-2021-05-14-as-of-2021-05-31-1848-pdt.xlsx

The Steps

Package and deliver the inventoried content:

  • normalize data

    • reconcile competing date formats

    • enforce title conventions, i.e. remove suffixed “ – Organization Name” etc.

    • remove initial and trailing whitespace in values for title, subtitle and other text fields

  • tokenize special characters, depending on how the inventory results will be processed

    • commas

    • quotation marks

  • indicate aliases where multiple URLs refer to the same content

  • set default sort order and subsort orders

  • package results into XLSX, XML, JSON, or other final format

  • store final inventory package in a redundant location

  • load inventory results into audit environment