Dark Content, Dark Structure
The concept of dark data is discussed widely in technical literature, along with the complexity, uncertainty and possibilities it entails.
Now, after decades of proliferation of managed and unmanaged content within countless companies, government entities and other organizations – and throughout the Web itself – two corollaries to dark data present themselves as similarly worthy of examination. These can be understood as dark content and dark structure.
Gartner defines dark data as:
the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). Similar to dark matter in physics, dark data often comprises most organizations’ universe of information assets.
Comparably, dark content is composed of content that remains secluded and unused in the shadowy internals of information systems, trapped in tangled digital cobwebs, or hidden in scattered corners of a website. Likewise, dark structure is comprised of interconnections that lie dormant and uncharted among content, spanning fragmented taxonomies and metadata, siloed repositories and otherwise unmapped connections.
As with the many considerations involved in evaluating dark data, dark content and dark structure introduce their own complications, risks and opportunities.
By their nature, dark content and dark structure do not lend themselves to easy observation or elucidation. Each has characteristics that pose challenges to routine detection and straightforward documentation. The keys to unlocking these murky content aspects are a solid content inventory and a sound content audit.
Arguably, the principal aim and foremost advantage of an inventory and audit should be specifically the recognition of dark content and dark structure in an organization’s content set. For technical specialists, knowledge workers and project managers alike, adopting a suitable mindset and approach is crucial for the success of these activities.
By properly designing the content inventory as a forensic and investigative undertaking, the inventory will necessarily surface dark content; and by properly preparing the content audit as an exploration and documentation exercise, the audit will necessarily illuminate dark structure.
Recognizing a Cross-Disciplinary Challenge
Outside the context of content management, experts across many disciplines – including archaeology, astronomy, civil litigation, counterterrorism, espionage, genealogy, history, law enforcement, politics and warfare – face the challenge of unveiling submerged information and of eliciting beneficial structure from within it.
As a way of understanding how dark content and dark structure relate to the practice of content management, consider how professionals in other fields address information gaps and form impactful conclusions in their own work.
Archaeological Excavation: Archaeologists carefully excavate historical sites and document found artifacts. Based on their findings, they attempt to fill in the gaps of unwritten history. The more puzzle pieces they can locate and assemble, the clearer the picture that emerges.
Criminal Investigation: A detective gathers evidence at a crime scene, conducts interviews among victims and witnesses, and employs forensic analysis. The investigation will seek to yield clues and leads and link statements and evidence into a plausible theory of the case.
Intelligence Gathering and Analysis: Intelligence agencies gather raw intelligence about a given target through clandestine methods, electronic surveillance and open sources. They analyze a mosaic of information to formulate a final assessment or forecast for policymakers.
Interestingly, Ancestry has built a business model around uncovering dark content and dark structure. People want to learn about their ancestors and chronicle the familial bonds among them – and are willing to pay money for a powerful platform. Ancestry seeks out and captures dark content within historical archives held by governments and religious bodies, then mines these records to establish genealogical relationships. Users can then identify further connections.
Similarly, Relativity has built its business model around enabling legal eDiscovery of organizational documents and records and analyzing those materials to discern semantics and structure among files, people and time points. Users can scrutinize document batches for keyword occurrences or proximity of key terms that suggest commonality between documents or connections among individuals in a legal matter. Relativity also reconstructs and visualizes complex email threads.
The Significance of Dark Content and Dark Structure
An organization that neglects content blind spots in its information landscape may incur surprise liabilities or miss significant opportunities.
Content inventories and content audits remain expensive, laborious and time consuming and are often performed only intermittently by an organization, if at all. Typically, several years elapse between an organization’s formal focus on inspecting and appraising its content holdings and on building and expanding its ontology and taxonomies.
Meanwhile, dark content can involve knowledge retention dangers. Dark content could be permitted to exist when it should instead be removed for compliance, governance or legal reasons. Conversely, dark content could be financially, historically or technically advantageous, but due to lack of detection and inventorying might be inadvertently allowed to die.
Simultaneously, dark structure represents an evolving set of unexplored possibilities. Unclear or unknown structure among content means a perpetually incomplete information picture for an organization and a weaker data model underlying its business processes and information systems.
The Business Value of Detecting Dark Content and Documenting Dark Structure
While a content inventory should already identify and catalog known content, the potential increased value of an inventory lies in detecting and registering dark content; and while a content audit should already capture and confirm known structure, the potential increased value of an audit lies in delineating and documenting dark structure.
In a mature organization, processes and systems should already be well established to support content identification and cataloging as persistent, ongoing activities, along with management and evolution of an organizational ontology, taxonomies and metadata.
Otherwise, whenever they can be ascertained and incorporated into an organization’s information operations, dark content and dark structure offer value by contributing to a more holistic domain understanding and prompting improved business processes, software development and communications functions.
Conceptually, the pursuits of detecting dark content and mapping dark structure provide value in that they can facilitate building, exploiting and monetizing a fuller domain model for the organization. By scrutinizing the opaque aspects of its content, an organization can arrive at clearer and more comprehensive insights and make better informed business decisions.
Functionally, the identification of dark content and documentation of dark structure benefit core areas such as content object modeling, publishing workflows, persona mapping, user experience, marketing strategy and other business processes. Once expounded, dark content and dark structure can inform the creation or expansion of ontologies, taxonomies and metadata schemas.
Finding Value in Darkness – Use Cases
Teams may find it painstaking and costly to expose dark content and formalize dark structure. Moreover, the business value that emanates from their attempts may ultimately emerge as meaningful, mundane, or questionable. Until an inventory and audit are completed, however, whatever new worth might be derived from content remains latent.
Typically, an organization’s investment in surfacing dark content and recording dark structure brings maximum value when it consistently:
Expands the understanding of the organizational domain model and ontology.
Clarifies or reinforces a design concept, as formulated in The Essence of Software: “a theory of software design that attributes clarity, simplicity and fitness for purpose in an application to its underlying semantic concepts.” – see https://www.cs.cornell.edu/content/concepts-essence-software-design and https://www.youtube.com/watch?v=wFk0pxuOW-Q
Enhances user experience through discovery of an unanticipated consumer segment, persona, use case, business requirement or feature request.
Improves search results, views and other content aggregations, making them more accurate and effective through application of expanded taxonomy and metadata values.
Graphs knowledge within the content set, whether it be other newly discovered content or people, events, places and their connections to content.
Enriches content with new historical and visual context such as a layered map, interactive timeline or suggestions for related content.
Streamlines publishing workflows, calibrates a user interface, informs marketing strategies and improves other business processes.
Liberates content confined within otherwise locked or unwieldy storage formats.
Disposes of obsolete content that could incur liability.
Saves valuable content from inadvertent destruction.
Recognizes the need to extend the lifespan of a vital system.
Identifies an archaic system as a candidate for retirement.
Brings renewed attention to a neglected content set.
Offers a fun factor: “Look at what we found buried in our archives.” “We thought you might enjoy this blast from the past.”
Preserves organizational content of significant historical value for posterity.
The Challenges of Dark Content
Dark content lies inert and isolated on the edges of an information ecosystem, having been overlooked within an organization’s legacy publishing platforms, buried in aging content repositories and even cloaked beneath undisclosed URLs on its public websites.
Dark content is similar – but not equivalent – to the Deep Web, in which content can be a step removed from public visibility behind a paywall or login.
Organizations typically lack even an initial, direct awareness of their own dark content or its potential dark structure. Whatever the reason for its obscurity, dark content has been sidelined or ignored – and an organization cannot yet leverage it for broader business purposes.
The operational challenges dark content presents are:
Revelation: Just knowing that it exists – it is, after all, “dark” content.
Governance and Retention: Once they become apparent, the novel variations of dark content can prove thorny to normalize quickly within existing data models and information structures for long-term storage and management.
Urgency: The systems that contain dark content could themselves be concealed, neglected or decrepit, and therefore more vulnerable to disruption and displacement. The content inventory may be the last occasion to detect and preserve dark content before legacy systems are retired and dark content is deleted along with them.
Where To Search for Dark Content
Dark content can exist in many different locations, forms and scenarios, such as in:
An entire system or repository that is siloed but active, such as legacy systems that are not yet retired and are still serving public but unlinked content en masse.
An individual resource that is published but unlinked, such as a webpage or binary with no incoming links or that is bypassed by redirections. A site map can remediate this.
A published document that is insufficiently referenced with taxonomy, metadata, state properties or other structural information to surface it in views and aggregations.
SEO-deficient conditions where poor search engine optimization causes content to fall out of search results or become de-ranked into effective invisibility.
Poorly-indexed databases or file stores where availability of their holdings is constrained for applications and users.
An active content management system where content remains in draft, unpublished, archived or pending deletion workflow states.
Prior content and document versions, such as Drupal node revisions or wiki page versions, that are stored but are not published.
A decommissioned system, offline storage area or deprecated feature set such as abandoned wikis, private file stores or discontinued content types.
Files and media that survive in formats no longer supported by an organization’s current systems or commercially available software.
Images or image-based PDFs that have not yet undergone optical character recognition to extract text, keywords and semantics.
Images or video that have not yet undergone processing to discern their semantics based on people, objects, landscapes or other patterns.
Audio files that have not yet been transcribed.
Video files that have not been captioned.
Images that lack alternative text.
Untranslated materials for which there is no source document and that may contain unique and authoritative content.
A third-party system such as a cloud storage provider. These tend to be marketing solutions where campaign assets are parked.
Inactive forums and message boards.
Completed surveys and form submissions.
Webpages and PDFs that do not meet accessibility standards for screen readers and assistive technologies.
Any mystery resource awaiting processing, analysis and classification by humans or artificial intelligence to assess any content they might retain.
The Challenges of Dark Structure
Dark structure can be thought of as existing – but as-yet unrecognized – connections among content. Dark structure is comprised of established but unrealized and untapped associations or relationships among known content, among dark content, or among both known content and dark content.
An intriguing aspect of dark structure is that it can exist even among known content items, as relationships not yet having been revealed or modeled among an already cataloged content set – even though the individual content items that are connected through dark structure are already well understood.
By extension, dark structure can exist wherever there is dark content itself – because where any content remains unknown, any structure shared among the unknown content is necessarily also unknown.
In another permutation, dark structure can exist among both dark content and known content. Again, the presence of dark content signifies the possibility of dark structure through a veiled connection to some other known content.
The operational challenges that dark structure presents are:
LOE and Return on Investment: An ongoing cost-benefit calculation is vital in determining the level of effort and return on investment of documenting dark structure as the audit proceeds, as complexity deepens and as costs are sunk.
Audit Support: Flexible and fast UX and data model design are critical in extending an audit interface to align to an expanding ontology or capture a growing taxonomy as new connections are realized within dark structure.
Again … Urgency: The content audit may be the last window to map dark structure before legacy systems are retired and dark structure dissipates.
Structural Possibilities Among Content
In summary, structure can exist in the following states among content:
Known Structure among Known Content and Other Known Content: This is the most common scenario, as this arrangement forms the essential foundation for a navigable website.
Dark Structure among Known Content and Other Known Content: This is a case of otherwise perceptible relationships hiding in plain sight, awaiting discovery.
Dark Structure among Dark Content and Other Dark Content: Necessarily, this is the conceivable state of structure among dark content for as long as it remains unknown.
Dark Structure among Dark Content and Known Content: Again, this is the exclusive state of structure until dark content is surfaced in relation to known content.
None: Structure need not exist among content. For example, while there could be dark structure within a dark content set, there might be no structure within a known content set.
How To Reveal Dark Structure
Dark structure is not immediately intelligible but can be revealed on the basis of:
Common Taxonomy Values or Shared Metadata
A specified keyword presence or density in content can denote candidacy for a specified content set, including values for:
Author
Team
Authoritative, official or legal date
Topic
Language
Geographical location
Title
Metadata description
Indexed full text
Free tags
Slug
Content type
Identical or Similar State Properties
This determination requires extensive forensics, often examining nuances at the file system and database level, including properties for:
Creation date
Modification date
Publication date
Workflow status
Publication disposition
MIME type
Size
Encoding
Origin system or software
Current system of record
Storage location
Internal system identifiers
Other Structural Relationships
An artful and meticulous audit will cross-index within the content inventory to fill in blanks and connect dots, making explicit and inferred references to:
Parent-child hierarchies
Sibling relationships such as node queues
Associations through incoming and outgoing hyperlinks
Correlations such as translations assembled in a multilingual family
URL patterns
Directory and file naming conventions
Proximity within a storage location
Redirection history
Tracking available in Web or log file analytics
Unusual class or style references present in the content inventory
Similar or identical counts of markup elements in the content inventory
Data attributes
Approaching Dark Content and Dark Structure in Your Organization
A typical organization is strewn with content assets amid all phases of the content lifecycle and across a scattering of systems and silos. This raises important questions about the effort, cost and returns involved in assessing that content.
Confronting and mastering dark content and dark structure in your organization’s content should be considered as the primary goal for your content inventory and content audit.
Your organization could run unnecessary risks or miss worthwhile opportunities because it lacks a commanding view of its content universe and the rich structure within it. By unlocking these enigmatic content aspects, an organization can expand its awareness and refine its domain model, ontology, taxonomies and metadata schemas into an improved foundation for business, technical and communications activities.
The challenges posed by dark content and dark structure require discipline and precision to address. Accordingly, to conduct a valid content inventory and an authoritative content audit entails adopting a systematized approach and exacting techniques.
A thorough and systematic content inventory should identify and catalog dark content. Refer to this content inventory technical quick guide to ensure the widest inventory coverage of systems and materials that may contain dark content.
An exhaustive and methodical content audit should reveal and record dark structure. Refer to this content audit technical quick guide to ensure the deepest audit penetration of content collections and arrangements that may hold dark structure.