Schema-Validated Structured Content

Summary

Design the content as typed, schema-defined data that is validated against that schema and kept independent of how it is presented, so the same content can be authored with confidence, reused in more than one place, and trusted by anything that reads it. This pattern is for HASS research software engineers, researcher-coders, and content authors who maintain material for research sites and data packages, and who want the content itself, rather than its rendered appearance, to be the durable asset.

Recommendation	Why?
Define an explicit schema for each kind of content, validate every item against it, and keep the content separate from its presentation.	Authors get immediate, specific feedback instead of silently broken pages, consumers such as templates, search indexes, sibling sites, and machines can rely on a stable shape, and the content outlives any single rendering of it.
Make the schema only as rich as the content’s real uses require, and grow it deliberately.	An over-specified schema burdens every author and freezes early guesses into obligations, while an under-specified one lets inconsistency creep back in. Matching the schema to actual reuse, rather than to imagined completeness, is the design judgement at the centre of this pattern.

Context

The content this suits is the kind with recurring structure: catalogue entries, patterns, people, datasets, events, or publications that feed a detail page, a listing, a search index, a feed, a sibling site, or an external machine consumer. It fits research settings especially well, where content must stay reusable and citable long after the site that first presented it, which is the reuse limb of FAIR.

Several of these conditions are social as much as technical, because the authors of research content range from software engineers to researchers who do not code. Comprehensible feedback rather than stack traces is what lets a non-coding author see what is wrong and fix it; agreeing what each field means is a shared-vocabulary problem before it is a software one; and the downstream consumer that justifies the structure might be a discovery service, an aggregator, or an archive.

It does not apply to one-off prose pages with no recurring structure and no downstream consumer, where a schema is pure overhead, nor to content whose shape genuinely cannot be anticipated. Enforcing the schema at publication time is a separate concern, handled by the sibling process pattern on build-time validation; this pattern is about designing the shape and requiring conformance to it, not about the mechanics of the gate.

Usage

Model each content type as a named collection with an explicit schema. Decide the fields, their types, which are required, and what values each may take, and write that down as a schema rather than leaving it to convention and memory.

Separate the content from its presentation. Hold the content as data that carries meaning, e.g. structured text files or records, and let templates decide appearance, so the same item can drive a detail page, a listing, a feed, and an external consumer without being rewritten for each.

Give authors type-safe access and legible errors. The content’s shape should be visible to the tools authors already use, through autocompletion and inline validation, and a failure should name which field in which item is wrong and why, so the feedback teaches rather than merely blocks.

Model relationships explicitly. Where one item refers to another, e.g. a pattern to its siblings or a dataset to its authors, express that as a typed reference between collections rather than as a loose string, so the link can be validated and followed rather than silently rotting.

Emit machine-readable structure for the consumers that need it. Where discovery or interchange matters, project the content into a recognised vocabulary, e.g. schema.org expressed as JSON-LD, so that search services, aggregators, and archives read it reliably. This reinforces FAIR findability as well as reuse.

Treat the schema as a versioned specification. Change it through review, and migrate existing content when it changes, so the contract between authors and consumers stays honest rather than drifting. Validate every item against the current schema as part of producing the site, which the sibling process pattern on build-time validation puts into effect.

One content source, validated once against an explicit schema, feeds many peer consumers; the rendered page is just one of them.

Implementations

Astro’s content collections are an exemplar static-site implementation. A collection declares its schema as a Zod object, every item is validated at build, and templates receive typed content with editor autocompletion. The Content Layer adds typed references between collections and loading from files or external sources, so one collection can drive a detail page, a listing, an RSS feed, a search index, and a JSON-LD block for crawlers: the same content, validated once, feeding many peer consumers. Astro is worth singling out because it makes the right design the path of least resistance, which is what a good tool does for a good pattern rather than the other way around. See the content collections reference for the current API.

The RSE-CEP pattern site is this pattern running live: each pattern is a typed content collection whose frontmatter is defined by a schema, validated at build, with typed references between related patterns. It is one click away from a working modern implementation you can inspect rather than take on trust.

The shape is not web-specific. RO-Crate applies the same idea to research-data packaging, i.e. schema-validated metadata as JSON-LD, held apart from an optional human-readable rendering. RO-Crate is already used in Australian research infrastructure such as the Language Data Commons of Australia.

References

Standards

JSON Schema: the general approach to validating structured documents across ecosystems.
schema.org: the common vocabulary for machine-readable structured content, typically expressed as JSON-LD for discovery.

Libraries

Zod: the TypeScript-first schema and validation library used by Astro content collections.

Other resources

The schema.org vocabulary is a common target for machine-readable structured content. Zod, a TypeScript-first schema and validation library, is the validation layer used by Astro content collections.

Acknowledgments

This pattern responds to concerns raised by participants in the HASS and Indigenous RDC Community Data Lab co-design workshop, in particular that software development too often prioritises features over long-term access to well-described data, leaving outputs unsustainable. It draws on the structured-metadata work of the RO-Crate and schema.org communities, whose conventions it points to rather than reinvents.