An Introduction to the Resource Description Framework
An overview of RDF, what it is, how it works, and why.
This is a growing project.Context.
Right at the end of 2021 I squeaked through a prodev expense to hire Dorian Taylor for a set of tutoring sessions around the basics of linked data, the semantic web, and RDF. Taylor is a consulting designer who has worked in the information architecture and linked data space for many years, and I had previously engaged him to consult on localization and content strategy approaches at SmugMug and Flickr.
This was predicated by adding some How-To JSON-LD to a number of Verge articles with the goal of improving search engine optimization, and the development of The Schemanator.
Working on this project made it clear that this was just the tip of the iceberg of a complex and interesting system, and that SEO was just one of the potential benefits. Since no one was quite sure how any of this worked – even in the SEO use case – I decided that it would be worthwhile to understand this.
Below is the results of this learning adventure!
Introduction: Semantic Web / Linked Data / Semantic Data /Resource Description Framework.
There are a number of challenges when it comes to learning about this space – the two leading ones are:
- All the interesting stuff is done by European academic researchers
- There’s like a dozen terms that carry different subtle distinctions in meaning.
The European academics I’ve met are all really friendly, and once you get used to the white papers it sort of works itself out. So I’ll start with the terminology here:
- Semantic
- In plain use, semantic means carrying meaning. In this context, it means carrying information that can be operated on in some capacity by computers.
- Linked
- This operates the same way as we use the term "links" for the internet, meaning that there are connections between different places defined by URLs.
- Data
- Points of information, usually collected into what is termed "Resources" or "Entities". Data here operates again the same way that we would expect it to as web developers.
- Web
- In it's more original sense, a Web is a collection of Resources that are Linked to each other. Each resource is identified, like a web page, by a URL, and each link is a reference to another resources URL.
The compound terms grow from the combination of these basic ideas:
- Semantic Data
- Data resources that are defined by machine-readable, computable structures and content.
- Linked Data
- Data resources that are identified by URLs and reference other data points via their URLs.
- Semantic Web of Linked Data
- A collection of data resources identified by URLs that contain computable machine-readable, computable structures and content that reference other resources via their URLs.
Simple, right?
The last concept to introduce is the Resource Description Framework, known as RDF. RDF is a W3C standard for defining and working with these collections of resources, or data. RDF has been around since the early 2000s, is a web standard, and ironically it’s been mostly embraced outside of the web development industry in areas like machine learning, bioinformatics, bibliographies, cultural archive institutions and trusts, and in the case of the French, agricultural studies.
Web developers, as a rule, hate RDF and refuse to use it. My current theory is this is because the first implementations of the spec were in XML and XML completely sucks and inimical to human joy and flourishing.
Why is this useful and/or interesting?
Lots of the literature and documentation around the RDF ecosystem is focused on “how’s” and not really on “why’s”. My own experience was one of learning lots of “how’s” and “what’s” before I even started to get a glimpse of the “why’s”. So before we go any further, I’ll just drop of some most interesting (to me anyway) use cases for these approaches:
- Search engine optimization.
- Content searching, indexing, and cataloging.
- Recommendation engines and audience analysis.
- Organizational modeling.
- Machine learning with like … algebra rather than neural nets.
- Data transparency and interoperability.
- Schema-less databases and flexible data structures.
- Analytics and metrics.
- Relational conceptual models.
- Prediction of consequences from altering complex systems.
- Logical reasoning to derive quite a lot of additional data.
That’s … a lot of interesting and useful and very different use cases! Fundamentally, RDF allows for the creation of complex data models, and direct computation and analysis of those models. All of the above use cases are examples of that process for different end goals.
RDF Fundamentals.
There are two basic ideas in RDF from which quite a lot of complexity is built. At its core tho, RDF is not terribly complicated.
Basic Idea One.
Subject Predicate Object.
The basic format at the core of RDF is called the “Triple”, and it’s very simple. Every single piece of data in an RDF system is a single triple, structured as “Subject, Predicate, Object”. This is a very natural way to describe a thing or a system, which is the goal of the resource description framework. Some examples of this format from sitting at my desk, structured as follows:
“(Subject) (predicate) (object).”
So:
- “(Nikolas) (is drinking) (coffee).”
- “(That plant) (is in) (that pot).”
- “(The radio) (is playing) (KMHD).”
- “(Slack) (is distracting) (Nikolas).”
That’s it! An RDF dataset is just a huge number of these simple, one line, three part structures. Some systems can get very very large, and the cutting edge of storing and querying these systems is measured in billions of triples, with a few companies pushing into one trillion triples which is bonkers.
Basic Idea Two.
Everything is a URL.
Or, more accurately, a URI.* This is a big part of what makes this data both Linked and Semantic. There are quite a few benefits to this which unfold into the potential complexity and power of an RDF system.
Because every subject, object, and predicate is a URL, this means every subject, object, and predicate is unique. So rather than talk about “Nikolas”, which is not a unique way to refer to someone (after all, there are lots of Nikolas’s in the world. The latent context of any conversation is what makes it clear who we’re talking about) we refer to https://nikolas.ws/is
. The URL is a unique string, and is an absolute way to indicate who we mean when we say “Nikolas”.
Subjects and objects are straightforward in this manner, but predicates are a little more interesting I think. Lets say we want to create an RDF system for Punch, thereby turning the huge amount of excellent articles and recipes into a world-class archive on par with the Getty. At the most basic any given cocktail will have a name and ingredients. These then must be URLs, perhaps rdf.punchdrink.com/name
and rdf.punchdrink.com/ingredient
. This removes ambiguity around what we mean when we say “name” – for instance we can be clear that a cocktails name is a different sort of thing than a humans name.
This brings us to a big caveat in the “everything is a URL” idea: literals are also okay! Sometimes to just need a string, date, or integer. That’s okay, they don’t need to be URLs. nikolas.ws/is
can have a name
of just “Nikolas Wise”.
* Tangent: URL, URI, WTF;
More acronyms with subtle differences, so let’s quickly define some more terms.
- URL
- Uniform Resource Locator: for our purposes this means an address that resolves over HTTP and returns some content. We use URLs every day, for example [https://nikolas.ws](https://nikolas.ws).
- URI
- Uniform Resource Identifier: same idea, but isn't exclusively used to resolve resources over HTTP. For example, `0399149864` is the ISBN number for the first edition hardcover of Gibsons Pattern Recognition. `9780425192931` is the ISBN for the same book's trade paperback. These are both URI's within the ISBN system.
- URN
- Uniform Resource Number: This is the same as an URI.
- Bonus round: UUID
- Universally Unique Identifier: A number that when generated to [the spec](https://datatracker.ietf.org/doc/html/rfc4122) will be completely, singly, absolutely unique due to Math Reasons™. UUIDs are a good way to create URLs and URIs that _will_ be unique without worrying about the hard problem of naming things.
Therefor:
Every data point is structured as “Subject predicate object” where each of these three is a URL.
Putting it together, we can write some simple RDF triples about the Manhattan: https://punchdrink.com/recipes/manhattan/.
For the sake of my typing and your reading, we’ll be truncating these URLs for Punch to simple [punch.com/manhattan]. While this isn’t quite correct, it will make things easier to read. One of the core problems with RDF, that we’ll be discussing later, is this exact situation of long URLs being unfriendly to humans.
Without further ado, some RDF triples about the Manahattan:
- [punch.com/manhattan][rdf.punch.com/name] “Manhattan”
- [punch.com/manhattan][rdf.punch.com/ingredient] [punch.com/bourbon]
- [punch.com/manhattan][rdf.punch.com/ingredient] [punch.com/sweet-vermouth]
- [punch.com/manhattan][rdf.punch.com/ingredient] [punch.com/angostura-bitters]
And that’s it! We have some RDF triples and have started our journey into creating semantic, linked data. A close reader may at this point be able to spot a number of implications that grow from these two basic ideas.
Implications of the RDF triple.
From what we’ve laid so far, we can see a number of implications, problems, and questions that are raised by the two fundamentals above. In no particular order, these include:
- This sucks to read and write.
- What’s on the other end of these URLs?
- What URLs can I use?
- How much Bourbon goes in the Manhattan?
- Where does all this data live?
Reading and writing triples of full URLs is bad and makes me feel bad.
It’s true! Triples suck! I would argue they are better than XML but that’s a pretty low bar. The RDF ecosystem has a couple solutions to this problem, which get used in tandem to make a better experience – both for reading and writing and also for interacting with this data inside a program.
Namespaces
This sounds like it’s going to be a very serious computer science thing, but it’s not really. It’s really just assigning a variable to a full URL path to make writing this stuff easier.
Instead of writing our full URL https://punchdrink.com/recipes/manhattan/
we can assign a namespace or prefix:
// Turtle Syntax
@prefix punch: <https://punchdrink.com/recipes/> .
Note: this is a different syntax than the triples we’ve been using – this one is called Turtle which is fun.
These namespaces allow us to write things like punch:manhattan
or punch:ingredient
and process that into the full URL with simple string replacement.
Syntaxes
Triples are the most verbose and granular syntax possible; there is no room for misunderstanding or data loss. This makes them a good format for transmission, storage, or making sure that your data is expressing the concepts and structure you intend it to. This also makes it hard to write without errors, and not super fun to read or work with programatically.
In response to this, RDF has a number of official syntaxes that can be used to express (or serialize) the underlying data.
Since RDF is a framework and a specification for how data should be related, it doesn’t actually have many opinions on how that gets done, and these syntaxes step up to the plate. These include:
- Turtle: Not too bad to write by hand, and a common format for delivering data via the
.ttl
file type. Much easier to read than the others. - JSON-LD: My favorite - a way to structure RDF as regular ol’ JSON objects.
- RDFa: Used to add additional markup to
.html
files that allows the semantic data to be extracted. - RDF/XML: The original sin of RDF; I recommend pretending this doesn’t exist and transforming into Turtle or JSON-LD if you need to every look at it.
A fun implication of this is that since each syntax is a formally defined spec of the same data, they can be freely and rapidly transformed from one to another. It’s a really simple matter to write your data in Turtle, transform it to Triples to send over the wire, then transform again to JSON-LD to operate on in a web app.
What’s on the other end of these URLs?
This is where RDF starts to get really interesting. These URLs resolve to additional RDF data, but with the current URL as the subject.
Lets say we have a single statement from above:
[punch:manhattan] [punch:ingredient] [punch:bourbon]
With our namespace of punch:
resolving to a full URL, we can visit punch:bourbon
at https://punchdrink.com/bourbon
. When we do, we’ll get additional statements about Bourbon:
[punch:bourbon] [punch:type] [punch:whiskey]
[punch:bourbon] [punch:basegrain] [punch:corn]
[punch:bourbon] [punch:agedIn] [punch:charred-oak]
[punch:bourbon] [punch:origin] [punch:kentucky]
“Visit” and “get” are both hand-wavy statements doing a lot of heavy lifting here. What do we mean by that?
At the most basic, this means visiting the URL in a browser. At the URL, we will find an HTML page that is just a regular ol’ web page about bourbon that some some extra data added to it that we can extract. This could be RDFa properties on the HTML nodes, or it could be a blob of JSON-LD somewhere in the page.
It could also mean that we send a GET
request via cURL
or fetch
to the URL, and specify what format we want with an Accept
header. We could ask for HTML
and get the page described above, or we could ask for JSON-LD
and just get the JSON
blob. We could ask for triples, or we could as for Turtle. Any way we do it, we just get more semantic data about whatever is the subject of that URL.
The Recursive Nature of Relations
This is the neat bit. When I was working at Esri, I would go looking for public data sets to explore and work with, and would have a frustrating time dealing with metadata. I would find the metadata for what I wanted, but it was never clear where the data data was that I could draw on a map or operate on. When I started trying to work with RDF predicates to build web apps, my first instinct was to follow the URLs all the way down to find some data data that would be useful to use. This … didn’t really work the way I was imagining it would. All I found was more RDF statements that pointed deeper and deeper until at the end of the day all I had were some statements from some fundamental libraries of terms.
What I realized after grasping at straws and feeling frustrated that there’s no there there is that the thing that actually matters is the relationship. One relationship is useful, and that relationship is defined by other relationships. Which are also just collected relationships. The only thing present in this system are a few type of literals, and a deeply compounded nest of relationships and associations.
This is a good thing! This is cool actually! This fundamentally means that RDF is an open-ended system that can be used to express anything you want, and can be consumed and interpreted any way you want. There are shared conventions and concepts like inverse
and disjoint
that programs can use to make inferences about the data, but at the end of the day there is no limit to what can be expressed. RDF is less of a language or vocabulary and more of a tool for creating languages — which can either be common or vernacular.
For an example of why this is politically interesting, we can take the example of regional Chinese dialects. Many languages in China share the same system of glyphs and characters, but with striking regional variations in pronunciation. The same sentence written can be read many different ways in many different languages, with a degree of shared understanding. The emergence of an official latinization (the correct way to translate characters into phonetic representations with the alphabet) threatens to disrupt these regional dialects and supplant it with a singular, centrally controlled system of language.
RDF and it’s nested relational structure of creating meaning through association rather than absolute definitions operates in the same way as many regional dialects sharing a single set of characters. The infrastructure of language is shared, but what you can say and how you say it is limitless. The creation of meaning is a distributed affair.