Skip to Content

2022.11.10

Tables, Documents, and Graphs

An in-progress examination of strengths and weaknesses of rectangular databases, NoSQL document stores, and graph data structures.

This is a growing project

How can we understand when to use what sort of data store for any given software system? I’ve spent my first decade plus of my career working hard to not care about it, staying focused on the front-end, stateless systems, design principles, and software engineering management. But with my crippling fascination with RDF the time has come — I think that triplestore graphs as good as hell and very useful. But do I know that? I think using ElasticSearch as a canonical document store is a really bad idea, but do I know that? And what about the venberable table? Is a set of SQL tables bad at the things a graph is good at for real?

Let’s find out!

Methodology

I’m going to set up 5 (five) data stores on my local system:

PostGRES will be my baseline table database. I could explore the differences between Postgres and MySQL but I really don’t care I’ve heard from enough people who I think know enough about this to have a working axiom “they are kinda the same but postgres is better for reasons”. Whatever, it’s rectangular – thats what I care about.

MongoDB and ElasticSearch will represent the NoSQL document-store systems. ElasticSearch because we use that at work, MongoDB because that’s actually a real document store not a search engine and I want to be clear if there are distinctions between “Document Store Problems” and “Elastic Search Problems”.

Representing the graph store, we’ll look at Neo4j and GraphDB. Neo4j is a labeled property graph, GraphDB is a native RDF triplestore. Both are “graphs” but with slightly different formal structures and implications, and slightly different best practives in patterns and shape of data. I’m curious how those differences shake out in the wash.

That Data Tho

I’m going to need two different data sets; one that is very single-rectangular-table and one that is very lots-of-kinds-of-things-with-complex-relationships. One former should favor Postgres, the latter should favor the graphs. The document stores theoretically will handle both datasets similarly. These two datasets will be very large, but I don’t think they need to be terribly wide. Ultra-wide resources are interseting tho, and there may be room to expand into a third data set if I’m having a good old time.

These data sets will be synthetically generated, with a script for generating each one that I will share here. These synthetic data sets should have a range of data types, from strings to booleans, integers, floats, coordinates, rich text, and of course relationships.

This will probably change

The current scripts can write JSON blobs to disk, about 10 million key:val pairs in a few seconds. This generates decently sized data sets — 350,000 nodes in the graph and 1,666,000 rows in the table. The graph is a foaf network or people, organizations, projects, and groups. People have past projects and current projects, know other people, and are part of a group or an org. Groups can be associated with other groups or orgs, and orgs can be related to each other.

The table is a big old bibo blob of books, with have titles, authors, and publishers. There are about 1/3 as many authors as books, and 1,0000 publishers.

Both scripts can be adjusted for how many to generate, and Im sure there is a hard limit on memory somewhere. I also broke the generate script form the JSON writing script since Im anticipating that each datastore will probably need it’s own little script for ingressing this data, and I don’t see what I have to work with huge JSON blobs for that.

OR

I also might just use this benchmark dataset too.

Queries and Such

Once our data sets are in our databases, we’ll define some queries to run for each of them and see what happens. From there we’ll perform some standard and basic maintence and development tasks, like adding new structures, changing schemas and the like. These will all be grounded in real-world use cases.

Once the data sets have been defined I’ll think of these actions and put them here.