January 25, 2025

Data Precedence

One theme I'll be returning to is a principle which I'm calling "Data Precedence" (assuming the specific goal doesn't already have a name). This specifically refers to the precedence of data over logic.

This is closely related to data-centrism, data as a product (1), and also aligns closely with much of the software tools ideology built on top of I think what Fielding calls a uniform pipes and filters architecture and linked data/semantic web efforts.

The main focus of the principle is that one of the primary goals of logic is to expose model data for portable external use. This is in contrast to approaches such as plugins or a preference for libraries which try to keep most of the processing inside of a given process or platform.

As mentioned this is particularly in-line with established concepts such as shell pipelines, and may be the merging of that and data as a product to lead to the goal that intermediate and not only finished data should be similarly exposed (albeit there should likely be different levels of controls and contracts depending on the firmness of the boundaries). This also seems a plausible long term evolution of how some current directions play out given that we seemingly tend to develop tools which allow us to work with finer-grained pieces over time - certainly this could be viewed as isomorphic to a function-as-a-service data mesh.

In practice this translates to what has already been mentioned - a primary goal should be to enable interoperable access to data and therefore adopting tools and practices to expose data rather than processing it directly. There are certainly potential drawbacks where some that spring to mind is any overhead in marshaling and unmarshaling the data. On modern systems this is less of a concern, and there are plenty of tools floating around that have focused on optimizing encoding performance and enabling use across languages. Another concern may be security - this by itself is not actually a real concern given that there should be appropriate enforcement such that data security should be pervasive/deperimeterized and therefore if where or when data is exposed changes access that is a problem. A related more real problem would be that such transfer may introduce additional costs for encrypting and decrypting sensitive data which could undermine the previous sentiment around low IPC costs - this is legitimate and if necessary such work should be done with some form of trust boundary/enclave. While either of these could introduce some form of cost issues - the standard guidance of avoiding premature optimization applies.

Pandoc Exporting

One tool which I've repeatedly ended up using over the years for various purposes is Pandoc. It also provides a natural way for me to export this site in conformance with the goal of Data Precedence (and generally enabling that pursuit by providing a more semantic model for many document types).

I've used it in past incarnations with markdown, and most recently with LogSeq and org mode (in an orphaned project) where I have some existing logic that I'll pull in here.

I'll start with seeing what gaps may exist just invoking Pandoc. I did a quick skim of the model from yesterday's journal

pandoc 20250124.org -o native

and it seemed more promising than what was in place for LogSeq so I'll configure that for HTML output and see how it looks. At this point I should also look at splitting out the source and output files.

This has largely been moved to generate.mk.

Test

Given yesterday's file has a citation it will again act as the test subject using the tangled contents of this file (moved to an appropriate place).

Test Result

The result is overall promising - one piece I didn't carry over was how the references were styled since the default isn't particularly Web friendly. I'll pull in some Pandoc defaults to help with that.

Citation Styling

I default to using the iso690-numeric-en.csl styling which I grabbed from - somewhere (to be referenced later) and which is a file that is included locally.

Pandoc Defaults

Here I'll specify that I want newer HTML in case that's not the default, activate citeproc (which I'll then attempt to remove from the previous command, indicate the desired citation formatting, and activate citation linking.

to: html5
citeproc: true
metadata:
  citation-style: iso690-numeric-en.csl
  link-citations: true

My previous file also had some settings that I don't need yet, so I'll wait on those. I could presumably also add the bibliography file here but that seems to smell a bit given that it crosses out of the relatively hermetic inputs.

Result

The citations are now formatted as desired - but they are also repeated. This may be due to the current redundant specifying of citeproc so I'll remove the one from the command (disabling the previous block).

Quick Smell Check

One concern I'd have is the splitting of specifying citeproc and the bibliography file - if there's a hard dependency between them then it would be undesirable behavior/connascence if things would break if the scattered pieces weren't aligned. Running Pandoc manually without specifying the bibliography produces a warning about the missing citation which seems like appropriate behavior.

Another concern are how internal links will work…where the previous page didn't allow for testing that given that there are no such links. The clearest candidate is the home page as it links to all of the journal pages so I'll try that one for testing. It's also about time for me wind down this session so if it works I'll flip my site over to Pandoc and otherwise I'll come back to this later.

Unfortunately but not surprisingly at all, internal links do not work given that they use the .org extension. This can be fixed with a relatively trivial Pandoc filter - I'd prefer to just remove the extension and leave the resolution to content negotiation and Web server settings, but at a glance that does not seem to be supported on sourcehut pages (which is unfortunately in-line with other static page hosts) so I'll likely start with replacing it .html and chasing down the options for avoiding extensions later on. I know of the common pattern to have each page be its own directory to make use of Web server's handling of index pages - that seems like a practical compromise where necessary but also a bit of tail wagging the dog which I'd avoid. Something like avoiding the extension in the output files may be a better option IMO so long as the servers and clients can work out the content type (I know IE used to have an issue ages ago so while I'd hope that no modern browsers should have issues the server may need to be configured properly).

1.
STRENGHOLT, Piethein. Data management at scale: modern data architecture with data mesh and data fabric. Second edition. Sebastopol, CA : O’Reilly Media, Inc, 2023. ISBN 978-1-09-813886-8.
QA76.9.D3 S75744 2023
“Data management is subject to disruption. Trends like artificial intelligence, cloudification, ecosystem connectivity, microservices, open data, software as a service, and new software delivery models are causing a paradigm shift in the way data management is practiced. Organizations need to face the fact that decentralization is inevitable. In this practical book, author Piethein Strengholt explains how to establish a future-proof and scalable data management practice. He’ll cut through new concepts like data mesh and data fabric and demonstrate what a next-gen data architecture will look like. Executives, architects and engineers, analytics teams, and compliance and governance staff will learn how to shape data management according to their needs. Strengholt provides blueprints, principles, observations, best practices, and patterns to get you up to speed”--