Describing my gameplan to self-publishing and self-generating my future websites
Published on
It's been a hot minute since I spoke about any projects. Truth be told, I haven't worked on much, and I beat myself up over that every day. Work has been pretty stressful all things considered, and the layoffs across the industry have not done much for my own mental wellbeing.
Never the less, I think this year will be the year I attempt to take my website into my own hands. I've spoken about it far too much in the past, and frankly, I think it's time I make it my February project. Here is my gameplan.
Right now, my website is published with Zola, and while I think it's a good baseline for creating and publishing websites, I feel like I'm kind of over it. There are certain things I want to do, and to get features like that into Zola, I'd have to fork or contribute, and none are of much interest to me as it's in Rust, which is not a time-friendly language or one I want to explore currently.
In the past, I created my website fully with Racket. I've written in the past about how I sort of don't want to continue using it, but I think I am inevitably drawn to it for many of it's good things. Let's just say last year I sort of flip-flopped between many languages, and Racket is my comfort food. It sucks, then it doesn't. I'm just not drawn to one language in particular, but I think Racket is just frankly the easiest to settle down with.
I have considered writing things in Guile Scheme, but I sort of don't like the overall feel of Guile just yet. The docs for Guile aren't quite as fun to read as Racket's, and that's one good thing I like about Racket a lot, the documentation is superb. Like, chef's kiss good. There's links to functions, types, classes, pages and pages of different sets of information, I could really go on, but it's the best documentation I've ever had to use. I think Rust's might be somewhere close. Zig's documentation is... Meh.
I have had some complaints about Racket, namely with it's strange, seemingly isolated community, people coming and going, inability to escape academia, but I feel like I have to accept that it is what it is, and I can simply continue using it for as long as I see fit. Now, let's start by understanding the complexities that go into making a website generator.
A website is (typically) made up of content that I write, namely my pages written in Markdown, and then it's up to the website generator to collect these content pages, parse them in some manner, then join it with some kind of HTML templating engine to create a full-blown website. I'd say this process is done in these steps:
Each of these five things have incredible jobs they must work out. We need to integrate with a file system, walk through directories recursively, collect content files, read template files and parse them into template functions, then begin the incredibly tedious task of parsing Markdown. I've struggled with Markdown parsing a lot in the past, but I think this time I need to try it for real.
The Collector's job is to collect things it finds along it's way. The collector takes on a big role of having to find specific files and relating their existence to some purpose. For content files, it would be for publishing. Templates would be for rendering, and static would be for files to live "globally" on the site.
$ tree new-site
new-site
├── content
│ ├── about
│ └── blog
├── functions
├── static
└── templates
This mirror's Zola's structure. The templates
and functions
are separate, because if you were to say copy someone's template folder, they might not share the same functions. A function for me is something to use inside of content publishing, so I don't think functions should live inside a templates folder. They have similar purpose, but not really. (Zola calls my idea of functions shortcodes
instead)
The Collector will have to:
The static
folder can be left alone since that will be a one-for-one copy to the public folder. Ultimately, the collector will have to distinguish parsable content from non-parsable content, which can then be fed into a parsing pipeline to convert written goods into publishable goods. I'm really running out of terms here.
I probably wrote about this a bit ago in some other post, but the goal would be to use regular expressions to convert blocks of text into a kind of parsable state-machine like rendering engine. Converting a template like this:
<html>
<body>
{{ insert_text }}
</body>
</html>
After converting it, would make it look like a list of elements like the following:
'((raw "<html><body>")
(sub "insert_text")
(raw "</body></html>"))
As we step through each element, we can perform actions based on the tag given to each element. For a raw
tag, we write it to wherever it's supposed to go. For a sub
tag, we substitute the text to flush with a key/value replacement from an exterior data source, like a dictionary or a hash map. If my insert_text
key were pointing to a string of text like Hello World!
, then we can convert substitution elements into raw tags pretty easily.
; after replacement
`((raw "<html><body>")
(raw "Hello World!")
(raw "</body></html>"))
This substitution pass would be one phase of rendering, where we substitute data supplied from a pretty simple source. This system serves as a baseline for simple substitution, but obviously for full-blown websites, we might need some actual logic.
Now, I'm not someone who's good at designing emulators or virtual machines or whatever, but I think it follows some similar principles. Here's some keywords describing my intent.
if
- the building block of all logic, executes if
condition is metelse
- the opposite end of the if
for
- to iterate over a span of data, like a list of postsblock
- a reserved section of elements to be easily replaceableextends
- a keyword indicating we're inheriting from another templatecomment
- something to help us comment our templates so they make senseAn if
with no else
should realistically be turned into a when
, but that's all about semantics at that point. I would still like to have a when
keyword for better clarity in templating, and maybe even an unless
. A template should realistically only be extending one template, so future extends
are to be ignored.
Ideally, a proper template might look something like this:
{% extends "base.html" %}
{% block title %}
{{ page.title }}
{% endblock %}
{% block content %}
<div id="content">
{{ page.content }}
</div>
{% endblock %}
This can render one blog post, which is doable by rendering a page to a variable post
and treating it like a dictionary with sub-keys accessed through a dot reference. The block
terms would override blocks in an inherited post, and then it's a bunch of basic substitutions. Not so terrible a goal.
The harder part will be, no doubt, looping, as that's a bit of a mess. Ideally a block iterating over a list of elements will be tricky, but hopefully not terribly difficult. Transforming this:
; assume 3 pages
{% for page in pages %}
{{ page.title }}
{% endfor %}
Into this:
'((raw "Page title 1")
(raw "Page title 2")
(raw "Page title 3"))
Should not pose to be an impossible task. This functionality serves as probably the most important of all, being able to iterate over data like this in a simple manner is crucial, and will come up a bit, especially when defining an RSS feed to output.
Blocks are a little tricker, but substituting them requires joining one template to another and substituting blocks that get overwritten. Block tags should be kept on a separate side so we can easily figure out what blocks are part of a template. If we're inheriting from a template, and a block isn't in the parent template, then the block probably shouldn't be used at all, and an error will be thrown. Blocks are by-design meant to be overwritten by a child, and starting new blocks would be problematic in a child template. This is subject to change, however.
That's it for templates right now, now let's look at a more difficult aspect.
Markdown parsing is problematic, for a lot of reasons. There's a lot of rules, and it's not an easy project. New lines being paragraphs is one thing, but the other rules that come into play are not as simple, like:
However, it's not out of the realm of impossibility. Like the template engine, using some clever regular expressions, we can separate text by line, understand the characters at the beginning of line to create a contextual tag for interpreting, and do our best to collect bits and pieces line by line, until we reach a final assembly point.
An example, here's some Markdown.
# Hello world
I am a paragraph
This should be converted into a structure like:
'((h1 "Hello world")
(p "I am a paragraph"))
Contextually, we have all we need to render these to HTML immediately. When we introduce intermixed elements like anchor links, it gets a little tricky, but not impossible.
I am a paragraph with a [URL](https://ste5e.site/).
Should turn into:
'((p "I am a paragraph with a ")
(a ([href "https://ste5e.site/"]
[target "_blank"])
"URL")
".")
The URL element should contain the URL we want, it's inner text (with any decorations, if any), and some properties that we store as a list at the start of the element we need to render when we're at the rendering phase.
Now, when it comes to text decoration, my understanding is that the text decoration has to actually make sense. Asterisks cannot be contextually interweaved between things like anchor links; here's what I mean.
[*Good anchor link italics*](url.com)
[*Bad anchor link italics](url.com)*
Inside square brackets, the text needs to make sense, so the first one is a good example of proper decorating, while the second one is not.
'((p (a ([href "url.com"] [target "_blank"])
(i "Good anchor link italics")))
(p (a ([href "url.com"] [target "_blank"])
"*Bad anchor link italics")
"*"))
The anchoring is a rule that must be matched accordingly, and italics is also a rule capture to capture text between asterisks. These cannot be so easily intermixed, so therefore I think it's good design to make sure that this is a valid distinction. We should not be creating crazy rules to match poor Markdown writing.
For numbered lists, which are sort of tricky, the goal is to convert a list of either bullet points or numbers into a list of <li>
elements, except the difference is that the numbered list elements will store the number with their <li>
tag to note the distinction between <ul>
and <ol>
.
1. Numbered 1
2. Numbered 2
* Not numbered
* Also not numbered
Should look something like this after a parse.
'((ol (li ([value "1"]) "Numbered 1")
(li ([value "2"]) "Numbered 2"))
(ul (li "Not numbered")
(li "Also not numbered")))
I don't know how good a design it is to keep the value
property for each element, because a <ol>
should correctly count the elements with only one starting number. However, a user can create numbered lists that might not necessarily make sense, and the automatic design of <ol>
lists might not supplement that. I can go crazy and write a list like this:
It shows a one-two-three list, but I actually wrote that as:
1. First
3. Second
2. Third
Zola's markdown only seems to provide a value
property if the first element is a different value than one. Which is fine, but I think sometimes you might want to go out of order for certain lists in writing, or at least supply different values inside seemingly numbered lists, maybe you're writing numbers backwards. I think this is where me and markdown might disagree.
I don't plan on doing much more fancy stuff than that; I don't really like tables because I don't have much use for them, but maybe I'll incorporate that at a later time. Once these elements are collected and properly grouped, they can be fully rendered into the above template engine format by converting the information into raw
tags, which will then be written to file once it's at the write phase.
All my posts incorporate metadata to produce information like titles or dates, and these are important to have. For sanity, I think a section at the top of each file should be sufficient enough. Just for consistency, I can use the same format as Zola, as I don't think it's a terrible method of storing information.
+++
title = Hello world
description = A post about worlds being hello'd at
date = 1/1/2023 1:00PM
+++
I don't normally supply the time, but for my posts to better be sorted by RSS readers, I think a timestamp is necessary to include the information of when a file is published live. Normally Zola requires you to mark things in quotes for strings, but if I split by new lines, I don't feel that's necessary here. I'm okay keeping things in one line, mostly because I don't deem it necessary to write huge paragraphs in either title or description.
At the file reading process, these values will mutate a metadata structure that will live with the page in order to give the build system more context about posts, and how we can go about sorting them (date, name, tags, etc).
Collecting tags from posts would also enable me to create something like a tag cloud, which is such an old thing I haven't seen used in forever, but I think it's important to demonstrate at a glance what a person really writes about a lot. This is not something I think Zola has enabled me to create easily, as I don't see a way of enumerating all posts and collecting info like this very easily, if at all.
Something that might be annoying and awkward is the idea of changing my URL slugs. Currently, they point to all the appropriate places, but should I change the format, then past bookmarks will break for anyone, RSS readers with old data will be outdated, and some things will just break, and (not that) worse, search engines will now be considered invalidated due to broken URLs.
Visibility will be impacted, but frankly, I'm not terribly happy with my current slugs. Right now they are a basic format, /blog/post-title
. This format is not very appealing for me, personally, because the time it's posted is not included in the URL, which is sort of a hard hit for page rank engines. It should be more clear, so I think /blog/2024/01/01/post-title
is more appropriate, if not more URL space taken up.
Since I mentioned changing my slugs, it's important to mention that with this new system comes a new feature to be able to view posts by a specific time-frame now. If a year is now included in the URL, it should be possible to visit /blog/2024
to see all posts for the year. I didn't have this set up with Zola, so I'll be looking to add this going forward. Moving up a timeframe, like say going to /blog/2024/01
will move you to all posts from January.
(This is subject to change, however, I don't know if I like having leading zeroes for months/days still.)
So, generally how I think it goes, would be something like:
public directory/
index.md
section/
year/
index.md
month/
index.md
date/
index.md
post-title/
index.md
picture.jpg
...
At the root of the folder would be where visitors land upon hitting your root domain, and each subfolder subsequently would be considered it's own section. Subfolders under that would be considered pages, each with their own assets and files.
The big picture here would be organizing all posts, then categorizing them into smaller groups for proper year/month/day publishing. Day is a little extreme; assuming I don't post more than once a day, it's probably not needed, but while I'm here, it might as well be done.
Copying assets from each page folder is not a hard task, but just generally speaking, it might be nice to know if optimizations can be made on certain files to make them smaller and more bandwidth-friendly. I wouldn't be opposed to using ditherpunk to make images smaller if needed. Visually, I like ditherpunk a lot.
JPEG images can be large, and PNG images can be better, but is it a small feat to be able to tell the user (me) of that? I feel like it's a pretty extensive task to scan through all images and perform compression tests to see which format is smaller than one another, and it's a lot of expensive CPU time used to do so. I would like a system to help aid me in such, and I figure I can staple it onto here using some of Racket's base image libraries.
If not Racket, then I can use exterior system calls to imagemagick
and do convert
calls on files to see which format in theory would be better. And, future considered, maybe even try JPEG XL.
This isn't really something that necessarily impacts me, but because of how I host my files, I would like my files to be smaller if need be, and not taking up large amounts of space if avoidable. I think it'd also be interesting to have a system that can help manage your space in this fashion.
Linktree is a common service people use to host quick links of where to catch you elsewhere on the internet. Because many content creators and individuals have many different outlets, it's nice to have a place to centralize your information so people know where you can be found in an official context, and fight against impostors who may try to phish or use links to pose as you.
Not that I have a security threat such as impostors, nor do I ask people for money in any capacity, but a Linktree is an interesting thing for me to think about, because it's a quick glance at where to find people. It's not hard to build, and might be one of the first things I build out when designing my new system.
The next part, not necessarily a crazy one, would be to help me craft and create links to send out on social media. I tend to craft links for Mastodon manually, so if I had something that could read my metadata and help me publish a new page, then that'd be a neat little timesaver. Hashtags could be inferred from my posts, and it would be so simple to implement and fire off immediately into the void.
The first step, amongst all this, is a successful template engine. Without that, I cannot really do anything I would like. The template engine itself is about 50% of the work, while Markdown parsing is probably 30%, and the other 20% is putting it all together into a single directory.
By next week I would like to have a rough sketch of the template engine in use, and hopefully a small Markdown parsing engine to use it. Once I get it to a point where I feel comfortable with it, I will build it in a beta/
directory, then slowly phase out the old Zola files with the new one, which will probably break things, but in a good way.
The goal of course, will be to build iteratively upon my current existing files, see how much I have to "back-support", and then gradually move away. I don't want to be a one-for-one with Zola in the slightest, and I know using Racket will lend itself to being much slower, but for the goal of creating something I fully own top-to-bottom, I feel like it's a choice I'm willing to make. Plus, it's easier to share with friends who are just getting started into their web goals if we have something easy and simple that we can collaborate on top of.
Until next time, happy weekend!