Systems and Identifiers
A few let's say statements about system design decisions followed by ponderings. Efficiency is actively sidelined.
URLs are extremely messy
SCHEMA:USER@HOST:PORT/PATH#FRAGMENT?QUERY
,
what in the name of data representation is this???
I don't want to call things a mess, there's a reason for every decision, however it does sort of just represent something in many different ways.
If we tried to make it as close to a regular path as possible,
it could look something like this: /HOST/SCHEMA/PORT/PATH/...FRAGMENT?
.
And here lies the reason for the different pieces of syntax, you can't really tell what is a fragment and what is still part of the path,
don't even start thinking about the query parameters.
Leaving aside that the URL is supposed to locate a resource, and query parameters sort of add inputs to it, why is the fragment separate?
Because the web isn't made up of well-defined documents as once described but rather hosts a completely arbitrary HTML file in response to a given request. And we just hope that the PATH
here corresponds to something even though we have absolutely no guarantee we won't just be served something completely different next time we ask.
These URLs are hierarchical to a degree and then cease to be, I cannot decide to ask for just a fragment of a document but have to load the entire thing and then search it for an identifier I can hopefully scroll to.
All this also means that cloning a website is very difficult and fully relies on the server to behave in a way that implies it's just serving files under the hood.
It also completely throws away any possibility of extending the files with extra metadata, for example following "versioned files" and asking for the file as it was on a given date.
Wait what even is //
?
Citing rfc1738:
The scheme specific data start with a double slash "//" to indicate that it complies with the common Internet scheme syntax.
Interestingly enough Firefox seems to be able to open a link in the form http:blog.michal-atlas.cz
, but cURL rejects it.
Some protocols do allow this, mailto:
if memory serves doesn't require it and tel:
doesn't accept it?
IPv6 Parsing Confusion
Another more historical issue came up back when
IPv6 was less popular and you'd commonly get
[::1]
resolved as if it was a search or domain name.
To this day Fing cannot ping IPv6 addresses.
The issue is relying purely on guesswork to disambiguate the different options,
I mean 123.456
could in theory be referencing the TLD 456
even though that isn't valid and you can take a guess why that's the case.
An excellent StackOverflow post goes into detail about what exactly is and isn't valid.
MultiAddrs
In any case I'd like to mention multiaddrs since they do address this in a way. They are part of the larger multiformats project which tries to design encodings that are future-proof through self-description.
Since we have multiple options we must first disambiguate and then write the given option so the addresses look like this:
/dns/blog.michal-atlas.cz
/ip4/127.0.0.1
/ip6/::1
You can plainly see that there is no ambiguity,
and it is now trivial to add for example /ip9/...
without having to change much else,
and software encountering it will not mistake it for
something that it isn't.
Going one step further they also allow
stuff like /dns6/blog.michal-atlas.cz/udp/9090/quic
,
specifying the protocol and port for connections. And include a binary-packed form as well.
This also allows arbitrary encapsulation of protocols, so doing minecraft over ssh over http would just be something like /ip6/::1/tcp/8080/http/tcp/22/ssh/tcp/25565/minecraft
. This though leads to another thought.
Much of our hierarchy isn't that hierarchical
Given the previous example we'd imply that the protocol is some sort of "descendant" of the port? That doesn't exactly make that much sense. Hopefully the structure means something like this:
[]
= 6
= "::1"
[[]]
= "tcp"
= "http"
= 8080
[[]]
= "tcp"
= "ssh"
= 22
[[]]
= "tcp"
= "minecraft"
= 25565
What a mouthful, but the transport protocol isn't necessarily any lower/higher than the port or application. And it is leaps more explicit and understandable.
It does also lead to one other strange idea.
Identifying machines and services
In the configuration files of some programs you can encounter an address field, and more often that not, that field takes a list of strings rather than a string. For example Syncthing or IPFS, when specifying peers, allow any peer to be listed as having many addresses, and conversely IPFS also allows the node itself to listen on any number of addresses.
These addresses are always paired with a public key as well, since what does routability have to do with who you are? Sometimes that may be the case but in many many others it isn't, as one giant counter example you could take DHCP.
And why does the public key need to identify a single machine anyways? With multiple addresses and a verifiable identity, then a service can be provided by multiple machines, the fact that now we sort of "confuse" clients into load-balancing traffic is also something that feels natural but is kind of a hack.
Files as connections
What I loved when reading about and testing out Hurd and parts of Styx,
was file translators.
These act as all sorts of file and network processing utilities,
the idea is that "everything is a file" including services, processors, abstractions, whatever.
So DNS resolution would be done through writing to /dev/dns
and reading back the address, or if you want to read the file https://blog.michal-atlas.cz/favicon.ico
, you'd just open /http/blog.michal-atlas.cz/favicon.ico
on your machine and there it would be.
Some could even be combined, given a .tar
file that contains a file
you could directly access that file using /tar/file.tar/file.txt
without
extracting or having to use a tar
library in your language.
I would love to be able to open /tar/(/http/blog.michal-atlas.cz/file.tar)/file.txt
or some equivalent.
Imagine all that complexity of encryption, parsing TAR files, DNS resolution, hidden away into a trivial opening of a file.
What would be interesting is if accessing a range in the .tar
file
would only access a range through the http
interface.
Relying on state yields broken or unreproduceable results
As we've seen through the recent triumph of NixOS, shipping something that works no matter what the current state of a system is incredibly valuable, as the system cannot necessarily be trusted to be in a sane state, or more likely the application creator, user or perhaps the creator of other libraries might never agree on what exactly is sane.
Installed programs or libraries, of certain versions and with certain features or configuration are just something that must be eliminated for a system to be reliable especially in a distributed or remotely-accessible state.
I'd layer another claim on top where
calling a remote program should be possible with
local files.
What I mean is that given a local file foo.txt
it should be fine to just call
ssh hydra foo.txt
without having to care that the given file is not on
the remote machine.
This should be abstractable.