Parallelizable tests (`ginkgo -r -p .`) were failing on my 20-core
(`-nodes=20`) Mac Studio. We narrowed this down to two causes:
1. The servers sometimes took longer than the hard-coded 3-second delay
to become ready to answer queries.
2. The blocklist was downloaded asynchronously, and sometimes weren't
ready by the time the queries were run.
To address these, we did the following:
1. Rather than hard-code a 3-second delay, we modified the server to
signal that it's ready to answer queries (by printing "Ready to
answer queries" to the log). We now wait for that string to appear
before we begin testing the server. IMHO, this is a much better
solution than a hard-coded delay.
2. The initial download of the blocklist occurs synchronously, and
subsequent downloads, asynchronously.
Drive-bys:
- If the server can't bind to even one address, it exits.
- Refactored the blocklist code; the nested if-then-else were too deep
Fixes:
```
Expected
<string>: 43.134.66.67
to match regular expression
<string>: \A52.0.56.137\n\z
In [It] at: /Users/cunnie/workspace/sslip.io/src/sslip.io-dns-server/integration_test.go:421
```
We'd like to parallelize the tests to lay the foundation for the
upcoming expansion of flags passed to the executable (e.g.
`-nameservers`), which will spawn a series of executables, each of which
takes 3 seconds to spin up, and running that sequentially would make
testing tiresome.
- We've migrated away from `serverSession.Err).Should(Say())`
to `serverSession.Err.Contents())).Should(MatchRegexp())`. `Say()`
depends on ordering, `MatchRegexp()` doesn't.
- We introduce a short, 50-millisecond `Sleep()` in `isPortFree()` to
eliminate a race condition introduced by parallelization where the
same port is returned twice.
- Some of our `DescribeTable` tests were order-dependent; we moved them
outside the table.
- We parallelize our pipeline's unit tests.
- For the `k-v.io` tests, we used different keys for each `It()` block
to avoid pollution. We are also more careful about waiting for the
setup to complete before running the actual test.
As a side-effect of parallelizing the tests, we no longer require `sudo`
on Linux to run the tests, for we no longer attempt to bind to port 53;
instead, we bind to a series of available unprivileged ports.
Previously our integration tests bound to port 53, and, if that failed,
fell back to binding to port 3553.
This commit introduces code to scan for an open port and uses that,
which lays the foundation for potentially parallelizing the integration
tests.
The massive 80+ line `Customizations` variable is a hard-coded
monstrosity, and I've fallen out of love with it.
I'd like the customizations to be passed in from the caller, in this
case, `main.go`.
To that end, I've created a `default.json`, which should contain all the
customizations with the exception of the key-value functionality, which
I don't have a good way to deal with just yet.
`[0-9]` → `\d`, `[0-9a-f]` → `[[:xdigit:]]`
A follow on to the previous commit, which did the same for Golang.
Ruby supports the above matchers like Golang does:
<https://ruby-doc.org/core-3.1.2/Regexp.html>
Some of them are simple, e.g. `[0-9]` → `\d`, `[0-9a-f]` →
`[[:xdigit:]]`
Others I deliberately chose to ignore, like `defer x.Close()` doesn't
handle the error.
There are dogmatic users on the internet such as [Joe
Shaw](https://www.joeshaw.org/dont-defer-close-on-writable-files/)
screed, who insist that all errors should be handled, and provide
contorted & unnatural solutions that detract from the readability of the
program. I think they're wrong, at least for my purposes: I don't care
if the `Close()` errors.
The TXT response to the query `metrics.status.sslip.io` was doomed to
exceed the UDP 512-byte limit, which would have forced the client to
re-attempt via TCP, and our server doesn't yet bind to TCP.
This commit fixes that by squeezing the packet. We haven't dropped any
information, but we made it more succinct.
Per [Infoblox](https://www.infoblox.com/dns-security-resource-center/dns-security-faq/is-dns-tcp-or-udp-port-53/):
> when the message size exceeds 512 bytes, it will trigger the ‘TC’ bit
(Truncation) in DNS to be set, informing the client that the message
length has exceeded the allowed size. In these situations, the client
needs to re-transmit over TCP
We implement PTR records for IPv6, for example:
2.a.b.b.4.0.2.9.a.e.e.6.e.c.4.1.0.f.9.6.0.0.1.0.6.4.6.0.1.0.6.2.ip6.arpa →
2601-646-100-69f0-14ce-6eea-9204-bba2.sslip.io.
We implement PTR records for IPv4.
When a PTR record is not found (e.g. "127.in-addr.arpa"), it returns the
SOA record, but, unlike other record lookups (e.g. "MX"), the SOA's
mname is locked to "sslip.io" because setting the mname to
"127.in-addr.arpa" doesn't make sense.
To be done:
- Implement IPv6
- Implement Metrics
- Update README
- Deploy new version
Note: the two biggest users are Cypriot IP addresses:
```
2 106.52.50.235 <- Tencent
1 223.71.46.114 <- China Mobile
157 31.153.14.207 <- Cypriot
110 62.228.164.123 <- Cypriot
4 73.189.219.4 <- My home IP
```
`
Prohibit setting DNS-01 challenge TXT record `_acme-challenge.k-v.io`
Although it may appear the TXT record can be set or deleted, it's
hardcoded to the string, "Please don't try to procure a k-v.io cert via
DNS-01 challenge". Setting a custom value was easier than writing a
special code path.
Special thanks to [Alan Liang](http://symb.olic.link/):
> ... one could easily add (and modify) a TXT record at
_acme-challenge.k-v.io, which I believe is used for verifying domain
ownership at various cert providers, so anyone could in theory obtain
valid SSL certs for k-v.io and *.k-v.io
I've chosen to add the website to GKE, not Hetzner, because I get fewer
strident abuse messages from GKE.
I'm dismayed that when I make a small change to the DNS, I need to go
through the laborious release process for it to take effect. Sigh. Maybe
that's something I'll fix another day.
We now have a Dockerfile to serve the upcoming https://k-v.io.
The dockerfile is patterned after the sslip.io nginx Dockerfile.
Note: the content isn't ready; the HTML needs fleshing out.
Also includes a gratuitous change to the HTML in order to trigger a
build.
Fixes <https://ci.nono.io/teams/main/pipelines/dockerfiles/jobs/build-and-push-sslip.io-nginx/builds/33>:
```
error: failed to solve: rpc error: code = Unknown desc = executor failed running [/bin/sh -c dnf install -y bind-utils iproute less lsof neovim net-tools nginx nmap-ncat procps-ng RUN mv /usr/share/nginx/html /usr/share/nginx/html-orig]: exit code: 1
```
We make sure that each of the three nameservers
(ns-{aws,azure,gce}.sslip.io) can set a key-value, that the value
propagates to the remaining nameservers, that a nameserver can delete a
key, and that the deletion propagates to the remaining nameservers.
ns-gce is unable to join the cluster because its source IP address is
the node on which its running, 34.72.45.206, and that's not included in
the SANs.
This commit updates the etcd certificate to one which includes the three
GKE nodes' IP addresses in its SANs.
This commit also includes instruction to update the certificates in the
event of an IP address change.
Fixes:
```
Apr 16 14:15:34 ns-aws etcd[500]: rejected connection from "34.72.45.206:43080" (error "tls: \"34.72.45.206\" does not match any of DNSNames [\"ns-aws.sslip.io\" \"ns-azure.sslip.io\" \"ns-gce.sslip.io\" \"ns-aws\" \"ns-azure\" \"ns-gce\"] (lookup ns-gce: Temporary failure in name resolution)", ServerName "ns-aws.sslip.io", IPAddresses ["127.0.0.1" "52.0.56.137" "52.187.42.158" "104.155.144.4" "::1" "2600:1f18:aaf:6900::a"], DNSNames ["ns-aws.sslip.io" "ns-azure.sslip.io" "ns-gce.sslip.io" "ns-aws" "ns-azure" "ns-gce"])
```
The original behavior was to return the deleted record, which
inadvertently prolonged the lifetime (in DNS cache) of the record which
was meant to expire as soon as possible.
- Removed the instructions to create a BOSH release. We are no longer
creating a BOSH release because we needed to colocate an etcd release
alongside the BOSH release, and we couldn't find an etcd BOSH release.
- Updated the instructions to run a quick test against the sslip.io DNS
server locally (sanity check) instead of deploying a VM with the BOSH
release & testing against that.
- Updated the instructions for updating ns-azure's DNS server. ns-azure
is no longer a BOSH-deployed VM.
When we check the production servers, we now expect, when we delete a
key, to NOT receive the key's old value as a response, lest we
inadvertently extend the lifetime of the key that we want to expire.
We don't return the deleted value because doing that would have the
unintended consequence of postponing the deletion: downstream caching
servers would cache the deleted value for up to three more minutes. We'd
rather have the key deleted sooner rather than later.
Some APIs, e.g. etcd's, return a list of deleted values on return: those
APIs can afford to do so because they don't need to worry about DNS
propagation.
We also lengthen the timeout of an `etcd` API call from 500 msec to 1928
msecs; 500 msec was too close; some calls routinely took 480 msec to
complete, and we wanted more headroom.
We also no longer do two `etcd` operations when we delete a value.
Previously we would do a GET followed by a DELETE, but since we're not
returning the value deleted, there's no point to the GET. Furthermore,
the GET was never necessary, for the `etcd` DELETE API call returned the
values deleted.
Drive-by:
- README: install gingko the proper way, with `go install`
[fixes#17]
Now that we're no longer create BOSH releases, we don't need to bury the
`src/` directory under `bosh-release`; we can now place it under the
repo root, and we no longer need to fiddle with symbolic links.
We're not creating BOSH releases because when we decided to implement a
key-value store, we'd have to create an `etcd` BOSH release, and we
didn't want to invest the time.