Also includes a gratuitous change to the HTML in order to trigger a
build.
Fixes <https://ci.nono.io/teams/main/pipelines/dockerfiles/jobs/build-and-push-sslip.io-nginx/builds/33>:
```
error: failed to solve: rpc error: code = Unknown desc = executor failed running [/bin/sh -c dnf install -y bind-utils iproute less lsof neovim net-tools nginx nmap-ncat procps-ng RUN mv /usr/share/nginx/html /usr/share/nginx/html-orig]: exit code: 1
```
We make sure that each of the three nameservers
(ns-{aws,azure,gce}.sslip.io) can set a key-value, that the value
propagates to the remaining nameservers, that a nameserver can delete a
key, and that the deletion propagates to the remaining nameservers.
ns-gce is unable to join the cluster because its source IP address is
the node on which its running, 34.72.45.206, and that's not included in
the SANs.
This commit updates the etcd certificate to one which includes the three
GKE nodes' IP addresses in its SANs.
This commit also includes instruction to update the certificates in the
event of an IP address change.
Fixes:
```
Apr 16 14:15:34 ns-aws etcd[500]: rejected connection from "34.72.45.206:43080" (error "tls: \"34.72.45.206\" does not match any of DNSNames [\"ns-aws.sslip.io\" \"ns-azure.sslip.io\" \"ns-gce.sslip.io\" \"ns-aws\" \"ns-azure\" \"ns-gce\"] (lookup ns-gce: Temporary failure in name resolution)", ServerName "ns-aws.sslip.io", IPAddresses ["127.0.0.1" "52.0.56.137" "52.187.42.158" "104.155.144.4" "::1" "2600:1f18:aaf:6900::a"], DNSNames ["ns-aws.sslip.io" "ns-azure.sslip.io" "ns-gce.sslip.io" "ns-aws" "ns-azure" "ns-gce"])
```
The original behavior was to return the deleted record, which
inadvertently prolonged the lifetime (in DNS cache) of the record which
was meant to expire as soon as possible.
- Removed the instructions to create a BOSH release. We are no longer
creating a BOSH release because we needed to colocate an etcd release
alongside the BOSH release, and we couldn't find an etcd BOSH release.
- Updated the instructions to run a quick test against the sslip.io DNS
server locally (sanity check) instead of deploying a VM with the BOSH
release & testing against that.
- Updated the instructions for updating ns-azure's DNS server. ns-azure
is no longer a BOSH-deployed VM.
When we check the production servers, we now expect, when we delete a
key, to NOT receive the key's old value as a response, lest we
inadvertently extend the lifetime of the key that we want to expire.
We don't return the deleted value because doing that would have the
unintended consequence of postponing the deletion: downstream caching
servers would cache the deleted value for up to three more minutes. We'd
rather have the key deleted sooner rather than later.
Some APIs, e.g. etcd's, return a list of deleted values on return: those
APIs can afford to do so because they don't need to worry about DNS
propagation.
We also lengthen the timeout of an `etcd` API call from 500 msec to 1928
msecs; 500 msec was too close; some calls routinely took 480 msec to
complete, and we wanted more headroom.
We also no longer do two `etcd` operations when we delete a value.
Previously we would do a GET followed by a DELETE, but since we're not
returning the value deleted, there's no point to the GET. Furthermore,
the GET was never necessary, for the `etcd` DELETE API call returned the
values deleted.
Drive-by:
- README: install gingko the proper way, with `go install`
[fixes#17]
Now that we're no longer create BOSH releases, we don't need to bury the
`src/` directory under `bosh-release`; we can now place it under the
repo root, and we no longer need to fiddle with symbolic links.
We're not creating BOSH releases because when we decided to implement a
key-value store, we'd have to create an `etcd` BOSH release, and we
didn't want to invest the time.
- You can select the port to bind to
- The NS record returned for `_acme-challenge` domains is special
Also, I removed the periods at the ends of bullets to be consistent.
We want to allow users to bind to ports other than 53. A big reason is
that port 53 is a privileged port, and often requires root privileges.
We don't want to force our users to use root privileges in order to run
the tests.
This isn't a problem on macOS, but is on Linux.
Previously we would download the blocklist every hour for every address
we've bound to, which, on Linux machines, could easily amount to 8
addresses (loopback, IPv4, several IPv6). Linux, if you recall, has a
systemd nameserver bound to 127.0.0.53, forcing us to bind to each
address individually. Downloading multiple identical versions of the
blocklist was inefficient.
With this commit, it downloads the blocklist only once per hour,
regardless of the number of individual IP addresses listened on.
But what really excites me about this commit is that I've moved much of
the initialization of the `xip.Xip` struct out of `main()` and into
`xip.NewXip()`. This makes `main()` lean again, and `xip.Xip` has gotten
complex enough that it warrants its own constructor.
This repo has been forked 36 times, and yet I've done a great disservice
to my would-be developers by not describing how to run/test my code.
This commit addresses that shortcoming by having a _Quick Start_ section
very near the top.
- includes new Ginkgo v2
- includes required `sudo` for Linux
- removed the now-wrong comment about TXT records (there's now a
plethora of TXT records such as `ip.sslip.io`)
- minor formatting tweaks
- updated comments in `blocklist.txt` to include references to CIDRs &
how they're handled
- updated webpage to include description of the upcoming metrics for the
blocklist
There is now a singleton which contains global state (metrics, etcd
client, blocklist, etc.). Singleton is quite the fancy name for a global
variable, which is global by virtue of being passed around by reference.
Prior to this commit, the Xip struct served two masters: global state
(e.g. metrics) and volatile state (querier's source IP address).
This was ugly, but workable.
But with the advent of the blocklist it became untenable. I needed the
Xip struct to be truly global, to download only one copy of the
blocklist, not one copy for each of the network interfaces that
sslip.io-dns-server was listening to. Hence this change.
One the downside I had to plumb the querier's source IP address through
several layers of function calls.
We conform to the modern usage of "blacklist". In Google search,
"blacklist" appears 45 million times, "black list", 7 million.
Yes, I'm aware that we're using "block", not "black", for the variable
name, but keep in mind that we're using "block" as a drop-in replacement
for "black". And the newer "blocklist" has a puny 1 million appearances
to "blacklist"'s 45.
My initial implementation of blocking phishers was flawed. I thought I
only needed to block by matching strings in a hostname (e.g.
"raiffeisen"), but I was recently served with a second abuse notice
(<https://nf-43-134-66-67.sslip.io/sg>), one which didn't lend itself to
blocking via a substring match. And at that moment I understood why
Roopinder of nip.io blocked by IP address.
The work is not yet complete, but at least I can parse and create an
array of CIDRs to match against.
Drive-by: I didn't realize Golang had increment ("++") (see [Why are ++
and -- statements and not expressions? And why postfix, not
prefix?](https://go.dev/doc/faq#inc_dec)), so I used the longer "+= 1"
throughout the codebase. Now that I know Golang has them, I use them.
I've refactored the metrics: where I previously used the term
"successful", I now use the term "answered". "Answered" means there was
at least one record in the Answer section of the response to the DNS
query. This is a more precise description.
I re-arranged the metrics integration test. Now it's sorted by type of
record queried (A, AAAA, MX, etc.). It's easier for me to follow.
When a hostname is queried with a blocked name, we return the address of
one of our servers (currently ns-aws). For example,
`raiffeisen.94.228.116.140.sslip.io` returns the IP address
`52.0.56.137` (`ns-aws.sslip.io`'s IPv4 address).
Currently we only block one name: "raiffeisen",
<https://en.wikipedia.org/wiki/Raiffeisenbank>.
- We enable the integration tests for the blocklist.
- We don't block private IP addresses; they can't be used in phishing
attacks.
- At the beginning of the integration tests (`ginkgo -r .`), we now
print the DNS server start-up messages. They help me debug.
- We broke out some of the code into their own methods.
`processQuestion()` remains too big, but at least it's now smaller.
- We use `blocklist` rather than `blacklist`. If this modest change
betters the black experience in America, then it was worth it.
TODO:
- Wire up the blocklist so we block the phisher domains
- Migrate the downloading of the blocklist outsid the `main()` method
- Uncomment the integration tests
We weren't aggressive enough to make sure our rate limiting channel was
emptied: we added only ten extra reads on the channel on our CI, which
worked perfectly on our Xeon workstation, but not on our CI.
Our MetricsBufferSize is 100, our delay is 250ms, and each query on CI
took ~25ms to complete, which meant we needed > 110 reads to exhaust the
channel, and we were on the knife's edge. So we doubled the number of
reads to 200 to make sure we had really, truly exhausted the channel's
buffers.
Fixes <https://ci.nono.io/teams/main/pipelines/sslip.io/jobs/unit/builds/50>:
```
sslip.io-dns-server for more complex assertions a TXT record for an "metrics.status.sslip.io" domain is repeatedly queries [It] rate-limits the queries after some amount requests
/tmp/build/b4e0c68a/sslip.io/bosh-release/src/sslip.io-dns-server/integration_test.go:302
```
`metrics.status.sslip.io` is a vector for a DNS amplification attack; we
mitigate it by latching a 1/4 second throttle on each query after a
certain amount of queries.
That endpoint is a 4x amplifier: 100byte request with a 400 byte reply.
Previously I never checked if `net.ParseIP()` returned `nil` for an IPv4
address—I couldn't imagine my IPv4 regex was incomplete. I was wrong.
Moral of the story: always check for errors, always check for nil.
Oddly, I checked for IPv6 addresses—I guess I wasn't as confident about
the regex used.
Drive-bys:
- updated SOA with today's date
- updated dependencies `go get -u`
[fixes#15]