I don't need this k8s configuration for sslip.io (DNS, NTP) because I'm
no longer hosting on GKE now that it has an ephemeral IP instead of a
reserved IP because otherwise I'd have to pay $360 extra per year for a
premium-tier load balancer.
The GKE's cluster's IP address is now an ephemeral IP because otherwise
I'd have to pay $360 extra per year from a premium-tier load balancer.
I don't want my website to point to an ephemeral address that quickly
becomes stale, so I'm pointing from what previously was the GKE
cluster's address to the AWS's NS server's address.
My integration tests would randomly fail 5% of the time (based on a sample size
of 59 tests), and the reason was that my algorithm for choosing a random
port was flawed. I was very proud of that algorithm, so accepting that
it was flawed was a bitter pill.
One of the problems was that it had unnecessarily limited the range of
available ports to pick from to 1,000. This change expands that
selection to 64,511.
I changed to a less-clever-but-more-reliable algorithm, and the results
are stunning order-of-magnitude increase in reliability: 0.5% failure
rate (based on 210 tests).
Fixes, when running `ginkgo -r -p -until-it-fails .`
```
Got stuck at:
...
2023/11/19 21:52:15 I couldn't bind via UDP to any IPs on port 1941, so I'm exiting
Waiting for:
Ready to answer queries
```
I'd assert that the server had exited with a 1 (error condition) when it
couldn't bind via UDP to any addresses; however, I wrote the expectation
wrong, and sometimes the server hadn't exited by the time I made the
assertion, resulting in an exit code of -1 (not yet exited) instead of
1.
Using an async assertion `Eventually()`, with a switch `ExitCode()` →
`Exit()`, fixes that problem.
Fixes, during `ginkgo -r -p`:
```
[FAILED] Expected
<int>: -1
to equal
<int>: 1
In [It] at: /home/cunnie/workspace/sslip.io/src/sslip.io-dns-server/integration_test.go:400 @ 10/02/23 03:58:15.824
```
- If it can't bind to all addresses via TCP, log the ones it could &
couldn't bind to & keep running
- If it can't bind to any address via TCP, keep running (unlike UDP
which must fail)
The big challenge in writing these tests is that the binding behavior is
different for macOS (Ventura 13.6 (22G120)) than for Linux.
Specifically, to "squat" on an address, macOS must listen on ALL TCP
addresses (INADDR_ANY) plus the specific address. Linux only needs to
listen to the specific address.
I have no idea what the behavior on Windows is.
I also removed listenPort as a top-level variable; it didn't need to be
top level.
I was printing out the throughput (queries/second) in the middle of the
ginkgo tests, and it was unseemly and didn't belong.
I changed the test to make sure that the throughput was > 1,000 queries
per second. No unnecessary output.
I've wanted sslip.io to bind to both UDP & TCP, mostly because TCP is
more secure (at least with regards to DNS cache poisoning).
In general, the process to receive a packet, whether TCP or UDP, is
similar.
- UDP uses `net.UDPConn`, TCP uses `net.TCPListener`
- Once bound, UDP uses `ReadFromUDP()` to get the data; TCP first
requires an `AcceptTCP()` followed by a `Read()`
- Technically you can ask several queries over a single TCP socket, but
I close the connection after the first query.
- DNS TCP packet has a two-byte length field that has no counterpart in
the DNS UDP packet.
- The TCP integration tests are lacking.
The integration test which worked fine on my dual-stack laptop failed on
my IPv4-only Concourse.
Fixes, when running `ginkgo -r -p .` on an IPv4-only machine:
```
sslip.io-dns-server When it can't bind to a port on loopback [BeforeEach] prints an informative message and continues
[BeforeEach] /tmp/build/b4e0c68a/sslip.io/src/sslip.io-dns-server/integration_test.go:399
[It] /tmp/build/b4e0c68a/sslip.io/src/sslip.io-dns-server/integration_test.go:409
[FAILED] Unexpected error:
<*net.OpError | 0xc000310d20>:
listen udp [::1]:1918: socket: address family not supported by protocol
```
I wasn't using them the way they're supposed to be used. I was using
them because they were "cool" and I wanted to force-fit them.
Specifically, I never called `WaitGroup.Done()`. Instead of using
WaitGroups to keep from exiting, I now dive into a readFrom(), which
never returns.
In preparation for TCP binding, I re-worked the UDP binding process so
that it could be more understandable and more easily replicated.
I don't know that it's more understandable. I may have failed.
I was worried that the DNS server had no headroom left on the DNS server
after one incident where the CI was red and the responses were "choppy".
Rebooting (restarting?) fixed the problem.
- ~19k Apple M2
- ~8k vSphere Xeon D-1736 2.7GHz
- ~6k AWS Graviton T2
- ~5k Azure Xeon E5-2673 v4 @ 2.30GHz
The busiest server, ns-aws.nono.io, handles ~132 queries/second.
It seems there's enough headroom for 37x (5000/132) the current traffic
on the slowest server.
If I ever want to make sure the results are IDNA2008-compliant, I'll
know which test to start with.
One of the things that held me back was that I couldn't find a spec for
what constitutes IDNA2008 compliance.
[#30]
This commit introduces fuzz-testing for the PTR lookups' integration
test.
This commit does NOT successfully surface the following error condition.
In that sense, this commit is a failure:
```
/usr/bin/dig @ns.sslip.io -x ::11b7:bf0a:0:0:d410 +short
/usr/bin/dig: '--11b7-bf0a-0-0-d410.sslip.io.' is not a legal IDNA2008 name (string start/ends with forbidden hyphen), use +noidnout
```
- moves helper functions for test into a separate package,
`xip/testhelper`.
- uses `dig`'s `-x` flag to make PTR lookup tests more readable, e.g.
`dig -x ::1`
This IDN complaint has at least one related commit
([06f1556](06f1556699)).
[#30]
This allows our Concourse CI to pull the new multi-platform OCI Docker
images instead of pulling very stale, old Docker images.
Fixes, from <https://ci.nono.io/teams/main/pipelines/sslip.io/jobs/unit/builds/97>:
```
Ginkgo detected a version mismatch between the Ginkgo CLI and the version of Ginkgo imported by your packages:
Ginkgo CLI Version:
2.5.0
Mismatched package versions found:
2.8.4 used by sslip.io-dns-server, xip
```
...instead of latest release. This happens, for example, if I didn't fix
the specs before rolling out a new release. I may change this back in
the future.
We are no longer doing key-value-over-DNS.
Fixes <https://ci.nono.io/teams/main/pipelines/sslip.io/jobs/dns-servers/builds/1097>
```
rspec './spec/check-dns_spec.rb[1:17:1]' # sslip.io k-v.io tested on the ns-aws.sslip.io. nameserver sets a value, 1678804743, on the key sslipio-spec.k-v.io
rspec './spec/check-dns_spec.rb[1:17:2]' # sslip.io k-v.io tested on the ns-aws.sslip.io. nameserver gets the newly-set value, 1678804743, from the key, sslipio-spec.k-v.io
rspec './spec/check-dns_spec.rb[1:33:1]' # sslip.io k-v.io tested on the ns-azure.sslip.io. nameserver sets a value, 1678804743, on the key sslipio-spec.k-v.io
rspec './spec/check-dns_spec.rb[1:33:2]' # sslip.io k-v.io tested on the ns-azure.sslip.io. nameserver gets the newly-set value, 1678804743, from the key, sslipio-spec.k-v.io
rspec './spec/check-dns_spec.rb[1:49:1]' # sslip.io k-v.io tested on the ns-gce.sslip.io. nameserver sets a value, 1678804743, on the key sslipio-spec.k-v.io
rspec './spec/check-dns_spec.rb[1:49:2]' # sslip.io k-v.io tested on the ns-gce.sslip.io. nameserver gets the newly-set value, 1678804743, from the key, sslipio-spec.k-v.io
```
Fixes, `fly trigger-job ...`:
```
error: resource not found
```
Fixes, `kubectl logs ...`:
```
flag provided but not defined: -etcdHost
Usage of /usr/sbin/sslip.io-dns-server:
```
I'm disabling the key-value store because no one was using it.
There are other reasons, too:
- The removal of the `etcd` library dropped the executable size by over
half from 17MB to 7MB
- I didn't want users who've deployed it internally to be "surprised" by
unexpected key-value features
- Key-value-over-DNS has a seamy side to it: "data exfiltration". I know
there are legitimate uses for it, but I've come to believe that a
Key-value-over-HTTP solution is preferable because it's not only more
legitimate but also because it eliminates the DNS caching problem.
From
<https://support.google.com/analytics/answer/10759417>:
> Google Analytics 4 is replacing Universal Analytics. On July 1, 2023
all standard Universal Analytics properties will stop processing new
hits.
I wonder if Google Analytics is worth the trouble.