My initial implementation of blocking phishers was flawed. I thought I
only needed to block by matching strings in a hostname (e.g.
"raiffeisen"), but I was recently served with a second abuse notice
(<https://nf-43-134-66-67.sslip.io/sg>), one which didn't lend itself to
blocking via a substring match. And at that moment I understood why
Roopinder of nip.io blocked by IP address.
The work is not yet complete, but at least I can parse and create an
array of CIDRs to match against.
Drive-by: I didn't realize Golang had increment ("++") (see [Why are ++
and -- statements and not expressions? And why postfix, not
prefix?](https://go.dev/doc/faq#inc_dec)), so I used the longer "+= 1"
throughout the codebase. Now that I know Golang has them, I use them.
I've refactored the metrics: where I previously used the term
"successful", I now use the term "answered". "Answered" means there was
at least one record in the Answer section of the response to the DNS
query. This is a more precise description.
I re-arranged the metrics integration test. Now it's sorted by type of
record queried (A, AAAA, MX, etc.). It's easier for me to follow.
When a hostname is queried with a blocked name, we return the address of
one of our servers (currently ns-aws). For example,
`raiffeisen.94.228.116.140.sslip.io` returns the IP address
`52.0.56.137` (`ns-aws.sslip.io`'s IPv4 address).
Currently we only block one name: "raiffeisen",
<https://en.wikipedia.org/wiki/Raiffeisenbank>.
- We enable the integration tests for the blocklist.
- We don't block private IP addresses; they can't be used in phishing
attacks.
- At the beginning of the integration tests (`ginkgo -r .`), we now
print the DNS server start-up messages. They help me debug.
- We broke out some of the code into their own methods.
`processQuestion()` remains too big, but at least it's now smaller.
- We use `blocklist` rather than `blacklist`. If this modest change
betters the black experience in America, then it was worth it.
TODO:
- Wire up the blocklist so we block the phisher domains
- Migrate the downloading of the blocklist outsid the `main()` method
- Uncomment the integration tests
We weren't aggressive enough to make sure our rate limiting channel was
emptied: we added only ten extra reads on the channel on our CI, which
worked perfectly on our Xeon workstation, but not on our CI.
Our MetricsBufferSize is 100, our delay is 250ms, and each query on CI
took ~25ms to complete, which meant we needed > 110 reads to exhaust the
channel, and we were on the knife's edge. So we doubled the number of
reads to 200 to make sure we had really, truly exhausted the channel's
buffers.
Fixes <https://ci.nono.io/teams/main/pipelines/sslip.io/jobs/unit/builds/50>:
```
sslip.io-dns-server for more complex assertions a TXT record for an "metrics.status.sslip.io" domain is repeatedly queries [It] rate-limits the queries after some amount requests
/tmp/build/b4e0c68a/sslip.io/bosh-release/src/sslip.io-dns-server/integration_test.go:302
```
`metrics.status.sslip.io` is a vector for a DNS amplification attack; we
mitigate it by latching a 1/4 second throttle on each query after a
certain amount of queries.
That endpoint is a 4x amplifier: 100byte request with a 400 byte reply.
Previously I never checked if `net.ParseIP()` returned `nil` for an IPv4
address—I couldn't imagine my IPv4 regex was incomplete. I was wrong.
Moral of the story: always check for errors, always check for nil.
Oddly, I checked for IPv6 addresses—I guess I wasn't as confident about
the regex used.
Drive-bys:
- updated SOA with today's date
- updated dependencies `go get -u`
[fixes#15]
We made a mistake: we blindly invoked a function that was sometimes
`nil`. Specifically, if we had a customized domain (e.g. `ns.sslip.io`)
that didn't have a TXT record (a function), we'd try to invoke it
anyway. Bad move.
Now we ensure the function is there before we try to invoke it.
This is a curious affirmation of installing metrics: if we hadn't seen
that the server had been restarted because uptime was too low, we
wouldn't have caught this bug.
Drive-by: we made the lengths of TXT records of `version.status.sslip.io`
exactly match what we replace them with during the linking phase. We
hope that this fixes the wrong-line-numbers we see in the `panic()`
messages.
[fixes#14]
Also, I moved the "versio" endpoint: `version.sslip.io` →
`version.status.sslip.io`. It seemed to make more sense to corral the
special endpoints under `status`.
Now we check first to see if etcd is running before diving in & testing
against it.
fixes:
```
Unexpected error:
<*fmt.wrapError | 0xc0003bc8e0>: {
msg: "couldn't GET \"my-key\": context deadline exceeded",
err: <context.deadlineExceededError>{},
}
```
I want the key-value changes to propagate faster, so I dropped the TTL.
Now, on average, it'll take on average 90 seconds for a stale value to
clear out, 3 minutes max.
I don't think this will affect traffic much; I suspect that there's not
much negative-caching going on (based on the queries I see), and the
lion's share of queries (I believe over 3/4 of my traffic are things I
don't have a record for). And I average ~5
queries-with-a-record-returned per second on ns-aws, which means that'll
bump to a little more than 8 queries a second.
- The metrics aren't fleshed out. In fact, there's only two so far:
1. uptime
2. number of queries
- Even though the metrics aren't complete, I'm checking it in because
this commit is already much too big.
- I moved the version information to `version.status.sslip.io`;
previously it was at `version.sslip.io`. I didn't want one endpoint
for both metrics & version (worry: DNS amplification), and I wanted a
consistent subdomain to find that information (i.e.
`status.sslip.io`).
- I'm not worried about atomic updates to the metrics; if a metric is
off by one, if I skip a count because two lookups are happening at the
exact same time, I don't care.
- The `Metrics` struct is a pointer within `Xip` because I might have
several copies of `Xip` (if I'm binding to several interfaces
individually), but I must only have one copy of `Metrics`
- I only include the metrics I'm interested in, usually because it took
some work to implement that feature. I don't care about MX records,
but I care about IPv6 lookups, DNS-01 challenges, public IP lookups.
- got rid of a section of unreachable code at the end of
`ProcessQuestion()`; I was tired of Goland flagging it. I had it there
mostly because I was paranoid of falling through a `switch` statement
We use a simple map as a key-value store when there's no `etcd`.
We test both: when there's etcd, and when there isn't.
I'm inordinately pleased at my intuitive understanding of
`DescribeTable()`, allowing me to define a function and using it in the
two places it's tested (with `etcd` and without), which is much better
than having duplicate copies of the same test (requiring twice the
maintenance when updating tests).
TODO: Don't run `etcd` unit tests when there's no `etcd`
We now use an interface for `Xip.Etcd` (`V3client` not
`*v3client.Client`). We've also used
[counterfeiter](https://github.com/maxbrunsfeld/counterfeiter) to create
a fake to use in testing.
Drive-by: we updated dependencies: `go get -u && go mod tidy`
The key-value portion is about to get a lot more complicated, and as a
prelude I've moved the three verbs into their own functions.
I plan to do the following:
- use an interface for Xip.Etcd (not `*v3client.Client)
- use counterfeiter to create fakes for above interface
- write vanilla Golang, non-Ginkgo test for getKv, putKv, deleteKv
- add local datastructure (non-etcd) to hold kv if no etcd
We need to determine whether we have `etcd` or use a local key-value
store instead.
This is for people who want to use this DNS server but don't want to
install `etcd`. Most people probably don't care for the key-value
portion anyway.
TODO: modify xip.go to accommodate lack of etcd. Test coverage, too.
Ginkgo v2.0.0 is hot off the press, released yesterday. Let's upgrade!
- `extensions/table` no longer needs to be separately imported
- `BeforeSuite()` must be outermost
fixes:
```
It looks like you are trying to add a [BeforeSuite] node within a container
```
```
imported and not used: "github.com/onsi/ginkgo/v2/extensions/table"
```
```
Entry redeclared during import "github.com/onsi/ginkgo/extensions/table"
```
Previously `etcd` wasn't running, causing the integration tests to fail
because they require `etcd`.
We now run `etcd`.
In the future I plan to add the ability to not require `etcd`, to use a
local table of key-value pairs, but I don't plan to test that option in
CI. It'll be for the very few users who use the sslip.io code but not
the service.
fixes <https://ci.nono.io/teams/main/pipelines/sslip.io/jobs/unit/builds/23>:
```
{"level":"warn","ts":"2021-12-31T01:34:28.089Z","logger":"etcd-client","caller":"v3@v3.5.1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001ef340/localhost:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
2021/12/31 01:34:28 couldn't GET "dmy-key": context deadline exceeded
```
We include an invisible "d" in our keys, but we don't want to leak them
to the user (it'll only serve to confuse), so we fix our error messages
to not display them. This code doesn't have coverage, but we don't feel
it's worth the contortions to cover it.
fixes
<https://ci.nono.io/teams/main/pipelines/sslip.io/jobs/unit/builds/23>
(should have been "my-key" not "dmy-key":
```
2021/12/31 17:36:27 couldn't GET "dmy-key": context deadline exceeded
```
...that we can customize for each of our three DNS servers.
Drive-bys:
- Bumped SOA serial 2021080200 → 2021123100. There's something poetic
about it being the last day of the year
- Deleted the old PowerDNS configuration. It's so stale there's no point
in having it. Or mentioning it in the README.
I didn't want a really long domain for the key-value store; I wanted a
short, easy-to-remember domain. And it cost $400 for ten years.
Many good domains (e.g. keyvalue.store, kv.io)
were taken, and some weren't easily registered (e.g. the Albanian
domain, keyv.al).
Browsing these domains that were never put into use is like strolling
along the Boulevard of Broken Dreams: high hopes dashed against the hard
rocks of reality.
Previously we maintained a local table of key-value pairs
(`TxtKvCustomizations`), but this had two drawbacks:
- no persistence: when the server is restarted, all key-value pairs are
wipe.
- no consistency: the key-value pairs on one server are completely
orthogonal to the key-value pairs on another.
By using `etcd` to store our KV pairs, we fix both those problems.
The addition of etcd was enough to inspire me to make a struct (`Xip`)
to hold the important information (source addr, etcd client). That way I
don't have to plumb that information through the hierarchy of function
calls.
Drive-by: fixed a bug in the random-IPv6-address-generator that would,
once in a great while, generate an IPv4 address.
When we implement the key-value store, we want new values to propagate
in a reasonable amount of time. Based on no scientific evidence
whatsoever, based solely on "gut feel", I came up with three minutes
(180 seconds).
The previous value was one week. I can't imagine anyone in their right
mind waiting a full week for their key-value to propagate.
I was uneasy: functions were returning values and mutating arguments
(specifically `response &Response`)--I was mixing meat with dairy, and
the result wasn't kosher.
Now I only return values, and don't mutate.
According to canonical [Go Code Review
Comments](https://github.com/golang/go/wiki/CodeReviewComments#pass-values):
> Don't pass pointers as function arguments just to save a few bytes. If
a function refers to its argument x only as *x throughout, then the
argument shouldn't be a pointer. Common instances of this include
passing a pointer to a string (*string) or a pointer to an interface
value (*io.Reader). In both cases the value itself is a fixed size and
can be passed directly. This advice does not apply to large structs, or
even small structs that might grow.