[talk] Holidaze, AWS, and astounding "clock drift outage"

Mon Dec 18 15:39:47 EST 2017

Hi All,

A bit OT from the pit of internet hell, but perhaps of interest to folks
here: This weekend AWS has been doling out a disruption of service of
the worst kind, clock skew insanity.  And when I say insanity, I mean
true madness.

(For those who don't know me, this loathsome cloud infrastructure is
something I'm paid to use, not tech I think is great or even acceptable
for many uses, and I'm not engaging any "lets argue the value of the
cloud" here today.)

Is anyone else experiencing the clock/drift issue and have interesting
notes to share?

--
BIZARRE:
Clocks drifiting up to 7-9min.  Clocks drifting so fast that ntpdate and
rdate can't even "set the time"*.
Clocks drifting past ~5min window means that cryptographic network
operations in our world fail outright, (ssl/tls and http services).
Driftfile worthless- the drifting appears non-determinstic, we have
found no apparent pattern on analysis.
New instances coming up with clocks that are *years* in the past.  ntpd
freak out when trying to handle that at boot.

First, we thought the problem was skew, so we put in the ntpdate run
ahead of ntp start- that settled things for a bit.  Then 90min later,
hosts were drifting past 5min- NTP was reporting offsets of between
3k-45k and jitter of between 2k-60k on the *second and subsequent
polls*.

Just to keep systems functioning, we've got a cron job running every
15min (ironic) to restart ntpd.

--
AWS ACKNOWLEDGEMENT:

AWS is infamous for burrying outages in marketing material, so not a lot
to go on here.  Look, all green:
https://status.aws.amazon.com/
We have loose ack from AWS, mostly in the form of other customers
posting to AWS forums from their support tickets, like this:
https://forums.aws.amazon.com/thread.jspa?messageID=819947

Furthermore, AWS support contracts have nasty NDA's precluding customers
from sharing information from support tickets.  Therefore, companies
like mine cannot get much from support- because we'd be in breach of
contract for merely telling our own customers about an AWS outage- let
alone any technical details they'd provide.  So, companies like mine
can't get technical support contracts from AWS. (Of course I can neither
confirm nor deny if this is the case for my employ).
No worries though, after living with AWS technical support elsewhere,
it's abysmal and nearly useless anyhow.

--
USERLAND EFFECTS OF THIS INSANITY:

We don't see things happening which would indicate CPU cycles are being
affected, just userland notions of time.  So, this makes 2 distinct
problems we see:
- Applications which rely on time, e.g. "do this at that time" are
completely hozed.  Less noticable with cron, totally happening with our
own apps.
- As mentioned above, cryptographic operations are so compromised they
outright fail when the clocks drift up over 5 min.

--
RANT ON THE PARADE OF THE AMATEUR, (possible root cause, AWS lit up some
chronyc!)

Looks like some fool decided they can do better than ntpd, specifically
for AWS.  Named 'chrony' or 'chronyc' on some platforms.
https://aws.amazon.com/about-aws/whats-new/2017/11/introducing-the-amazon-time-sync-service/
Some of the mind-blowingly bad decisions in here:
- deploy/announce an AWS-custom NTP daemon just weeks before Christmas
shopping crunch!  (What could possibly go wrong.)
- deploy/announce an AWS-custom NTP daemon in the first place,  (Ask
PHK, he makes NTP look easy!)
- keep using the NTP protocol, but abandon existing software, /facepalm

Now here's where it gets even more interesting,
<http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html>,
where we learn:
- "The Amazon Time Sync Service is available through NTP at the
169.254.169.123 IP address for any instance running in a VPC. Your
instance does not require access to the internet, and you do not have to
configure your security group rules or your network ACL rules to allow
access...."
That's right- beyond userland config massaging, they appear to have
forced global whitelisting of UDP to that single IP address across your
hand-built VPC ACL's.  (What could go wrong there.)

I don't think chronyc itself is the problem, but that they are smoking
crack over there at AWS.

--
So, as my team hobbles along today, does anyone else have any anectodal
stories to share on this one?
-  comment on the mechanics of cryptographic operations and time?
- root causes?
- any peek into actual technial detail, (kernel/hypervisors/drift?)
- impact to the GDP?

Best,
.ike