[talk] Holidaze, AWS, and astounding "clock drift outage"
pvarga at pvrg.net
Mon Dec 18 16:51:23 EST 2017
Once Time is off by years or even just months ntpdate may fail due to ip
problems, more obvious ntp over tcp. Yes date is the only command to
bring it close enough so ntp can work. The ironic or chrony 15 mins
ntpdate may just time out and never correct.
On Mon, Dec 18, 2017, at 20:39, Isaac (.ike) Levy wrote:
> Hi All,
> A bit OT from the pit of internet hell, but perhaps of interest
> to folks> here: This weekend AWS has been doling out a disruption of service of> the worst kind, clock skew insanity. And when I say insanity, I mean> true madness.
> (For those who don't know me, this loathsome cloud infrastructure is
> something I'm paid to use, not tech I think is great or even
> acceptable> for many uses, and I'm not engaging any "lets argue the value of the
> cloud" here today.)
> Is anyone else experiencing the clock/drift issue and have interesting> notes to share?
> Clocks drifiting up to 7-9min. Clocks drifting so fast that
> ntpdate and> rdate can't even "set the time"*.
> Clocks drifting past ~5min window means that cryptographic network
> operations in our world fail outright, (ssl/tls and http services).
> Driftfile worthless- the drifting appears non-determinstic, we have
> found no apparent pattern on analysis.
> New instances coming up with clocks that are *years* in the
> past. ntpd> freak out when trying to handle that at boot.
> First, we thought the problem was skew, so we put in the ntpdate run
> ahead of ntp start- that settled things for a bit. Then 90min later,> hosts were drifting past 5min- NTP was reporting offsets of between
> 3k-45k and jitter of between 2k-60k on the *second and subsequent
> Just to keep systems functioning, we've got a cron job running every
> 15min (ironic) to restart ntpd.
> AWS ACKNOWLEDGEMENT:
> AWS is infamous for burrying outages in marketing material, so
> not a lot> to go on here. Look, all green:
> We have loose ack from AWS, mostly in the form of other customers
> posting to AWS forums from their support tickets, like this:
> Furthermore, AWS support contracts have nasty NDA's precluding
> customers> from sharing information from support tickets. Therefore, companies
> like mine cannot get much from support- because we'd be in breach of
> contract for merely telling our own customers about an AWS outage- let> alone any technical details they'd provide. So, companies like mine
> can't get technical support contracts from AWS. (Of course I
> can neither> confirm nor deny if this is the case for my employ).
> No worries though, after living with AWS technical support elsewhere,> it's abysmal and nearly useless anyhow.
> USERLAND EFFECTS OF THIS INSANITY:
> We don't see things happening which would indicate CPU cycles
> are being> affected, just userland notions of time. So, this makes 2 distinct
> problems we see:
> - Applications which rely on time, e.g. "do this at that time" are
> completely hozed. Less noticable with cron, totally happening
> with our> own apps.
> - As mentioned above, cryptographic operations are so compromised they> outright fail when the clocks drift up over 5 min.
> RANT ON THE PARADE OF THE AMATEUR, (possible root cause, AWS
> lit up some> chronyc!)
> Looks like some fool decided they can do better than ntpd,
> specifically> for AWS. Named 'chrony' or 'chronyc' on some platforms.
> https://aws.amazon.com/about-aws/whats-new/2017/11/introducing-the-amazon-time-sync-service/> Some of the mind-blowingly bad decisions in here:
> - deploy/announce an AWS-custom NTP daemon just weeks before Christmas> shopping crunch! (What could possibly go wrong.)
> - deploy/announce an AWS-custom NTP daemon in the first place, (Ask
> PHK, he makes NTP look easy!)
> - keep using the NTP protocol, but abandon existing software,
> Now here's where it gets even more interesting,
> where we learn:
> - "The Amazon Time Sync Service is available through NTP at the
> 169.254.169.123 IP address for any instance running in a VPC. Your
> instance does not require access to the internet, and you do
> not have to> configure your security group rules or your network ACL rules to allow> access...."
> That's right- beyond userland config massaging, they appear to have
> forced global whitelisting of UDP to that single IP address
> across your> hand-built VPC ACL's. (What could go wrong there.)
> I don't think chronyc itself is the problem, but that they are smoking> crack over there at AWS.
> So, as my team hobbles along today, does anyone else have any
> anectodal> stories to share on this one?
> - comment on the mechanics of cryptographic operations and time?
> - root causes?
> - any peek into actual technial detail, (kernel/hypervisors/drift?)
> - impact to the GDP?
> talk mailing list
> talk at lists.nycbug.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the talk