I just read a blog post off hacker news: Why loading third party scriptsasync is not good enough. It reminded me of someone I used to work with at Amazon who wouldregularly find errors in our applications. This was quite a feat at Amazonbecause we instrument everything. We have regex’s constantly parsing logslooking for errors, we have a dozen kinds of monitors collecting hostmetrics, server metrics, client metrics, business metrics, coffeetemperature metrics, etc. all constantly checking “is your cpu load high?”,“do you have enough free memory?”, “how many times did you show pictures ofthe Twilight case?”, etc.
This one engineer (on a team of exceptional engineers) was consistently theonly one to find errors. It was definitely very healthy for the team but…. engineers secretly hate this because, by definition, finding errorsmeans he’s pointing out faults in your work. Managers less secretly hatethis because it means he’s ‘creating’ high priority work that getsaddressed ahead of their projects.
So with all these metrics and monitors on a team of high achievers, how didthis one person on our team keep finding errors? He looked at the logs.
That was his secret weapon; reading logs! It’s like grade 1 of servicemaintenance. With all our monitoring, regex’s, and features we thought wewere too good to ‘just’ read logs. The rest of the team would releasefeatures, put regex’s to detect our errors, trace a few requests afterlaunching, and then move on to the next project. I honestly don’t know howmuch time he spent on it, but every week or two, he’d come in and explainhow our programs were messing something up.
- requests to a dependency fails. We monitor overall failures, and acceptfailures of less than 0.1% (just hiccups and connection problems, right?).Turns out our dependency never worked for 0.1% of our customers.
- we have a dependency known to have errors, but retries often succeed. Wewill retry every request once before raising an error. Our dependencymakes a change which we don’t notice, but our retry rate goes from 2% to50%.
- you have ‘targeting’ params which you consume if available (i.e. the httpreferrer header). You make a change which loses this data in the course ofa request and now you’re never using it to target.
There were three morals of this story:
- Drill down into your metrics and understand where they are coming from(and their deficiencies)
- Your monitoring will never be perfectly reliable — you regularly needto just randomly re-verify things are working
- Every time you catch a problem, install the proper monitoring to makesure it never happens again
In my experience, the most likely error is one you’ve seen before.