Sorry for the loose nature of the notes rather than a good writeup, but I wanted to get things collated so I can work with them. Good conference. I ate at Al's for breakfast and Hong Kong Noodles for lunch. And I realized I take the green line over to the U of MN for lunch anytime I'm downtown at work, which didn't occur to me these last six months. If I had my bicycle, it's even a short ride that way across the campus bridge. I need to get my urban on. Not as much practical knowledge at this one (for me) as some of the past events, but that means I can focus on the few things I think have practical value rather than being all over the place.
Observability and the Glorious Future - Charity Majors (Honeycomb.io)
- O'Reilly Database Reliability Engineering (November 2017: http://shop.oreilly.com/product/0636920039761.do)
- How often do you deploy. How long, how often do you fail, recovery time - the basics.
- Hires for communication skills (initial tech interview is to get them talking at the in person). "Empowered to do their jobs".
- "How do I know if it breaks?" - all changes, all features
- "Serverless was a harbinger. Deployless is coming."
- Developers (senior+) should amplify the hidden costs.
- Team happiness = customer happiness (Steve says this too)
Observability in Big Analytics - Bonnie Holub, Teradata
50 Years of Observability - Mary Poppendieck
- What is the equivalent of metal fatigue in software? Operator fatigue. >> e.g. what Steve pushes that a focus on PIs is important.
- Talked planes, bridges, three mile island
- She likes the Control series by Brian out on Youtube....they're deep: https://www.youtube.com/channel/UCq0imsn84ShAe9PBOFnoIrg
- Observable - all critical states known from system outputs
- Observable is at war with complexity.
- Controllable activator - sensor can get back to a set state in a set time.
- If it's not observable, can it be totally controlled? (no)
- Fault Tolerance: replication and isolation.
- Responsibility (and understanding the big picture) leads to desire for observability (and isolation/duplication). >> PLEX team at VP is a form of big picture.
What's Happening in Your Production Data and ML Systems - Don Sawyer, PhData
- Most practical of the lectures.
- Focus on decoupled systems: Data warehouse, ML Models.
- Talked Provenance as both origin and change over time.
- Timestamp everything UTC (use Google Time API as an example to change it during compute).
- Focus on: audit trails, data quality, repeatability, added info (pipeline).
- Metadata payload. PROCESS: id/version, start/end, transformations, inputs, configuraitons, DATA VERSIONS: traces of issues, data change history, defect data, LINEAGE: sources, frequencuu of read.
- Last point was a little messy (from me) but you want to trace right down to the node data touched in transit so you can hydrate anything from the last known good state.
- NOT ALL DATA RECORDS require granular povenance. Can be expensive (so much data). Use a flexible or generic schema. Don't use S3 (slow). Storage considerations.
- Storage: 1.) attach info to the record (can get big, note that Avro and Parquet are meant to do this), 2 send a separate event message - separate provenance API, 3.) only track some. Note that for API approaches you may end up going down a rabbit hole of tracking the tracking api.
- Alternatives: Amundsen (Lyft), Marques (WeWork), DataBook (Uber), DataHub (LinkedIn)
- Look at Apache Nifi (there's a pluralsight class)
Evolving Chaos Engineering - Casey Rosenthal, Verica
- Ships, shoes, fruit (apricots), helium mining. He's a very funny guy.
- LOOK FOR A VIDEO to watch with the team:
https://www.youtube.com/watch?v=JfT9UxcEcOE
- Principlesofchaos.org
- Reversibility: blue/green, feature flags, ci/cd, agile to waterfall.
- Moved responsibility away from the people who do the work (hierarchy)
- Myths:
- 1. remove the people causing the accidents.
- 2. document best practices and use runbooks. (most interesting problems are unique)
- 3. defend against prior root causes, aka defense in depth. Root cause analysis: "at best, you are wasting your time." Was our sponsor audience issue an example? The answer was in part to restrict audience size. But the dig highlighted system no longer supports system-wide features after growth, high processing cost of feature, inability to test with all users, etc.
- 4. enforce procedures
- 5. avoid risk
- 6. simplify
- 7. add redundancy
- Do NOT eliminate complexity. Navigate it. CI, CD, CV - continuous verification (here's a link to a CV article:
https://thenewstack.io/continuous-verification-the-missing-link-to-fully-automate-your-pipeline/). That's New Relic for us.
- Has two books: Chaos Engineering and Learning Chaos Engineering. First book comes out June 2020.
No comments:
Post a Comment