In this episode, we talk to Michael Nygard about microservices and other things from his current blog series.
You can send feedback about the show to email@example.com, or leave a comment here on the blog. Thanks for listening!
EPISODE COVER ART
In this episode, we talk to Michael Nygard about microservices and other things from his current blog series.
CRAIG: Hello, and welcome to Episode 106 of The Cognicast, a podcast by Cognitect, Inc. about software and the people who create it. I'm your host, Craig Andera.
Well, as we get started here, I just want to throw out a congratulations to the Zetawar project. We mentioned them last time on the show. It was a Kickstarter for a ClojureScript game. It looks like they hit their funding goal, so they will be building Zetawar and sharing some details about how that's made and using it as a platform to help people learn ClojureScript, so very cool. Congratulations on that.
I want to mention a couple things about EuroClojure. EuroClojure is taking place in Bratislava, Slovakia, October 25th and 26th in 2016. The deadline for talk proposals is coming up real quick here. That's August 5th, so talk proposals need to be in by August 5th, 2016. If you're thinking about giving a talk, and you should, make sure you get your proposal in before that deadline. Hopefully you're hearing this before that.
Another deadline associated with EuroClojure is the opportunity grant applications. The opportunity grants are awarded to people to assist them to attend if they would not otherwise be able to. Those applications are being accepted until July 29th, 2016. You can find out more information about everything related to EuroClojure at the website EuroClojure.org.
All right, another thing I want to mention has to do with the podcast itself. As you are aware, if you've been listening, we have transcripts now, word-for-word transcripts that we post within a few days to a couple weeks of the episode going up, as we're able to. We had been putting them right on the podcast feed itself, but that actually caused us some technical issues that I won't bore you with the details of. And so we've actually created a separate feed for the transcripts. You can find that by going to the podcast home on the Web.
If you just look at Cognitect.com/cognicast, look at the episode. You can scroll down. You can see there's a link to the transcript. Go there, and you'll see the whole transcript feed. And so that's going to let you, if you want to subscribe to just the transcripts or, in addition, subscribe to the transcripts. And so when those go up, you'll get notified in whatever blog reader or whatever you use as an RSS feed that you can sign up.
We think that's pretty cool. Actually, that's going to let people consume the show in whatever way they like best. You can continue to listen to it via your podcast application. Nothing has changed there. Or, if you want, you can point whatever blog reader you use to the Cognicast transcript feed and consume them in the written form - either way. Hope you find that helpful.
As with anything, and as we always say at the end of every show, we do welcome feedback, either via Twitter @Cognitect or you can email us at firstname.lastname@example.org. Let us know if you like that, if you have a problem with it, or really anything about the show. We're always happy to hear from you.
I will stop blabbering at you now and we will move on to Episode 106 of The Cognicast.
[Music: "Thumbs Up (for Rock N' Roll)" by Kill the Noise and Feed Me]
CRAIG: Such confidence in authority. Excellent. All right, well, we'll begin then.
Well, welcome, everybody. Today is Wednesday, June 29th in the year 2016, and this is The Cognicast. I am pleased to once again welcome to the show Michael Nygard. Welcome to the show, Mike.
MICHAEL: Thanks. Glad to be here.
CRAIG: Yeah. It's great to have you back. Definitely one of the people I always enjoy talking to. So before we jump into the reason why we're having you on today, although obviously that's the discussion, let's go ahead with the initial question, which is about art. We always are interested in hearing how people experience or understand art, and so we ask them to tell us a little story or share an idea, an experience, or something centered around the idea of art, whatever that means to them. I wonder what you would like to tell us today.
MICHAEL: Well, I've had to answer this question a few times now, so I keep having to dig farther and farther back. I'm going to go back to my first experience going to an actual concert hall and hearing concert music performed live. It was Richard Wagner's The Ring Cycle, which was fabulous. It was the L.A. Philharmonic when I was in college, and it just opened my eyes to a completely different world of music than I'd been listening to. That feeling has still stayed with me. I think that great music has the ability to transport us, and it touches the human mind and heart in ways that few other things do.
CRAIG: Totally agree. I think we've talked many times before about music on this show, and there's something about it as an art form that's unique. I wonder, though, whether that experience was special to you because it was live, because I've been thinking a bit about live music recently, or if it was just the level of performance and the fact that you were there in person was less of a factor.
MICHAEL: I think the thing about the music being performed live, and particularly in a concert hall, is that it surrounds you to such a degree that it's almost a tactile experience as well as an auditory one.
MICHAEL: That's really hard to achieve in any kind of a home setup. Well, I'll say I've never developed the audio engineering expertise to achieve that in a home setup.
CRAIG: Yeah, I think that's right. I have friends–I'm sure you do have have–that are serious audiophiles and have really gone to great lengths to set up that type of environment. But I think there still is something about live music. You're not the first person, by the way, to make that comparison even on this show around it being beyond simply listening.
Anyway, no, but the reason I'm wondering is because I've been using Spotify a lot recently to listen to music, and so I'll go to artists. I'll just say, oh, I want to listen to everything by, say, Iron Maiden. That's what I'm doing today. And so I'll listen to all of it, and I've discovered that, for the most part, but with some exceptions – for the most part I don't like live albums, which surprised me.
I do like live music. I do own some live albums that I generally enjoy. But I've found that, by and large, when I'm listening to CDs and I'm trying to figure out what to add to my library to listen to again later, I almost always skip the live ones.
There's something about live music, and I don't know what it is exactly except maybe that it's got none of the presence. There's a certain element in which the music being recorded live, you've got all the crowd noise that, to some degree, detracts from the music itself when you're not there, if that makes any sense.
MICHAEL: I see what you mean. I do listen to a lot of classical music, and pretty much all classical music is recorded live, other than movie scores where the orchestra is in a studio. Mostly it's just mics at an ordinary concert performance, but they tend to be careful to not get the audience sounds into those.
CRAIG: Right. As opposed to Bruce Dickenson yelling, "I can't hear you!"
MICHAEL: Yeah. Yeah, exactly.
CRAIG: That's all right. Anyway, we should move on. Fascinating conversation, as always, but there's another thing that I want to talk to you about today, and it's driven by the series of blog articles that you have been writing for the Cognitect blog at blog.cognitect.com. The title is The New Normal.
CRAIG: A really fascinating series, and it's interesting. When you and I first started to kick around the idea of having you on the show, it wasn't really kicking around. It was more like: We're going to have you on the show. When should we do it?
MICHAEL: Yeah, you kind of grabbed me by the lapels.
CRAIG: Well, it's good stuff, and I really want to share with our audience. But I wanted to say that I think I had been referring to it when I was talking to you and said, "Oh, you should come on the show and talk about microservices since you're writing a microservices blog series." And you were very polite. You didn't correct me.
But having gone back and looked at the series again recently, I think I was extremely mistaken to say that it's about microservices per se. This is just another good reason to have you on today to correct me and to explain about the series and about some of the really interesting things you had to say in there.
I really probably shouldn't put too many words in your mouth. I should really hand it over to you and say, "Hey, man. This series you're writing, share a little bit." I don't think everybody has read it, but I'm sure some people have. But still, in your words, what is it about? What is this idea or set of ideas you're trying to get across?
MICHAEL: Okay. Well, it's easy to see how you might perceive it as microservices because that's in there. You could also perceive it as being about anti-fragility because that's in there, the economics of software and that's in there. What I'm really trying to do is take a number of different threads of conversation, of change in the industry, and weave them together to synthesize them and show how they support each other and how they should inform what we do in probably the coming decade.
Some of the threads are driven by technology change. New platforms make new architectures feasible that would have been not economically viable in the past. They also come from, I think, an increasing sophistication in our management structures and the processes by which we deliver things. We've talked for a long time about the dissolution of the command and control network, the failure of the hierarchic model, and the emergence of network models in corporations. Well, that network model among the people, it's now possible to match the network model among the people with a network model among the artifacts that the people create.
Kind of weaving those things together and saying, "What does this all mean?" It's taken me a number of posts to sort of lay out the story because this is the kind of thing that people who have been on this journey have just assimilated along the way incrementally. But if you're coming into it from a more traditional IT organization or a more traditional software architecture or structure, it can be really overwhelming. And so people will focus on one aspect and you'll get a mandate from your CTO that says, "We have to rebuild everything as microservices," because they pick up on that salient fact. But the problem is if you do just that in isolation, it will fail. You have to do a bunch of things together, incrementally, one little step at a time.
CRAIG: Yeah. That's actually something that, on rereading the series, I really started to see that theme, I think, which is one that I like a lot, the idea of, well, step back and take a look at the big picture. You have to have a strategy.
You talk a bit about Agile, for instance, and how Agile can be naïvely applied. I've come to understand at least that Tim Ewald is one that says you don't take a vacation by going to the airport and looking at whichever ticket is cheapest, buying that one, and then arriving and getting in the first cab and going wherever it's going to go, right? You have to have a plan, even if you also need to have flexibility. I think, again, maybe I'm putting words in your mouth, but it seems like that plays very much into the various threads that you're tying together. Do you think that's a fair statement?
MICHAEL: I think that might be a bit of a straw man presentation of Agile from Tim. I actually do think that Agile development is a great way to handle things at the team scale. The problem that I lay out is that having team scale Agile development and great velocity there is a necessary element of corporate success, but it's not sufficient. You do need to have a strategy at the larger scale that is implemented by the teams. The teams represent the execution of that strategy, but the strategy doesn't emerge from just what the aggregate of what all of your teams are doing.
CRAIG: Yeah, big picture, right?
MICHAEL: Mm-hmm, and there have been some efforts to take the Agile approach and sort of scale it up to the level of the whole organization. I think that's mistaken. First of all, I've never seen one of those succeed. Second, it still loses out on that element of saying somebody somewhere needs to be in the position of deciding that you need a new team or that an old team needs to go away, or that you're making this large scale shift in strategy that requires coordination action among all the different teams. And I don't see that coming out of any of the large-scale Agile efforts.
CRAIG: Mm-hmm. Yeah, and to be clear, I was paraphrasing Tim. I just don't want to put words in his mouth. I guess I did, but he was talking about a naïve application in a narrow sense.
Anyway, I want to come back to the series. You've sort of been building a story arch, I think it's fair to say, as you've made these posts. I wonder if you could take us through that arch in brief just to kind of outline it for people and then maybe we can drill down on a few areas.
MICHAEL: Sure. One of the elements that I try to address is the question of risk and risk management. At one time our approach to risk management in software was essentially to make sure nothing ever breaks, and that was typified by having a single computer–the mainframe era–having all the programs run on there, and strong verification processes.
As soon as you get two computers interacting, there's the possibility that one of them breaks and the other one doesn't. The one that doesn't break keeps running its software, making assumptions about being able to talk to the first one, and we start having to change our risk model from saying it's all or nothing, either the machine is working or it's not, to a more probabilistic approach. There's some likelihood that all of your machines are running.
With just two machines, maybe we can build them both to be super reliable and have high quality parts and redundancy down to dual back planes and dual buses. This was the era of the tandem computer. But, increasingly, the workloads we're handling require horizontal scaling, not vertical scaling.
This has been true for some time. We're not entering into the horizontal scaling era. We're hitting its peak, perhaps.
When we're horizontally scaling, if you run the probability, the odds of having everything working at any given moment in time are actually not great. It's likely that something is malfunctioning or something is being deployed somewhere in your network at all times.
MICHAEL: This can be routine, or it can be exciting.
MICHAEL: You may be in the middle of swapping out a bunch of hardware as part of a routine refresh cycle and then some other piece of equipment fails that tries to redirect all the load to the machines that you're swapping out. You know that kind of thing happens all the time.
CRAIG: This is the birthday problem, right?
MICHAEL: Yeah. In a way, you can frame it as the birthday problem. Maybe you want to describe that for our listeners.
CRAIG: Sure. Yeah, so the non-intuitive result is that, in a given crowd of people, in a given group of people, the probability that two people have the same birthday somewhere in that group goes up much faster than our intuition would say. For instance, if you take a group of two people, the odds that they have the same birthday is 1 in 365, roughly, right? As that number goes up, three, four, five people, you can ask someone who hasn't been exposed to this problem, "What's the number where that probability goes above 50%?" I can't actually remember off the top of my head. It's somewhere in the low–
MICHAEL: It's like 20 or something.
CRAIG: Yeah, it's like 22. It's like a soccer team, I think is how I heard it described, or a soccer game, but it's a very low number. It's way below what you might think: 180. The reason is that–I won't go through the math–it's something along the lines of what you're really computing is the probability that A and B don't have the same birthday, that A and C don't have the same birthday, or that A and D don't have the same birthday, and then the opposite of that. And so that number, even though you start with a very small probability, you exponentiate it and then subtract one from that, and that number goes towards one very, very quickly.
What you're saying is the same in the sense that the system is broken. If any two pieces of it have a bad day, they have the same birthday, and so that probability very quickly goes rapidly towards 100%, as the number of pieces involved moves above some fairly small number.
MICHAEL: Yeah, exactly. We have this state that I call continuous partial failure. The interesting thing is we can respond to continuous partial failure in a couple of different ways. One is to say it's an error condition. It's the abnormal state. We need to work to stop it.
The other way we can approach it is to embrace it and say, actually, we're going to use the fact of continuous partial failure to force us to architect our systems so that they are resilient to that kind of failure. Actually, the more of these continuous partial failures we endure, the stronger our systems get. This is the thread of anti-fragility working its way through.
Actually, if we go down the first path and try to make things – treat continuous partial failure with Java exceptions flowing through the code base and so on, it's a recipe for brittleness. And, when things do break, they're going to break in a much larger way and in a much broader way. In contrast, if you embrace continuous partial failure, it becomes much more like you're decomposing your failure domains into smaller and smaller grains and failures don't cross those boundaries.
When you do that, not only is your total risk reduced, but you can actually do things like deliver software all the time whenever the team feels like it instead of trying to wait for 2: 00 in the morning when the traffic is low and you can take an outage. Because deploying software looks like a failure in the service that's being deployed, and so everything is just going to continue functioning happily as you deploy your software. So by embracing this anti-fragile approach and embracing continuous partial failure, we actually get the ability to do deployments much more frequently. That presumably delivers value because you're getting features out faster.
CRAIG: Yeah. Well, I think everybody can agree that painful deployments are worse than easy deployments. I mean everybody has been – maybe not everybody, I guess, but many people have been through the case of, well, we rolled this thing out and it didn't work. It was like, "Oh, my gosh. It was a tire fire, and we had to roll everything back. Even that took a while because we bashed the software into an existing system, and then we couldn't undo it because it turned out that the package management system didn't let us install," blah, blah, blah, blah, blah.
MICHAEL: You're going to have to put a trigger warning on this episode because somebody listening probably just had flashbacks.
CRAIG: Yeah, no kidding.
MICHAEL: And that's just the picture when you're only deploying one system. If you extend your viewpoint out to the organizational scale, if you require outages in order to deploy software, you have to coordinate deployments across teams.
MICHAEL: That's a hard enough problem with just a pair of teams. Think about dozens or hundreds of teams across the enterprise and now you're trying to coordinate release windows through some change management board. It's just an unsolvable problem.
CRAIG: Mm-hmm. Okay, so this is something that I wondered about when I was reading the article. You alluded to kind of one of the approaches, one of the techniques, one of the tools that you can use to get some of this capability, in your book The Circuit Breaker Pattern, which I have used very successfully via Netflix's Hystrix library. I'm sure there are many other implementations. That was the one I happened to use. It worked great.
Maybe this is coming in a future article. The series certainly so far has been focused at a different level, but I wonder if you could help us understand some of the particular techniques that one can employ to get this sort of capability.
MICHAEL: Yeah. I definitely can. Actually, Part 11, which I'm working on this week, will address sort of some of the implementations that you can do at the micro scale to give you the macro scale attributes that I'm looking for. Circuit breakers are great.
Another approach that I use are bulkheads. You create multiple instances of services. Sometimes they're serving different populations, so maybe you split things up geographically. Even if one country's service is unavailable, other countries continue to function.
The key with bulkheads is that you have to separate things all the way down to the database. Otherwise you're really not separated at all. You still have the common mode issues.
MICHAEL: Bulkheads are something that you can do pretty easily in a virtualized environment as well. We have the ability to do automatic migrations at this point, so some of the techniques that Jez Humble talks about in Continuous Delivery work really well for this too. Something like blue-green deployments, you can have a supervisor watching a service and, if the service goes unavailable, spin up a new one in a different region - something along those lines.
Something that Hystrix implements, in addition to the circuit breakers, is the idea of a fallback strategy. If I can't communicate with my service provider, can I use the cached response from the last time I called, or can I use a placeholder, some kind of an estimate to show and then follow up later with the real answer?
Another great strategy is just to make things asynchronous. We default to thinking about things in terms of synchronous request response loops, but there are a great many business processes that can be made into a series of messages and queues. You even have the option of doing that as a fallback. So you can say, "I'm going to make the synchronous call. But if that fails, I'm going to drop something in a queue. And I'll return a response to the user that says he'll get a follow up email in 24 hours or whatever."
Now a lot of these are both technical decisions and business decisions. They do have the effect of kind of shifting work around between teams, so it definitely requires coordination. But they're all viable ways of saying a failure in my service doesn't need to propagate to the others around me.
CRAIG: Yeah. The point about asynchrony made me think because I certainly have that–I don't know what you want to call it–predilection for viewing things as synchronous even though, if you look at the real world, especially right now, when you think about the impetus towards sending a text message instead of picking up the phone and calling.
A phone call is synchronous, right? You're going to call someone. You're going to wait for them to pick up. They're going to answer. You're going to talk back, whatever.
Whereas text messages are asynchronous, and we seem to have, for reasons that maybe aren't super helpful in this conversation, but still we seem to have a strong preference for that asynchronous mode of communication in that realm, even though you're holding the device that is perfectly capable of communicating with the same person synchronously. Yeah, obviously it's something that we're capable of thinking about as a human level process.
MICHAEL: I think that analogy is actually pretty rich source because we can think about other aspects of making that phone call. Right now I'm talking to you over what's essentially a phone. It's voice over IP. While I'm talking to you this way, I can't also be having a conversation with someone else. But with text I can multiplex it, right? And so if I have a queue for incoming work, I can multiplex work for many different sources with a single processor, so I have a different way of approaching the workload and maybe I don't need to scale as big for the processor.
The other thing is I can get behind a little bit and catch up, so I don't need to scale my service to match the peak demand or the sum of the peak demands of all of my consumers. I only need to scale it to catch up within a reasonable SLA. I think those are both important issues.
There's one other aspect of making a phone call, which you can think about placing the call and the ringing on the other end. You can think about it two ways. One, it's a handshake to initiate a conversation just like a TCP three-phase handshake. But the other way is to say, "When I call you, I'm asking for permission to talk to you and, if you don't pick up, you've denied that permission."
When we build services, I think too often we build a service with a specific consumer in mind and we don't think about enabling our service to be used by other consumers without explicit permission. Whenever I build a service, I want to make sure that I'm not just serving one population of consumers, but anyone else in the company can write a new consumer of my service and start making calls without necessarily informing me first. Of course, I have to provide enough technical documentation and visibility that they can do that. But when you think about that notion of doing things, enabling people to do things without your permission, it causes you to make some different decisions in your protocol design that greatly reduce the amount of coupling, make the context a lot more explicit, and make your service more generally useful and provide more residual value so your service can be reused in new use cases that you haven't considered when you wrote it.
CRAIG: I want to drill down on this a little bit. When you're talking about without permission, do you mean like without prior agreement that they are going to be a customer of your service or a consumer of your service? You're not talking about security implementation like authentication or authorization.
CRAIG: What exactly do you mean by that?
MICHAEL: I'm not talking about authentication or authorization. I'm talking about team scale agreements. If you think about, say, some customer you've been in that has a database of their customers. To get access to that database you probably had to negotiate with the other team. Sort of once you had the credentials, and once you were able to talk to it, maybe it met your needs. Maybe it didn't. But the technical aspect of getting access was probably much easier than the political aspect or the inter-team communication needed to make sure you were going to be allowed to use their service.
We'll go even a step further. If you needed to create customers in that database, so you actually have an app that's a customer touch point. People can create accounts. You need to be able to create customers in that database. You have even a higher level of negotiation to go through.
When I'm saying without permission, I mean literally that anyone can approach your service and use it valuably without having to talk to anyone on your team. Now, obviously in a company you need to have a trust boundary somewhere. Maybe you define that as your business unit, your division, or something like that. I do have mechanisms for putting gateways and policy enforcement at the perimeter of those units. But, within the unit, they can make the call without having to have a conversation first.
CRAIG: Right. That makes perfect sense to me. I am wondering too. We've been thinking a lot about spec recently, obviously clojure.spec. One of the key things about spec or one of the things that kind of jumps out at you when you look at it is the use of namespace keywords.
I'm thinking to myself, as I hear you say this, that that might be a fairly important aspect of easily implementing the type of thing you're talking about because it lets you have a unique identifier for a piece of data that you can attach semantics to and now have to go and look up. You could still use un-namespace keywords and have a repository in your team of what this particular piece of data means exactly. But I feel like there's something about using namespace keywords that makes that easier that maybe now you can have some sort of registry where you can keep documentation and it doesn't have to be attached to that particular service and that that might somehow aid people in navigating the sea of capabilities within an organization.
MICHAEL: I see where you're going, and I agree on the value of having a global namespace. As much as I would like everyone to rewrite things in Clojure and use spec, I think we're going to be in a polyglot world for a long, long time.
MICHAEL: The equivalent of that in a polyglot world is a URL. I am not a fan, for example, of passing around a raw ID, just a number or an alpha numeric string. Say we're talking about a policy number for an insurance company. You hand me an ID. I don't know where it came from. Rather, I have to know. I have to make assumptions about where it came from. That means I can only go talk to one place to say, "I want to exchange this policy number for information about the policy." Maybe we have a bunch of different services, but they all have to share the same policy number.
The problem is your company will do something crazy like acquire another company. Now you've got two universes of policy numbers. Maybe they're even formatted the same way, so you can't tell lexically which one it came from. The problem is that the policy number doesn't give me any context about where it came from or where I can go to exchange it for more info.
If I have a full URL, then I can always resolve that to something. That URL carries an explicit context along with it rather than having the implied context of the naked ID.
CRAIG: Mm-hmm. I guess what I'm wondering is what your thoughts are on using URLs as keys because what you're talking about is, if I imagine the data going in and out of some service, is a map, which I think is a reasonable thing to do. You're talking about, I think, using URLs as keys. The ID is the value of some ID is a URL, but is there also value or different value in using URLs in the same way for keys so that I can say, well, this is ID, but I'm not going to use ID because when you use ID it means this, and when I use ID it means this, and when she uses ID it means this?
CRAIG: But rather, a URL that scopes that identifier to some particular semantic concept, if that makes sense.
MICHAEL: Yeah. I might not do it on the level of every key just because of the massive overhead implied there. But, for example, imagine that every message you pass on the wire includes a URN to someplace where I can go and look at that message format.
MICHAEL: Or it includes a URN that lets me go and resolve it to a spec on what the commitments are about that message format. At that point we're starting to make even the language of the system be explicit and discoverable and usable without permission.
There's an outfit out there called Confluent.io that did a really cool thing. They're building this big data streaming platform on top of Kafka. One of the things that they do is they have a Kafka topic, which is their registry of message definitions. When a message is received, you can take the token that represents the type of message and go look up the definition. In their case they're using Avro, which can actually load definitions dynamically and unpack the message according to that definition, so it's not just a documentation thing. It's actually part of the functionality of the system. But even if you just treat it as documentation, it's useful to have that idea that I can publish my message types for my service, and I can publish it without permission and other people can use it without permission.
CRAIG: It really reminds me of that idea of shipping the URN along with the message and having it apply to the whole message really reminds me of having an XML default namespace, right?
CRAIG: Where you're saying there's a context that flows down through the parts of this, I guess, document in the case of XML.
CRAIG: It gives them semantics without making me take on the burden of saying it every time I express some fact.
CRAIG: Hmm. Interesting. Interesting. Interesting.
MICHAEL: It's almost like there was some good stuff in XML that we threw out when we went to curly braces instead of angle brackets.
CRAIG: You know, I know. I've said before that I think – I'm going to maybe catch flack for this, but I think JSON was really a step backwards from XML. I know people do not like the syntax of XML, but going to something where only – anyway, namespaces is a big deal, and the fact that you can't really do them in JSON is a liability, and there are other issues as well. But I'm not looking to go back to XML. I already have something better than either one of them for my systems. Yeah.
MICHAEL: Yeah, but that notion of namespace is of providing some context for interpretation around the semantics of a message. It is actually pretty important.
MICHAEL: I view it as important not just for correct functioning of systems, but for that without permission characteristic where I want every team to be able to deploy their systems whenever they want, and I want them to be able to communicate about how they act without permission.
MICHAEL: To me, by the way, this answers one of the common questions about microservices from people coming from traditional environments. They'll look at it and they'll go, "Well, these microservices, that's nothing different than SOA." I actually see the differences as being almost entirely non-technical and about process and permission. With SOA, when you publish your service, you then have to communicate with some other group to make it visible through the ESP and to publish your message formats and to make sure that you meet all the governance requirements. In fact, most of the books on SOA are about SOA governance.
The whole approach to microservices says no. No governance. No process. Everybody publishes their own thing because the risk to our organization of being as slow as the governance makes us go is greater than the risk of occasional system breakage.
CRAIG: Right, which comes back to how do you address system breakage, and the answer is to be good at dealing with it–you mentioned this in your series–either by actually making it happen. I have the amusing experience of explaining this, explaining the chaos monkey, that thing that Netflix has that you mention that runs around and actually knocks over various bits of their infrastructure to some people who hadn't been exposed to it. They looked at me like I had two heads. It's a really bizarre concept for people that haven't run into it before.
MICHAEL: Yeah. This is why it's taken me so many words to articulate this whole thing because it does spring from some different premises. If your assumptions are different and your premises are different then, yeah, the conclusions will look insane. They're also kind of self-supporting, right? This idea of being able to do things without permission requires a certain kind of architecture. That architecture requires you to have this approach to failure. But when you have that approach to failure, you can do some other things. Yeah, they're mutually reinforcing ideas, which is why kind of laying them out in a linear fashion in words is challenging.
CRAIG: Mm-hmm. Well, I think you've done a good job. We were actually in the middle of kind of following through the arch today. I guess we even got a little bit of a peak into the stuff you're working on, which is awesome.
But let's see. Where were we? Please continue. Continue to carry us through the story arch of your series.
MICHAEL: Well, the next piece that we really haven't talked about is a question of sort of durability of the software and whether we should think of it as an asset or a liability. A lot of software development in enterprises is capitalized, meaning it's accounted for as if you were buying new machines and the machines were going to make you more productive and, therefore, increase the value of your enterprise. But it doesn't act very much like an asset. A lot of the software we write gets thrown away and rewritten before the asset depreciation period would be done. When accounting finds out about this, they have to take it as a write-off and all kinds of ugliness happens to mark it as a disposed asset and so on.
The problem with that viewpoint is it causes us to write code and keep it around for a lot longer than we should. I've been to a number of places where they have these very large monolithic code bases, some of them ten years old or more. Yeah, legacy systems in Java, really, they can't make changes any more. It embodies old architectures.
They've glued bits on over time, but the degree of coupling throughout this whole code base is such that they view it as a massive asset. But it's actually stopping them from doing things, and they're pouring tons of money into just keeping it running. It behaves a lot more like a liability than an asset, so I really like to have the idea that we need to be refreshing our code, tearing it apart, throwing away piece-by-piece, and rebuilding it piece-by-piece more or less continuously.
We use a lot of building metaphors for software, and I think that gives the false impression that you reach a point where you're essentially done and you just need somebody to go around and change light bulbs and repaint the walls once in a while. But that's not true. The software is only done when it's deleted, so I'd rather see companies build things in smaller pieces that they can throw away more rapidly and rewrite when needed instead of trying to face down a three-year, multimillion-dollar rewrite of something because, at the bottom, it's still a ten year old implementation of the Java data objects beta standard that got customized for this company. They just can't maneuver at all with that kind of an anchor.
CRAIG: I'm interested to come and explore this a little bit because I've had problems with the building metaphor for quite a while, I think, for a variety of reasons. One is, of course, that analogies are imperfect and they always get stretched to cover scenarios that are not helpful. But another is that I think there's a sense in which what we do is fundamentally different from building a building in that I think it's maybe more like building a factory. You're not building a home that is, to some degree, static. Of course homes require maintenance. They do. Anyone that owns a home knows that they're also never done, but you're not generally adding new rooms or taking the roof off or building a second home right next to it.
But I wonder whether it's more like building a factory because, first of all, a program, we write source code to produce a program. That program then runs, and it maybe runs many times over the course of its lifetime. It's the actual execution that's the end result. The program itself is static. It's just a set of instructions for how to do something when it actually becomes active.
And so I wonder whether there's any juice in the metaphor of building a factory. Of course, now that's the sort of thing where, okay, if I want to view it as building a factory, well, of course I need to be able to change it because the factory is producing widgets today, but tomorrow it needs to produce wadgets. If the conveyor belt is bolted to the floor, then I'm not going to be able to run it past the new paint shed or whatever it is that needs to happen. You need to be able to move the pieces around so you can make different things because you're building a thing that actually, when run, produces the desired effect. You're not producing the desired effect directly as you would be with constructing a home, if that makes any sense.
MICHAEL: I definitely see where you're going. You're falling into a trap that I often fall into, which is explaining one domain people don't understand by likening it to another domain they don't understand.
CRAIG: Fair enough.
MICHAEL: But I want to throw a different wrinkle at you.
MICHAEL: Which is, in the building metaphor you can think about the cranes and the scaffolding and the piles of materials that you use to construct the building as a temporary thing. Once the building is done, you take all that extra crap away. But increasingly, I want people to regard the machinery that you use for building your source code, deploying it, moving it out to production as its own machine worthy of service level agreements and production level attention. So the machinery that you use to produce the factory is also something that we need to keep around and maintain.
MICHAEL: Then we start getting into factories for making factories and that becomes even harder to visualize.
CRAIG: Yeah. The factory, factory, factory pattern is well-known, right? Yeah, I know you're talking about something slightly different, but I couldn't help but mention that.
CRAIG: Yeah, I think there's a sense in which what I was trying to say was along the lines of what you expressed. But I like the fact that you've carried it even further and that idea that you need to be as cognizant. You need to have the same standards for that stuff. I think if you kind of look at the way that TDD or whatever you want to call it, say what you will about any particular manifestation of that philosophy, I do think most people would agree that having developers write tests, think about testing is useful.
CRAIG: But in any event, that idea that that stuff should also have quality attributes that you care about, all of the stuff around the code, like we all care that our code works, is efficient, et cetera. But the idea that you also are building, using, maintaining, consuming, producing other tools, to that end that those also have quality attributes that you should care about and invest in to appropriate degrees, I think, is a powerful one.
MICHAEL: Yeah. Some of us have been saying development is production. Meaning, just like the systems we create for our users inside a company, those are the production systems by which our users do their job. The development pipeline and tooling is the production system by which we do our job. And so why should we treat ourselves with reckless abandon if our time is valuable as well?
CRAIG: Right. Yup. Yeah, you wouldn't. To go back to that analogy, if you went down to the place where the cranes are being built that you're going to use to lift the scaffolding into place that you're going to use to construct the building that's ultimately the place where your customers are going to live, well, you would kind of hope that the factory that's making those cranes has got good quality control and a big red button that someone can hit if somebody gets their hand caught in the mangler, et cetera.
MICHAEL: Yeah, and the cranes are maintained on site and they're kept in good working order.
CRAIG: Right, right. Yup. Cool. Well, that was a fascinating divergence. Well, not a divergence. It was a fascinating sidebar. I don't know. Whatever it was, it was good. But I want to make sure that we come back to the series since that is sort of our focus today. Before I dragged you off into talking about factories making factories to make factories, bring me back to the arch.
MICHAEL: Well, I think we jumped ahead a bit.
MICHAEL: Into talking about the team scale autonomy, so I want to go back to that idea of team scale autonomy just a little bit and talk about what are the things that typically break down that autonomy. If you think about a dev team that's been deploying code rapidly and something goes wrong, and all of a sudden a new development manager gets put in place. They say something typically condescending like, "We need to have some grownups around here," or, "It's time to get serious," or something like that. What are the things that would often cause them to make that maneuver and start reining in the chaos and getting people under control?
CRAIG: Usually it's some sort of perceived or actual threat or risk. Somebody says this is going to be a problem, and if I want to be cynical a little bit, it's someone thinks they're going to get in hot water with someone else.
MICHAEL: Yeah, or maybe it already has happened.
MICHAEL: But if we sort of drill down under the hypothetical, maybe we had a big system outage.
MICHAEL: Or maybe there was corrupted data that cost a lot of money to clean up. What we usually find is that it's some breakdown in boundaries where this team was able to do something that they couldn't fix themselves. They created problems for other people. Maybe it's just that you have a separate operations group from your dev group, and so every problem from dev becomes an operations problem. That's, by the way, one of the motivators for devops and merging those two teams or blending those two teams.
But another kind of breakdown is a technological breakdown. Maybe somebody was making a call with bad data and it crashed the supplier. Or maybe they were responding with bad data and it crashed a consumer. Sometimes it's just about dependencies. So imagine you're in sort of the large enterprise shop and somebody brings in a new version of a package.
Let's leave Java aside and just say somebody uses NuGet in Visual Studio and they update a dependency. It works fine for them. They push their assembly, and then that dependency is incompatible with everyone else and the whole rest of the application doesn't work. It's a breakdown of boundaries.
In that case you had multiple different sub-teams having a shared dependency at a binary level on something that runs in process. You want to break that down and say, all right, no more shared dependencies. Either we just get rid of all that and we have no shared libraries at all, or we run in different processes so you can upgrade your dependency and I can upgrade my dependency at separate times. I think this is one of the motivators for containers, by the way. We can package up not only all of our library dependencies, but our whole OS stack of dependencies, package it all up as a container.
The way to look at this is to keep saying, "What is the coupling between my team and other teams? What are the dependencies between people and in source code and in libraries? And let me find ways to either invert these dependencies, break them, turn a hard dependency into a soft dependency," meaning it doesn't cause a failure in the consumer if the supplier fails, "or factor dependencies out so two teams are independent from each other but each have a dependency on something new that we can put under a high SLA."
CRAIG: I was just trying to marry that up, and I think it works very well. I was trying to marry that up with the idea that you deposit early on, and I don't think you would claim it's your idea, but the idea that if something happens and it's bad, you don't want to create a process that makes it less likely to happen. You want to create a process that makes it easy to deal with.
I think I'm fair in saying in the series you were implying that you would make it easy to deal with through practice in doing so, but also of course through infrastructure to handle that. This is an example where I'm like, okay, well, how would I apply that to dependencies? I think you covered that by saying, well, here are some architectures that will mean you can do that sort of thing all the time and it won't matter as much because it's factored out or you're separate.
CRAIG: Did I get that right?
MICHAEL: Yes. The way that I describe it is that we have two different parts. This is what Part 11 is really going to be about. We have two different aspects to preserving team scale autonomy. One is shrinking the failure domains, and that's largely a technical exercise of drawing different architectural boundaries and having different ways of handling calls across those boundaries that are safer.
Then the other is kind of giving the team tools so that they know what they're doing and they don't create problems for others as a process. I haven't articulated all this up, which is why Part 11 isn't up yet, but I think of these as sort of safety factors. One of the analogies I want to go to there is in word processors. The biggest safety factor in a word processor is the undo feature, right?
You can do any crazy crap you want as long as you have confidence that undo is going to get you back to where you were. That makes it safe for you to explore the features of the word processor. You can say, "Well, what does this button do? Oh, that button turns everything into," I don't know, "Klingon. I'm going to undo that. I don't want my document in Klingon."
CRAIG: But why wouldn't you, really? Yeah. Or Git, right? Revision control, the same thing.
MICHAEL: Yeah, and in fact a lot of the sharp edges in Git are where it allows power users to sort of do actions that have no undo. That's when people really get into trouble. For instance, pushing a rewritten history to a branch that was previously pushed.
MICHAEL: A terrible idea because it's not undoable. But another thing that you would really like to have – I mean undo is great if you just shut down every virtual machine in your entire region. Being able to undo that is great. But wouldn't it be better to know that that's what was going to happen first? Visibility into the consequences of the action you're about to take is even better.
MICHAEL: I'd like to have preview. I'm going to change a firewall rule. What happens? Oh, everybody breaks. I better not change that firewall rule. Something along those lines, I think that kind of visibility is really key for safety as well.
CRAIG: Datomic has this, right? With.
MICHAEL: With. Yeah. Exactly.
CRAIG: Yeah, which is cool.
MICHAEL: You can do a speculative value of the DB, so there are a couple of other principles to creating safety. I'm not going to go through all of them now or people won't have a reason to read Part 11.
CRAIG: I realize you haven't published it. We will be happy to have you back again, Mike. It's always great to talk to you, so no obligation to go through those in detail.
MICHAEL: Okay. Overall, the idea is once you've created team scale autonomy, there will be social forces trying to tear it down and reconstruct the command and control structure because it's just sort of a natural primate instinct. And so we need to have mechanisms to make it more likely that we can preserve that team scale autonomy and independent action.
CRAIG: Yeah. People: that's always the hard part. Yeah. I'm looking forward to – I don't know if you have any answers for us there right now or if that's coming in the series, but that's also another thing I'd love to hear about at some point, just what are the structures that enable that. How do you make it in people's best interest?
It's fascinating to me. You can be in an organization where every single person you meet is interested in promoting the interest of the overall organization.
CRAIG: Every last person you meet is like that, and still – and still you wind up with a direction that is, to put it generously, suboptimal. And so it's fascinating to me to think about how. Maybe it doesn't make sense to go there right now, but it's fascinating to think about how you can create systems wherein the organization acts in its own best interest, if that's the right way to put it.
MICHAEL: Yeah, that is a really fascinating question. I'm not likely to address it in this series.
MICHAEL: But I have plenty of examples of the kind of thing you're talking about. I've also observed that there are a number of sort of high velocity, high performance methods, I'll call them, that are fragile to having one person who doesn't understand it acting in some way.
An example is Agile software development. I've seen a great, high performing, Agile team. One where we actually, on the team, decided to shrink the team because we were going too fast. The business couldn't assimilate change as rapidly as we were creating it.
A year later, and one project manager different, the team room was disbanded. Everyone was back in their cubes. Everyone had their ownership of their portion of the code base. Unit tests were breaking and getting deleted all over the place. Basically, everything that had allowed the team to go fast, all that supporting structure was dismantled one piece at a time by a project manager who just didn't understand why it was there.
MICHAEL: That's not the only kind of system, though. I also saw a company go through a large-scale, just in time, inventory project. And it worked. They got their inventory turns from three per year to nine per year or something like that, so using their capital three times more efficiently.
But in this one particular division the procurement manager got called on the carpet for a work stoppage that had lasted six hours or something like that. He decided it would never happen again, so he ordered something like four months' worth of parts. Well, there went the entire just in time system because one guy acted in his own self interest and harmed the organization.
CRAIG: Yeah. Yeah, one bad apple, right?
MICHAEL: Yeah, and he wasn't even deliberately harming it. He didn't understand the nature of the just in time system because he had only a local perspective. One of the questions that really vexes me, and maybe our listeners can write comments about how to avoid this, but how do we find high performance methods that are robust and self-correcting against single individuals who operate not in accord with the method?
CRAIG: I mean this is the same problem in the human scale that you are addressing the technical side of, which is how do we build a system where the failure, because really you could view someone's decision to act in a way that torpedoes the benefits you're getting from some process as a failure.
CRAIG: How do you protect yourself against that failure? Where are the circuit breakers against someone getting yelled at and deciding to take action to avoid getting yelled at again from doing that sort of thing?
MICHAEL: Right. Right. That is the question.
CRAIG: Yeah. Well, awesome. Wow. We went both directions today, didn't we? We went all the way down into technical and all the way up into the organizational philosophical. But I want to make sure that we kind of come back to the middle road where we were talking about this very interesting story that you've been telling around this stuff because I know we didn't get a chance to or we haven't yet completed that arch. There are still a few things that you talk about here that I'd love to get your summary of and/or thoughts on.
MICHAEL: Well, I think as we go through this series, you know the earlier parts are really about explaining why we need to change or why a certain group of companies are doing things so differently. As we get farther on through the series, it shifts much more into, okay, so how do we go about that? What are the preconditions that have to be there?
MICHAEL: I think the one piece that we haven't even touched on yet is the one about using high leverage tools or sharp tools in our local parlance.
CRAIG: Well, let me ask you a question. Unsurprisingly, you and I have already spent most of the really interesting hour talking about this stuff. We certainly don't have to stop right now, but do you think it would make sense to kind of make this part one of two and come back maybe after you've had a chance to write a few more episodes in your series? Episodes? Articles in your series.
Wow. I was talking to myself the other day. I'm like, have I really done 110 of these of this show? It's hard to believe.
Anyway, do you think that would make sense, or is there a good way to spend a few more minutes kind of wrapping up what's out there so far? Certainly, I think we're going to want to have you back, regardless, to talk about the other things that are coming up. But I'm just wondering whether you think it might make sense to break here and do a part two in the not too distant future that picks up from here.
MICHAEL: Yeah. I think that's a great idea.
MICHAEL: As you say, an hour is a big commitment already. I don't want to go too far beyond that. But also the next time we talk, and we talk about sharp tools, we might have some things to unveil.
CRAIG: Yeah. There's always stuff cooking here. I know you and I both got to hear about spec a little bit before that came to light, and sometimes it's hard, right? Sometimes it's hard to say, "Oh, man, I really want to talk about this in general," even though we sometimes have things that are very much an interesting bit of whatever to share, but not commit to anything. I think people see what I'm saying.
CRAIG: There's often things going on where we can talk about them later that we can't talk now, so I don't need to go into any more of that. All right, well, cool. Well, I think then we'll do what I usually do, which is definitely want to have you back on, for sure, even maybe more so in this case than is normally the case, although it's always sincerely meant. But also to leave room for anything. Not to cut it off here, but to leave room for anything else you think makes sense to talk about today, if anything.
MICHAEL: No, I think I've said quite a bit, and I'm going to leave it there and save the rest for the next cast.
CRAIG: Cool. I'm definitely looking forward to it. Yeah, that'll be fun, and I'm glad we did that because I think it'll make a good set of bookends. Maybe there'll be a part three. Who knows? It's all good.
Well, cool. Then there is another question that I have to ask you before we wind down. This is one that I'm particularly – well, I don't want to put any pressure on you, but I always love to hear your take on this. It's the question about advice. We always ask our guest to share with us a piece of advice, whatever they like. Something they've had told to them or whatever it is.
MICHAEL: Yeah, so this is going to sound kind of mundane, but from several conversations I've had lately, I would recommend that everyone listening should learn how to read a company's balance sheet and its cash flow statement. Not just understand what the lines are, but be able to understand what they mean behind it and what you can interpret about a company by looking at those things.
CRAIG: Cool. I love it. We get a lot of really great advice, and I have to say I would definitely count your advice that ended Episode 100 among some of the best that we've gotten. But that's a great one, and I totally agree with you that it's definitely useful. But it's also remarkably practical, and so that is an excellent addition to our rapidly growing stable of really good advice, so thanks for that.
MICHAEL: Maybe we should have a special transcript somewhere of just all the bits of advice collected.
CRAIG: That's not a bad idea. Yeah, that's not a bad idea at all.
MICHAEL: The collected advice of The Cognicast.
CRAIG: There you go. Yeah, our guests have said such wonderful, wonderful, inspiring, interesting, and practical things. Well, anyway, so that's awesome advice, but we are wrapping it up here, so I can't forget and I cannot do this enough, but thank you so much for coming on today.
Really, you definitely are one of my favorite people to have conversations with. Always so interesting. Clearly you think deeply about this stuff. I think your insights are amazing and valuable, and it's really been cool to see the series develop. I think a lot of people are really paying close attention to it. You know we're getting good feedback. People are loving it, and it's fun.
Even though you've written this stuff, I really think it's great to get together and have a conversation about it. It just adds something to it. We were talking about live music earlier. It's almost like the live concert version of your album, right?
MICHAEL: Yeah, exactly.
CRAIG: Yeah, there we go. There's a callback. It's been great, so thanks a lot for coming on the show today.
MICHAEL: It's always a pleasure.
CRAIG: Likewise. This has been The Cognicast.
[Music: "Thumbs Up (for Rock N' Roll)" by Kill the Noise and Feed Me]
CRAIG: You have been listening to The Cognicast. The Cognicast is a production of Cognitect, Inc. Cognitect are the makers of Datomic, and we provide consulting services around it, Clojure, and a host of other technologies to businesses ranging from the smallest startups to the Fortune 50. You can find us on the Web at Cognitect.com and on Twitter, @Cognitect. You can subscribe to The Cognicast, listen to past episodes, and view cover art, show notes, and episode transcripts at our home on the Web, cognitect.com/cognicast. You can contact the show by tweeting @Cognicast or by emailing us at email@example.com.
Our guest today was Michael Nygard, on Twitter @MTNygard. Episode cover art is by Michael Parenteau, audio production by Russ Olsen and Daemian Mack. The Cognicast is produced by Kim Foster. Our theme music is Thumbs Up (for Rock N' Roll) by Kill the Noise with Feed Me. I'm your host, Craig Andera. Thanks for listening.