Seeking Scale

Reflections on my first 60 days at Datadog.

I got a new job! Well, I’ve had it for some time now, but I figured rather than bang on about it on Week 1 I’d hold off for a bit and learn a least a little bit of the lay of the land. After a long, restful break after my last day at Shapeways (approximately 48 hours), I started my first day as an Engineering Manager at Datadog. I’m now leading two fast-growing teams which are responsible for extracting customer data from cloud providers and funneling it into our intakes.

Shameless plug alert: these teams are growing, and fast, so if you’re looking for something new and want to work on critical teams at a large public company…hit me up.

Why Datadog?

“Datadog, wow, that’s a pretty big change from Shapeways!”, you might be thinking to yourself…and you’d be right. After 8.5 years working as both a senior IC and VP of Engineering in a startup environment, I wanted to take the next step in my career. Step 1, of course, was figuring out what that actually meant. Long term, I’d love to be a CTO of a growing company - startup, public, or otherwise. While I enjoyed my stint as a senior IC, cranking out solutions to difficult problems, I knew that I wanted to get back into leadership and management in my next gig - I missed setting strategy, and working with my colleagues to make that strategy into a reality. I figured I had two options - I could go be CTO or VP Engineering for an early- to mid-stage startup, or I could take a title demotion and go work at a larger company and learn about scale. The former, I’d already done, and felt a bit repetitive. The latter, however…while I’d worked at big companies in the past, I’d never been in a leadership position before. I realized that this gap would eventually become career limiting - I’d never be hired as a CTO of a mid-sized company if I kept my experience to managing teams of 30 or less.

Decision made, I started seeking out companies who I respected and fit the profile. And, while I spoke with lots of different places (and was ignored by just as many, if not more), I always had a good feeling about Datadog. See I’ve been a DD customer since 2013, and have loved watching it grow and evolve. Hell, read old posts on this blog - you’ll see they’re full of Datadog charts, graphs, and alerts. I had a deep respect for their product, knew that they were growing, and liked that they were already public. Datadog fit my requirements to a T - and fortunately, they felt the same way about me :)

Operating at Scale

As mentioned above, I knew that I had some gaps that I wanted to fill in my experience. The north star I was pursuing in a larger company was “learn how they do X at scale”. At first glance, I thought of engineering problems - building distributed systems at scale, managing uptime, reliability, multi-cloud presence and deployment. These were all challenges that I had little experience with at the scale Datadog operates, and would definitely be an excellent learning experience. However, what really got me excited about the role was to learn more about how a company ran at that scale, on the human side. How are decisions made in an organization with over 800 engineers? How is a vision set, how do you build a strategy to drive towards that vision, and how do you devise tactics to make progress against that strategy? How do we collaborate, communicate, share knowledge, in a team so vast? The more I thought about it, the more I realized that this was the real gap that I had to fill.

Initial Thoughts and Comparisons

So, I’ve been at Datadog for around 60 days now, which obviously means that I know all there is to know…or not. Here are my initial thoughts on some differences and similarities between my previous gig and my new one, with a GIANT CAVEAT that these are 100% subject to change, and that any mis-representation or error is my own. I look forward to looking back on these in a year and laughing, a lot, about the mistakes and assumptions that I’m about to type up below.

Differences

There are, as one would imagine, significant differences between Shapeways and Datadog. I’m breaking these down into two main categories: Business, and Scale.

From a business perspective, Shapeways is a venture-backed B2C digital manufacturing and technology platform with heavy B2B tendencies. Datadog, on the other hand, is a publicly-traded enterprise SaaS company. The size, scale, and scope of these two businesses leads to some predictable difference. Shapeways is a startup, lacking a huge bankroll, and has to be scrappy and fast with its business and technology decisions, prioritizing time-to-market and first mover advantage to land customers to drive future innovation. Datadog has a much larger balance sheet and an entrenched, high-tough customer base, leading to more considered decisions around product and feature launches, taking care not to regress on previous promises to customers. Additionally, Shapeways is a manufacturing company - it must consider physical magins, COGS, and physical supply chains to deliver products to customers, leading to a complex calculation of EBITDA. Datadog, being a SaaS company, has a much smaller number of factors to consider prior to EBITDA, none of which are physical in nature. This leads to faster growth, faster iteration time vs physical products, and sheds some light on Shapeways’ decision to build a manufacturing technology platform to supplement its 3D Printing Service :)

Then there’s the scale - Datadog is simply a much larger company, with larger technology requirements and more people at its disposal than Shapeways. Note that the following stated figures for both companies will be approximate, and please don’t get too bogged down in numbers - they’re more for proving order-of-magnitude difference than they are a tied-out representation of either.

Technology

Let’s start with technology requirements. Shapeways handles an admirable amount of data every day - thousands of uploads, thousands of orders, tens of thousands of physical parts produced, finished, and shipped globally. At its peak, Shapeways sees a thousand or so concurrent visitors to the website, with hundreds more accessing backend manufacturing and supply chain management systems. Compare this to Datadog, which regularly processes tens of millions of data points per second, in a multi-cloud environment…and that’s just the metric intake. This ignores the breadth of products, including APM, Profiling, Logs….Datadog intakes, tags, processes, and presents a truly planet-scale amount of data.

This difference in scale demands different requirements of each company. In the case of Shapeways, the requirements aren’t nearly as large as Datadog. That said, availability and performance remain critical to the business, so redundancy and self-healing systems are critical to success. To meet these requirements, Shapeways maintains two datacenters (one US, one EU), each of which are capable of running the entire suite of tools themselves, and ergo function largely as disaster recovery / hot standby DCs. It takes about 10mins to flip the entirety of Shapeways’ web presence between them, which is quite the feat when you consider that the team responsible is made up of two people. Each of these DCs has a few hundred hosts, split between bare metal and virtualization, which handle the databases, apps, storage, and more required to run the business. Shapeways also has the ability to scale compute into the cloud, and does so on a regular basis for the more fungible services.

That system is now becoming too small for Shapeways’ growing business, and will need to be revisited by the team running those systems. That’s a problem they’re well-equipped to solve, and one I believe will result in them migrating their services fully to the cloud.

Shapeways leverages a fairly standard set of services and applications to make the company go. These include, but aren’t limited to

  • MySQL, MongoDB, Solr for databases
  • Redis, Memcached, APC for caching
  • ActiveMQ, Celery for queueing
  • Luigui, Airflow, Redshift, Looker for BI
  • Puppet, Terraform for configuration management
  • zfs, ubuntu, nginx, cumulus, quagga, strongswan for storage, networking, security, etc

You could consider these to be “Boring” technology choices, and you’d have a strong argument. You know what though? I like Boring technology running my mission-critical stuff. Boring, in the software world, is often coded language for “stable”, “well-understood”, and “has a large community using it.” Those are characteristics i want when I’m building on a tight budget and a high uptime requirement. You can hire people who already know Boring. You can build Boring fast, because it’s well documented, and easy to get help when you need it. Compare that to more interesting technologies, which are often developed by large companies to meet a specific scale challenge they’re experiencing. And, as much as we’d all like to be Google, it’s unlikely that your seed-round startup’s scaling problems require the same solutions as Google.

Datadog, on the other hand, is a cloud-first company. As a SaaS platform, Datadog’s need for compute scales directly with its customer base. In the Shapeways case, it takes the same amount of compute to place an order for 100 of a single model as it does for 1. Datadog, on the other hand, requires 100x compute to process 100 data points when compared to processing just 1. Datadog has to go where its customers are - be they in AWS, GCP, Azure, Oracle, or using their own datacenters. This means that Datadog must support installations in multiple clouds, in multiple regions, or face prohibitive costs extracting that data and loading it into another cloud provider.

At the scale of the company, Datadog needs to focus not just on the uptime and reliability of their services, but also the performance. When a company is processing that much data, performance hits or timeouts can cause cascading effects that seriously degrade or stop service to their customers. As this is Datadog’s entire business, there’s a huge amount of effort spent around both optimizing for performance, and building resilience into our systems to limit the blast radius of service degradation or failure. In order to meet the requirements, Datadog uses a broad set of technologies (too many to list here), but unlike Shapeways, leverages them for specific purposes that are suited to their performance characteristics. For example, if you need a database @ Datadog, you use PostGres. However, if what you need is strong write performance and bulk read access, you would use Cassandra. Need to drink from a firehose of data, and partition it automatically for consumption? Kafka would be your tool of choice. The point is, Datadog has a different set of needs than a company like Shapeays, and those needs are reflected in the considered technology choices they make.

People

The technological differences are stark between the two companies, but that’s not why I joined. I wanted to see how things were different in the organization of people.

The first, and most obvious, thing to note is that Datadog is around 10x the size of Shapeways from a pure employee count. However, from an Engineering perspective, it’s more like 40x the size. A large portion of Shapeways’ headcount are folks working in our factories - the engineering team makes up around 10% of employees. Compare that to Datadog, where the engineering team is around 50% of total headcount. Things are naturally different at that scale - here’s a few things I’ve observed so far.

  • There are many more layers of management at Datadog. At Shapeways, it was pretty simple - 1 VP, 3 Directors, ~20 engineers. At Datadog, it’s 1 CTO, 2 SVPs, ~5 VPs, 10s of Directors, maybe 10 Engineering Managers (a new role, more coming soon), dozens of Team Leads, and hundreds of Engineers. While this might sound like a mess of overhead, it’s actually quite efficient - vision and high-level strategy is set at the VP/Director level, OKRs, roadmaps, and goals are set at the EM/TL level, and the engineers perform the execution.

  • All strategy and OKR documents are public, so people have unfettered access to information. Everyone has a good idea of what they’re working on, why they’re working on it, and how they’re doing against their goal. The engineering org is built on a ton of trust - individual devs have permissions to deploy to production pretty much anything that they deem ready. This removes a lot of unnecessary barriers to context, provides a level of accountability, and enables the individual engineers to make and act upon decisions.

  • While the company’s tech stack is huge, most people at Datadog work on a small, constrained piece of it. This enables them to learn deeply the characteristics of the software that they own, and to feel confident deploying changes to production. Which is good, because…

  • Developers are responsible for the code they write. At Datadog, if you write it - you run it. There are no dedicated ops teams. This is made possible largely because the developers are so focused on smaller product surface areas than at a startup like Shapeways. At Shapeways, day you may be doing frontend and the next you’re optimizing SQL. At Datadog, you’re working on your team’s product, with your dedicated team.

Similarities

Those are some pretty big differences between two dramatically different companies…but they’re all expected. What’s a bit more surprising is the collection of similarities that I’ve noticed between the two.

First off - despite a dramatic staffing size difference, both Shapeways and Datadog employ a similar “small teams” philosophy. Smaller teams (10 person max) communicate more clearly, know what each other are working on (because it’s probably the same project), and build meaningful bonds with one another. Compare this to larger teams, where it’s more difficult to communicate, difficult to find in-kind work for 15-20 people to do, and therefore harder to build these bonds. This is roughly equivalent to Bezos’ “two-pizza rule”, though I hesitate to call it so because I reject his overall belief that “communication is bad”. Communication isn’t bad - it just becomes overhead as teams grow too large.

Another commonality between the two companies is that the engineers have a general enthusiasm for the product that they’re building. Shapeways is full of 3D printing and manufacturing fiends - they experiment with the product, and use their firsthand knowledge and insights to bring improvements to the work that they do. Same goes for Datadog - many people who work here have used Datadog in their previous jobs, and love the immediate value it brings to their work. Unsurprisingly, we use Datadog’s product suite to monitor our own production software, and I’m not sure how easy it’d be to take those on-call shifts without it.

A third similarity - despite being a publicly traded company and a NYC unicorn with extremely competitive compensation, it’s still hard to hire at Datadog! Datadog has grown 2x a year for basically the past 4 years. It’s difficult to find that many qualified candidates to sustain that growth, and there are tons of teams inside of Datadog who are vying for the same talent pool. Also, Datadog is now a public company - there’s less potential equity to hit a home run with as opposed to a startup. Now, these hiring challenges are definitely different than those at Shapeways, it’s worth mentioning that hiring didn’t get any easier w/ moving to a larger company with more resources.

Finally, on a bit of a somber note, I’m still working out of my bedroom. The COVID-19 pandemic is now entering it’s 9th month, and while vaccines are starting to be considered by the FDA, we’ve still got a long way to go here in NYC. I’ve completed an entire job hunt, did a 5 week transition period, started a new job 60 days ago…and I’m still in the same chair, in the same bedroom, in the same small apartment as I was in 9 months ago. And, guess what - I’m lucky. My family works from home, and has remained healthy. I have job security, food security, and housing security that’s not guaranteed to so many others. So, if you’ve got those things, take a moment and be grateful for them - even though this is one of the hardest years of my life, I only have to look out my window to see how lucky i have it. If this sounds like you, I encourage you to do the same, and think about what you can do for your community this holiday season.

Wrapping Up and Looking Ahead

This is the ninth and final post in a series about the transition from a monolith to microservices that I led at Shapeways as the Vice President of Architecture. I hope you find it useful! You can find the series index here.
Low-Carb!

Regular readers (hah) of this blog will have noticed that I have ended my time at Shapeways. You can read all about it here if that’s of interest to you. With that in mind, I wanted to draw this series to a conclusion, and talk about some things that are on the horizon for Shapeways’ continued journey towards microservices, and what I hope to see them continue to do as a fan and cheerleader from the sidelines.

What We’ve Done So Far

It's been fun!

Before we talk about where we’re going, I’d like to take a moment to reflect on where we are. At the outset of this blog series, all of Shapeways’ production web properties were running out of a single monolithic codebase. This was difficult to work with, but also functional, and was capable of running the business of the day. It took a change in our approach to our business, plus a desire to expand from a service bureau to a technology platform, to make it worth the time and effort to convert this monolith into microservices.

Once the decision was made, of equal importance was identifying a senior individual contributor to lead the charge. It was imperative that this task would have 100% of their focus - this was a job too big for a manager to take on in tandem. That person happened to be me - this isn’t to say that I was the only one capable of doing it (far from it), but my familiarity with the business model, current software stack, and vision of the future was hard to compete against.

The first step after figuring out who was going to do the work was how to break it down. This required a critical look at our current offering, breaking down the functionalities of our monolith into standalone, digestible chunks. Once we’d done that, the next step was to identify promising candidates for services from both our core and support functionalities, and then selecting + building our first series of microservices. It was critical that these services were built with clear APIs and abstractions, and that their APIs were the only way to interact with the data for which they were responsible. And, once you’ve got your services running: go build some more! To date, we’ve built around 10 separate services at Shapeways, and have many more coming as the company continues to evolve its software.

What’s To Come

Engage!

And there, my friends, is where my contribution to Shapeways comes to an end…of course, that doesn’t mean I don’t have opinions on the next steps that should be taken! This was my baby after all :) Here are, in order-ish, where I believe the team should focus future efforts as the usage continues to grow.

  1. Completion - Perhaps the most obvious next step of all is to complete implementing all core features of the monolith as microservices, rendering the monolith redundant and able to be retired. This will give Shapeways permission to stop maintaining the monolith, freeing up people inside the Engineering organization to focus on tending and growing their new architecture. Which is good…cause they’re gonna need it!

  2. Service Discovery - The system as it’s implemented now is not designed to scale immediately. That’s not to say that it’s incapable of doing so, but rather that the implementation is incomplete. One big area that remains unsolved for Shapeways is to implement service discovery. There are many implementations of this, from the battle-tested but brittle DNS to the designed-for-the-task xDS from Envoy Proxy.

  3. Containerization - We kept around our existing deployment system when migrating from the monolith, adapting it along the way to work for our newly virtualized microservice deployments. However, true build+test virtualization is not yet complete for the Shapeways infrastructure. Completing that task is vital to creating an easily scalable, resilient, and autohealing platform that can deliver reliability for Shapeways’ customers. Personally, I’d be pursuing a combination of modern state-of-the-industry tools and standards such as Docker, Kubernetes, and CNAB to keep our infra healthy.

  4. API Gateway - as of the moment, Powered by Shapeways only exposes APIs internally for intra-service communication. However, as the system continues to expand and functionality moves from the monolith to the service mesh, Shapeways will need to expose external API endpoints. These would be used by external software products to programmatically retrieve things like model pricing, geometry, optimized files, or order status. These pieces of information are core to building a truly interconnected platform - you can’t integrate with another companies ERP system if you don’t give them APIs! Fortunately the API Gateway pattern is here to help, and there are many commercial and open-source gateways available for easy installation and management of data. These gateways not only provide ingress and egress for your data, but can also provide telemetry, security, and observability for your APIs.

Fill to Me a Parting Glass

From "The Parting Glass", a traditional Irish song of farewell.

I’ve had the privilege of working at Shapeways for the past 8.5 years of my career. I joined as a software engineer, eager to escape the perceived doldrums of management and get back to writing some code! I fairly quickly ended up back in management, becoming a Team Lead, Engineering Manager, and two different flavors of Vice President. And now, after many years filled with wonderful experiences, it’s time for me to move on to my next challenge in a few weeks (which I’ll write about in a different post).

In my time at Shapeways, I did a bunch of awesome technical stuff with my colleagues: wrote and open sourced a functional testing framework, turned a SPOF datacenter into globally distributed hybrid cloud, converted a php monolith to a polyglot microservice mesh, put out dozens of fires, survived 8 Black Friday/Cyber Monday experiences, quintupled our material offerings, partnered with Fortune 500 companies, expanded into 3 new production facilities, launched a SaaS platform, built an ERP from scratch that focused on 3D printing….the list goes on. That’s great, and I’m proud of the work that we’ve done over the years.

However, what I look back most fondly on at Shapeways is the people. We’ve always had exceptional people at Shapeways, people who were mission driven, dedicated to their craft and their teams, and that has been a constant over my entire tenure. Ex-Shapies have been telling me for years how Shapeways to this day holds a special place in their hearts, but only now, as I’m making my own departure, do I have the opportunity to share their perspective. Companies aren’t made or broken by their achievements, by their top-level executives, or even their products. Companies are made by their people, and I’ve been blessed to have spent most of the past decade surrounded by some of the finest I’ve ever worked with.

My biggest regret about my departure in late 2020 is that I won’t have the opportunity to say my goodbyes in person. Instead, I have to hope that this blog post, written from a small bedroom in a small apartment in New York City in the middle of a pandemic, can help me close out what has been the most life-changing experience I’ve had as a professional software engineer. That aside, I cannot WAIT until we’re all able to be together again - you can bet your asses I’ll be back for my own goodbye party once we’re able to have it. But, until then:

But since it fell into my lot
That I should rise and you should not
I'll gently rise and softly call
Good night and joy be to you all

Thank you, Shapeways. Here’s to you.

A Break in the Action

It’s been a little while since my last post on the microservices series - apologies! Unfortunately I managed to fall off my bike and break my arm, which has made typing…well, difficult, to say the least :) I’ve got what’s called a Galeazzi Fracture, which required surgery to repair (and leaves me with some cool plates and screws).

Generic shot of a Galeazzi fracture - not my wrist (thankfully - I have no pin).

Enjoy your summer, and be safer on your bike than I was! I’ll be back in the fall - for right now I’m focused on resting my arm and letting it heal.

Introducing Inshape

This is the eigth in a series of posts about the transition from a monolith to microservices that I led at Shapeways as the Vice President of Architecture. I hope you find it useful! You can find the series index here.
Manufacturing has come a long way.

This series of posts thus far has focused on a common, well-understood space - e-commerce. Sure, we’re selling 3d printing vs traditional goods, and sure, we’re manufacturing them in real time…but just as an API provides a defined interface to an underlying service, so too does e-commerce define a standard web practice: buying stuff on the internet. You’ve got products, they’ve got prices, and you exchange your hard earned cash to get your hands on them.

However, as you can probably guess, Shapeways isn’t a standard e-commerce company. We don’t keep inventory - we manufacture every single item that’s ordered from us, on demand (always fresh, never frozen). What’s more, we have an infinite catalogue of products - every time someone uploads a new file to our site, our catalogue grows. To support the variety of materials and the volume of orders we receive, we have a globally distributed supply chain performing just-in-time manufacturing of parts for us. These parts have to be produced, processed, assembled, QA’d, packaged, and shipped all over the world. There’s a purpose-built suite of software tools that make this possible - we call it Inshape.

Inshape

A customer journey with Shapeways starts with visiting our site, and ends with them receiving their 3d printed product(s). Shapeways.com and the Powered by Shapeways applications mentioned in prior posts are responsible for everything that happens between a customer visiting our site and placing their order. Inshape is responsible for everything else - what happens between the order being placed, and the customer receiving it. Inshape’s functionality can be broken down into a few different silos

  • Analysis - responsible for determining if files can be fabricated in the material in which they have been ordered. If they cannot, we will work with a customer to ensure that it’s able to be produced. This step also provides the opportunity for file optimizations, such as hollowing, orientation, and scaling

  • Routing and Planning - responsible for determining where in our global supply chain the customer part will be produced. Determining factors include: who can make the material, how large or small the part is, the customer’s shipping address (we try to produce as close the customer as possible to get them their parts quickly), required certifications (such as ISO:9001), and more.

  • Manufacturing - responsible for guiding the manufacturing of our orders. This includes job preparation (including packing and nesting, which will be a separate article), workflow management, contract management, part traceability, and quality assurance

  • Fulfillment - responsible for collating multi-part orders, final QC checks, assembly, kitting, packing, shipping, and delivery.

But…why would you build this?

Inshape is what’s known in the business world as an Enterprise Resource Planning system. ERP systems power the backend of most businesses - chances are, if you’re dealing with customer management, inventory, and physical products, you’ve likely got an ERP system helping you out. Like our e-commerce example above, this isn’t a unique problem to Shapeways - there are commercially available ERP systems from big companies you know. Oracle, SAP, Siemens, and others all sell ERP systems. So, why didn’t we use one of them?

Two reasons:

  • Cost
    • ERP systems from big companies come with big price tags. At a few grand per seat, and a 20-30% annual recurring cost on those seats, we’d have spent a good portion of our company budget just on this software.
    • These softwares are incredibly full-featured, doing everything from accounting to CRM to Sales to forecasting. This was way more than we needed - we had neither the scale nor the staffing to leverage these features in way that made sense for Shapeways
  • Business
    • Most ERPs rely on your business selling a finite catalogue of products. Shapeways does not do this - as mentioned above, every time someone uploads and purchases a new model, that’s a new entry for us. On average, this happens a few hundred to thousand times a day. Software that would increase in cost as our model base grew didn’t make sense for us.

    • At the time, no ERPs were available that specifically targeted the 3d printing space. There were no machine connections, no file analysis or preparation tools, and no concept of visual sorting or other tools that are specific to the 3DP production cycle.

Inshape in a B2B World

Inshape itself is an application that’s built into our monolith - it’s in the same repo, uses the same database, and is deployed alongside shapeways.com. This made it easy to work with at the start, but now it exhibits all the characteristics of our commercial applications, which have been discussed in previous posts. When we made the business decision to pursue B2B customers, we knew that these customers would want access to Inshape. While the commercial front-end drives volume into our business, Inshape is really the engine that drives Shapeways. Without it, we’d never have been able to scale to producing thousands of parts per day.

So, we have partners that want access to inshape…but usually not to all of it. Some will have manufacturing capabilities, and only want our manufacturing component to help drive their workflows. Others will have their own internal supply chains, and simply want to plan trays and assign volume to their production partners. We want to have the ability to treat these modules like lego blocks, snapping them together on a per-partner basis to provide a tailor-made solution for our partners. You can probably see where this is going…composible features implemented as microservices.

Same Song, Different Dance

We approached splitting Inshape into services using the same approach described earlier - identify core vs support functionality, and begin breaking them up into potential services. With Inshape being the Secret Sauce that makes Shapeways go, almost all of its functionality was considered core. That settled, we started to identify user-facing front-end services, and from there figured out the backing services we’d need to drive them. Using the silos above, we decided to start with our Manufacturing functionality. Manufacturing is the tool that is used the most by our partners, and also has the most well-defined scope for its interactions. And, honestly - it was one of our oldest pieces of software at the company, and was badly in need of some re-working. The next post in this series will focus in on how we approached building a new Manufacturing Execution System at Shapeways - stay tuned!