
How The Home Depot gets a single pane of glass for metrics across 2,200 stores
“Are we making it easier for customers to buy hammers or are we stuck on toil?” At The Home Depot, part of SRE’s responsibility is to alimony our developers focused on towers the technologies that make it easier for our customers to buy home resurgence goods and services. So we like to use the hammer question as a referral of whether a process or technology needs to be automated/outsourced, or whether it is something that deserves our sustentation as engineers.
We run a highly distributed, hybrid- and multi-cloud IT environment at The Home Depot which connects our stores to our deject and datacenters. You can read well-nigh the transformation that occurred when we switched to BigQuery, making sales forecasting, inventory management, and performance scorecards increasingly effective. However, to collect that data for wide analytics, our systems need to be up and running. Monitoring the infrastructure and applications that run wideness all of our environments used to be a ramified process. Google Deject Managed Service for Prometheus helped us pull together metrics, a key component of our observability stack, so we now have a single pane of glass view for our developers, operators, SRE, and security teams.
Monitoring increasingly than 2,200 stores running yellowish metal Kubernetes
We run our applications in on-prem data centers, the cloud, and at the whet in our stores with a mix of managed and self-managed Kubernetes. In fact, we have bare-metal Kubernetes running at each of our store locations — over 2,200 of them. You could imagine the large number of metrics that we’re dealing with; to requite you some sense, if we don’t shrink data, egress from each of our stores can run in the 20-30 MBPS range. Managing these metrics quickly became a huge operational burden. In particular, we struggled with:
Storage federation: Open-source Prometheus is not built for scale. By default it runs on one machine and stores metrics locally on that machine. As your applications grow, you will quickly exceed the worthiness for that single machine to scrape and store the metrics. To deal with this you can start federating your Prometheus metrics, which ways aggregating from multiple machines and storing them. We initially tried using Thanos, which is an unshut source solution to volume and store metrics, but it took a lot of engineering resources to maintain.
Uptime: As your federation becomes increasingly complex, you need to maintain an ever-increasing infrastructure footprint and be ready to deal with changes to metrics that unravel the federation structure. Eventually, you have a team that is really good at running a metrics scraping, storage, and querying service. Going when to the question above: as an SRE manager, is running this metrics operation making it easier for customers to buy hammers, or is this operational toil that we need to consider outsourcing?

For us, the right wordplay was to simply use a service for all of this and we chose Google Deject Managed Service for Prometheus. It allows us to alimony everything we love well-nigh Prometheus including the ecosystem and the flexibility — we can monitor applications, infrastructure, and literally anything else that emits Prometheus-format metrics — while offloading the heavy operational undersong of scaling it.
Creating a “single pane of glass” for observability at The Home Depot
Part of what I do as an SRE director is make the developers and operators on my team increasingly constructive by providing them processes and tools they can use to make largest applications. Our observability stack provides a comprehensive view of logs, metrics, and traces that are unfluctuating in such a way that gives us visibility wideness our IT footprint and the data we need for root rationalization analysis.

Logs: We generate a huge value of logs wideness our applications and infrastructure and we use BigQuery to store and query them. The powerful search sufficiency of BigQuery makes it easy to pull up stack traces whenever we encounter an exception in our lawmaking workflows.
Metrics: We can alimony an eye on what is happening in real time wideness our applications and infrastructure. In wing to the metrics we all are used to, I want to undeniability out exemplars as particularly useful elements of our observability strategy. Exemplars add data, such as a traceID, to metrics that an using is producing. Without exemplars you have to investigate issues such as latency through guesswork wideness variegated UIs. It is inefficient and less precise to review a particular timeframe in the metrics, then review the same timeframe in your traces and yank the conclusion that some event happened.
Traces: We use OpenTelemetry and OpenTracing to provide visibility into traces and spans so we can create service and using dependency graphs.
What’s next for The Home Depot
We are working closely with the Google Deject team to get plane increasingly features incorporated into Managed Service for Prometheus to help us round out our single-pane-of-glass goals. Support for exemplars in the managed collector is well-nigh to be widow by the Google Deject team and we will incorporate that as soon as it’s ready. Further, they are working to expand PromQL support throughout their Deject operations suite so that their seated alerting can use PromQL.
I am unchangingly looking for people who are passionate well-nigh Site Reliability Engineering and DevOps, so please take a squint at our Home Depot jobs board. And if you want to get increasingly in-depth on this topic, trammels out the podcast I did with the Google Deject team.