I wrote the first version of this runbook after being paged at 3 AM during a Black Friday rush. A checkout data layer push had regressed, and purchase tracking was silently broken on 12 retail clients at once. We recovered in 40 minutes. The version of this playbook below is the one we ship to every new engagement on day one.

Before the First Alert: Prep the On-Call

Incident response speed depends on preparation. Before your first alert fires, complete this setup:

Slack or email alerts configured with per-domain routing. You need the alert to reach the right person within 60 seconds.
Dashboard bookmarked so you can open it in one click. Do not rely on searching for the URL at 2 AM.
GTM access ready. Ensure you have publish-level access to the GTM containers you monitor. If an agency client owns the container, get your account added before an incident happens, not during.
Runbook document shared with your team. List the domains, their GTM containers, the primary contacts, and the escalation paths.
Status page ready. If you serve B2B clients, have an incident status page template ready to publish. Customers forgive incidents. They do not forgive silence.

Reading a TagDrishti Alert Without Panic

A TagDrishti alert (Slack or email) contains these fields:

Domain: Which site is affected. Example: www.clientsite.com.
Tag name: The specific tag that failed. Example: GA4 - purchase event.
Current success rate: The rate right now. Example: 42%.
Expected success rate: The baseline. Example: 96%.
Affected pages: Which pages show the failure. Example: /checkout/confirmation (100% failure).
Duration: How long the failure has been running. Example: 23 minutes.
Link: Direct link to the tag detail page on the dashboard.

Read the alert in full before taking action. The affected pages and duration fields tell you whether this is a site-wide incident or a localized issue. Do not start debugging until you have scoped the blast radius.

Step 1: Scope the Failure on the Dashboard

Click the link in the alert. The tag detail page shows:

Timeline chart: Success rate over the last 24 hours. Look for the exact minute the drop started. A sharp cliff means a deployment or configuration change. A gradual decline means a third-party vendor issue or an intermittent network problem.
Page breakdown: Which pages are affected. If only /checkout pages fail but /product pages are fine, the issue is specific to the checkout data layer or trigger.
Error type distribution: Network block, timeout, missing parameters, or error response. This tells you where in the chain the failure occurs.
Browser and device breakdown: If failures appear only on Safari or only on mobile, the root cause is likely a browser-specific JavaScript issue or a responsive design change that broke a DOM element the tag depends on.

Step 2: Identify the Root Cause (Five Patterns)

The error type from Step 1 narrows the cause. Five patterns account for 95%+ of the incidents I’ve seen. Here is how to confirm each:

Pattern A: GTM Container Change

Signal: Sharp drop that correlates with a GTM publish timestamp. Check the GTM container’s Activity Log (Admin → Activity) for recent publishes.

Confirm: Open GTM Preview on the affected page. Check if the tag fires. Check if the trigger matches. Check if a variable returns the expected value. Compare the current container version with the previous version in GTM’s version history.

Pattern B: Site Deployment / Code Change

Signal: The tag fires but required data layer variables are undefined or have changed format. The data completeness score dropped at the same time as the success rate.

Confirm: Open the browser console on the affected page. Run dataLayer.filter(e => e.event === 'purchase') and check the output. Compare the data layer schema against the GTM variable configuration.

Pattern C: Vendor Outage

Signal: Network timeout or error response errors. The tag fires, the browser sends the request, but the endpoint does not respond or returns 5xx.

Confirm: Check the vendor’s status page. For GA4, check Google Cloud Status. For Facebook, check Meta Platform Status. If the vendor is down, there is nothing to fix on your end. Document the outage and wait.

Pattern D: Consent Management Platform Block

Signal: The tag does not fire at all on pages where the CMP loads. The error type is “tag did not fire” rather than a network error. This often happens after a CMP configuration update.

Confirm: Open the site in a private browser window. Decline all consent. Check if the tag fires (it should not). Accept all consent. Check if the tag fires (it should). If the tag does not fire after consent, the CMP is blocking it incorrectly. Check the CMP’s tag classification settings.

Pattern E: Ad Blocker / Browser Update

Signal: Network block errors. Gradual increase over days, not a cliff. Correlates with a browser update rollout or a popular ad blocker list update.

Confirm: Check the browser breakdown in the dashboard. If the failure is concentrated in one browser version, check that browser’s release notes for tracking prevention changes. For ad blockers, this is expected loss. Document the rate and move on.

Step 3: Deploy the Fix

Each root cause has a specific fix path:

GTM change: Revert the container to the previous version. GTM → Admin → Container → Versions → select the last known good version → Publish. This restores the tag, trigger, and variable configuration to the pre-incident state. Then debug the new version in Preview before re-publishing.
Code change: Work with the development team to restore the data layer schema. Provide the exact variable names and expected formats. If a hotfix is faster than a full deploy, push a patch that adds the missing data layer pushes.
Vendor outage: No action required. Monitor the vendor status page. Set a reminder to verify recovery after the vendor reports resolution.
CMP block: Update the CMP configuration to re-classify the tag correctly. Test in a private window. Publish the CMP change.
Ad blocker: If the blocked rate is above 20%, consider implementing a server-side proxy for critical tags (GA4 measurement protocol, for example). For most cases, document the rate and accept it.

Step 4: Verify Recovery

After deploying the fix, watch the TagDrishti Live Feed for the affected tag.
Confirm new fires appear with success status.
Wait 10 minutes. Check the success rate on the Tag Health panel. It should start climbing toward the baseline.
Check the per-page breakdown to confirm all affected pages are recovering, not just the page you tested.
If the success rate does not recover within 30 minutes, the fix did not fully resolve the issue. Return to Step 2.

Step 5: Document the Incident

TagDrishti maintains an incident timeline automatically. But you should add your own notes:

Go to the incident in the dashboard (it is logged under Incidents in the left sidebar).
Click Add Note. Record: what broke, when it broke, what caused it, what you did to fix it, and how long data was affected.
Tag the incident with a category: GTM change, code deployment, vendor outage, consent issue, or ad blocker.
If you are an agency, include this incident in the client’s next report. Frame it as: incident detected in X minutes, resolved in Y minutes, data impact limited to Z hours.

The 30-Minute Postmortem Template

For any incident longer than 1 hour or affecting revenue-critical tags, run a short postmortem. Use these five questions:

What was the customer impact? Sessions affected, events lost, estimated revenue attribution loss.
What was the root cause? Not “the data layer broke,” but the specific change, who made it, why it reached production without being caught.
What did we do well? Time from alert to acknowledge, time from acknowledge to fix, quality of the communication.
What did we miss? Monitoring gap, documentation gap, process gap.
What’s the one change we’ll make? Not ten. One. That actually gets done before the next sprint.

Every incident should make you faster next time. Review your incident log monthly. Look for patterns: same client, same tag, same root cause. Patterns reveal systemic issues that a process change or GTM restructuring can eliminate.

Catch incidents before your customers do

14-day free trial. No credit card required.

Start Free Trial →

When a Tag Breaks: Incident Response