Production Monitoring: SLAs, Errors & User Behavior

Q: What is the difference between 99.9% and 99.99% uptime?

99.9% uptime allows 8.76 hours of annual downtime or 43.8 minutes monthly. 99.99% allows only 52.6 minutes annually or 4.4 minutes monthly. Each additional nine roughly 10x the engineering effort and cost. Most websites do not need five nines uptime.

Q: What do 4xx HTTP error codes indicate?

4xx errors indicate client-side problems: 400 Bad Request means malformed data, 401 Unauthorized means auth failed, 403 Forbidden means permissions denied, 404 Not Found means missing resources, and 429 Too Many Requests means rate limiting triggered. Spikes in 4xx errors often indicate UX problems or breaking changes.

Q: What do 5xx HTTP error codes indicate?

5xx errors indicate server-side problems: 500 Internal Server Error means unhandled exceptions, 502 Bad Gateway means upstream service issues, 503 Service Unavailable often means overload or maintenance, and 504 Gateway Timeout means slow backend responses. Any 5xx is your problem and needs immediate attention.

Q: What metrics matter for user behavior monitoring?

Key user behavior metrics include session duration, bounce rate, conversion funnel drop-off points, feature adoption rates, and error encounters. Track where users come from, where they go, and where they leave. If they repeatedly view pricing but never convert, thats a signal.

Q: What makes a postmortem effective?

Effective postmortems are blameless (focus on systems not individuals), establish a clear timeline, identify root cause (why did the system allow this), list concrete action items with owners and deadlines, and include follow-up to verify items were completed. The goal is preventing the same failure twice.

Q: What should basic production monitoring include at minimum?

At minimum, start with three things: error tracking to know when things break, uptime monitoring to know when the site is down, and one key business metric like signups or purchases. You can add sophistication later, but first you need to stop being blind to what is happening in production.

DevOps Monitoring Operations

RJ Lindelof

September 2, 2026 8 min read Explore Operations Excellence at RJL.guru

Production Monitoring: SLAs, Errors & User Behavior

Your code deployed successfully. Now what? Here's how to monitor production systems that actually tells you what's happening before your users complain.

Deployment is not the finish line. The most dangerous time for any feature is the first 24 hours in production. Without proper monitoring, you're flying blind - learning about problems from angry user emails instead of dashboards. Good monitoring tells you what's wrong before anyone else notices.

SLAs: Setting Realistic Expectations

Service Level Agreements define what "working" means. The difference between uptime targets is bigger than it looks:

SLA	Annual Downtime	Monthly Downtime
99%	3.65 days	7.2 hours
99.9%	8.76 hours	43.8 minutes
99.99%	52.6 minutes	4.4 minutes
99.999%	5.26 minutes	26.3 seconds

Each additional nine roughly 10x's the engineering effort and cost. Pick the SLA your business actually needs, not the one that sounds impressive. Most websites don't need five nines.

Error Monitoring: What the Numbers Mean

4xx Errors: Client Problems

400 Bad Request - Client sent malformed data. Check your validation.
401 Unauthorized - Auth failed. Check token expiration, login flows.
403 Forbidden - Auth worked but permissions denied. Review access control.
404 Not Found - Missing resources. Broken links, deleted content, bad URLs.
429 Too Many Requests - Rate limiting kicked in. Someone's hammering your API.

4xx errors are often user errors, but spikes indicate UX problems or breaking changes.

5xx Errors: Your Problems

500 Internal Server Error - Something crashed. Check logs immediately.
502 Bad Gateway - Upstream service down. Check dependencies.
503 Service Unavailable - Overloaded or maintenance. Scale up or investigate.
504 Gateway Timeout - Upstream too slow. Database queries? External APIs?

5xx errors are always your responsibility. Each one is a user who had a bad experience.

Tools That Actually Work

Error Tracking

Sentry - Catches exceptions with full stack traces, context, and user info
Bugsnag - Similar to Sentry, strong mobile support
Rollbar - Real-time error tracking with deployment correlation

APM (Application Performance Monitoring)

Datadog - Full stack observability, traces, metrics, logs unified
New Relic - Deep application insights, database query analysis
Dynatrace - AI-powered root cause analysis

Alerting

PagerDuty - On-call scheduling, escalation policies, incident management
Opsgenie - Alerting with team routing
Slack/Teams integrations - For non-critical alerts

Alerting Without Alert Fatigue

The worst monitoring setup: alerts for everything. Your team learns to ignore them, and the real emergency gets lost in noise.

Design alerts with tiers:

Page immediately (wake someone up) - Service down, data loss, security incident
Urgent (respond within hours) - Error rate spike, degraded performance
Normal (next business day) - Elevated warnings, capacity approaching limits
Informational (don't alert) - Log for debugging, no action needed

If everything is urgent, nothing is.

User Behavior Analytics

Technical metrics tell you what's broken. Behavior analytics tell you what's working (or not).

Heatmaps

Where do users actually click? Scroll? Hover? Heatmaps reveal:

CTAs that nobody notices
Content users never scroll to
Elements users try to click that aren't clickable

Session Replay

Watch recordings of actual user sessions (anonymized). You'll see:

Rage clicks when something doesn't respond
Confusion navigating your UI
Steps where users abandon flows

Tools like FullStory, Hotjar, and LogRocket make this easy.

Funnel Analysis

For any multi-step process (signup, checkout, onboarding):

How many users start?
Where do they drop off?
What's different about users who complete vs. abandon?

Acting on Data

Metrics without action are just expensive charts. Create feedback loops:

Weekly metrics review - What's trending wrong?
Error budgets - When reliability drops, prioritize fixes over features
Postmortems - After incidents, document what happened and how to prevent it
Alerts that create tickets - Don't rely on memory

Postmortems That Actually Improve Things

Most postmortems are blame sessions that produce nothing. Effective postmortems:

Are blameless - Focus on systems, not individuals
Establish timeline - What happened when?
Identify root cause - Why did the system allow this?
List action items - Concrete tasks with owners and deadlines
Follow up - Did we actually complete those items?

The goal isn't to prevent all failures - it's to prevent the same failure twice.

Starting Point

If you have nothing today, start with:

Error tracking - Know when things break
Uptime monitoring - Know when the site is down
One key business metric - Signups, purchases, whatever matters most

You can add sophistication later. First, stop being blind.

Your code isn't done when it deploys. It's done when you can prove it's working. Build monitoring that gives you confidence - and sleep.

Frequently Asked Questions

What is the difference between 99.9% and 99.99% uptime?

What do 4xx HTTP error codes indicate?

What do 5xx HTTP error codes indicate?

What metrics matter for user behavior monitoring?

What makes a postmortem effective?

What should basic production monitoring include at minimum?

About the Author

RJ Lindelof is a technology executive with 35+ years of experience spanning Fortune 500 companies to startups. He does don't just talk about AI; he implement's it to solve real-world business problems. RJ's approach has led to significant improvements in team velocity, code quality, and time-to-market.

Learn more Get in Touch

Back to Blog