Production Monitoring: SLAs, Errors & User Behavior
Your code deployed successfully. Now what? Here's how to monitor production systems that actually tells you what's happening before your users complain.
Deployment is not the finish line. The most dangerous time for any feature is the first 24 hours in production. Without proper monitoring, you're flying blind - learning about problems from angry user emails instead of dashboards. Good monitoring tells you what's wrong before anyone else notices.
SLAs: Setting Realistic Expectations
Service Level Agreements define what "working" means. The difference between uptime targets is bigger than it looks:
| SLA | Annual Downtime | Monthly Downtime |
|---|---|---|
| 99% | 3.65 days | 7.2 hours |
| 99.9% | 8.76 hours | 43.8 minutes |
| 99.99% | 52.6 minutes | 4.4 minutes |
| 99.999% | 5.26 minutes | 26.3 seconds |
Each additional nine roughly 10x's the engineering effort and cost. Pick the SLA your business actually needs, not the one that sounds impressive. Most websites don't need five nines.
Error Monitoring: What the Numbers Mean
4xx Errors: Client Problems
- 400 Bad Request - Client sent malformed data. Check your validation.
- 401 Unauthorized - Auth failed. Check token expiration, login flows.
- 403 Forbidden - Auth worked but permissions denied. Review access control.
- 404 Not Found - Missing resources. Broken links, deleted content, bad URLs.
- 429 Too Many Requests - Rate limiting kicked in. Someone's hammering your API.
4xx errors are often user errors, but spikes indicate UX problems or breaking changes.
5xx Errors: Your Problems
- 500 Internal Server Error - Something crashed. Check logs immediately.
- 502 Bad Gateway - Upstream service down. Check dependencies.
- 503 Service Unavailable - Overloaded or maintenance. Scale up or investigate.
- 504 Gateway Timeout - Upstream too slow. Database queries? External APIs?
5xx errors are always your responsibility. Each one is a user who had a bad experience.
Tools That Actually Work
Error Tracking
- Sentry - Catches exceptions with full stack traces, context, and user info
- Bugsnag - Similar to Sentry, strong mobile support
- Rollbar - Real-time error tracking with deployment correlation
APM (Application Performance Monitoring)
- Datadog - Full stack observability, traces, metrics, logs unified
- New Relic - Deep application insights, database query analysis
- Dynatrace - AI-powered root cause analysis
Alerting
- PagerDuty - On-call scheduling, escalation policies, incident management
- Opsgenie - Alerting with team routing
- Slack/Teams integrations - For non-critical alerts
Alerting Without Alert Fatigue
The worst monitoring setup: alerts for everything. Your team learns to ignore them, and the real emergency gets lost in noise.
Design alerts with tiers:
- Page immediately (wake someone up) - Service down, data loss, security incident
- Urgent (respond within hours) - Error rate spike, degraded performance
- Normal (next business day) - Elevated warnings, capacity approaching limits
- Informational (don't alert) - Log for debugging, no action needed
If everything is urgent, nothing is.
User Behavior Analytics
Technical metrics tell you what's broken. Behavior analytics tell you what's working (or not).
Heatmaps
Where do users actually click? Scroll? Hover? Heatmaps reveal:
- CTAs that nobody notices
- Content users never scroll to
- Elements users try to click that aren't clickable
Session Replay
Watch recordings of actual user sessions (anonymized). You'll see:
- Rage clicks when something doesn't respond
- Confusion navigating your UI
- Steps where users abandon flows
Tools like FullStory, Hotjar, and LogRocket make this easy.
Funnel Analysis
For any multi-step process (signup, checkout, onboarding):
- How many users start?
- Where do they drop off?
- What's different about users who complete vs. abandon?
Acting on Data
Metrics without action are just expensive charts. Create feedback loops:
- Weekly metrics review - What's trending wrong?
- Error budgets - When reliability drops, prioritize fixes over features
- Postmortems - After incidents, document what happened and how to prevent it
- Alerts that create tickets - Don't rely on memory
Postmortems That Actually Improve Things
Most postmortems are blame sessions that produce nothing. Effective postmortems:
- Are blameless - Focus on systems, not individuals
- Establish timeline - What happened when?
- Identify root cause - Why did the system allow this?
- List action items - Concrete tasks with owners and deadlines
- Follow up - Did we actually complete those items?
The goal isn't to prevent all failures - it's to prevent the same failure twice.
Starting Point
If you have nothing today, start with:
- Error tracking - Know when things break
- Uptime monitoring - Know when the site is down
- One key business metric - Signups, purchases, whatever matters most
You can add sophistication later. First, stop being blind.
Your code isn't done when it deploys. It's done when you can prove it's working. Build monitoring that gives you confidence - and sleep.
Frequently Asked Questions
About the Author
RJ Lindelof is a technology executive with 35+ years of experience spanning Fortune 500 companies to startups. He does don't just talk about AI; he implement's it to solve real-world business problems. RJ's approach has led to significant improvements in team velocity, code quality, and time-to-market.