Here's what I did that worked for last two teams I managed. We used slack as a hub for everything, we had some brittle [1] services so there were cron scripts that ran every minute and would alert to slack when it was down @ing someone. Likewise rolled own dumb logger that just simply existed as a middleware initially to capture requests and responses and certain events were logged to slack and disk in the code with `event(enum, json)`. This slack bot could also dump certain info for users or events with some slash commands provided you had an id. From these logs and other bits added over time could see when execs connected or someone had difficulty with auth, a job or method took abnormally long and current active sessions etc. This grew to support ceo, marketing and other devs and got pretty involved at some point we had small services tied in that could visualize geojson over a map for a trip completed, dump a replay session or get all stacktraces for the day. Also for 3rd party services we couldn't tie into directly used a proxy setup where we didn't call it directly but inside a wrapper where we could capture data so a call to `api.somesite.com/v1/events` became `mysite.com/proxy?site=api.somesite.com/v1/events` in our apps so when our clients called this we knew and could again log.
Since this seems close enough to the similar problem I had you could take a look at this approach and start with what's being requested or the repeating problems and have a central hub for others ingest these via discord or slack and appropriate channels #3rd-party-uptimes, #backups, #raw-logs, #events. From this we rarely used our dashboards, bugsnag or had the need to ssh into any server to pull access or error logs.
- [1] This one was particularly so because they had a org policy to randomly reset vpn passwords and the only way to change it was using a desktop client to basically set the same password again.
I'd argue this is an issue not just with "non-tech" folks, but even engineers who don't have experience with Prometheus and other time-series databases. Learning promql always seemed like a hard thing to ask of other engineers. Grafana has made it easier to explore and build queries over time, but there are still quirks and nuances that can be difficult to explain to people whose role doesn't typically involve scouring through metrics.
https://www.statuspage.io/ or an open equivalent (https://github.com/oneuptime/oneuptime) perhaps? It's a straightforward implementation shim to distill from engineering telemetry to the non tech interface in my experience. Think of this as scaling engineering comms to non engineering parts of the org.
yeah but in my experience it's either the information is very shallow (like status page) or too technical (like traces). i feel like non engineering parts of the org still want to have a view of what's slow, what's fast, how much data we ingested and things like that
Here's what I did that worked for last two teams I managed. We used slack as a hub for everything, we had some brittle [1] services so there were cron scripts that ran every minute and would alert to slack when it was down @ing someone. Likewise rolled own dumb logger that just simply existed as a middleware initially to capture requests and responses and certain events were logged to slack and disk in the code with `event(enum, json)`. This slack bot could also dump certain info for users or events with some slash commands provided you had an id. From these logs and other bits added over time could see when execs connected or someone had difficulty with auth, a job or method took abnormally long and current active sessions etc. This grew to support ceo, marketing and other devs and got pretty involved at some point we had small services tied in that could visualize geojson over a map for a trip completed, dump a replay session or get all stacktraces for the day. Also for 3rd party services we couldn't tie into directly used a proxy setup where we didn't call it directly but inside a wrapper where we could capture data so a call to `api.somesite.com/v1/events` became `mysite.com/proxy?site=api.somesite.com/v1/events` in our apps so when our clients called this we knew and could again log.
Since this seems close enough to the similar problem I had you could take a look at this approach and start with what's being requested or the repeating problems and have a central hub for others ingest these via discord or slack and appropriate channels #3rd-party-uptimes, #backups, #raw-logs, #events. From this we rarely used our dashboards, bugsnag or had the need to ssh into any server to pull access or error logs.
- [1] This one was particularly so because they had a org policy to randomly reset vpn passwords and the only way to change it was using a desktop client to basically set the same password again.
I'd argue this is an issue not just with "non-tech" folks, but even engineers who don't have experience with Prometheus and other time-series databases. Learning promql always seemed like a hard thing to ask of other engineers. Grafana has made it easier to explore and build queries over time, but there are still quirks and nuances that can be difficult to explain to people whose role doesn't typically involve scouring through metrics.
yeah promql is a nightmare... don't know if datadog's better at that
https://www.statuspage.io/ or an open equivalent (https://github.com/oneuptime/oneuptime) perhaps? It's a straightforward implementation shim to distill from engineering telemetry to the non tech interface in my experience. Think of this as scaling engineering comms to non engineering parts of the org.
yeah but in my experience it's either the information is very shallow (like status page) or too technical (like traces). i feel like non engineering parts of the org still want to have a view of what's slow, what's fast, how much data we ingested and things like that
Engineering is always about tradeoffs. Observe, measure, and iterate accordingly.