I know you can export directly to the backend, but the collector typically uses less than 50MB of RAM in my experience (even when handling lots of traces) and it's pretty easy to add sidecars to however you deploy your backends nowadays. Using Grafana SaaS metrics could look a little spiky or generally weirder without the collector, but normal with it, so I just default to using it now.
It's the shame the docs on it are still quite bad. The example config in the article here does look almost identical to the one we use everywhere, just without the redact, and should probably be pasted somewhere into the official docs.
Every provider seems to produce their own soft fork of the collector for branding (eg Alloy, ADOT, etc) and slightly changes the configuration, which doesn't help.
I happen to work with otel a lot so I'll offer a few of my thoughts:
- Consider decoupling your collector from whatever is consuming your traces with something like kafka. Traces can be pretty heavy and it can be tricky to scale collectors. If something goes down, it's probably a good idea to continue writing the traces to queue or topic.
I've been putting off a self-hosted observability setup for a long time.
Any recommendations on basis ease of setup and operation? (For something low-medium scale).
My ideal setup would be to just write SQL on telemetry data and plot dashboards / set alerts.
IMO, with the current tech, it entirely depends on what data you're talking about.
For metrics and traces, I would use the OTel collector personally. You will have much more flexibility and it's pretty easy to write custom processors in Go. Support for traces is quite mature and metrics isn't far off. We've been running collectors for production scale of metric and trace ingest for the past couple of years, on the order of 1m events/sec (metric datapoints or spans). You mentioned low volume so that's less important, but I just wanted to mention in case others find this comment.
Logs are a bit different. We looked in to this in the past year. Vector has emerging support for OTLP but it's pretty early. Still, I bet it's pretty straightforward if your backend can ingest via OTLP. Our main concern with running the otel-collector as the log ingest agent was around throughput/performance. Vector is battle-tested, otel is still a bit early in this space. I imagine over time the gap will be closed but I would probably still reach for Vector for this use-case for higher scale. That said, YMMV and as with any technical decision, empirical data and benchmarking on your workloads will be the best way to determine the tradeoffs.
For your scale you could probably get away with an OTel collector daemonset and maybe a deployment with the Target Allocator (to allocate Prometheus scrapes) and call it a day :)
I'm using OpenObserve - it does logs, metrics and traces all under one roof. Handles alerts too.
It's been solid, but the UI is kind of clunky and a little buggy here and there. Dashboards are tricky to setup too. But it has no dependencies, and was easy to setup, and I couldn't find anything else that handled logs too.
I tried both. Signoz is pretty sloppily built. For ex the self hosted option starts a ZK instance with 1 clickhouse host-no way to disable, 800MB ram. Signoz log transformation tool is broken and confusing.
HyperDX is just a lot better, sure a few papercuts but they got all the important stuff right imo.
Otel stuff always seems overly complicated to me, but it must just be the types of projects I generally work with. Feels like observability meets java.
I've dabbled in building a project that collects metrics from the logs for smaller projects. Everyone tells me it's a bad idea, but it seems to work well for me.
I'm evaluating Greptimedb in prod and while I hate to have a needless component like OTEL-Collector in general, it serves as a read-only gate between the database and the user, so that greptime keeps listening to localhost only, and OTEL-Collector guarantees nobody will write to the database directly.
If it were to give more fine-grained control over write-only access -- would probably just write directly and let it handle the load.
I know you can export directly to the backend, but the collector typically uses less than 50MB of RAM in my experience (even when handling lots of traces) and it's pretty easy to add sidecars to however you deploy your backends nowadays. Using Grafana SaaS metrics could look a little spiky or generally weirder without the collector, but normal with it, so I just default to using it now.
It's the shame the docs on it are still quite bad. The example config in the article here does look almost identical to the one we use everywhere, just without the redact, and should probably be pasted somewhere into the official docs.
Every provider seems to produce their own soft fork of the collector for branding (eg Alloy, ADOT, etc) and slightly changes the configuration, which doesn't help.
I happen to work with otel a lot so I'll offer a few of my thoughts:
- Consider decoupling your collector from whatever is consuming your traces with something like kafka. Traces can be pretty heavy and it can be tricky to scale collectors. If something goes down, it's probably a good idea to continue writing the traces to queue or topic.
- https://www.otelbin.io is a nice little tool to help with collector configuration
I've been putting off a self-hosted observability setup for a long time. Any recommendations on basis ease of setup and operation? (For something low-medium scale).
My ideal setup would be to just write SQL on telemetry data and plot dashboards / set alerts.
Also, thoughts on Vector vs otel agent?
> Also, thoughts on Vector vs otel agent?
IMO, with the current tech, it entirely depends on what data you're talking about.
For metrics and traces, I would use the OTel collector personally. You will have much more flexibility and it's pretty easy to write custom processors in Go. Support for traces is quite mature and metrics isn't far off. We've been running collectors for production scale of metric and trace ingest for the past couple of years, on the order of 1m events/sec (metric datapoints or spans). You mentioned low volume so that's less important, but I just wanted to mention in case others find this comment.
Logs are a bit different. We looked in to this in the past year. Vector has emerging support for OTLP but it's pretty early. Still, I bet it's pretty straightforward if your backend can ingest via OTLP. Our main concern with running the otel-collector as the log ingest agent was around throughput/performance. Vector is battle-tested, otel is still a bit early in this space. I imagine over time the gap will be closed but I would probably still reach for Vector for this use-case for higher scale. That said, YMMV and as with any technical decision, empirical data and benchmarking on your workloads will be the best way to determine the tradeoffs.
For your scale you could probably get away with an OTel collector daemonset and maybe a deployment with the Target Allocator (to allocate Prometheus scrapes) and call it a day :)
I'm using OpenObserve - it does logs, metrics and traces all under one roof. Handles alerts too.
It's been solid, but the UI is kind of clunky and a little buggy here and there. Dashboards are tricky to setup too. But it has no dependencies, and was easy to setup, and I couldn't find anything else that handled logs too.
HyperDX is really great. It is basically SQL on telemetry data in clickhouse.
Don’t use vector or otel-agent. Add a materialized view in clickhouse to transform data and swap HyperDX to load from your view (in the UI.)
Sounds like you should take a look at ClickStack (HyperDX) to me
OneUptime does this with otel. Happy to help! Feel free to reach out at nawazdhandala [at] oneuptime [dot] com
Seq?
I've been looking at HyperDX (ClickStack) and SigNoz, but those indeed are coupled
I tried both. Signoz is pretty sloppily built. For ex the self hosted option starts a ZK instance with 1 clickhouse host-no way to disable, 800MB ram. Signoz log transformation tool is broken and confusing.
HyperDX is just a lot better, sure a few papercuts but they got all the important stuff right imo.
Otel stuff always seems overly complicated to me, but it must just be the types of projects I generally work with. Feels like observability meets java.
I've dabbled in building a project that collects metrics from the logs for smaller projects. Everyone tells me it's a bad idea, but it seems to work well for me.
I did not like working with OpenTelemetry; made me miss the good old days (monolith).
I'm evaluating Greptimedb in prod and while I hate to have a needless component like OTEL-Collector in general, it serves as a read-only gate between the database and the user, so that greptime keeps listening to localhost only, and OTEL-Collector guarantees nobody will write to the database directly.
If it were to give more fine-grained control over write-only access -- would probably just write directly and let it handle the load.