Observability on a Budget: Logging, Tracing, and Metrics Without Enterprise Tooling

#observability on a budget
Sandor Farkas - Founder & Lead Developer at Wolf-Tech

Sandor Farkas

Founder & Lead Developer

Expert in software development and legacy code optimization

The bill from Datadog landed on a Wednesday morning and everyone in the engineering meeting got very quiet. A SaaS team we worked with last year was paying more for observability than for their entire staging infrastructure, and every new service pushed the number higher. The product was shipping well, traffic was growing, and the monitoring that had felt like a safety net at Series A had become a recurring cost that shaped every roadmap conversation. The question on the table was whether the same visibility could be delivered for a tenth of the price. The answer, after three months of migration, was yes.

Observability on a budget is not a compromise on quality. It is a deliberate architecture choice that uses open standards and open-source tooling to get the same three signals—logs, metrics, and traces—that the enterprise platforms sell. OpenTelemetry for instrumentation, Prometheus and Grafana for metrics and dashboards, Loki or structured logs with Monolog for log aggregation, and Tempo or Jaeger for distributed tracing. Each piece is production-ready, widely deployed, and well-documented. Together they form a stack that scales from a two-person startup to a mid-size SaaS without a five-figure monthly bill.

This post covers the architecture of that stack, the reasoning behind each tool choice, and the trade-offs you accept when you own the pipeline instead of outsourcing it.

The Three Pillars and What Each Actually Answers

Observability is often reduced to "logs, metrics, and traces," but the phrase obscures why you need all three. Each signal answers a different kind of question, and a monitoring stack is only as strong as its weakest pillar.

Metrics answer questions about rates and totals over time: how many requests per second, what is the p95 latency of the checkout endpoint, how many background jobs are failing, how full is the queue. Metrics are cheap to store because they aggregate. They are fast to query because they are pre-indexed time series. They power alerts and dashboards, and they are the first thing you look at when something feels slow.

Logs answer questions about specific events: what happened when user 4523 tried to check out at 14:32, what was the stack trace of the exception that fired twice in the last hour, which parameters did the failing webhook receive. Logs are expensive at scale because they are high-volume and require full-text indexing to be useful. The quality of a logging system depends almost entirely on whether logs are structured and correlated.

Traces answer questions about causality across services: a request came in at the API gateway, fanned out to three backend services and two external providers, and took 4.2 seconds—where did the time go. Traces are the most valuable signal in a distributed system and the one that is most commonly skipped because teams underestimate how hard it is to reconstruct them from logs alone.

A good stack delivers all three with shared identifiers so a trace ID in a log line opens the corresponding trace, and a slow span in a trace points to the exact log output of the handler that ran. This correlation is what turns raw telemetry into actual observability.

Structured Logging with Monolog and a Shared Correlation ID

For PHP/Symfony applications, Monolog is the logging foundation. Out of the box it supports multiple handlers, log rotation, and a handful of built-in processors, but the meaningful work is in configuring it for structured output and trace correlation.

Structured means JSON. A log line like "User 4523 failed checkout" is searchable only by substring; a JSON object with user_id, event, order_id, and error_code fields can be filtered, aggregated, and correlated at query time. The shift to JSON logging is the single highest-leverage observability change most teams can make, and it costs nothing.

A minimal Monolog setup for a Symfony service looks like this:

# config/packages/monolog.yaml
monolog:
  handlers:
    main:
      type: stream
      path: 'php://stdout'
      level: info
      formatter: monolog.formatter.json
      channels: ['!event']
  processors:
    - Monolog\Processor\UidProcessor
    - App\Logging\TraceIdProcessor

The TraceIdProcessor is where correlation happens. It reads the current OpenTelemetry trace ID from context and attaches it to every log record so the logs for a single request can be pulled back by searching for one ID.

// src/Logging/TraceIdProcessor.php
final class TraceIdProcessor
{
    public function __invoke(array $record): array
    {
        $span = Span::getCurrent();
        $context = $span->getContext();
        if ($context->isValid()) {
            $record['extra']['trace_id'] = $context->getTraceId();
            $record['extra']['span_id'] = $context->getSpanId();
        }
        return $record;
    }
}

Logs written to stdout in JSON are trivial to collect. Promtail or Vector ships them into Loki, Grafana's log storage engine, which costs a fraction of Elasticsearch or a SaaS log provider because it only indexes labels and relies on cheap object storage for the rest. For many teams, Loki on a €40/month VPS handles the entire log volume of a growing SaaS.

Metrics with Prometheus and Grafana

Prometheus is the de facto standard for metrics in open-source infrastructure, and there is almost no scenario in a mid-size SaaS where it is the wrong choice. It scrapes metrics endpoints on a schedule, stores them as efficient time series, and exposes a query language (PromQL) that is expressive enough for virtually any alert or dashboard you might build.

Instrumenting a PHP application for Prometheus requires an exporter that exposes the /metrics endpoint in the right format. promphp/prometheus_client_php handles this. For counters, histograms, and gauges you use a few lines:

$registry = new CollectorRegistry(new Redis(['host' => 'redis']));

$counter = $registry->getOrRegisterCounter(
    'app',
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'route', 'status']
);
$counter->inc([$method, $route, (string) $status]);

$histogram = $registry->getOrRegisterHistogram(
    'app',
    'http_request_duration_seconds',
    'HTTP request duration',
    ['route'],
    [0.05, 0.1, 0.25, 0.5, 1, 2, 5]
);
$histogram->observe($durationSeconds, [$route]);

The metrics you should expose by default are request rate and latency histograms per route, database query duration, queue depth, job duration, error rates, and one or two business metrics that matter for your product (signups per minute, orders per minute, active WebSocket connections). Grafana dashboards built on these five or six metric families cover 90% of operational questions.

Grafana itself is free, runs on any small server, and has a mature alerting system that fires to Slack, PagerDuty, Opsgenie, or email. Alert rules live as code in Grafana's provisioning files, so they are reviewed and versioned the same way as application code. This is often a better workflow than SaaS alerting UIs where alert definitions drift from the assumptions of the team that wrote them.

Distributed Tracing with OpenTelemetry and Tempo

OpenTelemetry is the critical piece of the budget stack because it removes vendor lock-in. The OpenTelemetry SDK instruments your application with vendor-neutral semantics, and the OpenTelemetry Collector routes the resulting telemetry to any compatible backend. Start with Tempo (Grafana's trace backend), and if you later migrate to a commercial vendor, the instrumentation does not change—only the Collector's export configuration.

A Symfony application instruments itself with open-telemetry/opentelemetry-auto-symfony and a tracer provider configured in the service container. The auto-instrumentation wraps the kernel, Doctrine queries, HTTP clients, and cache operations, and produces a trace tree for every inbound request. For manual spans around specific operations, a thin helper makes the code ergonomic:

public function processPayment(Order $order): PaymentResult
{
    return $this->tracer->inSpan('payment.process', function () use ($order) {
        $span = Span::getCurrent();
        $span->setAttribute('order.id', $order->getId());
        $span->setAttribute('order.amount_cents', $order->getAmountCents());

        $result = $this->paymentProvider->charge($order);

        $span->setAttribute('payment.provider_ref', $result->getReference());
        return $result;
    });
}

Traces flow through the OpenTelemetry Collector into Tempo, which stores them in S3-compatible object storage. Because Tempo does not index trace content—it indexes only the trace ID and uses external systems (Loki, Prometheus) for search—it is dramatically cheaper than Jaeger with Elasticsearch at the same volume. The typical budget stack stores weeks of traces for the price of a day or two on a commercial platform.

The practical value shows up when something goes wrong: a slow checkout trace shows exactly which span dominated the request, the logs for that span are one click away via the shared trace ID, and the relevant Prometheus metrics are visible in the same Grafana instance. That integration is what makes observability work, and it comes for free when all three signals live in the same place.

The Honest Trade-offs

A budget stack is not identical to Datadog or New Relic. There are real differences worth naming.

The first is operational overhead. Running Prometheus, Grafana, Loki, Tempo, and the OpenTelemetry Collector requires someone on the team who understands how to upgrade them, configure retention, tune resource limits, and recover from disk pressure. Teams with no platform engineer often underestimate this cost. In practice, a well-configured stack on Docker or Kubernetes needs a few hours per month of maintenance—much less than the enterprise bill, but not zero.

The second is advanced features. Commercial platforms include anomaly detection, automatic dependency mapping, session replay, real user monitoring, and AI-assisted root cause analysis. Most of these features are useful at scale but unnecessary for a team below a few hundred services. If you need them later, OpenTelemetry instrumentation ports cleanly to commercial backends.

The third is retention at very high volumes. At tens of billions of log lines per day, running Loki yourself becomes non-trivial. For most SaaS companies this is a problem for later; for very high-volume products it is worth validating early.

For mid-size European SaaS companies, the budget stack is usually the right choice—not because it is cheaper but because it is inspectable, portable, and built on standards that will outlast any single vendor. It is also the stack that pairs best with legacy code optimization and broader custom software development work, because it gives engineering teams the visibility they need to change systems with confidence.

A Reference Architecture You Can Ship This Quarter

A concrete end-to-end setup that we have deployed successfully across several Berlin-based SaaS products looks like this: Symfony or Next.js applications instrumented with OpenTelemetry and Monolog; an OpenTelemetry Collector running as a sidecar or single instance per environment; Prometheus scraping /metrics from each service and from infrastructure exporters (node_exporter, mysqld_exporter, redis_exporter); Loki collecting JSON logs from stdout via Promtail; Tempo storing traces in S3; and Grafana as the single pane of glass for dashboards, logs, and traces, with alerting routed to Slack and PagerDuty.

The entire stack runs on two or three modest Hetzner or OVH servers for the vast majority of SaaS products, including some with millions of requests per day. The monthly cost is typically under €200, and the visibility is indistinguishable from what a commercial platform provides for day-to-day operational work.

Observability is not a problem you solve once. It is a practice that grows with the product, and the early choices compound. Teams that commit to OpenTelemetry instrumentation and structured logging early can change backends freely as their needs evolve; teams locked into a proprietary agent pay for that decision in every future migration.

If your team is evaluating an observability migration, scoping the first instrumentation pass, or trying to stop a Datadog bill from defining the quarter, we help mid-size European SaaS companies design and deploy production-grade open-source observability stacks. Contact us at hello@wolf-tech.io or visit wolf-tech.io for a free consultation.