Sginoz is an open source APM(Application Performance Management), which is a practice of application observability. It uses the OpenTelemetry protocol to integrate traces/metrics/log.
SigNoz is an open-source APM. It helps developers monitor their applications & troubleshoot problems, an open-source alternative to DataDog, NewRelic, etc.
This article focuses on the implementation of traces and metrics.
1. Overall structure
- The app uses opentelmetry-sdk to write code (support java/golang/python, etc.);
- app configuration sends metrics and tracing to otel-collector;
otel-collector custom implementation:
- clickhouse-metrics-exporter: send metrics to clickhouse;
- clickhouse-trace-exporter: send trace to clickhouse:
- query-server is responsible for querying metrics and trace information from clickhouse and providing front-end API;
2. otel-collector
1. The app sends metrics and trace s to otel-collector
Introduce opentelemetry in app:
go.opentelemetry.io/otel
If you use gin as the http framework:
import "go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin" r := gin.Default() r.Use(otelgin.Middleware(serviceName))
When the app is running, specify the ServiceName and collector address:
# SERVICE_NAME=goApp INSECURE_MODE=true OTEL_EXPORTER_OTLP_ENDPOINT=192.168.0.1:4317 go run main.go
2. Configuration of otel-collector
The configuration of otel-collector is divided into receivers, processors, exporters,
Then the service is organized into a complete function through the pipeline.
receivers: opencensus: endpoint: 0.0.0.0:55678 otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 jaeger: protocols: grpc: endpoint: 0.0.0.0:14250 thrift_http: endpoint: 0.0.0.0:14268 processors: batch: send_batch_size: 10000 send_batch_max_size: 11000 timeout: 10s exporters: clickhousetraces: datasource: tcp://clickhouse:9000/?database=signoz_traces clickhousemetricswrite: endpoint: tcp://clickhouse:9000/?database=signoz_metrics resource_to_telemetry_conversion: enabled: true service: telemetry: metrics: address: 0.0.0.0:8888 extensions: - health_check - zpages - pprof pipelines: traces: receivers: [jaeger, otlp] processors: [signozspanmetrics/prometheus, batch] exporters: [clickhousetraces] metrics: receivers: [otlp] processors: [batch] exporters: [clickhousemetricswrite]
3. otel-collector realizes exporter of clickhouse
otel-collector saves metrics and traces to clickhouse. For this, signoz implements in otel-collector:
- clickhousemetricsexporter;
- clickhousetracesexporter;
metrics information stored in clickhouse:
## time_series information 73048f54a32c :) select * from time_series_v2 limit 3; SELECT * FROM time_series_v2 LIMIT 3 Query id: 2540c93a-a396-42a7-b4e5-bed772ac00c5 ┌─metric_name─────────────────────────────────┬──────────fingerprint─┬──timestamp_ms─┬─labels────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─labels_object──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ otelcol_exporter_enqueue_failed_log_records │ 136087390202145591 │ 1659598169051 │ {"__name__":"otelcol_exporter_enqueue_failed_log_records","exporter":"clickhousemetricswrite","instance":"localhost:8888","job":"otel-collector-metrics","service_instance_id":"b3d22c64-5c09-44d7-9367-9716747842db","service_version":"latest"} │ ('otelcol_exporter_enqueue_failed_log_records','','','','','','','clickhousemetricswrite','','','localhost:8888','otel-collector-metrics','','','','','','','','','','b3d22c64-5c09-44d7-9367-9716747842db','','','latest','','','','','') │ │ otelcol_exporter_enqueue_failed_log_records │ 7551020479515194203 │ 1659598139047 │ {"__name__":"otelcol_exporter_enqueue_failed_log_records","exporter":"prometheus","instance":"otel-collector:8888","job":"otel-collector","service_instance_id":"ed7abeec-d18f-4c4d-8bcb-64a9dea37976","service_version":"latest"} │ ('otelcol_exporter_enqueue_failed_log_records','','','','','','','prometheus','','','otel-collector:8888','otel-collector','','','','','','','','','','ed7abeec-d18f-4c4d-8bcb-64a9dea37976','','','latest','','','','','') │ │ otelcol_exporter_enqueue_failed_log_records │ 11388370004215594241 │ 1659598139047 │ {"__name__":"otelcol_exporter_enqueue_failed_log_records","exporter":"clickhousetraces","instance":"otel-collector:8888","job":"otel-collector","service_instance_id":"ed7abeec-d18f-4c4d-8bcb-64a9dea37976","service_version":"latest"} │ ('otelcol_exporter_enqueue_failed_log_records','','','','','','','clickhousetraces','','','otel-collector:8888','otel-collector','','','','','','','','','','ed7abeec-d18f-4c4d-8bcb-64a9dea37976','','','latest','','','','','') │ └─────────────────────────────────────────────┴──────────────────────┴───────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
## samples information 73048f54a32c :) select * from samples_v2 limit 3 SELECT * FROM samples_v2 LIMIT 3 Query id: 40dfc9f5-e76e-4ecc-8b50-c791fa9e041c ┌─metric_name─────────────────────────────────┬────────fingerprint─┬──timestamp_ms─┬─value─┐ │ otelcol_exporter_enqueue_failed_log_records │ 136087390202145591 │ 1659598160222 │ 0 │ │ otelcol_exporter_enqueue_failed_log_records │ 136087390202145591 │ 1659598220222 │ 0 │ │ otelcol_exporter_enqueue_failed_log_records │ 136087390202145591 │ 1659598280222 │ 0 │ └─────────────────────────────────────────────┴────────────────────┴───────────────┴───────┘ 3 rows in set. Elapsed: 0.016 sec.
traces information stored in clickhouse:
73048f54a32c :) select * from signoz_spans limit 3; SELECT * FROM signoz_spans LIMIT 3 Query id: d9d84316-c6f0-4089-87a1-187132a295b1 Connecting to database signoz_traces at localhost:9000 as user default. Connected to ClickHouse server version 22.4.5 revision 54455. ┌─────────────────────timestamp─┬─traceID──────────────────────────┬─model──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ 2022-08-04 14:42:21.920273000 │ 00000000000000000000976d1a8457bf │ {"traceId":"00000000000000000000976d1a8457bf","spanId":"0c8946ee620b3b72","name":"HTTP GET /route","durationNano":35578000,"startTimeUnixNano":1659624141920273000,"serviceName":"route","kind":2,"references":[{"traceId":"00000000000000000000976d1a8457bf","spanId":"75509681018b9e65","refType":"CHILD_OF"}],"tagMap":{"client-uuid":"1f460e28dcd19136","component":"net/http","host.name":"93a9af66f59d","http.method":"GET","http.status_code":"200","http.url":"/route?dropoff=728%2C326\u0026pickup=232%2C495","ip":"172.18.0.3","opencensus.exporterversion":"Jaeger-Go-2.30.0","service.name":"route"},"event":["{\"timeUnixNano\":1659624141920308000,\"attributeMap\":{\"event\":\"HTTP request received\",\"level\":\"info\",\"method\":\"GET\",\"url\":\"/route?dropoff=728%2C326\\u0026pickup=232%2C495\"}}"]} │ │ 2022-08-04 14:42:21.920759000 │ 00000000000000000000976d1a8457bf │ {"traceId":"00000000000000000000976d1a8457bf","spanId":"3a48059caa1e422a","name":"HTTP GET /route","durationNano":50933000,"startTimeUnixNano":1659624141920759000,"serviceName":"route","kind":2,"references":[{"traceId":"00000000000000000000976d1a8457bf","spanId":"37835dd357b1d10c","refType":"CHILD_OF"}],"tagMap":{"client-uuid":"1f460e28dcd19136","component":"net/http","host.name":"93a9af66f59d","http.method":"GET","http.status_code":"200","http.url":"/route?dropoff=728%2C326\u0026pickup=745%2C522","ip":"172.18.0.3","opencensus.exporterversion":"Jaeger-Go-2.30.0","service.name":"route"},"event":["{\"timeUnixNano\":1659624141920796000,\"attributeMap\":{\"event\":\"HTTP request received\",\"level\":\"info\",\"method\":\"GET\",\"url\":\"/route?dropoff=728%2C326\\u0026pickup=745%2C522\"}}"]} │ │ 2022-08-04 14:42:21.920505000 │ 00000000000000000000976d1a8457bf │ {"traceId":"00000000000000000000976d1a8457bf","spanId":"2754f5b55262e27b","name":"HTTP GET /route","durationNano":71846000,"startTimeUnixNano":1659624141920505000,"serviceName":"route","kind":2,"references":[{"traceId":"00000000000000000000976d1a8457bf","spanId":"44b47488d0e49c98","refType":"CHILD_OF"}],"tagMap":{"client-uuid":"1f460e28dcd19136","component":"net/http","host.name":"93a9af66f59d","http.method":"GET","http.status_code":"200","http.url":"/route?dropoff=728%2C326\u0026pickup=462%2C723","ip":"172.18.0.3","opencensus.exporterversion":"Jaeger-Go-2.30.0","service.name":"route"},"event":["{\"timeUnixNano\":1659624141920525000,\"attributeMap\":{\"event\":\"HTTP request received\",\"level\":\"info\",\"method\":\"GET\",\"url\":\"/route?dropoff=728%2C326\\u0026pickup=462%2C723\"}}"]} │ └───────────────────────────────┴──────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
3. Query Service
The query-service service reads the metrics and traces of the clickhouse, and provides an HTTP interface for the front end.
In addition, query-service also introduces prometheus as sdk to realize the creation of alarm rules and alarm functions.
1. Query metrics
API entry: /api/v1/query_range
// app/http_handler.go func (aH *APIHandler) RegisterRoutes(router *mux.Router) { router.HandleFunc("/api/v1/query_range", ViewAccess(aH.queryRangeMetrics)).Methods(http.MethodGet) ... } func (aH *APIHandler) queryRangeMetrics(w http.ResponseWriter, r *http.Request) { query, apiErrorObj := parseQueryRangeRequest(r) ... ctx := r.Context() ... res, qs, apiError := (*aH.reader).GetQueryRangeResult(ctx, query) ... response_data := &model.QueryData{ ResultType: res.Value.Type(), Result: res.Value, Stats: qs, } aH.respond(w, response_data) }
remote-read is implemented for clickhouse, so the query is similar to the query code of prometheus:
- construct query object;
- Call the engine to execute the query;
// app/clickhouseReader/reader.go func (r *ClickHouseReader) GetQueryRangeResult(ctx context.Context, query *model.QueryRangeParams) (*promql.Result, *stats.QueryStats, *model.ApiError) { qry, err := r.queryEngine.NewRangeQuery(r.remoteStorage, query.Query, query.Start, query.End, query.Step) ... res := qry.Exec(ctx) // Optional stats field in response if parameter "stats" is not empty. var qs *stats.QueryStats if query.Stats != "" { qs = stats.NewQueryStats(qry.Stats()) } qry.Close() return res, qs, nil }
2. Query traces
Query url: /api/v1/traces/00000000000000001348070b30cd6aec
Entry code: /api/v1/traces
// app/http_handler.go func (aH *APIHandler) RegisterRoutes(router *mux.Router) { ... router.HandleFunc("/api/v1/traces/{traceId}", ViewAccess(aH.searchTraces)).Methods(http.MethodGet) ... } func (aH *APIHandler) searchTraces(w http.ResponseWriter, r *http.Request) { vars := mux.Vars(r) traceId := vars["traceId"] result, err := (*aH.reader).SearchTraces(r.Context(), traceId) if aH.handleError(w, err, http.StatusBadRequest) { return } aH.writeJSON(w, r, result) }
Execute the query, implemented through the query syntax of clickhouse:
// app/clickhouseReader/reader.go func (r *ClickHouseReader) SearchTraces(ctx context.Context, traceId string) (*[]model.SearchSpansResult, error) { var searchScanReponses []model.SearchSpanDBReponseItem query := fmt.Sprintf("SELECT timestamp, traceID, model FROM %s.%s WHERE traceID=$1", r.traceDB, r.spansTable) err := r.db.Select(ctx, &searchScanReponses, query, traceId) ... searchSpansResult := []model.SearchSpansResult{ { Columns: []string{"__time", "SpanId", "TraceId", "ServiceName", "Name", "Kind", "DurationNano", "TagsKeys", "TagsValues", "References", "Events", "HasError"}, Events: make([][]interface{}, len(searchScanReponses)), }, } for i, item := range searchScanReponses { var jsonItem model.SearchSpanReponseItem json.Unmarshal([]byte(item.Model), &jsonItem) jsonItem.TimeUnixNano = uint64(item.Timestamp.UnixNano() / 1000000) spanEvents := jsonItem.GetValues() searchSpansResult[0].Events[i] = spanEvents } return &searchSpansResult, nil }
3. Create an alert rule
After creating an alert rule on the page, save it in sqlite:
sqlite> select * from rules; 1|2022-08-05 07:18:31.989405383+00:00|0|{"condition":{"compositeMetricQuery":{"builderQueries":{"A":{"queryName":"A","name":"A","formulaOnly":false,"metricName":"up","tagFilters":{"op":"AND","items":[]},"groupBy":["job"],"aggregateOperator":1,"expression":"A","disabled":false,"toggleDisable":false,"toggleDelete":false}},"promQueries":{"A":{"query":"up ","stats":"","name":"A","legend":"","disabled":false}},"queryType":3},"op":"3","matchType":"1"},"labels":{"severity":"warning"},"annotations":{"description":"A new alert"},"evalWindow":"5m0s","alert":"service_down","source":"http://192.168.0.1:3301/alerts/new","ruleType":"promql_rule"} sqlite>
The entry to create an alert rule:
// app/http_handler.go func (aH *APIHandler) RegisterRoutes(router *mux.Router) { ... router.HandleFunc("/api/v1/rules", EditAccess(aH.createRule)).Methods(http.MethodPost) ... } func (aH *APIHandler) createRule(w http.ResponseWriter, r *http.Request) { decoder := json.NewDecoder(r.Body) var postData map[string]string err := decoder.Decode(&postData) ... apiErrorObj := (*aH.reader).CreateRule(postData["data"]) ... aH.respond(w, "rule successfully added") }
Concrete implementation of creating rules:
- First, insert a rules record into db;
- Then, add an alertRule to prometheus;
- Finally, a transactional commit;
// app/clickhouseReader/reader.go func (r *ClickHouseReader) CreateRule(rule string) *model.ApiError { tx, err := r.localDB.Begin() ... var lastInsertId int64 { // insert a record into db stmt, err := tx.Prepare(`INSERT into rules (updated_at, data) VALUES($1,$2);`) ... defer stmt.Close() result, err := stmt.Exec(time.Now(), rule) ... lastInsertId, _ = result.LastInsertId() groupName := fmt.Sprintf("%d-groupname", lastInsertId) // Let prometheus add an AlertRule err = r.ruleManager.AddGroup(time.Duration(r.promConfig.GlobalConfig.EvaluationInterval), rule, groupName) if err != nil { tx.Rollback() return &model.ApiError{Typ: model.ErrorInternal, Err: err} } } err = tx.Commit() if err != nil { zap.S().Errorf("Error in committing transaction for INSERT to rules\n", err) return &model.ApiError{Typ: model.ErrorInternal, Err: err} } return nil }
signoz also deploys an alertmanager instance, and when an alert is triggered by prometheus in query-service, the alert is sent to the alertmanager.
When the signoz front-end queries the alarm, it can be realized by querying the alertmanager:
HTTP GET /api/alertmanager/alerts?active=true&inhibited=true&silenced=false
Four. Features
Unified storage and reading of metrics and traces are stored in clickhouse;
- Solve the traditional pain points of storing prometheus in metrics and Jager in traces;
Integrates the alarm rule function of prometheus:
- Introduced through the prometheus sdk, not the prometheus process;
- The deduplication, aggregation, and sending of prometheus alarms are realized through the alertmanager instance;
refer to:
1.https://www.cnblogs.com/rongf...
2.signoz docs: https://signoz.io/docs/archit...
3.opentelmetry docs: https://opentelemetry.io/docs...