Ideas for "tagging" and listing services/hosts for monitoring? #14261

etfz · 2023-11-14T08:57:47Z

etfz
Nov 14, 2023

Hi,

I'm looking to use the NetBox API together with Prometheus blackbox exporter and HTTP discovery in order to determine what needs monitoring. I will have a translation layer in between for transforming the results into the appropriate format for Prometheus. So, what I need is something like one or more lists of monitoring targets. Utilizing the NetBox Service model seems like the obvious starting point here. I would add, for example, a TCP 443 service to each host that needs a web server monitored, and then retrieve a list of all those services, along with the appropriate IP address. GraphQL seems to be able to easily accomplish this.

But then I might want to disable monitoring for some services, or set different service priorities, which determines alerting behaviour, etc. Using boolean (enabled) and choice (priority) custom fields seems appropriate, but unfortunately it is not possible to filter on custom fields using GraphQL at this time. I could, of course, perform the additional filtering in the translation layer, but still. Using tags would be another option that I think would support GraphQL filtering, but doesn't seem as appropriate.

Just curious if anyone has any comments or ideas, or even does something similar.

candlerb · 2023-11-15T11:26:58Z

candlerb
Nov 15, 2023

I use Netbox to drive node_exporter, windows_exporter and snmp_exporter - but not blackbox services.

I have tags "prom_node", "prom_windows" and "prom_snmp" to determine which devices and VMs to monitor. The relevant prometheus scrape configs are:

  - job_name: node
    scrape_interval: 1m
    http_sd_configs:
      - url: https://netbox.example.net/api/plugins/prometheus-sd/devices/?tag=prom_node&status=active
        refresh_interval: 10m
        authorization:
          type: Token
          credentials_file: /etc/prometheus/netbox.token
      - url: https://netbox.example.net/api/plugins/prometheus-sd/virtual-machines/?tag=prom_node&status=active
        refresh_interval: 10m
        authorization:
          type: Token
          credentials_file: /etc/prometheus/netbox.token
    relabel_configs:
      # Labels which control scraping
      - source_labels: [__address__]
        target_label: instance
      - source_labels: [__meta_netbox_primary_ip4]
        regex: '(.+)'
        target_label: __address__
      - source_labels: [__meta_netbox_primary_ip6]
        regex: '(.+)'
        target_label: __address__
        replacement: '[${1}]'
      - source_labels: [__address__]
        target_label: __address__
        replacement: '${1}:9100'
      # Optional extra metadata labels
      - source_labels: [__meta_netbox_cluster_slug]
        target_label: cluster
      - source_labels: [__meta_netbox_device_type_slug]
        target_label: device_type
      - source_labels: [__meta_netbox_model]
        target_label: netbox_model
      - source_labels: [__meta_netbox_platform_slug]
        target_label: platform
      - source_labels: [__meta_netbox_role_slug]
        target_label: role
      - source_labels: [__meta_netbox_site_slug]
        target_label: site
      - source_labels: [__meta_netbox_tag_slugs]
        target_label: tags
      - source_labels: [__meta_netbox_tenant_slug]
        target_label: tenant

  # ... Windows same except prom_windows and port 9182 ...

  - job_name: snmp
    scrape_interval: 15s
    http_sd_configs:
      # See also https://github.com/netbox-community/netbox/issues/11538#issuecomment-1635839720
      - url: https://netbox.example.net/api/plugins/prometheus-sd/devices/?tag=prom_snmp&status=active
        refresh_interval: 10m
        authorization:
          type: Token
          credentials_file: /etc/prometheus/netbox.token
      - url: https://netbox.example.net/api/plugins/prometheus-sd/virtual-machines/?tag=prom_snmp&status=active
        refresh_interval: 10m
        authorization:
          type: Token
          credentials_file: /etc/prometheus/netbox.token
    metrics_path: /snmp
    relabel_configs:
      # Labels which control scraping
      - source_labels: [__address__]
        target_label: instance
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__meta_netbox_primary_ip]
        regex: '(.+)'
        target_label: __param_target
      - source_labels: [__meta_netbox_custom_field_snmp_module]
        target_label: __param_module
      # Ugh: multiselect is of form ['foo','bar'] and we need foo,bar. There is no gsub.
      - source_labels: [__param_module]
        regex: "\\['(.*)'\\]"
        target_label: __param_module
      - source_labels: [__param_module]
        regex: "(.*)', *'(.*)"
        replacement: "$1,$2"
        target_label: __param_module
      - source_labels: [__param_module]
        regex: "(.*)', *'(.*)"
        replacement: "$1,$2"
        target_label: __param_module
      - source_labels: [__param_module]
        regex: "(.*)', *'(.*)"
        replacement: "$1,$2"
        target_label: __param_module
      - source_labels: [__meta_netbox_custom_field_snmp_auth]
        target_label: __param_auth
      - target_label: __address__
        replacement: 127.0.0.1:9116  # SNMP exporter
      # Optional extra metadata labels
      - source_labels: [__param_module]
        target_label: module
      - source_labels: [__meta_netbox_cluster_slug]
        target_label: cluster
      - source_labels: [__meta_netbox_device_type_slug]
        target_label: device_type
      - source_labels: [__meta_netbox_model]
        target_label: netbox_model
      - source_labels: [__meta_netbox_platform_slug]
        target_label: platform
      - source_labels: [__meta_netbox_role_slug]
        target_label: role
      - source_labels: [__meta_netbox_site_slug]
        target_label: site
      - source_labels: [__meta_netbox_tag_slugs]
        target_label: tags
      - source_labels: [__meta_netbox_tenant_slug]
        target_label: tenant

  # Monitoring of netbox itself
  - job_name: netbox
    scrape_interval: 2m
    scheme: https
    authorization:
      type: Token
      credentials_file: /etc/prometheus/netbox.token
    static_configs:
      - targets: ['netbox.example.net:443']

  # Monitoring of other generic exporters
  - job_name: exporter
    scrape_interval: 1m
    static_configs:
      - targets:
          - localhost:9115  # blackbox_exporter
          - localhost:9116  # snmp_exporter

For blackbox_exporter, I just maintain the files statically; it might be possible from Netbox services with custom fields but I haven't thought about it too much, and in any case there are services I want to monitor which are not attached to Netbox devices or VMs. Scrape config:

  - job_name: blackbox
    file_sd_configs:
      - files:
          - /etc/prometheus/blackbox.d/*.yml
    metrics_path: /probe
    relabel_configs:
      - source_labels: [__address__]
        regex: '([^ ]+)'    # name or address only
        target_label: instance
      - source_labels: [__address__]
        regex: '([^ ]+)'    # name or address only
        target_label: __param_target
      - source_labels: [__address__]
        regex: '(.+) (.+)'  # name address
        target_label: instance
        replacement: '${1}'
      - source_labels: [__address__]
        regex: '(.+) (.+)'  # name address
        target_label: __param_target
        replacement: '${2}'
      - source_labels: [module]
        target_label: __param_module
      - target_label: __address__
        replacement: 127.0.0.1:9115  # Blackbox exporter

The files in blackbox.d look like this:

- labels:
    module: icmp
  targets:
    - dns.google-pri 8.8.8.8
    - dns.google-sec 8.8.4.4

- labels:
    module: certificate
  targets:
    - gw1.example.net:8443
    - gw2.example.net:8443

This lets me scrape by IP address but give a different instance label, if I so choose.

As for "disable monitoring for some services, or set different service priorities, which determines alerting behaviour": disabling monitoring and disabling alerting are two different things. Disabling monitoring I use the Status: i.e. only monitor devices/VMs with status "active". I would generally disable alerting via labels, typically creating silences in alertmanager if this is a short-term thing.

Alertmanager routing rules use tenant and/or role to route alerts. If you wanted additional control via Netbox you'd add some additional label(s) and match them in your alertmanager routing rules; or use regexp matches in alertmanager to match the Netbox "tags" directly.

7 replies

xkilian Nov 17, 2023

Absolutely correct that services should be bound to an ip. As the IP could move from a vm/device to another.
Thanks for the monitoring integration breakdown with prometheus. Much appreciated.

candlerb Nov 17, 2023

I think that "services" in Netbox can be used for two complementary source-of-truth purposes:

Activating services on a device or VM. In that case, you may wish to bind the service to a subset of interface IPs, but if you don't bind then it's available on "all IPs". (For service activation you probably need to add tags or a custom field to define which service you're enabling, since the UDP/TCP port by itself is probably not sufficient)
Monitoring services. (Again, you probably need a custom field to say how to monitor the service)

Most probably, everything you activate you also want to be monitoring. But you might also want to monitor things you don't activate yourself, such as hosted services / SaaS. In that case, the monitoring target could be an IP address (e.g. 8.8.8.8) but more likely it is a hostname, since the provider controls what IP address it resolves to.

etfz Nov 22, 2023
Author

Alright, so service discovery using the Prometheus SD plugin works near perfectly. What I have done so far is I have added a boolean custom field "is monitored" that's used on the service, device and virtual machine models. Every monitored service named "HTTP" I use for the HTTP probe. I am not concerned with HTTP hosts for now; only that the web server is operational. I might do individual website probing in some other way, and I might add a custom hostname field to make sure it probes the "default" site. Then I use all monitored devices and VMs for the ICMP probe. Something like this:

  - job_name: "icmp"
    metrics_path: /probe
    params:
      module: [icmp]
    http_sd_configs:
      - url: https://netbox/api/plugins/prometheus-sd/devices/?cf_monitor=true&status=active
      - url: https://netbox/api/plugins/prometheus-sd/virtual-machines/?cf_monitor=true&status=active
    relabel_configs:
      - source_labels: [__meta_netbox_primary_ip]
        target_label: __param_target
      - source_labels: [__meta_netbox_name]
        target_label: instance
      - target_label: __address__
        replacement: blackbox_exporter:9115

  - job_name: "http"
    metrics_path: /probe
    params:
      module: [http_2xx]
    http_sd_configs:
      - url: https://netbox/api/plugins/prometheus-sd/services/?cf_monitor=true&name=HTTP
    relabel_configs:
      - source_labels: [__meta_netbox_primary_ip]
        target_label: __param_target
      - source_labels: [__meta_netbox_parent]
        target_label: instance
      - target_label: __address__
        replacement: blackbox_exporter:9115

I think this makes sense. Similarly for MySQL, SSH and RDP. Probably combine those in a single TCP probe job, if possible. The name based service match is not ideal. I guess using a probe type custom field like you mentioned might be better. For the few services I have set up so far I have basically just put the protocol name for the service name, which might not always be very useful.

But then I do indeed have additional exporters such as node and SNMP exporter, which are not really services, NetBox semantically speaking. On the other hand, you could reasonably argue that they would fall under the "is monitored" custom field, but you might, of course, not always want to apply all of them. Perhaps a multiple selection custom field.

candlerb Nov 22, 2023

Thanks for closing the loop and confirming this works; I may use the sd "services" approach myself in future.

I agree that SNMP and node exporter aren't really monitoring a "service", they are monitoring the device or VM itself - hence why I use tags on the device/VM to enable this. Arguably if there were something that doesn't support SNMP or an exporter, then you might want to monitor it by ICMP instead (which could also be a tag). Or "is monitored" could become a selection list, which seems quite a nice way to do it as long as you don't want to monitor the same device in more than one way. (Multi-select lists with netbox-prometheus-sd are messy)

Similarly for MySQL, SSH and RDP. Probably combine those in a single TCP probe job, if possible.

That should be fine, except it's unclear how that would interact with "ports" if there's more than one, since you can't use rewriting rules to expand one SD target to multiple scrape targets. That's unless prometheus-sd can expand multiple ports into multiple service entries, but apparently it doesn't:

    "labels": {
...
      "__meta_netbox_ports": "9999,9093,9998,9095",
...

Also, I'd want to be able to set the "module" parameter for each probe, as usually I want to do a better test than just opening a TCP connection. Maybe I would just choose the service name in Netbox to match the "module" parameter in blackbox_exporter, which forces a new module to be created for each distinct service type.

The "target" parameter wants to be the service IP that you've bound to; again, it's a problem if there's more than one, but you could just pick the first and test that. If one isn't set then you can fallback to the primary_ip of the device/VM, which is provided as __meta_netbox_primary_ip

Some care is needed with the "instance" label. I'd probably use the "parent" label for that, which together with "module" ought to give a unique-ish label set. I say "unique-ish" because it's possible to have multiple devices with the same name in multiple sites and/or tenants, and it's possible to have devices and VMs with the same names, so maybe the service_id needs to be added as a label too just to be sure (from __meta_netbox_id). The parent id (e.g. device_id or virtualmachine_id) is not available from SD.

etfz Nov 23, 2023
Author

Or "is monitored" could become a selection list, which seems quite a nice way to do it as long as you don't want to monitor the same device in more than one way. (Multi-select lists with netbox-prometheus-sd are messy)

I was thinking just use query parameters to filter them, and still use separate jobs. I think multiple selection works for that.

  - job_name: "node"
    http_sd_configs:
      - url: https://netbox/api/plugins/prometheus-sd/devices/?cf_monitoring_methods=node

  - job_name: "snmp"
    http_sd_configs:
      - url: https://netbox/api/plugins/prometheus-sd/devices/?cf_monitoring_methods=snmp

That should be fine, except it's unclear how that would interact with "ports" if there's more than one, since you can't use rewriting rules to expand one SD target to multiple scrape targets.

I think it makes sense to define separate services in that case, as each individual service only use a single port normally. At least, for the above mentioned protocols.

Also, I'd want to be able to set the "module" parameter for each probe, as usually I want to do a better test than just opening a TCP connection. Maybe I would just choose the service name in Netbox to match the "module" parameter in blackbox_exporter, which forces a new module to be created for each distinct service type.

Probably, but do you mean custom made modules? Because except for HTTP I can't see any modules more specialised than TCP. Feels like I would need to involve a lot of credentials in order to fully test those services.

The "target" parameter wants to be the service IP that you've bound to; again, it's a problem if there's more than one, but you could just pick the first and test that. If one isn't set then you can fallback to the primary_ip of the device/VM, which is provided as __meta_netbox_primary_ip

Yeah, I'm just using the parent addresses for now. I have very few hosts with multiple IP addresses, and I don't anticipate having services bound to multiple addresses. For that matter, I am largely setting up these services for the purpose of monitoring, so I will just tailor them accordingly. (while still not abusing NetBox modeling philosophy) Although come to think of, that's a lot of extra work compared to just using the multiple selection custom field... I'll have to think about it.

Some care is needed with the "instance" label. I'd probably use the "parent" label for that, which together with "module" ought to give a unique-ish label set. I say "unique-ish" because it's possible to have multiple devices with the same name in multiple sites and/or tenants, and it's possible to have devices and VMs with the same names, so maybe the service_id needs to be added as a label too just to be sure (from __meta_netbox_id). The parent id (e.g. device_id or virtualmachine_id) is not available from SD.

Good point!

candlerb · 2023-11-23T12:46:35Z

candlerb
Nov 23, 2023

I was thinking just use query parameters to filter them, and still use separate jobs. I think multiple selection works for that.

You're right - just tested and it does.

The awkward part I found is with snmp_exporter, where you might have multiple modules selected and you want to pass module=foo,bar,baz as a parameter. prometheus-sd gives you literally

"__meta_netbox_custom_field_snmp_module": "['foo', 'bar', 'baz']",

(with single quotes - not even JSON!)

Also, I'd want to be able to set the "module" parameter for each probe, as usually I want to do a better test than just opening a TCP connection. Maybe I would just choose the service name in Netbox to match the "module" parameter in blackbox_exporter, which forces a new module to be created for each distinct service type.

Probably, but do you mean custom made modules? Because except for HTTP I can't see any modules more specialised than TCP. Feels like I would need to involve a lot of credentials in order to fully test those services.

As you've found, blackbox_exporter is capable of much more than just establishing a TCP connection. You can do HTTP(S) exchanges, you can specify the path, you can provide authentication, you can check the headers and/or body in the response etc.

Whilst you might not necessarily want to exercise the whole service (and indeed that could take multiple tests), you might at least want to check that the login page is displayed correctly, with some expected content, not returning a 500 error, and certificate is valid.

Perhaps in future blackbox_exporter will be expanded to allow a list of modules in the same call, which would be neat.

0 replies

candlerb · 2023-11-30T22:13:11Z

candlerb
Nov 30, 2023

Having thought about this a bit more, and having looked at how some monitoring platforms like check_mk handle this, I'm starting to think that config contexts rather than services are the best way to handle this.

Config contexts can be assigned to all devices with a particular role, or in a site, or with a given tag, or with a particular platform. This means you wouldn't have to set up specific monitoring rules for every single one, but they could all inherit a consistent set of monitoring.

Given the deep merge, you could put everything under a single "monitoring" key and still be able to override selectively. A single service might be like this:

{
    "monitoring": {
        "ssh": {
            "module": "tcp",
            "target": "primary_ip4:22"
        }
    }
}

Using the Config Context tab on an individual device you could see exactly what's being monitored on that device; but the generation of this complete context can be inherited from multiple levels, and overridden at device level only if necessary. Using tags to generate most of it would mean most manual JSON writing could be avoided.

Anyway, it's just an idea I thought was worth mentioning. The downside is that this wouldn't work using the out-of-the-box prometheus sd plugin.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ideas for "tagging" and listing services/hosts for monitoring? #14261

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Ideas for "tagging" and listing services/hosts for monitoring? #14261

Uh oh!

Uh oh!

etfz Nov 14, 2023

Replies: 3 comments · 7 replies

Uh oh!

candlerb Nov 15, 2023

Uh oh!

xkilian Nov 17, 2023

Uh oh!

candlerb Nov 17, 2023

Uh oh!

etfz Nov 22, 2023 Author

Uh oh!

candlerb Nov 22, 2023

Uh oh!

etfz Nov 23, 2023 Author

Uh oh!

candlerb Nov 23, 2023

Uh oh!

candlerb Nov 30, 2023

etfz
Nov 14, 2023

Replies: 3 comments 7 replies

candlerb
Nov 15, 2023

etfz Nov 22, 2023
Author

etfz Nov 23, 2023
Author

candlerb
Nov 23, 2023

candlerb
Nov 30, 2023