Nitpicking Hashicorp Nomad

Davinder Pal
DevOps.dev
Published in
5 min readDec 26, 2023

--

Bing AI

Premises

  1. OSS Hashicorp Nomad
  2. OSS Hashicorp Consul
  3. My pain points from the last year
  4. Consul KV = KV
  5. I am only providing a few of the issues reported by myself or the community, so please explore GitHub Issues if interested.

Restart All tasks at once when the consul/vault changes happen

Now, let me explain, what is the pain point here. let’s say you are running a task/microservice using the following template to render variables. Since you are using Consul KV, a change in any KV will trigger a restart of all tasks/microservices simultaneously. It doesn’t allow any option to configure rolling restart/canary/etc. So your service will have an outage any time a change happens and was requested at least 4 years ago and there has been little to no progress on it.

  template {
data = <<EOF

# Read all keys in the path `app/environment` from Consul KV.
{{range ls "app/environment"}}
{{.Key}}={{.Value}}
{{end}}
EOF

destination = "local/env"
}

Non-Reliable Integration with Hashicorp Consul Discovery

Now, let me explain, what is the pain point here. let’s say you are running a task/microservice using the following template to render dependent services address/port as configuration then you may see duplicate entries or deleted entries and the previous issue of all tasks restarted at the same time. This kind of issue makes template logic super hard.

  template {
data = <<EOF
# Configuration for a single upstream service.
upstream my_app {
{{- range service "my-app" }}
server {{ .Address }}:{{ .Port }};{{- end }}
}

# Configuration for all services in the catalog.
{{ range services }}
# Configuration for service {{ .Name }}.
upstream {{ .Name | toLower }} {
{{- range service .Name }}
server {{ .Address}}:{{ .Port }};{{- end }}
}
{{ end -}}
EOF

destination = "local/nginx.conf"
}

Inter/Intra Job Dependency Nightmare

It is a novel request to make sure Nomad orchestrator provides a way to control, how the jobs will be deployed/updated/etc. It was requested in 2015 with two minimum controls.
1. How to control within a Job/Group.
2. How to control within a Nomad Environment like Job 1 depends on Job 2

Fortunately, after 5 years, the Nomad Team shipped Lifecycle Management integration to support within Job/Group Dependency.

Unfortunately, it has been more than 8 years and the Nomad Team doesn’t have a plan to implement Inter Job Dependency.

Community-created tools like this
1. https://github.com/sagarrakshe/nomad-dtree
2. https://github.com/joyent/containerpilot

Volume Integration Issues

So what is the pain here? Since Nomad supports containers/exec/java/etc drivers which can host a variety of applications, some of them need storage either local storage or network-attached storage and both of them have this problem.

Example 1 ( Exec Driver )
Useful things to keep in mind
1. Nomad Exec Driver Runs tasks as Nobody User
2. Nomad doesn’t modify volume permission. ( Hack used by K8s )

Now, when a Nobody User process tries to write to a host path like /data/<volume> then it will get permission denied because it is not allowed by default.
Hack: Host Volume Path needs to have the permission of 777 or Nobody User as Owner.

# pseudo code, please don't take it literally :)
job "kerberos" {
datacenters = ["*"]
type = "service"
group "primary" {
volume "backup" {
type = "host"
source = "kerberos-backup"
read_only = false
}
task "test" {
driver = "exec"
config {
command = "sleep"
args = ["infinity"]
cap_add = ["sys_chroot"]
}
volume_mount {
volume = "kerberos-backup"
destination = "/mnt/backups"
propagation_mode = "bidirectional"
}
}
}
}

Example 2( Docker/Containerd Driver )
Same problem but in another driver as well, The Interesting thing here is that all the Nomad Documentation uses the docker driver for examples and none of them can work because of this problem and the Nomad Team says they can’t document this thing in their documentation since it is feature request, lul. Again requested in 2020 and It has been more than 3 years and no progress on development or documentation :( that is pretty lame.

No Templating/Packaging Strategy For Nomad

A bunch of tools are developed in parallel and no roadmap for choosing one before I rant more, let me list the options
1. https://github.com/hashicorp/consul-template
2. https://github.com/hashicorp/nomad-pack
3. https://github.com/hashicorp/levant
4. https://registry.terraform.io/providers/hashicorp/nomad

Consul Template uses go-template inside and is heavily biased towards the Consul.
Nomad Pack has been in tech preview for 2 years.
The Levant was started in 2017 then development stopped in March 2019 and no updates since.
Terraform Nomad is another promising option since terraform templates are well known and can be used but it adds extra dependency of terraform to submit jobs to Nomad.

What should be used as of now aka 2024?
1. Consul Templates
2. Terraform Nomad Provider

No Proper Testing of Releases

Why release testing? I am quite certain that the Nomad team is doing some testing but it is not covering [example] general cases like people upgrading Nomad version x to y. Since I have been watching nomad releases for almost a year. here are some insights

v1.5.0 - 2 march 2023
v1.5.1 - 13 march 2023 # first bugfix release within a week
v1.5.2 - 22 march 2023 # second bugix release within next week

v1.6.0 - 18 July 2023
v1.6.1 - 21 July 2023 # same pattern first bugfix release within a week

v1.7.0 - 7 Dec 2023
v1.7.1 - 8 Dec 2023 # same pattern first bugfix release within a week
v1.7.2 - 13 Dec 2023 # second bugfix release within a week

The above release pattern shows a positive side that the Nomad Team is actively trying to resolve issues after the release of the major version and the Negative side, the Nomad Team is not doing proper pre-release testing and somehow mitigating these before release.

I guess, this much will be enough rant in one article :) and I am sorry this article is published on holiday!

--

--