Upgrade to Vault 1.14.x

The Vault 1.14.x upgrade guide contains information on deprecations, important or breaking changes, and remediation recommendations for anyone upgrading from Vault 1.13. Please read carefully.

Feature deprecations and EOL

Duplicative Docker images

As of Vault 1.14 we will no longer update the the vault Docker image. Only the Verified Publisher hashicorp/vault image will be updated on DockerHub.

Users of Official Images need to use docker pull hashicorp/vault:<version> instead of docker pull vault:<version> to get newer versions of Vault in Docker images. Currently, HashiCorp publishes and updates identical Docker images of Vault as Verified Publisher and Official images separately.

Important changes

vault.raft_storage.bolt.write.time has been corrected from a summary to a counter to more accurately reflect that it is measuring cumulative time writing, and not the distribution of individual write times.

Application of Sentinel Role Governing Policies (RGPs) via identity groups

As of versions 1.15.0, 1.14.4, and 1.13.8, the Sentinel RGPSs derived from membership in identity groups apply only to entities in the same and child namespaces, relative to the identity group.

Also, the group_policy_application_mode only applies to to ACL policies. Vault Sentinel Role Governing Policies (RGPs) are not affected by group policy application mode.

Activity Log Changes

Disable client counting activity

License utilization cannot be reported if client counting is disabled.

As of Vault Enterprise 1.14.10 and later, client counting cannot be disabled using /sys/internal/counters/config endpoint as manual license utilization reporting is always enabled.

As of Vault Enterprise 1.14.0 and later, client counting cannot be disabled using /sys/internal/counters/config endpoint when automated license utilization reporting is enabled.

Known issues and workarounds

Users limited by control groups can only access issuer detail from PKI overview page

Affected versions

Vault 1.14.x

Issue

Vault UI users who require control group approval to read issuer details are directed to the Control Group Access page when they try to view issuer details from links on the Issuer list page.

Workaround

Vault UI users constrained by control groups should select issuers from the PKI overview page to view detailed information instead of the Issuers list page.

API calls to update-primary may lead to data loss

Affected versions

All versions of Vault before 1.14.1, 1.13.5, 1.12.9, and 1.11.12.

Issue

The update-primary endpoint temporarily removes all mount entries except for those that are managed automatically by vault (e.g. identity mounts). In certain situations, a race condition between mount table truncation replication repairs may lead to data loss when updating secondary replication clusters.

Situations where the race condition may occur:

When the cluster has local data (e.g., PKI certificates, app role secret IDs) in shared mounts. Calling update-primary on a performance secondary with local data in shared mounts may corrupt the merkle tree on the secondary. The secondary still contains all the previously stored data, but the corruption means that downstream secondaries will not receive the shared data and will interpret the update as a request to delete the information. If the downstream secondary is promoted before the merkle tree is repaired, the newly promoted secondary will not contain the expected local data. The missing data may be unrecoverable if the original secondary is is lost or destroyed.
When the cluster has an Allow paths defined. As of Vault 1.0.3.1, startup, unseal, and calling update-primary all trigger a background job that looks at the current mount data and removes invalid entries based on path filters. When a secondary has Allow path filters, the cleanup code may misfire in the windown of time after update-primary truncats the mount tables but before the mount tables are rewritten by replication. The cleanup code deletes data associated with the missing mount entries but does not modify the merkle tree. Because the merkle tree remains unchanged, replication will not know that the data is missing and needs to be repaired.

Workaround 1: PR secondary with local data in shared mounts

Watch for cleaning key in merkle tree in the TRACE log immediately after an update-primary call on a PR secondary to indicate the merkle tree may be corrupt. Repair the merkle tree by issuing a replication reindex request to the PR secondary.

If TRACE logs are no longer available, we recommend pre-emptively reindexing the PR secondary as a precaution.

Workaround 2: PR secondary with "Allow" path filters

Watch for deleted mistakenly stored mount entry from backend in the INFO log. Reindex the performance secondary to update the merkle tree with the missing data and allow replication to disseminate the changes. You will not be able to recover local data on shared mounts (e.g., PKI certificates).

If INFO logs are no longer available, query the shared mount in question to confirm whether your role and configuration data are present on the primary but missing from the secondary.

Using 'update_primary_addrs' on a demoted cluster causes Vault to panic

Affected versions

1.13.3, 1.13.4 & 1.14.0

Issue

If the update_primary_addrs parameter is used on a recently demoted cluster, Vault will panic due to no longer having information about the primary cluster.

Workaround

Instead of using update_primary_addrs on the recently demoted cluster, instead provide an activation token.

Affected versions

1.14.0

Issue

The login screen on Safari appears to be broken, presenting as a blank white screen.

Workaround

Scroll down to find the login section.

AWS static roles ignore changes to rotation period

Affected versions

1.14.0+

Issue

AWS static roles currently ignore configuration changes made to the key rotation period. As a result, Vault will continue to use whatever rotation period was set when the roles were originally created.

Workaround

Delete and recreate any static role objects that should use the new rotation period.

Transit Encryption with Cloud KMS managed keys causes a panic

Affected versions

1.13.1+ up to 1.13.8 inclusively
1.14.0+ up to 1.14.4 inclusively
1.15.0

Issue

Vault panics when it receives a Transit encryption API call that is backed by a Cloud KMS managed key (Azure, GCP, AWS).

Note

The issue does not affect encryption and decryption with the following key types:

PKCS#11 managed keys
Transit native keys

Workaround

None at this time

Transit Sign API calls with managed keys fail

Affected versions

1.14.0+ up to 1.14.4 inclusively
1.15.0

Issue

Vault responds to Transit sign API calls with the following error when the request uses a managed key:

requested version for signing does not contain a private part

Note

The issue does not affect signing with the following key types:

Transit native keys

Workaround

None at this time

Affected versions

The UI issue affects Vault versions 1.14.0+ and 1.15.0+. A fix is expected for Vault 1.16.0.

Issue

The Vauil UI currently uses a version of HDS that does not allow users to click within collapsed elements. In particular, the dev console or namespace picker become inaccessible when viewing the components in smaller viewports.

Workaround

Expand the width of the screen until you deactivate the collapsed view. Once the full navbar is displayed, click the desired components.

User lockout potential double logging

Affected versions

1.14.5

Issue

A logging enhancement released before we intended in 1.14.5. In some cases, additional logging does not trigger or generate double logs.

Internal error when vault policy in namespace does not exist

If a user is a member of a group that gets a policy from a namespace other than the one they’re trying to log into, and that policy doesn’t exist, Vault returns an internal error. This impacts all auth methods.

Affected versions

1.13.8 and 1.13.9
1.14.4 and 1.14.5
1.15.0 and 1.15.1

A fix has been released in Vault 1.13.10, 1.14.6, and 1.15.2.

Workaround

During authentication, Vault derives inherited policies based on the groups an entity belongs to. Vault returns an internal error when attaching the derived policy to a token when:

the token belongs to a different namespace than the one handling authentication, and
the derived policy does not exist under the namespace.

You can resolve the error by adding the policy to the relevant namespace or deleting the group policy mapping that uses the derived policy.

As an example, consider the following userpass auth method failure. The error is due to the fact that Vault expects a group policy under the namespace that does not exist.

# Failed login
$ vault login -method=userpass username=user1 password=123
Error authenticating: Error making API request.

URL: PUT http://127.0.0.1:8200/v1/auth/userpass/login/user1
Code: 500. Errors:

* internal error

To confirm the problem is a missing policy, start by identifying the relevant entity and group IDs:

$ vault read -format=json identity/entity/name/user1 | \
  jq '{"entity_id": .data.id, "group_ids": .data.group_ids} '
{
  "entity_id": "420c82de-57c3-df2e-2ef6-0690073b1636",
  "group_ids": [
    "6cb152b7-955d-272b-4dcf-a2ed668ca1ea"
  ]
}

Use the group ID to fetch the relevant policies for the group under the ns1 namespace:

$ vault read -format=json -namespace=ns1 \
  identity/group/id/6cb152b7-955d-272b-4dcf-a2ed668ca1ea | \
  jq '.data.policies'
[
  "group_policy"
]

Now that we know Vault is looking for a policy called group_policy, we can check whether that policy exists under the ns1 namespace:

$ vault policy list -namespace=ns1
default

The only policy in the ns1 namespace is default, which confirms that the missing policy (group_policy) is causing the error.

To fix the problem, we can either remove the missing policy from the 6cb152b7-955d-272b-4dcf-a2ed668ca1ea group or create the missing policy under the ns1 namespace.

To remove group_policy from group ID 6cb152b7-955d-272b-4dcf-a2ed668ca1ea, use the vault write command to set the applicable policies to just include default:

$ vault write                                             \
  -namespace=ns1                                          \
  identity/group/id/6cb152b7-955d-272b-4dcf-a2ed668ca1ea  \
  name="test"                                             \
  policies="default"

To create the missing policy, use vault policy write and define the appropriate capabilities:

$ vault policy write -namespace=ns1 group_policy - << EOF
    path "secret/data/*" {
        capabilities = ["create", "update"]
    }
EOF

Verify the fix by re-running the login command:

$ vault login -method=userpass username=user1 password=123

Vault is storing references to ephemeral sub-loggers leading to unbounded memory consumption

Affected versions

This memory consumption bug affects Vault Community and Enterprise versions:

1.13.7 - 1.13.9
1.14.3 - 1.14.5
1.15.0 - 1.15.1

This change that introduced this bug has been reverted as of 1.13.10, 1.14.6, and 1.15.2

Issue

Vault is unexpectedly storing references to ephemeral sub-loggers which prevents them from being cleaned up, leading to unbound memory consumption for loggers. This came about from a change to address a previously known issue around sub-logger levels not being adjusted on reload. This impacts many areas of Vault, but primarily logins in Enterprise.

Workaround

There is no workaround.

Sublogger levels not adjusted on reload

Affected versions

This issue affects all Vault Community and Vault Enterprise versions.

Issue

Vault does not honor a modified log_level configuration for certain subsystem loggers on SIGHUP.

The issue is known to specifically affect resolver.watcher and replication.index.* subloggers.

After modifying the log_level and issuing a reload (SIGHUP), some loggers are updated to reflect the new configuration, while some subsystem logger levels remain unchanged.

For example, after starting a server with log_level: "trace" and modifying it to log_level: "info" the following lines appear after reload:

[TRACE] resolver.watcher: dr mode doesn't have failover support, returning
...
[DEBUG] replication.index.perf: saved checkpoint: num_dirty=5
[DEBUG] replication.index.local: saved checkpoint: num_dirty=0
[DEBUG] replication.index.periodic: starting WAL GC: from=2531280 to=2531280 last=2531536

Workaround

The workaround is to restart the Vault server.

Fatal error during expiration metrics gathering causing Vault crash

Affected versions

This issue affects Vault Community and Enterprise versions:

1.13.9
1.14.5
1.15.1

A fix has been issued in Vault 1.13.10, 1.14.6, and 1.15.2.

Issue

A recent change to Vault to improve state change speed (e.g. becoming active or standby) introduced a concurrency issue which can lead to a concurrent iteration and write on a map, causing a fatal error and crashing Vault. This error occurs when gathering lease and token metrics from the expiration manager. These metrics originate from the active node in a HA cluster, as such a standby node will take over active duties and the cluster will remain functional should the original active node encounter this bug. The new active node will be vulnerable to the same bug, but may not encounter it immediately.

There is no workaround.

Deadlock can occur on performance secondary clusters with many mounts

Affected versions

1.15.0 - 1.15.5
1.14.5 - 1.14.9
1.13.9 - 1.13.13

Issue

Vault 1.15.0, 1.14.5, and 1.13.9 introduced a worker pool to schedule periodic rollback operations on all mounts. This worker pool defaulted to using 256 workers. The worker pool introduced a risk of deadlocking on the active node of performance secondary clusters, leaving that cluster unable to service any requests.

The conditions required to cause the deadlock on the performance secondary:

Performance replication is enabled
The performance primary cluster has more than 256 non-local mounts. The more mounts the cluster has, the more likely the deadlock becomes
One of the following occurs:
- A replicated mount is unmounted or remounted OR
- A replicated namespace is deleted OR
- Replication paths filters are used to filter at least one mount or namespace

Workaround

Set the VAULT_ROLLBACK_WORKERS environment variable to a number larger than the number of mounts in your Vault cluster and restart Vault:

$ export VAULT_ROLLBACK_WORKERS=1000

PKI OCSP GET requests can return HTTP redirect responses

If a base64 encoded OCSP request contains consecutive '/' characters, the GET request will return a 301 permanent redirect response. If the redirection is followed, the request will not decode as it will not be a properly base64 encoded request.

As a workaround, OCSP POST requests can be used which are unaffected.

Impacted versions

Affects all current versions of 1.12.x, 1.13.x, 1.14.x, 1.15.x, 1.16.x, 1.17.x.

Performance Standbys revert to Standby mode on unseal

Affected versions

1.14.12
1.15.8
1.16.2

Issue

If you previously set a value for retention_months via the sys/internal/counters/config endpoint, upgrading to Vault Enterprise versions 1.14.12, 1.15.8, and 1.16.2 will cause Performance Standby nodes to revert to Standby mode.

Adding nodes with Vault Enterprise versions 1.14.12, 1.15.8, or 1.16.2 to a cluster with an older versioned leader will see any previously set retention_months value and attempt to write the new minimum value of 48. The storage write will result in a read-only error:

[ERROR] core: performance standby post-unseal setup failed: error="cannot write to readonly storage"

You can verify the status of your nodes by checking the /sys/health endpoint.

Deployments that rely on scaling across Performance Standbys will now forward all requests to the active node, increasing the utilization of the active node.

Post-upgrade cluster membership

During the last step of a full upgrade, the old leader steps down, causing one of the Standby nodes to become leader.

A fix for the read-only storage error has been prioritized and escalated. The fix will be in releases 1.14.13, 1.15.9 and 1.16.3.

Important

If you have already upgraded to versions 1.14.12, 1.15.8, or 1.16.2, please refer to the workaround section for options.

Workaround

Once the leader of the cluster has been updgraded to version 1.14.12, 1.15.8, or 1.16.2, the workaround is to update the retention_months value on the active node via the sys/internal/counters/config endpoint:

$ vault write sys/internal/counters/config retention_months=48

This storage entry will be written to all nodes in the cluster, allowing them to immediately unseal as Performance Standbys.

After the new retention_months value is written to storage on the active node, adding new nodes to the cluster will not cause the read-only error.

Sending SIGHUP to vault standby node causes panic

Affected versions

1.13.4+
1.14.0+
1.15.0+
1.16.0+

Issue

Sending a SIGHUP to a vault standby node running an enterprise build can cause a panic if there is a change to the license, or reporting configuration. Active and performance standby nodes will perform fine. It is recommended that operators stop and restart vault nodes individually if configuration changes are required.

Workaround

Instead of issuing a SIGHUP, users should stop individual vault nodes, update the configuration or license and then restart the node.

Upgrade to Vault 1.14.x

Feature deprecations and EOL

Duplicative Docker images

Important changes

Application of Sentinel Role Governing Policies (RGPs) via identity groups

Activity Log Changes

Disable client counting activity

Known issues and workarounds

Users limited by control groups can only access issuer detail from PKI overview page

Affected versions

Issue

Workaround

API calls to update-primary may lead to data loss

Affected versions

Issue

Workaround 1: PR secondary with local data in shared mounts

Workaround 2: PR secondary with "Allow" path filters

Using 'update_primary_addrs' on a demoted cluster causes Vault to panic

Affected versions

Issue

Workaround

Safari login screen appears broken on the UI

Affected versions

Issue

Workaround

AWS static roles ignore changes to rotation period

Affected versions

Issue

Workaround

Transit Encryption with Cloud KMS managed keys causes a panic

Affected versions

Issue

Workaround

Transit Sign API calls with managed keys fail

Affected versions

Issue

Workaround

Collapsed navbar does not allow you to click inside the console or namespace picker

Affected versions

Issue

Workaround

User lockout potential double logging

Affected versions

Issue

Internal error when vault policy in namespace does not exist

Affected versions

Workaround

Vault is storing references to ephemeral sub-loggers leading to unbounded memory consumption

Affected versions

Issue

Workaround

Sublogger levels not adjusted on reload

Affected versions

Issue

Workaround

Fatal error during expiration metrics gathering causing Vault crash

Affected versions

Issue

Deadlock can occur on performance secondary clusters with many mounts

Affected versions

Issue

Workaround

PKI OCSP GET requests can return HTTP redirect responses

Impacted versions

Performance Standbys revert to Standby mode on unseal

Affected versions

Issue

Workaround

Sending SIGHUP to vault standby node causes panic

Affected versions

Issue

Workaround