Skip to content

Conversation

@XaBbl4
Copy link
Contributor

@XaBbl4 XaBbl4 commented Jan 10, 2025

Now, if an error is made in replication.conf, replication is not initialized at all and a situation arises that can lead to desynchronization of an already configured replica.

For example, a synchronous replica is configured - it works successfully. And the administrator decides to add asynchronous replication, stops the DBMS, adds journal_directory = /path/to/journals to the configuration and starts the DBMS - at first glance, everything is fine (but access to the directory in the OS is not configured), as a result, during the first connection, access to the directory is checked - an error occurs (written to replication.log), the user connects and can continue working with the DB, but without replication at all, which he does not know about it.
The administrator, except for the message in the log, cannot understand in any way that there are problems with the configuration and at the same time, if there is a trigger for connection, then the synchronous replica will most likely become irrelevant immediately after the first connection, which is not good.

This patch offers a way to fix this situation:

  1. do not interrupt the process of reading the config when the first error is found, instead - all errors found are combined into one message. This will make it easier to fix them all at once, rather than step by step.

  2. add a message output to the log if the administrator specified a parameter name without a value. Previously, such a parameter was simply ignored. But logically, if the administrator wrote it for some reason, it meant he wanted to use it. This is not a critical error, but the message in the log should attract his attention to remove this parameter from the configuration, or still set a value for it.

  3. use the disable_on_error configuration parameter: in cases where it is enabled, allow disabling one or more replicas when initializing replication and allow the user to connect to the DB. In the example above, the connection will occur, but asynchronous replication will be disabled, since there is no access to the directory with journals, but the synchronous replica will continue to work. If disable_on_error = false the user will receive the error "One or more replicas configured with errors" when connecting.

…correct

Configuration errors can lead to loss of synchronization between master and replica.
@aafemt
Copy link
Contributor

aafemt commented Jan 10, 2025

Replication errors reporting to user is ruled by report_errors configuration option. disable_on_error has different purpose.

@XaBbl4
Copy link
Contributor Author

XaBbl4 commented Jan 10, 2025

I know that report_errors is responsible for sending errors to the client that occurred during replication. But at the time of replication initialization, cannot ignore errors related with configuration, otherwise, when changing the configuration, it may lead to desynchronization of previously configured replicas.

@aafemt
Copy link
Contributor

aafemt commented Jan 10, 2025

Yes, and this is exactly as this replication was designed: no matter if replicas get de-synchronized, main database must continue operation as if nothing happened. If a DBA doesn't like this behavior, they must explicitly set report_errors to true to prevent operations on the main database.

@XaBbl4
Copy link
Contributor Author

XaBbl4 commented Jan 10, 2025

Here the situation is a little different. If in the current implementation set report_errors to true, then the user in this situation will still not receive an error and the problem will persist - the replica will become de-synchronized because the DBMS does not initialize replication

@aafemt
Copy link
Contributor

aafemt commented Jan 12, 2025

I completely agree with configuration checks in this PR, just point out that the condition for error reporting must be different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants