Skip to content

Conversation

vepadulano
Copy link
Member

The default Snapshot compression setting has always been 101. Historically, this was done for simplicity reasons and following the principle of least surprise. TTree was the only output format available with Snapshot, so the operation was defaulting to the same value used by TTree. Now that Snapshot supports more than one output format, this reason is less strong than before. It has been established that 505 is a better default compression setting overall, so RDataFrame should follow that. The main disadvantage is that this change is hard to communicate. This commit proposes to introduce an information message being shown once per program execution, and only if the program is calling Snapshot. This message is supposed to help the users detect changes in their output file size with the next development cycle (6.38.x) and should be removed afterwards.

As discussed during the ROOT I/O meeting

Copy link
Member

@pcanal pcanal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link

github-actions bot commented Oct 3, 2025

Test Results

    21 files      21 suites   3d 17h 45m 44s ⏱️
 3 688 tests  3 680 ✅ 0 💤   8 ❌
75 597 runs  75 461 ✅ 0 💤 136 ❌

For more details on these failures, see this check.

Results for commit 09afe35.

♻️ This comment has been updated with latest results.

@vepadulano vepadulano force-pushed the rdf-change-snapshot-comp-settings branch from 81c2ac0 to 7c5c04f Compare October 3, 2025 23:08
@vepadulano
Copy link
Member Author

@pcanal I am now slightly in doubt about this current approach. For the information message to be actually visible, a user would have to explicitly activate the "info" logging in their application by writing

#include <ROOT/RLogger.hxx>

// this increases RDF's verbosity level as long as the `verbosity` variable is in scope
auto verbosity = ROOT::RLogScopedVerbosity(ROOT::Detail::RDF::RDFLogChannel(), ROOT::ELogLevel::kInfo);

Whereas I believe our goal here is to show the message once per application anytime the user is writing at least one Snapshot call in their program. By turning the "info" to a "warning", we would get that behaviour.

@silverweed
Copy link
Contributor

I think a Warning is fine, considering that you are warning the user about the change in behavior

@dpiparo
Copy link
Member

dpiparo commented Oct 6, 2025

Thanks for adding it to the RN, which do not reach all users, but at least make the change officially documented in writing.

Copy link
Contributor

@enirolf enirolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I agree with the suggestion to turn this into a warning. I added one suggestion to the RN to add some additional context, but it's also okay for me to leave them as-is.

@vepadulano vepadulano requested a review from hageboeck October 6, 2025 12:46
Copy link
Member

@hageboeck hageboeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favour of changing the default, and also think that it's good to notify users, but I think the way how the message gets out could be improved:

  • An entry in the release notes is warranted
  • The Info message should show as info, so we might have to move to TError.h's Info. (A warning would be too high in my opinion).
  • And I think the message should be silenceable using an entry in TEnv.

How about:

The default compression settings of Snapshot have been changed from 101 (ZLIB with compression level 1) to 505 (ZSTD with compression level 5).
See https://root.cern/
Silence this message adding RDF_SNAPSHOT_INFO=0 to the process environment or .rootrc

Copy link
Contributor

@martamaja10 martamaja10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Vincenzo!

The default Snapshot compression setting has always been 101. Historically, this was done for simplicity reasons and following the principle of least surprise. TTree was the only output format available with Snapshot, so the operation was defaulting to the same value used by TTree. Now that Snapshot supports more than one output format, this reason is less strong than before. It has been established that 505 is a better default compression setting overall, so RDataFrame should follow that. The main disadvantage is that this change is hard to communicate. This commit proposes to introduce an information message being shown once per program execution, and only if the program is calling Snapshot. This message is supposed to help the users detect changes in their output file size with the next development cycle (6.38.x) and should be removed afterwards.
@vepadulano vepadulano force-pushed the rdf-change-snapshot-comp-settings branch from 7c5c04f to 09afe35 Compare October 6, 2025 17:31
@vepadulano
Copy link
Member Author

I have found a way to show kInfo level messages with RLogger, which I prefer over using the legacy TError.h Info function. The environment variable ROOT_RDF_SILENCE_SNAPSHOT_INFO=1 can be used to suppress the message on the user side. I would avoid adding another knob to .rootrc.

The remaining discussion point is about how long this info message should stay around. I believe it should only stay in 6.38 as leaving it in an LTS release has the concrete risk of sticking around in user code for many years.

@hageboeck
Copy link
Member

I have found a way to show kInfo level messages with RLogger, which I prefer over using the legacy TError.h Info function. The environment variable ROOT_RDF_SILENCE_SNAPSHOT_INFO=1 can be used to suppress the message on the user side. I would avoid adding another knob to .rootrc.

How about the shorter ROOT_RDF_SNAPSHOT_INFO=0?
Having the same in .rootrc would certainly be easier for people who configure other things in rootrc as well. These would have to do some configs in the environment, others in .rootrc.

The remaining discussion point is about how long this info message should stay around. I believe it should only stay in 6.38 as leaving it in an LTS release has the concrete risk of sticking around in user code for many years.

After today's discussion, 6.38 seems to be enough.

@pcanal
Copy link
Member

pcanal commented Oct 7, 2025

The message appears to be:

Info in <[ROOT.RDF] Info /github/home/ROOT-CI/build/include/ROOT/RDF/RInterface.hxx:1348 in auto ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager, void>::Snapshot(std::string_view, std::string_view, const ColumnNames_t &, const RSnapshotOptions &)::(anonymous class)::operator()() const [Proxied = ROOT::Detail::RDF::RLoopManager, DataSource = void]>: In ROOT 6.38, the default compression settings of Snapshot have been changed from 101 (ZLIB with compression level 1, the TTree default) to 505 (ZSTD with compression level 5). This change may result in smaller Snapshot output dataset size by default. In order to suppress this message, set ROOT_RDF_SILENCE_SNAPSHOT_INFO=1 in your environment.

The message is being obscured by the prefix which in this case in uninformative. Something closer to the following might be more readable.

Info in <[ROOT.RDF] Snapshot>:  In ROOT 6.38, the default compression settings of Snapshot have been changed
	from 101 (ZLIB with compression level 1, the TTree default) to 505 (ZSTD with compression level 5). 
	This change may result in smaller Snapshot output dataset size by default. In order to suppress this message, 
		set ROOT_RDF_SILENCE_SNAPSHOT_INFO=1 in your environment.

@pcanal pcanal self-requested a review October 7, 2025 16:48
@vepadulano
Copy link
Member Author

The message appears to be:

Info in <[ROOT.RDF] Info /github/home/ROOT-CI/build/include/ROOT/RDF/RInterface.hxx:1348 in auto ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager, void>::Snapshot(std::string_view, std::string_view, const ColumnNames_t &, const RSnapshotOptions &)::(anonymous class)::operator()() const [Proxied = ROOT::Detail::RDF::RLoopManager, DataSource = void]>: In ROOT 6.38, the default compression settings of Snapshot have been changed from 101 (ZLIB with compression level 1, the TTree default) to 505 (ZSTD with compression level 5). This change may result in smaller Snapshot output dataset size by default. In order to suppress this message, set ROOT_RDF_SILENCE_SNAPSHOT_INFO=1 in your environment.

The message is being obscured by the prefix which in this case in uninformative. Something closer to the following might be more readable.

Info in <[ROOT.RDF] Snapshot>:  In ROOT 6.38, the default compression settings of Snapshot have been changed
	from 101 (ZLIB with compression level 1, the TTree default) to 505 (ZSTD with compression level 5). 
	This change may result in smaller Snapshot output dataset size by default. In order to suppress this message, 
		set ROOT_RDF_SILENCE_SNAPSHOT_INFO=1 in your environment.

I really do not think a rewriting of the implementation of RLogger is in scope for this PR

@hageboeck
Copy link
Member

I really do not think a rewriting of the implementation of RLogger is in scope for this PR

Maybe that's not required. Adding \n\t at the beginning and somewhere in the middle of the message may help making it more readable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants