Skip to content

Commit 09afe35

Browse files
committed
[df] Change default Snapshot compression settings
The default Snapshot compression setting has always been 101. Historically, this was done for simplicity reasons and following the principle of least surprise. TTree was the only output format available with Snapshot, so the operation was defaulting to the same value used by TTree. Now that Snapshot supports more than one output format, this reason is less strong than before. It has been established that 505 is a better default compression setting overall, so RDataFrame should follow that. The main disadvantage is that this change is hard to communicate. This commit proposes to introduce an information message being shown once per program execution, and only if the program is calling Snapshot. This message is supposed to help the users detect changes in their output file size with the next development cycle (6.38.x) and should be removed afterwards.
1 parent 4157d78 commit 09afe35

File tree

3 files changed

+26
-2
lines changed

3 files changed

+26
-2
lines changed

README/ReleaseNotes/v638/index.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,10 @@ If you want to keep using `TList*` return values, you can write a small adapter
115115
RDF uses one copy of each histogram per thread. Now, RDataFrame can reduce the number of clones using `ROOT::RDF::Experimental::ThreadsPerTH3()`. Setting this
116116
to numbers such as 8 would share one 3-d histogram among 8 threads, greatly reducing the memory consumption. This might slow down execution if the histograms
117117
are filled at very high rates. Use lower number in this case.
118+
119+
### Snapshot
118120
- The Snapshot method has been refactored so that it does not need anymore compile-time information (i.e. either template arguments or JIT-ting) to know the input column types. This means that any Snapshot call that specifies the template arguments, e.g. `Snapshot<int, float>(..., {"intCol", "floatCol"})` is now redundant and the template arguments can safely be removed from the call. At the same time, Snapshot does not need to JIT compile the column types, practically giving huge speedups depending on the number of columns that need to be written to disk. In certain cases (e.g. when writing O(10000) columns) the speedup can be larger than an order of magnitude. The Snapshot template is now deprecated and it will issue a compile-time warning when called. The function overload is scheduled for removal in ROOT 6.40.
121+
- The default compression setting for the output dataset used by Snapshot has been changed from 101 (ZLIB level 1, the TTree default) to 505 (ZSTD level 5). This is a better setting on average, and makes more sense for RDataFrame since now the Snapshot operation supports more than just the TTree output data format. This change may result in smaller output file sizes for your analyses that use Snapshot with default settings. During the 6.38 development release cycle, Snapshot will print information about this change once per program run. Starting from 6.40.00, the information will not be printed. The message can be suppressed by setting ROOT_RDF_SILENCE_SNAPSHOT_INFO=1 in your environment.
119122

120123
## Python Interface
121124

tree/dataframe/inc/ROOT/RDF/RInterface.hxx

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@
3232
#include "ROOT/RResultPtr.hxx"
3333
#include "ROOT/RSnapshotOptions.hxx"
3434
#include <string_view>
35+
#include "ROOT/RLogger.hxx"
3536
#include "ROOT/RVec.hxx"
3637
#include "ROOT/TypeTraits.hxx"
3738
#include "RtypesCore.h" // for ULong64_t
@@ -44,8 +45,12 @@
4445
#include "TProfile2D.h"
4546
#include "TStatistic.h"
4647

48+
#include "ROOT/RVersion.hxx"
49+
4750
#include <algorithm>
4851
#include <cstddef>
52+
#include <cstdlib>
53+
#include <cstring>
4954
#include <initializer_list>
5055
#include <iterator> // std::back_insterter
5156
#include <limits>
@@ -1331,6 +1336,22 @@ public:
13311336
const ColumnNames_t &columnList,
13321337
const RSnapshotOptions &options = RSnapshotOptions())
13331338
{
1339+
// TODO: Remove before releasing 6.40.00
1340+
#if ROOT_VERSION_CODE >= ROOT_VERSION(6, 40, 0)
1341+
static_assert(false && "Remove information about change of Snapshot defaut compression settings.");
1342+
#endif
1343+
[[maybe_unused]] static bool once = []() {
1344+
if (const char *suppress = std::getenv("ROOT_RDF_SILENCE_SNAPSHOT_INFO"))
1345+
if (std::strcmp(suppress, "1") == 0)
1346+
return true;
1347+
RLogScopedVerbosity showInfo{ROOT::Detail::RDF::RDFLogChannel(), ROOT::ELogLevel::kInfo};
1348+
R__LOG_INFO(ROOT::Detail::RDF::RDFLogChannel())
1349+
<< "In ROOT 6.38, the default compression settings of Snapshot have been changed from 101 (ZLIB with "
1350+
"compression level 1, the TTree default) to 505 (ZSTD with compression level 5). This change may result "
1351+
"in smaller Snapshot output dataset size by default. In order to suppress this message, set "
1352+
"ROOT_RDF_SILENCE_SNAPSHOT_INFO=1 in your environment.";
1353+
return true;
1354+
}();
13341355
// like columnList but with `#var` columns removed
13351356
auto colListNoPoundSizes = RDFInternal::FilterArraySizeColNames(columnList, "Snapshot");
13361357
// like columnListWithoutSizeColumns but with aliases resolved

tree/dataframe/inc/ROOT/RSnapshotOptions.hxx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,8 +46,8 @@ struct RSnapshotOptions {
4646
}
4747
std::string fMode = "RECREATE"; ///< Mode of creation of output file
4848
ECAlgo fCompressionAlgorithm =
49-
ROOT::RCompressionSetting::EAlgorithm::kZLIB; ///< Compression algorithm of output file
50-
int fCompressionLevel = 1; ///< Compression level of output file
49+
ROOT::RCompressionSetting::EAlgorithm::kZSTD; ///< Compression algorithm of output file
50+
int fCompressionLevel = 5; ///< Compression level of output file
5151
int fAutoFlush = 0; ///< AutoFlush value for output tree
5252
int fSplitLevel = 99; ///< Split level of output tree
5353
bool fLazy = false; ///< Do not start the event loop when Snapshot is called

0 commit comments

Comments
 (0)