Skip to content

Conversation

wkliao
Copy link
Collaborator

@wkliao wkliao commented Jul 15, 2025

This is in case an interrupt occurs while processing the log file, resulting the log file corrupted.
See issue #1052.

@wkliao wkliao requested a review from carns July 15, 2025 21:48
Copy link
Contributor

@carns carns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, just a few minor suggestions

final_core->mmap_log_name);
mmap_fd = open(dup_log_fame, O_CREAT | O_WRONLY, 0644);
if (mmap_fd != -1) {
write(mmap_fd, fileSize, buf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably need to check the return code of the write here as well. If it fails we just do a best effort cleanup (unlink etc.).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I on purpose ignore all the error codes in this code block, as we
do not want any error to affect the log processing after this code block.
The duplicated file will be unlinked at the end, line 809, any way.

If this is OK, I can add some comments into the codes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that strategy makes sense. It may be good to add a comment explaining.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments have been added in commit b3ce870.

@wkliao
Copy link
Collaborator Author

wkliao commented Jul 18, 2025

I also notice that no CI is testing the configure option --enable-mmap-logs.
Maybe we should add one.

@wkliao
Copy link
Collaborator Author

wkliao commented Jul 18, 2025

The solution proposed in this PR only works when --enable-mmap-logs is set at configure time.
When this is not the case, we may need another solution.

@carns
Copy link
Contributor

carns commented Jul 20, 2025

I also notice that no CI is testing the configure option --enable-mmap-logs. Maybe we should add one.

Yeah we really should.

carns
carns previously approved these changes Jul 28, 2025
Copy link
Contributor

@carns carns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Let's see if Mikaila can test it. If it works for her then this can be merged.

if (mmap_fd != -1) {
int dup_mmap_fd;
void *buf = (void*) malloc(final_core->mmap_size);
read(mmap_fd, buf, final_core->mmap_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this block of code produces the following warning with gcc 14.2:

../../../darshan-runtime/lib/darshan-core.c: In function ‘darshan_core_shutdown’:
../../../darshan-runtime/lib/darshan-core.c:494:9: warning: ignoring return value of ‘read’ declared with attribute ‘warn_unused_result’ [-Wunused-result]
  494 |         read(mmap_fd, buf, final_core->mmap_size);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../../../darshan-runtime/lib/darshan-core.c:500:13: warning: ignoring return value of ‘write’ declared with attribute ‘warn_unused_result’ [-Wunused-result]
  500 |             write(dup_mmap_fd, buf, final_core->mmap_size);
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ignored them on purpose, as these codes are for file duplication, wouldn't
want any error stop the duplication.

Anyway, I added checks in 14a755c.
Strangely, I also run gcc 14.2.1, but did not get those warnings.
Please see if they are silenced.

void *buf = (void*) malloc(final_core->mmap_size);
read(mmap_fd, buf, final_core->mmap_size);
close(mmap_fd);
snprintf(dup_log_fame, strlen(final_core->mmap_log_name), "%s.dup",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This snprintf truncates the files name to something like /tmp/carns_python3_id1445957_mmap-log-3779006568372861245-0.darsha because the 2nd argument to is calling strlen() on the original mmap_log_name. Maybe this size limit on the snprintf should be based on the size of the dup_log_fname instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be fixed in 6a5c60f.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I have one other request here. Sorry this didn't occur to me sooner. I like that the duplicate file has "dup" in the file name, but it may be better if it weren't used as the file extension. I find myself doing things like ls /tmp/*.darshan to look for intermediate files, when with this patch it would really need to be ls /tmp/*.darshan /tmp/*.darshan.dup to find them all.

Can we change the naming convention so that even the duplicate file names end with .darshan?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way is to run "ls /tmp/*.darshan*".

The log file name is constructed in:

snprintf(core->mmap_log_name, __DARSHAN_PATH_MAX,
"/%s/%s_%s_id%d_mmap-log-%" PRIu64 "-%d.darshan",
core->config.mmap_log_path, cuser, __progname, jobid, logmod, my_rank);

Where do you suggest to insert substring "dup"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be other things in /tmp with "darshan" in the name that aren't log files, though (I have a bunch for sure, though I'm an outlier because I've been experimenting with a lot of Darshan things :))

I'm not picky about where to put dup in the name (other than not at the end). Maybe just making the extension .dup.darshan?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Please give it a try.

@wkliao
Copy link
Collaborator Author

wkliao commented Aug 22, 2025

I have a doubt this issue may be related to the data sieving bug in OpenMPI in #1064.
Is it possible to get @mikail-g to confirm whether or not she is using OpenMPI, etc.?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants