Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and improve nldd timing and output reporting #6124

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

jakobtorben
Copy link
Contributor

Improving reporting of NLDD

Changes

This PR fixes some bugs that were present in the Non-Linear Domain Decomposition (NLDD) reporting and improves how statistics from the NLDD solver are collected, displayed, and analyzed.

The key improvements are:

  1. Fixed a bug that the simulation summary reported the NLDD summary for rank 0 only. Now, it only reports the total local solve time in the simulation summary, and more detailed reports are shown in the DBG file. Removing the breakdown from the final summary report is motivated by the fact that this does not make sense for NLDD in parallel, where all the ranks have to wait for the slowest rank. Instead, we now output detailed reports per domain and per rank in the DBG file, which makes more sense.
  2. There were also some inconsistencies in the NLDD assembly reporting, which have been fixed.
  3. Better reporting structure: Changed BlackoilModel::localAccumulatedReports() from returning a simple SimulatorReportSingle to a more comprehensive SimulatorReport that can track both successful and failed solves.
  4. Per-domain statistics: Added a new method BlackoilModel::domainAccumulatedReports() that provides detailed statistics for each individual domain on the current rank.
  5. Cell-level visualization: Implemented BlackoilModel::writeNonlinearIterationsPerCell() to generate ResInsight-compatible files showing nonlinear iteration counts at the cell level.
  6. Added a helper method BlackoilModel::hasNlddSolver() that makes it cleaner to check if an NLDD solver exists before attempting to use it.
  7. Created an NLDD-specific report writer with more details, used for rank- and domain-level statistics.

New Features

1. Domain Distribution Summary

The solver now provides a summary of how cells, wells and domains are distributed across ranks:

NLDD domain distribution summary:
  rank   owned cells   overlap cells   total cells   wells   domains
--------------------------------------------------------------------
     0         9682           2227         11909       5       12
     1        11809           1777         13586      14       12
     2        12131            536         12667      12       14
     3        10809            590         11399       5       13
--------------------------------------------------------------------
   sum        44431           5130         49561      36       51

2. Per-Domain Performance Reports

Each domain's performance is now reported with detailed statistics in the DBG file:

======  Accumulated local solve data for domain 0 on rank 1 ======
Owned + overlap cells:          1086
Number of wells:                   2
Number of domains:                 1
-------------------------------------------------------
Total time:                        0.00 s
  Pre/post/wait time:              0.00 s
  Solver time:                     2.87 s (Wasted: 0.1 s; 1.9%)
    Assembly time:                 1.39 s (Wasted: 0.0 s; 1.5%)
      Well assembly:               0.06 s (Wasted: 0.0 s; 1.3%)
    Linear solve time:             0.45 s (Wasted: 0.0 s; 2.3%)
      Linear setup:                0.19 s (Wasted: 0.0 s; 2.3%)
    Props/update time:             1.02 s (Wasted: 0.0 s; 2.2%)
Overall Linearizations:          1250   (Wasted:    22; 1.8%)
Overall Nonlinear Iterations:     787   (Wasted:    21; 2.7%)
Overall Linear Iterations:       1253   (Wasted:    12; 1.0%)
-------------------------------------------------------
Converged domain solves:         462
  Accepted with relaxed tol:       0
Unconverged domain solves:         1

3. Per rank statistics

Detailed NLDD statistics accumulated per rank is now shown per rank in the DBG file (what was former shown in the simulation summary).

======  Accumulated local solve data for rank 1 ======
Owned + overlap cells:         13586
Number of wells:                  14
Number of domains:                12
-------------------------------------------------------
Total time:                       31.76 s
  Pre/post/wait time:              2.10 s
  Solver time:                    29.66 s (Wasted: 0.1 s; 0.2%)
    Assembly time:                13.91 s (Wasted: 0.0 s; 0.1%)
      Well assembly:               0.30 s (Wasted: 0.0 s; 0.2%)
    Linear solve time:             5.35 s (Wasted: 0.0 s; 0.2%)
      Linear setup:                1.92 s (Wasted: 0.0 s; 0.2%)
    Props/update time:            10.22 s (Wasted: 0.0 s; 0.2%)
Overall Linearizations:         14517   (Wasted:    22; 0.2%)
Overall Nonlinear Iterations:    8961   (Wasted:    21; 0.2%)
Overall Linear Iterations:      13370   (Wasted:    12; 0.1%)
-------------------------------------------------------
Converged domain solves:        5555
  Accepted with relaxed tol:       0
Unconverged domain solves:         1

4. Visualisation Capabilities

The PR adds the ability to visualise nonlinear iteration counts:

  • Cell-level NLDD iteration count: A ResInsight-compatible output file that allows visualisation of nonlinear iteration counts per cell, helping identify problematic regions in the model.

ResInsight_nonlinear_iterations

  • Domain-level performance: The new detailed domain statistics make it possible to analyse the NLDD method in more detail and create useful plots like the one below.

NORNE_ATW2013_nonlinear

Implementation Details

  • The implementation extends the existing SimulatorReport structures for domain-specific performance reporting. A dedicated one can be created for NLDD to avoid having some unused fields during the current default Newton method. However, there is quite a lot of functionality built around it for things like adding reports and etc. So, I kept it as one report to avoid too much duplication.
  • To produce the NLDD statistics there is a significant amount of processing that needs to be done. However, all the processing-intensive and global calls are only done once per simulation, and I argue that the value it provides justifies the cost.
  • In the NLDD domain statistics report, there are the three following fields:
  • Total time: Total time spent in NLDD for this domain
  • Pre/post/wait time: Total time spent preparing for NLDD solve and waiting for other ranks to finish
  • Solver time: Total time spent doing an actual solve for this domain
  • The total solve time will therefore be around the same for all domains and about the same as the local solve time in the simulations summary. Whereas the pre/post/wait time will differ based on the workload on that rank.

@jakobtorben
Copy link
Contributor Author

jenkins build this please

Copy link
Member

@bska bska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this looks good for the most part. I haven't looked very hard into the details, but the parts that I have checked seem sound to me. I personally prefer using std::size_t from <cstddef> over "naked" size_ts (from <stddef.h>). You may consider switching to the std:: variant.

Other than that, I have only minor comments.

Comment on lines 395 to 435
{
// Write the number of nonlinear iterations per cell to a file in ResInsight compatible format
const auto& odir = eclState().getIOConfig().getOutputDir();
simulator_->model().writeNonlinearIterationsPerCell(odir);

// Create a deferred logger and add the report with rank as tag
DeferredLogger local_log;
std::ostringstream ss;

// Accumulate reports per domain
const auto& domain_reports = simulator_->model().domainAccumulatedReports();
for (size_t i = 0; i < domain_reports.size(); ++i) {
const auto& dr = domain_reports[i];
ss << "====== Accumulated local solve data for domain " << i << " on rank " << mpi_rank_ << " ======\n";
dr.reportNLDD(ss);
// Use combined rank and domain index as tag to ensure global uniqueness and correct ordering
local_log.debug(fmt::format("R{:05d}D{:05d}", mpi_rank_, i), ss.str());
ss.str(""); // Clear the stringstream
ss.clear(); // Clear any error flags
}
// Gather all logs and output them in sorted order
auto global_log = gatherDeferredLogger(local_log, FlowGenericVanguard::comm());
if (this->output_cout_) {
global_log.logMessages();
}
}
{
// Create a deferred logger and add the report with rank as tag
DeferredLogger local_log;
std::ostringstream ss;
ss << "====== Accumulated local solve data for rank " << mpi_rank_ << " ======\n";
simulator_->model().localAccumulatedReports().reportNLDD(ss);
// Use rank number as tag to ensure correct ordering
local_log.debug(fmt::format("{:05d}", mpi_rank_), ss.str());

// Gather all logs and output them in sorted order
auto global_log = gatherDeferredLogger(local_log, FlowGenericVanguard::comm());
if (this->output_cout_) {
global_log.logMessages();
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could push this logic to a separate helper function? That way, runSimulatorAfterSim_() wouldn't be dominated by NLDD-related output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the same thing but never got around to doing it. Added this now and also moved some functions out from BlackOilModelNldd.hpp into this new NlddReporting.hpp file.

@bska
Copy link
Member

bska commented Apr 1, 2025

One more general question: If a model has many local domains (e.g. more than 5000), either because the operator requests a very fine-grained partition or because the model "just" has a lot of active cells, how do the statistics plots look then? Do they become very crowded?

@atgeirr atgeirr added this to the Release 2025.04 milestone Apr 2, 2025
@jakobtorben jakobtorben force-pushed the fix_and_improve_nldd_timing_and_output_reporting branch from 19151fd to 9faa3f8 Compare April 2, 2025 14:17
@jakobtorben
Copy link
Contributor Author

The use of size_t should be removed.

For the NLDD_ITER plot in ResInsight, I don't think this would be an issue. Here we could also add a filter to only show the cells above a certain value. For the other plot I included, this is just something I made myself based on the domain statistics so here it is up to the user how they want to plot the statistics.

What I was slightly worried about is if we write many domain reports to the DBG file, this would clutter the file or the file would be too large. This is already starting to become a problem with NLDD, which has a lot more output. But I ended up adding it to the DBG file as the logger in OPM conveniently solved all the problems I had for logging and sorting this info. So for now, I think this is fine, but this is something we can consider fixing later.

@jakobtorben
Copy link
Contributor Author

jenkins build this please

Copy link
Member

@bska bska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the improvements here. This is starting to look good to me. I (think I) have identified a couple of minor issues, the most important of which may be a differing function type for the domainAccumulatedReports() function. Is that function called at all?

Other than that, we should use ::Opm:: as the namespace prefix when we need to disambiguate Opm from within the Opm namespace.

Comment on lines 157 to 160
if (!resInsightFile) {
OPM_THROW(std::runtime_error,
"Failed to open NLDD nonlinear iterations output file: " + fname.string());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you really need this check. The std::ofstream constructor will itself throw an exception if the it fails to open the output file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. And apologies for not fixing this since you pointed this about in the last PR. This code was written before that...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably add this new file (NlddReporting.hpp) to the set of build files too (top-level CMakeLists_files.cmake file). If the file is supposed to be installed and usable from outside of opm-simulators, and I think it probably should be, then it should go into the PUBLIC_HEADER_FILES. Otherwise, I think it should go into the MAIN_SOURCE_FILES. It is, admittedly, not quite as critical in that case, but it does nevertheless help the build system to generate file listings for those who use IDEs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying what should go into this file! Added it to PUBLIC_HEADER_FILES now.

domain_reports_accumulated_.resize(num_domains);

// Print domain distribution summary
Opm::printDomainDistributionSummary(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're already in the Opm namespace, so this Opm:: prefix is a little misleading. If you want it, then Opm:: itself should be prefixed by the global namespace, meaning this function call becomes

::Opm::printDomainDistributionSummary(...);

instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yes I was following the pattern in the partitionCells call but hastly overlooked the prefixed ::, added this prefix now.

{
return local_reports_accumulated_;
}

/// return the statistics of local solves accumulated for each domain on this rank
std::vector<SimulatorReport>& domainAccumulatedReports()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to return a reference here? Unless I'm reading it incorrectly, the header file declares this function as returning an object (a vector<SimulatorReport>) rather than a reference to a mutable vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was intentional, as I don't want to make a copy of the reports, given that the object can be quite large.
There was indeed a mismatch in the function definitions in BlackOilModel.hpp and BlackOilModelNldd.hpp.

The complication here was that domainAccumulatedReports() needed to count the wells for each domain using the mapping found in the well model for all the wells seen so far. So, we need to mutate this internal variable whenever we use this function to ensure this is updated. We could, of course, always keep this count updated, but I didn't want to do extra work since this is only used once in the end. Using the wells at the end from the schedule like we do for the per-rank reports is not possible since we do not have the well to domain mapping available here.

To overcome this issue, I made the functions consistently a const functions but made the domain_reports_accumulated_ member mutable. To also ensure that I returned a reference, I removed returning an empty object if not used by NLDD, and instead raised an error.

@jakobtorben jakobtorben force-pushed the fix_and_improve_nldd_timing_and_output_reporting branch from 9faa3f8 to 90d0ff7 Compare April 3, 2025 07:09
@jakobtorben
Copy link
Contributor Author

jenkins build this please

@jakobtorben jakobtorben force-pushed the fix_and_improve_nldd_timing_and_output_reporting branch from 90d0ff7 to ddbb1dd Compare April 3, 2025 08:10
@jakobtorben
Copy link
Contributor Author

jenkins build this please

@jakobtorben jakobtorben force-pushed the fix_and_improve_nldd_timing_and_output_reporting branch from ddbb1dd to bc20dfd Compare April 3, 2025 08:15
@jakobtorben
Copy link
Contributor Author

jenkins build this please

@jakobtorben jakobtorben force-pushed the fix_and_improve_nldd_timing_and_output_reporting branch from bc20dfd to 35e1f59 Compare April 3, 2025 08:24
@jakobtorben
Copy link
Contributor Author

jenkins build this please

@jakobtorben
Copy link
Contributor Author

Thanks for your comments @bska @akva2 ! This should be ready from my side now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants