-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix and improve nldd timing and output reporting #6124
base: master
Are you sure you want to change the base?
Fix and improve nldd timing and output reporting #6124
Conversation
jenkins build this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me this looks good for the most part. I haven't looked very hard into the details, but the parts that I have checked seem sound to me. I personally prefer using std::size_t
from <cstddef>
over "naked" size_t
s (from <stddef.h>
). You may consider switching to the std::
variant.
Other than that, I have only minor comments.
opm/simulators/flow/FlowMain.hpp
Outdated
{ | ||
// Write the number of nonlinear iterations per cell to a file in ResInsight compatible format | ||
const auto& odir = eclState().getIOConfig().getOutputDir(); | ||
simulator_->model().writeNonlinearIterationsPerCell(odir); | ||
|
||
// Create a deferred logger and add the report with rank as tag | ||
DeferredLogger local_log; | ||
std::ostringstream ss; | ||
|
||
// Accumulate reports per domain | ||
const auto& domain_reports = simulator_->model().domainAccumulatedReports(); | ||
for (size_t i = 0; i < domain_reports.size(); ++i) { | ||
const auto& dr = domain_reports[i]; | ||
ss << "====== Accumulated local solve data for domain " << i << " on rank " << mpi_rank_ << " ======\n"; | ||
dr.reportNLDD(ss); | ||
// Use combined rank and domain index as tag to ensure global uniqueness and correct ordering | ||
local_log.debug(fmt::format("R{:05d}D{:05d}", mpi_rank_, i), ss.str()); | ||
ss.str(""); // Clear the stringstream | ||
ss.clear(); // Clear any error flags | ||
} | ||
// Gather all logs and output them in sorted order | ||
auto global_log = gatherDeferredLogger(local_log, FlowGenericVanguard::comm()); | ||
if (this->output_cout_) { | ||
global_log.logMessages(); | ||
} | ||
} | ||
{ | ||
// Create a deferred logger and add the report with rank as tag | ||
DeferredLogger local_log; | ||
std::ostringstream ss; | ||
ss << "====== Accumulated local solve data for rank " << mpi_rank_ << " ======\n"; | ||
simulator_->model().localAccumulatedReports().reportNLDD(ss); | ||
// Use rank number as tag to ensure correct ordering | ||
local_log.debug(fmt::format("{:05d}", mpi_rank_), ss.str()); | ||
|
||
// Gather all logs and output them in sorted order | ||
auto global_log = gatherDeferredLogger(local_log, FlowGenericVanguard::comm()); | ||
if (this->output_cout_) { | ||
global_log.logMessages(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could push this logic to a separate helper function? That way, runSimulatorAfterSim_()
wouldn't be dominated by NLDD-related output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the same thing but never got around to doing it. Added this now and also moved some functions out from BlackOilModelNldd.hpp
into this new NlddReporting.hpp
file.
One more general question: If a model has many local domains (e.g. more than 5000), either because the operator requests a very fine-grained partition or because the model "just" has a lot of active cells, how do the statistics plots look then? Do they become very crowded? |
19151fd
to
9faa3f8
Compare
The use of For the NLDD_ITER plot in ResInsight, I don't think this would be an issue. Here we could also add a filter to only show the cells above a certain value. For the other plot I included, this is just something I made myself based on the domain statistics so here it is up to the user how they want to plot the statistics. What I was slightly worried about is if we write many domain reports to the DBG file, this would clutter the file or the file would be too large. This is already starting to become a problem with NLDD, which has a lot more output. But I ended up adding it to the DBG file as the logger in OPM conveniently solved all the problems I had for logging and sorting this info. So for now, I think this is fine, but this is something we can consider fixing later. |
jenkins build this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the improvements here. This is starting to look good to me. I (think I) have identified a couple of minor issues, the most important of which may be a differing function type for the domainAccumulatedReports()
function. Is that function called at all?
Other than that, we should use ::Opm::
as the namespace prefix when we need to disambiguate Opm
from within the Opm namespace.
if (!resInsightFile) { | ||
OPM_THROW(std::runtime_error, | ||
"Failed to open NLDD nonlinear iterations output file: " + fname.string()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you really need this check. The std::ofstream
constructor will itself throw an exception if the it fails to open the output file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. And apologies for not fixing this since you pointed this about in the last PR. This code was written before that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should probably add this new file (NlddReporting.hpp
) to the set of build files too (top-level CMakeLists_files.cmake
file). If the file is supposed to be installed and usable from outside of opm-simulators, and I think it probably should be, then it should go into the PUBLIC_HEADER_FILES
. Otherwise, I think it should go into the MAIN_SOURCE_FILES
. It is, admittedly, not quite as critical in that case, but it does nevertheless help the build system to generate file listings for those who use IDEs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying what should go into this file! Added it to PUBLIC_HEADER_FILES
now.
domain_reports_accumulated_.resize(num_domains); | ||
|
||
// Print domain distribution summary | ||
Opm::printDomainDistributionSummary( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're already in the Opm namespace, so this Opm::
prefix is a little misleading. If you want it, then Opm::
itself should be prefixed by the global namespace, meaning this function call becomes
::Opm::printDomainDistributionSummary(...);
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh yes I was following the pattern in the partitionCells call but hastly overlooked the prefixed ::
, added this prefix now.
{ | ||
return local_reports_accumulated_; | ||
} | ||
|
||
/// return the statistics of local solves accumulated for each domain on this rank | ||
std::vector<SimulatorReport>& domainAccumulatedReports() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to return a reference here? Unless I'm reading it incorrectly, the header file declares this function as returning an object (a vector<SimulatorReport>
) rather than a reference to a mutable vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was intentional, as I don't want to make a copy of the reports, given that the object can be quite large.
There was indeed a mismatch in the function definitions in BlackOilModel.hpp
and BlackOilModelNldd.hpp
.
The complication here was that domainAccumulatedReports()
needed to count the wells for each domain using the mapping found in the well model for all the wells seen so far. So, we need to mutate this internal variable whenever we use this function to ensure this is updated. We could, of course, always keep this count updated, but I didn't want to do extra work since this is only used once in the end. Using the wells at the end from the schedule like we do for the per-rank reports is not possible since we do not have the well to domain mapping available here.
To overcome this issue, I made the functions consistently a const functions but made the domain_reports_accumulated_
member mutable
. To also ensure that I returned a reference, I removed returning an empty object if not used by NLDD, and instead raised an error.
9faa3f8
to
90d0ff7
Compare
jenkins build this please |
90d0ff7
to
ddbb1dd
Compare
jenkins build this please |
ddbb1dd
to
bc20dfd
Compare
jenkins build this please |
bc20dfd
to
35e1f59
Compare
jenkins build this please |
Improving reporting of NLDD
Changes
This PR fixes some bugs that were present in the Non-Linear Domain Decomposition (NLDD) reporting and improves how statistics from the NLDD solver are collected, displayed, and analyzed.
The key improvements are:
BlackoilModel::localAccumulatedReports()
from returning a simpleSimulatorReportSingle
to a more comprehensiveSimulatorReport
that can track both successful and failed solves.BlackoilModel::domainAccumulatedReports()
that provides detailed statistics for each individual domain on the current rank.BlackoilModel::writeNonlinearIterationsPerCell()
to generate ResInsight-compatible files showing nonlinear iteration counts at the cell level.BlackoilModel::hasNlddSolver()
that makes it cleaner to check if an NLDD solver exists before attempting to use it.New Features
1. Domain Distribution Summary
The solver now provides a summary of how cells, wells and domains are distributed across ranks:
2. Per-Domain Performance Reports
Each domain's performance is now reported with detailed statistics in the DBG file:
3. Per rank statistics
Detailed NLDD statistics accumulated per rank is now shown per rank in the DBG file (what was former shown in the simulation summary).
4. Visualisation Capabilities
The PR adds the ability to visualise nonlinear iteration counts:
Implementation Details
SimulatorReport
structures for domain-specific performance reporting. A dedicated one can be created for NLDD to avoid having some unused fields during the current default Newton method. However, there is quite a lot of functionality built around it for things like adding reports and etc. So, I kept it as one report to avoid too much duplication.