-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PTE proposal with load-side barrier bits #234
Conversation
CC: @nwf |
If you mprotect a page it can lose its capability write permission but still be capability dirty |
OK, let me give this a shot. I'm sorry this is long, but I hope it's all accurate and provides, if nothing else, a useful software-driven perspective. It's also worth remembering that we have done... very little experimentation, relative to the body of work on MMUs in general, when it comes to "what software wants from CHERI PTE bits". Some of this, then, is necessarily a best guess extrapolating from very few data points. I've said it before, but to repeat: it's a bit sad we have to resort to using PTE bits for this, and I hope that in the future there are alternative extensions that build on physically-indexed metadata tables, like AMD's SEV-SNP's RMTs or Arm's RME's GPTs. In any case, until such time, we are stuck using PTEs to reflect properties of a logical page (or even of the underlying physical frame), but hopefully future Zcheripte editions can use fewer PTE bits in composition with those other mechanisms. Some of the design space is dependent on how willing microarchitecture is to take on data dependence in operations. It's most useful to software if we can trigger events only on tagged loads or stores, but these are new, and possibly expensive, dependencies, keeping state speculative for longer. Some of the design space is also dependent on how willing software is to perform expensive operations on the set of page mappings for a given page. Historically, such need was limited to things like paging, and so such operations are, generally, not particularly speedy. The use of PTE mapping bits to approximate page state sometimes requires traversing this set. OK, with all that out of the way... it's probably most useful to think about what logical states we need a logical page or an individual page mapping to be in. As relevant here, pages can be given four states:
Correctness absent temporal safety concerns really only differentiates between 1 and 2, smashing 3 and 4 up into an amalgamated 2. This is how the TLBE bits of CHERI-MIPS were used. Under Cornucopia (Reloaded) revocation, the revoker must visit (at least) all cap-dirty and cap-dirtyable pages. Transition from ephemerally cap-clean is induced by traps or new mappings, and may require broadcast updates of PTEs, depending on exactly which things PTEs can represent (see below). Re-entry to ephemerally cap-clean requires broadcast down-grading all of a page's mappings, and may require retries across multiple revocation epochs. Mappings also come in multiple flavors. The easy cases:
And then there is the hard case involving ephemerally cap-clean (or "recently" ephemerally cap-clean) pages. We would, ideally, like for the revoker to be able to pay no attention to cap-clean pages, be they ephemerally or always so. This requires that software either be able to... a. broadcast updates to PTEs for pages transitioning from cap-clean. That's potentially expensive. b. configure mappings of cap-clean pages so that they become as if they were the wrong CRG whenever the page -- not this mapping, but the page via another mapping -- transitions away from cap-clean. The latter seems preferable to me, but it seems to necessitate a PTE state of "all tagged loads trap" and we almost surely want that to be a true dependence on the loaded value, so that cap-width reads of data don't cause software to have to start paying attention to the page again. In any case, assuming it were microarchitecturally agreeable, I think the following set of six states suffice for all software we've written so far:
If we had to remove the last one, we could (by using state 1 instead, and broadcasting to PTEs). That's still 5 states, tho', so still 3 bits. The distinction between TP and TA states is only meaningful if the W bit is asserted, but that's probably not useful for any kind of representation compression. Specifically, to the point of
Yes, but that cap-dirty-ness doesn't need to be reflected in the updated mapping, so long as it's tracked in the logical page state in software. Whew. |
@nwf: Thanks for the explanation and your proposal. Its very informative! First of all, I am new to temporal safety using CHERI! Could you elaborate a bit on how could a page get into the "ephemeral cap-clean" state? Is this purely due to the revocation algorithm? My guess is that a page that the revoker can ignore pages that are ephemerally cap clean altogether. However, these pages are distinct from cap-dirtyable because in the latter you may have other mappings that are cap-dirty for example -- would the revoker at any point scan all mappings of a cap-dirtyable page and, for example, conclude that they are all "cap-clean" so that page should really be ephemeral and is safe to ignore it in the next revocation cycle? P.S. I can see why it would be helpful to have physically-indexed metadata tables! |
Great to hear that that brain-dump communicated something. :)
That's exactly right: it's easy for the revoker to keep track of whether or not a given page was found to have a capability on it during the sweep, and, if not, that's a good reason to try downgrading it to ephemerally cap-clean (even though the application could write a capability later, we might hope that it'll be a few revocation epochs in the future). That downgrade is a frightful wad of complexity that eventually involves looking at all aliases of a page and which we stage across multiple revocation epochs (so that we can avoid per-page TLB shoot-downs and IPIs on platforms that require those). I have a... long-abandoned, long-winded draft of a technical report that has an exhaustively tested Murφ model of multiple TLBs and multiple page aliases and all the permitted flows through the system. While there's a huge semantic gap between that and the implementation in CheriBSD, it was, nevertheless, quite useful for understanding. I'll try to dig that up, stamp "DRAFT" on it in big letters, and put it up as a PDF somewhere for people to look at. |
@nwf: Thanks for the info! My understanding so far is that we there are 3 key goals for the PTEs:
Is that correct? For the optimizations, are there any studies/figures showing what improvements (e.g. performance, pause times, etc) the dirty/ephimeral ideas provide? Is it also feasible to consider these optimizations as separate features? |
@nwf and @andresag01 (I didn't quite parse how the 6 states behaved with respect to capability loads, but I assume the "with 1 CRG bit" meant that it traps on capability read if CRG doesn't match.) |
To try to summarise a conversation from yesterday, a seemingly sensible 2-bit proposal is as follows: Bit 1: CW (Trap if a tagged capability is stored to this page) Desired behaviours: Capability revocation sweep - Capability presence/dirty tracking - Relation to Wes' states: This is more-or-less 2 & 3 of Wes' states; TPg and TAg. =========================== What is missing?/What is our wishlist for an extra bit?
============================ Tests required to make decision:
2.1 Is it possible to ignore CRG when CW is disallowed? If so, we can actually solve the "Missing 2" above, and reduce our states to 3. Can this be mocked up in hardware or software to detect if it is ever necessary? 2.2 If Test#2.1 is not possible, we need to measure the cost of manually flipping CRG bits in the STW phase for pages that need not be swept. Presumably this can be emulated either in Morello or Toooba. |
If CRG, the Capability Read Generation bit, is ignored when CW, Capability Write, is not set, then I think we might get the important states we want? State 1 & 2: CW True, CRG 1/0 State 3: CW False, CRG 0 (for example) State 4: CW False, CRG 1 (for example) Only need to answer 2.1 above. |
After a little bit of discussion elsewhere, here's a different 2-bit proposal. It's not great -- I think the 3-bit proposals are nicer -- but I think this is at least workable, if a little trap-heavy in use. Revocation with four-state PTEs 2024/06/07The four states proposed here are
That is, this approach jetissons CAP-DIRTYABLE / AMO-cap-dirtying PTEs in favor of the "CapsTrap" state. Steady states
State transitions
Other Caveats
|
@nwf Thank for the detailed thoughts! Is there a justification for jettisoning the pre-CRW state for the CapsTrap state? This would maybe also eliminate the tag sensitivity for loads? |
@jonwoodruff has updated the proposal as discussed last week! Thanks for taking the time to do this @jonwoodruff! andresag01#2 |
With some help, I've now performed a large-scale experiment on the feasibility of a restricted CHERI PTE scheme using Morello. I created a branch of CheriBSD that does not flip individual CHERI protection bits (of which there are four), but simply assigns the bits to one of our four states. Quite surprisingly, I got this to work correctly such that it runs all attempted software and benchmarks with revocation enabled without crashing. Sadly, tracing PTE bits on QEMU indicates that I never get the pre-CRW state, and thus am only using 3 states. That is, NoCaps, Caps0, and Caps1. We then booted Morello hardware with this kernel and ran spec benchmarks. The changes were confirmed by tracing PTE cap-dirtying events, which were zero with the new kernel (and non-zero with the old one). Surprisingly, we see no overhead from running with these reduced states, with performance being consistently slightly faster than previously. This is thought to be due to lack of optimisation of the baseline and some unknown effects that we will now be tracking down. Nevertheless, the average overhead for temporal safety/revocation on the ref workload for Spec2006 benchmarks that would run (all int workloads but gcc and perlbench) is about 5%. It's likely that this can be reduced significantly with optimisation such that the three-state solution would be slower than the four-state solution with hardware dirtying, which this proposal would also support. It seems more of a stretch to say that any other missing states would have a significant impact on performance. @nwf has made a cursory review of my modifications, and believes that they should be safe. In summary, I believe we have demonstrated that the 3-state subset of the proposal works and does not have surprising overheads with Spec2006 benchmarks. The full 4-state solution should be very close indeed to the best we can do. ======== Exemple modification note: |
I think that |
The key is that virtual memory is associated to S-mode, but not M-Mode. Perhaps the right place for the |
fair point - but let's stick with regularity for now and add it into menvcfg, which I'm working on. All these extra bits will change when we get to ARC review anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's probably worth adding something about the benefits/consequences of each of the 4 options, but I think the actual spec is ok now
Jon has pointed out that this is already covered in the doc |
Looking through the PR, I don't think there is a CW bit in either |
(v)sstatus would make the most sense for it. There's precedent from SUM and MXR (and TVM in the full mstatus). |
I agree, |
@jrtc27 I agree that (v)sstatus would be a good place to put this flag. Does hstatus also need its own CRG bit? The RISC-V privileged spec says this:
Wouldn't we expect that the OS or hypervisor running in HS-mode to also take advantage of revokation? |
Probably. I think we can safely rule out mstatus. If in doubt maybe add the bit in hstatus? |
yes ok - but I noticed that hstatus doesn't have all the mstatus/status fields. |
I migrated the CRG bit from senvcfg to xstatus excluding hstatus. We can add it to hstatus if its necessary. |
Signed-off-by: Andres Amaya Garcia <[email protected]>
any objections to merging this? |
ping @jonwoodruff @nwf @nwf-msr: Is this ready to be merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as a minimal baseline. I'm a little hesitant about the requirements on software, which are not fully correctly realized (that is, there are known bugs, sorry) in the existing CheriBSD prototype, but I think it is possible to be correct atop this specification, which is a very good first step. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can merge this now since Wes is also ok with it.
Signed-off-by: Tariq Kurd <[email protected]>
Just found one run-on sentence. Otherwise, it looked good to me! --- a/src/cheri-pte-ext.adoc |
Thanks for spotting this! I applied your change in this commit: 48d5761 |
Refactor the Zcheripte extension to use load-side barriers and only allocate 2 PTE bits for CHERI as follows: * Bit 1: CW (Trap if a tagged capability is stored to this page) * Bit 2: CRG (Trap if a tagged capability might be loaded from this page and the generation bit doesn't match the current one in a CSR. Potentially either tag-sensitive trapping, or capability-width trapping allowed.) Fixes riscv#17 --------- Signed-off-by: Andres Amaya Garcia <[email protected]> Signed-off-by: Tariq Kurd <[email protected]> Co-authored-by: Jonathan Woodruff <[email protected]> Co-authored-by: Tariq Kurd <[email protected]>
This is a draft proposal that reworks the Zcheripte extension to include load-side revocation support in the PTE. The goal of the extension is to support the following:
In summary, CHERI would add 3 bits to the PTE:
Note that this is only a draft to stimulate discussion, not a finished proposal. For example, @jrtc27 previously mentioned that conflating read/write into CA is not desirable. My intuition is that some combinations are not useful, so more efficient encodings are possible. Specifically:
However, is it always the case that the capability read/write bit of a page is set on allocation and remains unchanged throughout the lifetime of a page? If so, then how about something like this:
If this works, then we must be able to encode 7 values which is possible in 3 bits (and we get one spare encoding!). What do you think @jrtc27 @tariqkurd-repo @arichardson?
Fixes #17