I need to re-add BFI, the feature got missed accidentaly in GM releases after June. I'll read your suggestion calmly. Anyway, multithreaded rendering has always been problematic. I haven't seen a single implementation that doesn't crash under certain stress conditions. We already have a "blitting" thread in GM for the triplebuffer implementation. The roadmap we have goes in the direction of implementing a cross-platform software vsync "interrupt" library, using threading to keep track of vsync while keeping rendering in the main thread, similar to what we discussed in your site. Not sure how BFI and your other suggestions will fit in this scheme.
It’s not multithreaded rendering, and actually similiar to your triple buffer workflow, I thinkActually, it’s not a multithreaded render technically that I am suggesting — see the comments at
https://github.com/libretro/RetroArch/pull/11342Basically, it’s like your blitter thread, where the main thread renders, and the other thread does Present(). Basically it’s a thread that only does timings/presenting. So the only thing it really does is conditionally blitting, waiting for a timer, busywaiting, and presenting to the screen. No rendering per-se, technically.
Basically, you want to extend your triple-buffering-only “blitting thread” to all sync technologies, not just tirple buffering. It would do its own software emulation of waitable swapchains for lowest latency (But may still use actual waitable swapchains at the output level, if it’s actually VSYNC ON).
Full frame workflowsIt would provide the framepacing for triple buffering, G-SYNC, and VSYNC OFF, making sure that emulator frames stay framepaced at the emulator Hz.
1. Rendering thread blits frame to presenting thread
2. Presenting thread will busywait until correct time to present relative to last presentation
This will actually work universally, even for VSYNC ON. It will signal the rendering thread when it’s presented (if needed for emulator thread to continue, since we’re emulating VSYNC ON blocking behavior in software for all sync technologies). If it’s VSYNC ON at the output, frame presentation thread is just a passthrough
Beamraced workflowsEvery raster, you would blit one “scanline slice” at a time from the rendering thread to the presenting thread. Could be one row of pixels (if 1:1 mapped) or multiple rows of pixels (if CRT filtered). Don’t worry about curved CRT filters and missing data, you’d just let the user adjust jittermargin (the beam race margin) accordingly to prevent glitching, this is just a per-emulator-pixel-row frameslice blit, and would not be the same frameslice size as the presenting thread.. The presenting thread would decide when it’s time to present a frameslice (i.e. blocks of how many scanlines)
1. Rendering thread would call a scan line blitter (perhaps a PresentScanLineFrameSlice() or whatever blitter wrapper name you do) to blit the scanline from the emulator to the presenter thread framebuffer. In the wrapper, busywaiting can occur there (to pace the calling emulator execution), a scanline-blocking behavior version of frame-blocking behavior (VSYNC ON). Maybe this is like a “HSYNC ON” — lol (horizontal sync)
2. Presenter thread will decide whether enough scanlines have been added to start a frameslice present. If so, it will suddenly present the frameslice to the display (with all appropriate busywaiting logic, which can be decoupled from the busywaiting in #1, which presents advantages — like scanline-level busywaiting in the blit wrapper, but frameslice-level busywaiting in present thread). The rest of the emulator wouldn’t need to know how big or small the configured frameslicing is done, or even frontbuffer execution (NVIDIA VRWorks API to make frontbuffer the display buffer = perfect for single-scanline beam racing). In fact, the frameslice size could be different between rendering thread and present thread, basically the amount of frameslice blitted between threads may not be the same size as the amount of frameslice actually presented to the screen — basically a jittery rolling queue of scanlines that can be chunked-inwards and chunked-outwards in independent chunksizes (i.e. single scanline blitted in, but frameslices blitted out), or can stay synchronous (same frameslice in, same frameslice out), or not (different size frameslice in, different size frameslice out, due to weird shapes of CRT filters), with the present thread maintaining the jittermargin as needed
3. There’d be a final Present() in the main emulator module which probably does do nothing except make sure the sync is still aligned (but might busywait if the destination display has scanned far ahead of the emulator).
BFI Workflow1. The blitter thread would pre-generate the series of black framebuffers and pass them to the present thread
2. The present thread will accurately sequence the black frames with proper timing precision.
So you see, it’s exactly your blitter thread, except extended to cover all use cases (including beam racing). No rendering done in the present thread.
- It allows high precision VRR
- It allows high precision beam racing
- It allows high precision BFI
- It allows future rolling-scan software BFI (that can also be simultaneously beam-raced)
- It continues to allow high precision triple buffering
You’d do the appropriate thread safety practices to make sure that the framebuffers at the present thread level isn’t accessed simultaneously. So during a blit operation, you’d lock the present thread’s framebuffer being blitted to. And during a present operation, you’d lock the framebuffer too before presenting it and then unlock the framebuffer. That way, you got complete buffer thread-safety, and
ZERO RENDERING in the frame presentation thread, while achieving the hit-many-birds-at-once goals.
And magically, this makes a lot of behaviours become combineable — such as inputdelays combined with BFI, or doing BFI on a non-blocking sync (BFI onto VRR) — and you’d be able to program new sync technologies not yet invented without needing to modify the rest of the emulator. Because the blit thread is just a VSYNC ON emulator regardless of what the output is doing.
Possible Architecture / Concept / IdeaWhat I suggest is that you have blitter wrappers, BlitScanLineFrameSlice() and a BlitFullFrame(). Let’s say you already implemented BlitFullFrame() for your triplebuffering implementation (i am not sure what your actual naming convention you used).
Blitting the scanlineBlitScanLineFrameSlice(fullbuffer, emupixelrow) would potentially blit (1/emu-vert-rez)th of (actual-vert-resolution) frameslice, corresponding to emulator pixel row, to the other thread (presenting thread). This call would block a time of (1/emu-vert-rez)th of (emulator Hz) since the last raster. This would only be called for beamraceable emulators, even if beamracing is not yet currently enabled. The scanline blitted doesn’t have to be the same scanline as what is emulator rendered, just approximately the territory, since we don’t need exact 1:1 since it’s near the end of the jittermargin territory, though could be a perfect 1:! If emulator framebuffer and output framebuffer is same resolution, with CRT filter disabled, then it’d be a one-pixel-row frameslice corresponding to the most recently rendered emulator pixel row, then it’d be pretty literal to its name. But what matters is that it’s Blitting frameslices in ultrafine granularity that are tinier than the actual output frameslicing on the output end.
Blitting the full framebufferBlitFullFrame() would potentially blit the full framebuffer to the other thread (presenting thread). This would be called every time emulator module finished rendering a thread.
You’d call both BlitScanLineFrameSlice() and BlitFullFrame() all of them regardless of current sync tech / beamrace setting.
Behavior at the render thread level- BlitScanLineFrameSlice() would be a no-operation (Return immediately) if beam racing is disabled or undesired (e.g. RetroArch-style RunAhead workflows).
- BlitFullFrame() would be a no-operation (except potential timing-alignment busywaiting) if beamracing is enabled.
Behavior at the flame-flipper thread level (the present thread)For today’s workflow you do for triple buffering already, the new added BitScanLineFrameSlice() is ignored (no delay, no data blitted inwards), while existing BlitFullFrame() has already done a full blit, much like you already do today with triple buffering. What happens in the thread is now extended to also include all sync technologies AND bfi AND beam racing, not just triple buffer.
Example situation of the presenter thread (frame flipper):
1. VSYNC ON + nonbeamraced: Behaves as passthrough VSYNC ON. BlitFullFrame becomes a synonym for a waitable swapchain Present(). Then do a busywait If Present() unblocked less than one emulator Hz since last Present(). (This occurs if output Hz is higher than emulator Hz, so it looks good for 60fps emulator at 120Hz). Return immediately if VSYNC ON blocking behavior was predictable.
2. VSYNC ON + BFI + nonbeamraced: Same algorithm as #1 — same as VSYNC ON + nonbeamraced including the busywait, with one exception: We busywait is at output Hz granularity instead of emulator Hz granularity, so the code is identical with only minor modifications. Cycle whole sequence of prerendered black frames this way.
BFI Antiflicker logic for emulator-running-faster-than-output-Hz-multiple: If emulator runs fast (Or the output refresh rate is blocking VSYNC ON at too low refresh rate) the number of framebuffers queued will grow — (basically too many framebuffers blitted from rendering thread to the frame flipper thread). If we build up enough BFI framebuffers for more than one emulator refresh cycle (example: If doing 180Hz and 3-frame-sequence BFI, then an overflow condition is 6 buffers queued in the frame-presenter thread) — then throw away unwanted BFI buffers and only cycle the newest emulator refresh cycle’s BFI sequence (e.g. keeping only 3). Result: we’ve dropped an emulator frame’s BFI-sequence-of-frames without creating interrupting flicker.
3. Unsynced (triple buffer / VRR / DWM / VSYNC OFF): Execute exactly the same algorithm as #1. It already has a conditional busywait, so automatically works correctly. Would work kind of like today’s GroovyMAME during triple buffering / VRR. So the existing algorithm is universal for all unsynced technologies. Basically works fine for triplebuffered / DWM / VRR / VSYNC OFF. Basically defacto, timingwise, it’d behave the same as your existing triple buffering threaded algorithm.
4. Unsynced + BFI + nonbeamraced: Execute exactly the same algorithm as #1 except with busywaits at the custom software defined Hz. For VRR you can do any software-defined Hz within VRR range, can be 120 or 180 or 240 for 240Hz VRR, since GroovyMAME I think already works with VRR simply by using its triple buffering algorithm, this is just a different workflow to achieve a BFI-compatible result.
Notice 1/2/3/4 is essentially the same algorithm — essentially the same as your existing triple buffering algorithm (slightly modified to be compatible with all sync technologies).
Now, that makes it much easier to add future beamracing workflows:
A. Beamraced: Thanks to BlitScanLineFrameSlice() from the rendering thread, that means the frame flipper’s thread’soutput framebuffer is already built up to almost current emulator scanline territory, you present the frameslice as you already do in your GroovyMAME patch, including the small raster-based busywaiting you already do. The frame flipper thread will do the raster busywaiting, while the rendering thread will spin on a raster mutex maintained by the frame flipper thread (respecting configurable jitter margin). Essentially hardware-based beamracing (emuraster=realraster)
B. Rolling-scan BFI: Electron gun emulator. The BlitScanlineFrameSlice() will have built up the framebuffer and then could render a “rolling bar” at the output framebuffer (in the rendering thread within the wrapper, before blitting) and then finally blit that to the presenter thread. At 360Hz, that is six rolling-bar positions (with alphablend overlaps), like six different 1/6th screenfuls (with bleed overlaps for alphablends to prevent that “stationary tearline artifact” problem). So most BlitScanlineFrameSlice() would return immediately with about 6 of the calls (evenly spaced apart in raster 1/6th screenfuls) suddenly rendering the framebuffer containing the rolling-bar, and passing it to the presenter thread on the spot. For 360Hz, it’d be full-refresh-cycle beamracing at the destination, 1/360sec behind the emulator refresh. No modification to existing emulator modules needed that are already calling the blitter wrappers, all of these modifications are all within the blit-wrapper and the presenter thread. Essentially software-based beamracing (output Hz granularity, don’t care about actual hardware raster position).
Hopefully this is a catch-all architecture that is a minor modification of your triple buffer workflowThere’s only one rendering thread. The other thread’s job is only to flip framebuffers. This would be a universal workflow that works with all sync technologies. CRT filtering will continue to stay in the rendering thread, as today.
I think it is your existing triple buffering workflow, minor modified to be also compatible with:
- VSYNC ON
- VSYNC OFF
- VRR (FreeSync, G-SYNC)
- BFI on VSYNC ON
- BFI on VRR
- Beamrace enhanced (hardware beamracing, software beam racing)
- Future rolling BFI / electron gun emulators
- Future sync technologies not yet invented
If I think what you already implemented (correct me if I am wrong) — then hopefully you can wrap your head around how conveniently futureproof your triple buffer algorithm is — with minor modifications to accomodate all sync technologies, plus add a scanline blitter hook (In addition to your existing frame blitter hook) to be compatible with all future workflows;
By default, your emulator would then work correctly with most sync technologies out of the box. Launching emulator into VSYNC ON or VRR or triple buffering, would work correctly without mandatory user configuring. So would DWM + VSYNC OFF or Enhanced SYnc or Fast Sync. Tjen by specifying a custom framerate cap (for VRR, that’s a software-based refresh rate) such as “180”, it would correctly do BFI on VRR. It could also be calculated from the BFI sequence size (e.g. 2,3,4) you specify, and it would assume that refresh rate, and framepace at that cap successfully regardless of VSYNC ON, VSYNC OFF, VRR, triple buffering. It’d only erratically flicker if it was fixed-Hz and not divisible by emulator Hz but it’d not flicker in all other situations (even triplebuffering would look fine, as long as it’s framepacing extremely accurately and the output Hz is an exact multiple of emulator Hz)
So basically a lot of automagic compatibility, simply by using your triplebuffer workflow for everything (even including VSYNC ON, even including BFI) by default.