Personally I don't get all the technique behind it but as far as I understand; most of the games/hardwares emulated are not frame-based, rather they do all what they've been programmed for by their developers within a certain time (which would be better measured in miliseconds)
MAME emulating a game (the driver) if correct/accurate and not using additional software to assist that driver (which is sometimes the case for a number of games) does not expand nor shrink that time, it should be the same as the real thing.
With that MAME's renderer creates as many frames as it can in succession, agnostically, and along that there will be other frames used for vblank and inputs.
(I think understanding how all these are organized in terms of frames/threads/priority is essential to fully explain but that part is beyond my current level)
There what I'll say is speculation again because I don't fully understand, but somehow I think the moment the inputs register happens close to (or same as under conditions?) that of the render frame. If that render frame you see happens late because there's been several buffer frames before (like what happens with vsynced D3D/OGL/BGFX) then of course you feel a lot of lag because it happens several frames even on top of / away from the game's natural delay.
Nth edit: or maybe it's the other way around and the cause of perceptible lag is that they're too far apart.
Using plain D3D9ex (or frame_delay 1) there's only one frame used for vsync so the render and therefore inputs are already much closer to the game.
Using higher values of frame_delay then...delays the rendering of the next/upcoming frame, giving the inputs more chances to register close to (or within the same time frame?) as that 'closest' render frame, instead of missing the opportunity and register with the next.
Frame_delay 0 to 9 represent 10 fractions of a frame (1 step then being a bit over 1.6ms), the higher your setting the more the next frame is delayed and the inputs have a chance to register within the time of the current rendered frame.
In theory if you can set it to 9 then the inputs register as close to the base game's delay as possible, which should be almost unmeasurable (too short)
So in practice if you set frame_delay about 7~8 there's only 5~3.3ms delay on top of the game's.
Of course there you have to consider your PC+OS ability and settings, controller port input polling, controller pcb or adapter's own lag, and display lag.
With carefully selected and set hardware you can keep the lag chain very low adding only a few more ms on top of GroovyMAME and maintain under 1/2 a frame (8.3ms) in total on top of the game's normal lag/time.
This is how I understand it, but of course
my explanation is probably wrong in various places. Calamity and a handful of more educated members here would certainly correct and explain better than I do.
In any case contrary to run-ahead, frame_delay is not designed so that it would allow to eliminate frames corresponding to the game's original time/delay, so 'lower lag than real hardware/game' is impossible with it.
There are more matters worthy of consideration on top of it all, which are proper refresh speeds, and in your case the alternative of featured variable refresh-capable hardware such as G-Sync and FreeSync.
Wordy, sure