Main Restorations Software Audio/Jukebox/MP3 Everything Else Buy/Sell/Trade
Project Announcements Monitor/Video GroovyMAME Merit/JVL Touchscreen Meet Up Retail Vendors
Driving & Racing Woodworking Software Support Forums Consoles Project Arcade Reviews
Automated Projects Artwork Frontend Support Forums Pinball Forum Discussion Old Boards
Raspberry Pi & Dev Board controls.dat Linux Miscellaneous Arcade Wiki Discussion Old Archives
Lightguns Arcade1Up Try the site in https mode Site News

Unread posts | New Replies | Recent posts | Rules | Chatroom | Wiki | File Repository | RSS | Submit news

  

Author Topic: The input lag issue in the context of emulation [about new -frame_delay option]  (Read 15108 times)

0 Members and 1 Guest are viewing this topic.

Dr.Venom

  • Trade Count: (0)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 270
  • Last login:May 08, 2018, 05:06:54 am
  • I want to build my own arcade controls!
Hi Calamity,

I've stumbled upon an issue with GroovyUME. I'm currently (only) using tweaked Soft15Khz modelines in my system and I'm (only) making use of the 'switchres' functionality of GroovyUME 0146u4 (modelines=0). The settings that I use are :

- waitvsync=1
- throttle=1
- multithreading=1
- changeres=1 
- switchres=1
- Display on my secondary (15Khz) monitor

This works very nice (smooth scrolling and ingame screen switching for genesis, snes etc..) with all mame/mess machines.

Now, recently I thought I'd try the "Syncrefresh=1" option. But if I use this option in combination with the above settings then GroovyUME runs very much too fast! :-/    I also tested the exact same settings on latest 0147u2 official UME (ofcourse without the 'changeres=1' functionality), but then it runs at correct (smooth) speed.

The testcase was with Genesis emulation, but it also seems the case for other systems.  Some information on my setup: I have setup a PC with two graphic cards (first slot AMD HD 7870 and second slot HD4850) and two monitors. First monitor is a LED and the second is a 15Khz monitor.

Any idea why adding the syncrefresh=1 setting to the existing config, is causing GroovyUME to run way too fast? Hopefully it's something that can be resolved.
« Last Edit: December 01, 2012, 07:35:49 am by Calamity »

Calamity

  • Moderator
  • Trade Count: (0)
  • Full Member
  • *****
  • Offline Offline
  • Posts: 7473
  • Last login:Today at 02:50:19 pm
  • Quote me with care
When using -syncrefresh, GroovyMAME/GroovyUME relies exclusively on the vertical retrace signal in order to throttle the game. So a crazy speed when using -syncrefresh usually means no proper vsync is reported by the hardware. This is often caused by DirectDraw's hardware acceleration being disabled (see dxdiag, screen tab), which in turn can be due to a badly installed video driver, or a mirror driver installed by some remote control app like RealVnc (http://forum.arcadecontrols.com/index.php/topic,113382.msg1310183.html#msg1310183), even cloning the desktop over two monitors with different specs might be the issue. You can try using -video d3d too.

The odd thing here is that you stated you were getting smooth scrolling with the above settings. I'm almost sure that what you've seen was only almost-smooth scrolling. The only way to get truly smooth scrolling with GroovyMAME is by means of the -syncrefresh option.
Important note: posts reporting GM issues without a log will be IGNORED.
Steps to create a log:
 - From command line, run: groovymame.exe -v romname >romname.txt
 - Attach resulting romname.txt file to your post, instead of pasting it.

CRT Emudriver, VMMaker & Arcade OSD downloads, documentation and discussion:  Eiusdemmodi

Dr.Venom

  • Trade Count: (0)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 270
  • Last login:May 08, 2018, 05:06:54 am
  • I want to build my own arcade controls!
I don't think it's related to a driver installation issue, as I don't have problems with other emulators like WinUAE (low latency vsync), bSNES, and others. But nonetheless, I've been digging deeper into the matter, taking your suggestions into account, and have some interesting findings. I created a clean ume.ini (GroovyUME -cc) and only changed modeline=0 and put my rom paths in.

All testcases are done by starting the genesis emulation with the game Mega Turrican (US), as it nicely switches a few times at start between manufacturer logo, title screen and game.

The first thing I did was remove the HD7870 from my PC, to test *only* with the HD4850 installed. This gave the identical results (GroovyMAME running too fast when using syncrefresh). This rules out the possibility that it has something to do with multi-gfx-card setup or cloning/spanning of desktops over multiple monitors.

The second thing I did was testing with different ATI Catalyst driver versions. I've been testing a number of different drivers from the latest AMD 12.6 driver, back to the ATI versions  from a few years ago (10.3) and a few versions in between. They all resulted in GroovyMAME running too fast with syncrefresh. Just being safe, I checked dxdiag with all all the drivers, and they were all properly installed with 2d hardware acceleration enabled. These driver tests pretty much rule out its a (specific) driver issue. (Which was a bit to be expected since I have no issues with other emulators).

Then I got the idea to disable switchres in ume.ini, keeping everything else the same, so that it would open its screen on my desktop setting 740x240@60hz, and guess what? It ran smoothly!  Hmm... so that got me thinking, it seems to have something to do with the particular resolution used. So then I started setting my desktop to various resolutions, and found out that the ones that had a vertical resolution of 224 resulted in GroovyUME running too fast with syncrefresh, but the resolutions of vertical 240 or above didn't show the problem.

To try to pinpoint the problem even more I set the switchres option back to 1 in UME.ini, but now doing testcases with *only* four modelines installed on my system. The only difference between them being that they alternately have a 224 or 240 *visible* vertical resolution (but note:  refresh rates are exactly the same!)

[1] modeline "320x224x60.00 (Genesis)" 6.72 320 341 372 426 224 236 239 263 -hsync -vsync
[2] modeline "320x240x60.00 (Genesis)" 6.72 320 341 372 426 240 243 246 263 -hsync -vsync
[3] modeline "640x224x60.00 (Genesis)" 13.44 640 682 744 852 224 236 239 263 -hsync -vsync
[4] modeline "640x240x60.00 (Genesis)" 13.44 640 682 744 852 240 243 246 263 -hsync -vsync

I've logged four testcases; testcase one starts out with the above four modelines (groovymame prefers them from top to bottom), and then removing one modeline at a time from the system (reboot, test, etc...). So that in each testcase it will pick the next modeline.

I've attached the full logs in the zip. The summary is as follows:

[1] 320x 224@ 60Hz : GroovyUME runs into the problem of running too fast with syncrefresh (Average speed: 177.24%)
Code: [Select]
blit_lock = TRUE
DirectDraw: Configuring device ATI Radeon HD 4800 Series         
Target refresh = 60.000000
DirectDraw: Selecting video mode...
   320x 224@ 60Hz -> 3000.000000
   320x 240@ 60Hz -> 1058.823530
   640x 224@ 60Hz -> 1003.115265
   640x 240@ 60Hz -> 1002.967359
DirectDraw: Mode selected =  320x 224@ 60Hz
DirectDraw: primary surface created: 320x224x32 (R=00FF0000 G=0000FF00 B=000000FF)
DirectDraw: New blit size = 320x224
DirectDraw: blit surface created: 320x224x32 (R=00FF0000 G=0000FF00 B=000000FF)
blit_unlock = TRUE
window_proc: WM_PAINT
blit_lock = FALSE
window_proc: WM_PAINT:END
Average speed: 177.24% (19 seconds)

[2]  320x 240@ 60Hz : It works correctly with syncrefresh!
Code: [Select]
blit_lock = TRUE
DirectDraw: Configuring device ATI Radeon HD 4800 Series         
Target refresh = 60.000000
DirectDraw: Selecting video mode...
   320x 240@ 60Hz -> 1058.823530
   640x 224@ 60Hz -> 1003.115265
   640x 240@ 60Hz -> 1002.967359
DirectDraw: Mode selected =  320x 240@ 60Hz
DirectDraw: primary surface created: 320x240x32 (R=00FF0000 G=0000FF00 B=000000FF)
DirectDraw: New blit size = 320x224
DirectDraw: blit surface created: 320x224x32 (R=00FF0000 G=0000FF00 B=000000FF)
blit_unlock = TRUE
window_proc: WM_PAINT
blit_lock = FALSE
window_proc: WM_PAINT:END
Average speed: 97.19% (29 seconds)
Note that it does not report 100% on average speed because of the screen switching it is doing, but I can assure you that it is running correct/smooth in between the switches.
 
[3] 640x 224@ 60Hz : Again (just like testcase 1) with the 224 pixels vertical resolution, GroovyUME runs too fast with syncrefresh on (Average speed: 180.73%)
Code: [Select]
blit_lock = TRUE
DirectDraw: Configuring device ATI Radeon HD 4800 Series         
Target refresh = 60.000000
DirectDraw: Selecting video mode...
   640x 224@ 60Hz -> 1003.115265
   640x 240@ 60Hz -> 1002.967359
DirectDraw: Mode selected =  640x 224@ 60Hz
DirectDraw: primary surface created: 640x224x32 (R=00FF0000 G=0000FF00 B=000000FF)
DirectDraw: New blit size = 640x224
DirectDraw: blit surface created: 640x224x32 (R=00FF0000 G=0000FF00 B=000000FF)
blit_unlock = TRUE
window_proc: WM_PAINT
blit_lock = FALSE
window_proc: WM_PAINT:END
Average speed: 180.73% (22 seconds)

[4]  640x 240@ 60Hz: It works correctly with syncrefresh!
Code: [Select]
blit_lock = TRUE
DirectDraw: Configuring device ATI Radeon HD 4800 Series         
Target refresh = 60.000000
DirectDraw: Selecting video mode...
   640x 240@ 60Hz -> 1002.967359
DirectDraw: Mode selected =  640x 240@ 60Hz
DirectDraw: primary surface created: 640x240x32 (R=00FF0000 G=0000FF00 B=000000FF)
DirectDraw: New blit size = 640x224
DirectDraw: blit surface created: 640x224x32 (R=00FF0000 G=0000FF00 B=000000FF)
blit_unlock = TRUE
window_proc: WM_PAINT
blit_lock = FALSE
window_proc: WM_PAINT:END
Average speed: 96.04% (20 seconds)

To me these results suggest that there's possibly an issue with the way GroovyUME handles/calculates the really low vertical resolution (< 240?) cases. What do you think?

At least hopefully this will bring us closer to pinpointing the problem and finding a solution.
« Last Edit: November 07, 2012, 10:18:32 am by Dr.Venom »

Calamity

  • Moderator
  • Trade Count: (0)
  • Full Member
  • *****
  • Offline Offline
  • Posts: 7473
  • Last login:Today at 02:50:19 pm
  • Quote me with care
Thanks for doing this detailed test. It's very strange that two almost identical modelines behave so different. Your setup happens to be one that I hardly ever test myself: the one with the -modeline option disabled.

I'd perform the following tests before going on:

- Try switching to video d3d just in case
- Try disabling the -changeres option
- Try disabling -multithreading

The problem could be that because of the resolution switch, the display didn't complete its setup the second time for some reason so the vysnc signal is not available. But it''s strange that it doesn't happen always.

We also need to make sure that those modes are actually properly formed by the system. Please launch Arcade_OSD (it's in the CRT_Emudriver download), so you can test each of those modes full screen and perform a speed measurement in order to find if the vsync signal is actually supported. Once you make sure everything is fine we can focus on the possible issue in GM.

Important note: posts reporting GM issues without a log will be IGNORED.
Steps to create a log:
 - From command line, run: groovymame.exe -v romname >romname.txt
 - Attach resulting romname.txt file to your post, instead of pasting it.

CRT Emudriver, VMMaker & Arcade OSD downloads, documentation and discussion:  Eiusdemmodi

Dr.Venom

  • Trade Count: (0)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 270
  • Last login:May 08, 2018, 05:06:54 am
  • I want to build my own arcade controls!
Thanks for doing this detailed test. It's very strange that two almost identical modelines behave so different. Your setup happens to be one that I hardly ever test myself: the one with the -modeline option disabled.

Yes I also would rather use -modelines, but unfortunately I'm on Windows 7 64-bit and haven't been able to get that option running. Since I gathered that it also isn't supposed to work without the CRT_Emudriver, I've focused myself on getting GroovyMAME to run with the Soft15Khz modelines. Which is great btw, apart from the mentioned issue with the syncrefresh.

Quote
I'd perform the following tests before going on:

- Try switching to video d3d just in case

This works (correct speed), but unfortunately it has a short sound glitch (pitch shift) upon screen switches, plus very small graphic glitches.  (These don't happen in ddraw).

Quote
- Try disabling the -changeres option
- Try disabling -multithreading

Both make no difference.

Quote
The problem could be that because of the resolution switch, the display didn't complete its setup the second time for some reason so the vysnc signal is not available. But it''s strange that it doesn't happen always.

Yes indeed.. Are there any recalculations done in the code based on screen dimensions to track the refresh rate? Or does it (try to) poll the screen rate shortly? (Or both?)

Quote
We also need to make sure that those modes are actually properly formed by the system. Please launch Arcade_OSD (it's in the CRT_Emudriver download), so you can test each of those modes full screen and perform a speed measurement in order to find if the vsync signal is actually supported. Once you make sure everything is fine we can focus on the possible issue in GM.

I've run Arcade_OSD and done the speed measurements per screenmode (5/coin). The results are:

320x224 -> 60.002 hz
320x240 -> 60.001 hz
640x224 -> 60.005 hz
640x240 -> 60.006 hz

Also with every screen when doing the speed test, the rastered background is scrolling very smoothly (nice feature btw :) ). I guess this confirms that the screenmodes are ok?

Calamity

  • Moderator
  • Trade Count: (0)
  • Full Member
  • *****
  • Offline Offline
  • Posts: 7473
  • Last login:Today at 02:50:19 pm
  • Quote me with care
Yes I also would rather use -modelines, but unfortunately I'm on Windows 7 64-bit and haven't been able to get that option running. Since I gathered that it also isn't supposed to work without the CRT_Emudriver, I've focused myself on getting GroovyMAME to run with the Soft15Khz modelines. Which is great btw, apart from the mentioned issue with the syncrefresh.

You should have started with this ;)

Quote
I'd perform the following tests before going on:

- Try switching to video d3d just in case

This works (correct speed), but unfortunately it has a short sound glitch (pitch shift) upon screen switches, plus very small graphic glitches.  (These don't happen in ddraw).

The sound glitch upon mode change is something that you have to live with. I'm interested on the other hand on those small graphic glitches, what are they exactly?

That said, unfortunately DirectDraw seems to be extremely buggy in Windows 7. Even in Vista I've seen very odd things when testing GM + ddraw on people's laptops, frames chopped by the middle and stuff like that.

Do you happen to have your desktop mode set as interlaced? There's a well known bug affecting W7 and ddraw when switching from progressive to interlaced and viceversa.

I'm don't know the underlying reasons but it seems that DirectDraw is emulated to some degree under W7/Vista, so the only interface you can trust to work as advertised is Direct3D.

I'm even considering switching back to d3d as the default video setup in GM for the new version for compatibility reasons, once the classic drawbacks of using d3d have been solved by means of patches (-cleanstretch, etc.).

Quote
I've run Arcade_OSD and done the speed measurements per screenmode (5/coin). The results are:

320x224 -> 60.002 hz
320x240 -> 60.001 hz
640x224 -> 60.005 hz
640x240 -> 60.006 hz

Also with every screen when doing the speed test, the rastered background is scrolling very smoothly (nice feature btw :) ). I guess this confirms that the screenmodes are ok?

Yes, this confirms the screen modes are ok. I'm thinking of a possible test if you have time and energy. Arcade_OSD uses ddraw's flip function in order to synchronize, however GM uses ddraw's waitvsync, which is a different although related DX's feature. You can force GM to use ddraw's flip function by enabling -triplebuffer, so something you could test would be this:

groovymame game -video ddraw -triplebuffer -nothrottle -nomt

I don't believe it would make any difference, but just for the sake of science...
« Last Edit: November 08, 2012, 11:49:45 am by Calamity »
Important note: posts reporting GM issues without a log will be IGNORED.
Steps to create a log:
 - From command line, run: groovymame.exe -v romname >romname.txt
 - Attach resulting romname.txt file to your post, instead of pasting it.

CRT Emudriver, VMMaker & Arcade OSD downloads, documentation and discussion:  Eiusdemmodi

Dr.Venom

  • Trade Count: (0)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 270
  • Last login:May 08, 2018, 05:06:54 am
  • I want to build my own arcade controls!
You should have started with this ;)

No, no, you should have started asking that ;)

Quote
The sound glitch upon mode change is something that you have to live with. I'm interested on the other hand on those small graphic glitches, what are they exactly?

What it does upon switch is that it shortly shows the desktop wallpaper (this is normal) then it switches back into the game visuals (which is normal), but then while it is already showing the game visuals, it again very shortly shows an instant (a frame or two) of the desktop wallpaper again. That last part I was/am perceiving as a "glitch". (Mainly because ddraw switches cleanly from game visuals-> desktop shortly -> game visuals).

Quote
That said, unfortunately DirectDraw seems to be extremely buggy in Windows 7. Even in Vista I've seen very odd things when testing GM + ddraw on people's laptops, frames chopped by the middle and stuff like that.

Do you happen to have your desktop mode set as interlaced? There's a well known bug affecting W7 and ddraw when switching from progressive to interlaced and viceversa.

I don't have my desktop set to interlaced mode, but I'm aware of this bug. I encountered it when running the psx emulation. At one point it then just shows a black screen.  I've been involved with WinUAE testing some time ago, and from my experience with that, the problem mainly shows when the desktop is set to interlace and the program tries to open a progressive mode. The other way around shouldn't be a problem. And once the program is running it also doesn't have an issue switching between the two.

Quote
I'm even considering switching back to d3d as the default video setup in GM for the new version for compatibility reasons, once the classic drawbacks of using d3d have been solved by means of patches (-cleanstretch, etc.).

Yeah that automatic resizing is definately one of the drawbacks when using d3d in pixel perfect emulation. GM's -cleanstretch does seem to go a long way already into getting 1:1 pixel mapping though, which is a good thing..

Quote
Yes, this confirms the screen modes are ok. I'm thinking of a possible test if you have time and energy. Arcade_OSD uses ddraw's flip function in order to synchronize, however GM uses ddraw's waitvsync, which is a different although related DX's feature. You can force GM to use ddraw's flip function by enabling -triplebuffer, so something you could test would be this:

groovymame game -video ddraw -triplebuffer -nothrottle -nomt

I don't believe it would make any difference, but just for the sake of science...

Forcing the above makes it run perfectly at the correct speed! (And the switching is clean without artifacts.)

But... IMO, there's a large drawback in the use of - triplebuffer as it introduces quite a a large portion of "input lag". So while it looks good, the playability of fast shoot 'm ups goes down the drain. IMHO sadly overlooked by many people, but it becomes painfully obvious when comparing it side by side with real hardware.

Is there any chance of getting the flip function to work correctly without -triplebuffer in GM? That would be perfect :) Possibly you're already familiar with it, but you can ask a Direct3D Device if it currently is in VBLank via the D3DRASTER_STATUS

http://msdn.microsoft.com/en-us/library/windows/desktop/bb172596%28v=vs.85%29.aspx . Maybe that could create a possibility to using the flip method with (no-buffer) vblank timing?
« Last Edit: November 08, 2012, 02:21:42 pm by Dr.Venom »

Calamity

  • Moderator
  • Trade Count: (0)
  • Full Member
  • *****
  • Offline Offline
  • Posts: 7473
  • Last login:Today at 02:50:19 pm
  • Quote me with care
Forcing the above makes it run perfectly at the correct speed! (And the switching is clean without artifacts.)

Good to know, fairly interesting.

Quote
But... IMO, there's a large drawback in the use of - triplebuffer as it introduces quite a a large portion of "input lag". So while it looks good, the playability of fast shoot 'm ups goes down the drain. IMHO sadly overlooked by many people, but it becomes painfully obvious when comparing it side by side with real hardware.

Yeah, and that's a paradox because the whole idea behind triplebuffering is to reduce input lag associated to double buffering to the minimum while preventing tearing.

It's not the concept of triplebuffering what's wrong IMHO but the implementation of DirectX flip function what seems to be the problem, because it doesn't return inmediately as advertised, thus preventing a truly asynchronous (lagless) rendering.

We have implemented an asynchronous triple buffer in GM that works when -multithreading is enabled, by moving the rendering code into a third execution thread, thus bypassing the flip wait bottleneck, and in theory should be lagless. Don't expect smooth scrolling, obviously.

However we can't be sure, even in this case, that the behaviour of DirectX is the correct one when dealing with more than two buffers. We would expect DirectX to always flip to the most recent rendered frame but I suspect it could be just arranging a damned chain, that would explain some of the extra lag noticed by people.

Additionally, there's another source of lag in main line MAME that gets exposed when using -triplebuffer, especially with -multithreading, when the video card's refresh and the game refresh are different enough, that's truly dramatic. This happens because the input is received through the window thread but this one is locked during a flip operation (for the reason explained above). As the main emulation thread runs freely this often results in several consecutive frames being virtually deaf to the input messages.

Many 60 Hz vertical games are often forced to run rotated on horizontal monitors at frequencies of 50-53Hz or so in order to allow 256-288 lines in 15 kHz, this is the perfect test case, and I wonder if most horror tales about -triplebuffer don't come from this fact. (This is also fixed by GM.)

If it wasn't clear: of course *normally* you don't need triplebuffer, -syncrefresh (vsync) is enough. We only need triplebuffer when video card and game speed are too different and we can't synchronize without affecting speed but we still want to get rid of tearing.

Anyway there's a lot of confusion because most bad press about vsync/triplebuffer comes from articles written for the pc 3d game scenario where they want their game loops to run at as many fps as possible regardless the video card refresh. Our case is totally different because in emulation we want the loop and the screen to update at the same pace.

Quote
Is there any chance of getting the flip function to work correctly without -triplebuffer in GM? That would be perfect :) Possibly you're already familiar with it, but you can ask a Direct3D Device if it currently is in VBLank via the D3DRASTER_STATUS

http://msdn.microsoft.com/en-us/library/windows/desktop/bb172596%28v=vs.85%29.aspx . Maybe that could create a possibility to using the flip method with (no-buffer) vblank timing?

Well that is a very easy patch to implement if you have the time to compile and test. This change in ddraw.c will revert -triplebuffer behaviour to classic double buffer:

Code: [Select]
// for triple-buffered full screen mode, allocate flipping surfaces
if (window->fullscreen && video_config.triplebuf)
{
dd->primarydesc.dwFlags |= DDSD_BACKBUFFERCOUNT;
dd->primarydesc.ddsCaps.dwCaps |= DDSCAPS_FLIP | DDSCAPS_COMPLEX;
//dd->primarydesc.dwBackBufferCount = 2;
dd->primarydesc.dwBackBufferCount = 1;
}

Actually what I had in mind for the future(?) would be to get rid of flipping altogether, in order to manually implement a -triplebuffering model that actually worked as the theory says, by bypassing the whole DirectX black box but for the waitvsync function, but unfortunately today I learned that can't be trusted exclusively :)

Another option is to create a manual loop to poll the vsync status as you say, that's a good possibility now that we already have a separate thread for that.

Important note: posts reporting GM issues without a log will be IGNORED.
Steps to create a log:
 - From command line, run: groovymame.exe -v romname >romname.txt
 - Attach resulting romname.txt file to your post, instead of pasting it.

CRT Emudriver, VMMaker & Arcade OSD downloads, documentation and discussion:  Eiusdemmodi

Dr.Venom

  • Trade Count: (0)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 270
  • Last login:May 08, 2018, 05:06:54 am
  • I want to build my own arcade controls!
Yeah, and that's a paradox because the whole idea behind triplebuffering is to reduce input lag associated to double buffering to the minimum while preventing tearing.

True, and when it works as expected it would probably still lower the latency.

But that said, personally I'm not a big fan of using either double or triplebuffering technologies when it comes to emulation. To me all buffering technologies have been invented to facilitate "modern day" computing. E.g.
- Watching a movie on a computer where background programs can interrupt the flow --> buffering comes to the rescue
- Playing a game with gfx settings that are too demanding for the hardware --> buffering comes to the rescue
- etc...

Don't get me wrong, these are all very much valid and useful applications of buffering. The problem is that the buffering model and the way 80's and 90's arcade and home consoles work are too far apart. For the sake of science, as a generalization the old hardware is simply a "no buffer" design, video ram is prepped during vertical blank and then displayed, sound runs directly as it is generated each cycle and the state of input (joystick/keyboard) is available to the system in realtime. (At least that is how I understand it works, correct me if I'm wrong..)

To come anywhere near this design with emulation you have to have video, sound and input polling running within one frame. This can be done *only* when using a single buffer design that flips within the same frame, has a soundbuffer of less than one frame, and polls input as often as possible. Looking at the emulation of a single frame (suppose 50hz refresh), in 20 milliseconds of real world time, it would need to do:

Code: [Select]
0ms ---------------------------------------------------17ms-- vertical blank -- 20ms
|| emulate frame (render to buffer) --> wait for sync ----> flip in vblank ----- ||
||<--------------------------- poll input continuously --------------------------->

I guess we can call this the "holy grail" for emulation. To me this should be possible to achieve given enough computing power on the users end, and a good software implementation of the emulation.

What's also becoming clear from this model is that either form of double or triple buffering will simply break the "holy grail", as it will always cause - at least - the problem of one frame of additional "input lag" (actually the video is delayed, but it is perceived as input lag)

Given this, it seems there's probably room for two (configurable) emulation/display update methods:

1) The "holy grail" (users hardware is powerful enough to render full frame rate and has a matching display refresh)
2) triple buffering  (users hardware is not powerful enough to render full frame rate and/or doesn't have a matching display refresh)

Given what you said earlier (quote below), I guess we're on the same page regarding this already :)

Quote
If it wasn't clear: of course *normally* you don't need triplebuffer, -syncrefresh (vsync) is enough. We only need triplebuffer when video card and game speed are too different and we can't synchronize without affecting speed but we still want to get rid of tearing.

Quote
Additionally, there's another source of lag in main line MAME that gets exposed when using -triplebuffer, especially with -multithreading, when the video card's refresh and the game refresh are different enough, that's truly dramatic. This happens because the input is received through the window thread but this one is locked during a flip operation (for the reason explained above). As the main emulation thread runs freely this often results in several consecutive frames being virtually deaf to the input messages.

Many 60 Hz vertical games are often forced to run rotated on horizontal monitors at frequencies of 50-53Hz or so in order to allow 256-288 lines in 15 kHz, this is the perfect test case, and I wonder if most horror tales about -triplebuffer don't come from this fact. (This is also fixed by GM.)

Thanks for explaining. Years ago, before I got into the whole Soft15Khz/CRT/modeline tweaking I had a LCD monitor at fixed refresh and had the described dramatic experience too many times with MAME (whatever config I tried), which made me abandon it all together for many years. Luckily I got back into it now with GroovyMAME :)

I'm not sure how it works, but your comment might also explain a quote from the official MAME documentation re triplebuffer, that I still don't understand fully. It's found in the newvideo.txt in the docs folder (http://mamedev.org/source/docs/newvideo.txt.html) under the description for the "Category 1" user:

Quote
To avoid tearing artifacts, I recommend using the -triplebuffer option as well. Just make sure your monitor's refresh rate is higher than the game you are running.

The only thing I can think of is that running at a lower monitor refresh will make MAME render and drop frames (to adjust to the lower speed), which is more of a bad thing then just skipping ahead (having the "benefit" of not rendering the frame)?
Quote
Well that is a very easy patch to implement if you have the time to compile and test. This change in ddraw.c will revert -triplebuffer behaviour to classic double buffer:

I tried compiling it, but unfortunately  I get an error at the end, which seems to have to do with the fact that my MinGW installation is already updated for the new compile chain (I've been compiling the mainline  0147 versions succesfully).

Quote
Actually what I had in mind for the future(?) would be to get rid of flipping altogether, in order to manually implement a -triplebuffering model that actually worked as the theory says, by bypassing the whole DirectX black box but for the waitvsync function, but unfortunately today I learned that can't be trusted exclusively :)

:)

Quote
Another option is to create a manual loop to poll the vsync status as you say, that's a good possibility now that we already have a separate thread for that.

It would at least be worth exploring I guess. One of the advantages is that you keep full control of "the box" and at any time you know what's going on (the function returns the rough scanline number when it's not in vblank). So you could do nice things like still flipping a frame if it only missed vblank by a small fraction.  You would also be able to quickly gauge the real refresh of the video card, which opens a door to sync sound in line with the refresh rate.

I would expect it to take some time and testing to get implemented correctly though. If you'd choose to experiment with it and at any time you'd want me to do some testing just let me know.
« Last Edit: November 09, 2012, 11:07:25 am by Dr.Venom »

Calamity

  • Moderator
  • Trade Count: (0)
  • Full Member
  • *****
  • Offline Offline
  • Posts: 7473
  • Last login:Today at 02:50:19 pm
  • Quote me with care
Thank you for your elaborated answer, it's a pleasure to discuss with you.

We're basically talking about the same thing here. It's just a matter of terms and I think I can hopefully help clarifying. This scheme of yours:

Code: [Select]
0ms ---------------------------------------------------17ms-- vertical blank -- 20ms
|| emulate frame (render to buffer) --> wait for sync ----> flip in vblank ----- ||
||<--------------------------- poll input continuously --------------------------->

I guess we can call this the "holy grail" for emulation. To me this should be possible to achieve given enough computing power on the users end, and a good software implementation of the emulation.

Well, this is what we know as double buffering. This is actually what you'd get by compiling the suggested patch. You have two buffers:

- buffer #1: the visible VRAM being transferred to the screen*
- buffer #2: the back buffer where you render to.

Quote
What's also becoming clear from this model is that either form of double or triple buffering will simply break the "holy grail", as it will always cause - at least - the problem of one frame of additional "input lag" (actually the video is delayed, but it is perceived as input lag)

Indeed, but it's not the fact of having 2 or more buffers what adds a frame of lag as one would think, it's the very concept of "frame-based" emulation what causes this. The reason behind this is that transferring the contents of the VRAM to the screen (*) is actually a process that consumes time too (17 ms), as the raster travels through the screen, so once you "flip in vblank" you need to wait some time to see the whole frame displayed, but in the meanwhile there's in new frame being cooked that won't contain your reactions to what's happening on the screen.

By using the option -syncrefresh in MAME you get a slightly different implementation of double buffering, so instead of "flipping" (which consists of a low level change of the visible VRAM offset without involving memory transfers), what we do is a plain copy of our back buffer into the visible VRAM ("blitting"), we're just careful of doing it during VBLANK. Obviously this approach consumes more resources but I tend to prefer it to the flipping black box.

But even if we used a single buffer, which is certainly possible for a fast nowadays' computer, so we would directly render everything into the visible VRAM during the VBLANK time without previous buffering, we would be running in the same 1-frame-of-lag issue, as long as our emulator design is frame-based.

On a different plane of things, we have to consider how the input is polled. In an event driven OS like Windows we don't poll input continuously. The system will send us a message when some new input happens, these messages will get buffered and we usually read them once per frame. Now, this model should be good enough, leaving apart the built-in system input lag that in theory should be possible to get reduced to a minimum as hardware improves.

But due to the design of MAME, when vsync is enabled we can get some extra lag as the input remains locked during the wait for vsync, which is represented in the following scheme, as compared the GM case where this problem is solved:

Code: [Select]
Vanilla MAME + vsync:

0ms --------------------------------------------------------15.4ms --- vertical blank -- 16.7ms
||...emulate frame (render to buffer) --> wait for sync ----> blit --> emulate next---...... ||
||<---------- input enabled ----------> <----- input locked ---------> <--- input enabled ---->

GroovyMAME + vsync + multithreading:

0ms --------------------------------------------------------15.4ms --- vertical blank -- 16.7ms
||...emulate frame (render to buffer) --> wait for sync ----> blit --> emulate next---...... ||
||<---------------------------------- input enabled ------------------------------------------>

Notice that the scale is not correct and in a normal situation the wait for vsync will take most of the frame time, specially on a fast computer.

So now it's when emulator writers tell you that these are the limits of emulation. But I do believe that the "holly grail" of emulation is actually feasible in practice, understanding it as a piece of software that works as an *exact* substitution of the emulated hardware, in terms of response. It's only that, IMHO, the frame based concept would need to be replaced by a scanline based model, where only the next scanline is buffered and we use hsync instead of vsync for synchronizing.

Considering that emulator writers use flat panels, such an emulator is not likely going to see the light :)

Quote
Thanks for explaining. Years ago, before I got into the whole Soft15Khz/CRT/modeline tweaking I had a LCD monitor at fixed refresh and had the described dramatic experience too many times with MAME (whatever config I tried), which made me abandon it all together for many years. Luckily I got back into it now with GroovyMAME :)

It's good to hear that.

Quote
I'm not sure how it works, but your comment might also explain a quote from the official MAME documentation re triplebuffer, that I still don't understand fully. It's found in the newvideo.txt in the docs folder (http://mamedev.org/source/docs/newvideo.txt.html) under the description for the "Category 1" user:

Quote
To avoid tearing artifacts, I recommend using the -triplebuffer option as well. Just make sure your monitor's refresh rate is higher than the game you are running.

The only thing I can think of is that running at a lower monitor refresh will make MAME render and drop frames (to adjust to the lower speed), which is more of a bad thing then just skipping ahead (having the "benefit" of not rendering the frame)?

:)

Yeah, that's a good point.

The word "triple" in -triplebuffer is misleading as it suggests an additional degree of buffering when that's not the concept. It took me some time to visualize this. But we must see triple buffering just as an asynchronous version of double buffering.

The double buffering model anchors the game loop to the refresh rate of the video card. I believe that PC game developers wanted to free themselves from the tyranny of refresh rates so they invented triple buffering. We can visualize it as two separate loops running in parallel, the game loop and the flip loop. So the game loop can run at any absurd speed sending new frames to the flip loop which will obviously need to drop some of them depending of the video card's refresh but in theory will always draw the most recent once at the time the VBLANK happens.

Now as MAME is designed to use the CPU clock for accurately timing of emulation it needs to be decoupled from the screen refresh but this leads to horrible tearing, so someone thought it would be a good idea to use the triple buffering model, and actually it is, if it wasn't for the fact that the DX's flip functions don't worked as advertised, i.e. creating a second back buffer doesn't result in asynchronous flipping (notice I mean asynchronous to the game loop, the flip is always synced to the vertical retrace).

This results in MAME's -triplebuffer option anchoring the game loop to the video card's refresh when this is lower than the desired speed, so the benefits of triple buffering don't apply here and we have only a sophisticated version of double buffering.
« Last Edit: November 09, 2012, 01:58:43 pm by Calamity »
Important note: posts reporting GM issues without a log will be IGNORED.
Steps to create a log:
 - From command line, run: groovymame.exe -v romname >romname.txt
 - Attach resulting romname.txt file to your post, instead of pasting it.

CRT Emudriver, VMMaker & Arcade OSD downloads, documentation and discussion:  Eiusdemmodi

Dr.Venom

  • Trade Count: (0)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 270
  • Last login:May 08, 2018, 05:06:54 am
  • I want to build my own arcade controls!
Thank you for your elaborated answer, it's a pleasure to discuss with you.

Likewise :)

Quote
Well, this is what we know as double buffering. This is actually what you'd get by compiling the suggested patch. You have two buffers:

- buffer #1: the visible VRAM being transferred to the screen*
- buffer #2: the back buffer where you render to.

Quote
What's also becoming clear from this model is that either form of double or triple buffering will simply break the "holy grail", as it will always cause - at least - the problem of one frame of additional "input lag" (actually the video is delayed, but it is perceived as input lag)

Indeed, but it's not the fact of having 2 or more buffers what adds a frame of lag as one would think, it's the very concept of "frame-based" emulation what causes this. The reason behind this is that transferring the contents of the VRAM to the screen (*) is actually a process that consumes time too (17 ms), as the raster travels through the screen, so once you "flip in vblank" you need to wait some time to see the whole frame displayed, but in the meanwhile there's in new frame being cooked that won't contain your reactions to what's happening on the screen.

Thanks for explaining, I see your point. Given your explanation, aren't we actually talking a -two- frame delay with this method? So because frame emulation is not directly attached to the copy/flip in vblank (but done way before that) and will only take a fraction of the frame (rest is mostly waiting for vblank) you get approximately one frame delay; add to this the one frame of delay because of the whole concept of "frame based" emulation, don't we end up with two frames of delay?

So to visualize (example case where there's an input event mid frame):

Code: [Select]
0ms ----------------display------------15.4ms -vblank- 16.7ms||0ms ----------------display----------15.4ms -vblank- 16.7ms||0ms ----------------display----------15.4ms -vblank- 16.7ms||
||...emulate | -------> wait sync -----> blit ----> emulate..||...emulate | -------> wait sync -----> blit ----> emulate..||...emulate | -------> wait sync -----> blit ----> emulate..||
||<---------------  (x) user input ------------------------->||<-----------------  (x) not shown ------------------------>||<-----------------  (x) = shown! ------------------------> ||

Quote
But even if we used a single buffer, which is certainly possible for a fast nowadays' computer, so we would directly render everything into the visible VRAM during the VBLANK time without previous buffering, we would be running in the same 1-frame-of-lag issue(*), as long as our emulator design is frame-based.

(*)I guess this is not entirely the same 1-frame-of-lag issue as the double buffer (there are two there), but isn't this exclusively the minimum delay (1-frame) that is achievable when dealing with frame based emulation?

So to visualize the input delay in the case of "emulate+blit in vblank" :
Code: [Select]
0ms ----------------display------------15.4ms -vblank- 16.7ms||0ms ----------------display------------15.4ms -vblank- 16.7ms||
||--------------------------------------| emulate + blit.....||----------------------------------------| emulate + blit.....||
||<---------------  (x) user input ------------------------->||<----------------- (x) = shown! --------------------------->||

Isn't this then the one and only "holy grail" when talking about frame based emulation?  If so, would it be an idea to add this as a say "accurate" = yes/no option to groovymame? That would be wonderful! :)   I guess it would probably need to be a separate option, because not for all MAME/MESS systems the emulation + blit/copy can probably be done within the vertical blank?  But imagine if it would work for the "old boys" like Genesis/SNES/MSX2/Colecovision/C64/etc etc.. :)

Quote
On a different plane of things, we have to consider how the input is polled. In an event driven OS like Windows we don't poll input continuously. The system will send us a message when some new input happens, these messages will get buffered and we usually read them once per frame. Now, this model should be good enough, leaving apart the built-in system input lag that in theory should be possible to get reduced to a minimum as hardware improves.

Thanks, makes it clear. There is however one question that I'm wondering about. I read all of these stories about the HID USB polling rate in windows being 125hz, i.e. polling about every 8ms, and people trying to overclock the USB ports (through USBPORT/HIDUSB patches) to 250/500/1000hz. I'm not sure what to believe of this all, and whether it's true for all versions of windows. But if there's some truth to it, then it would actually mean that input changes are only signalled 2 times a frame? If so, it would also raise the question whether or not something could be done about it from a software developers point of view?

Quote
But due to the design of MAME, when vsync is enabled we can get some extra lag as the input remains locked during the wait for vsync, which is represented in the following scheme, as compared the GM case where this problem is solved:

Code: [Select]
Vanilla MAME + vsync:

0ms --------------------------------------------------------15.4ms --- vertical blank -- 16.7ms
||...emulate frame (render to buffer) --> wait for sync ----> blit --> emulate next---...... ||
||<---------- input enabled ----------> <----- input locked ---------> <--- input enabled ---->

GroovyMAME + vsync + multithreading:

0ms --------------------------------------------------------15.4ms --- vertical blank -- 16.7ms
||...emulate frame (render to buffer) --> wait for sync ----> blit --> emulate next---...... ||
||<---------------------------------- input enabled ------------------------------------------>

Very cool :)

Quote
So now it's when emulator writers tell you that these are the limits of emulation. But I do believe that the "holly grail" of emulation is actually feasible in practice, understanding it as a piece of software that works as an *exact* substitution of the emulated hardware, in terms of response. It's only that, IMHO, the frame based concept would need to be replaced by a scanline based model, where only the next scanline is buffered and we use hsync instead of vsync for synchronizing.

Considering that emulator writers use flat panels, such an emulator is not likely going to see the light :)

I wholeheartedly agree, this would indeed be perfection.  Re the emulator writers, I guess we need to donate them some CRT "3D" panels :)

Quote
Now as MAME is designed to use the CPU clock for accurately timing of emulation it needs to be decoupled from the screen refresh but this leads to horrible tearing, so someone thought it would be a good idea to use the triple buffering model, and actually it is, if it wasn't for the fact that the DX's flip functions don't worked as advertised, i.e. creating a second back buffer doesn't result in asynchronous flipping (notice I mean asynchronous to the game loop, the flip is always synced to the vertical retrace).

This results in MAME's -triplebuffer option anchoring the game loop to the video card's refresh when this is lower than the desired speed, so the benefits of triple buffering don't apply here and we have only a sophisticated version of double buffering.

Thanks for explaining, and providing some more insight into how these things actually work.

Lastly, with regards to potential cause for display lag, you may or may not be familiar with this, but since we're on the topic I thought I'd just flag them.

First about the display model that is used in Windows Vista and 7. The Desktop Compositor Engine on which the Aero interface is built has its own vertical synchronization routines, which are known to possibly interfere with emulator vertical sync routines. I've encountered this with bsnes and some other emulators, where the (smooth) scrolling would show a hick-up every now and then. After disabling the desktop composition for such a program it ran flawlessly. So just in case you encounter weird things when testing GM stuff on W7..

Second issue may have an even greater effect. It's about the video drivers in Windows (from NVidia/AMD) that sort of seem to have a will of their own when it comes to buffering. The culprit in question is the so called "flip queue size" (ATI/AMD) or "Maximum Pre-rendered Frames" (Nvidia) variable in these drivers, which normally defaults to three. To my experience this value can be a cause for serious additional lag, especially when the emulator is intending to use the minimal amount of buffering.

The flip queue size for ATI/AMD cannot be configured by the Catalyst Control Center (why o why?). But luckily a solution was written in the form of the RadeonPro tool (http://www.radeonpro.info/en-US/), where you can change the flipqueue size setting per application between a setting of 5 to 0. I'm not an NVidia user, but apparently the Maximum  pre-rendered frames can be set through the video control panel. I read that in the newer drivers the setting of 0 has been removed, and the lowest is 1. In my experience using a setting of 0 (versus the default of 3) can make a world of difference on most of the emulators.

I originally got triggered on this subject by the PC-Engine "Ootake" emulator author, the original topic can be found at the Ootake page here: http://www.ouma.jp/ootake/delay-win7vista.html. It's about lowering causes of input delay in Vista and 7, but halfway down the page it also mentions that the flip queue size settings have an effect in WindowsXP too (given modern enough PC).
« Last Edit: November 10, 2012, 09:36:50 am by Dr.Venom »

Calamity

  • Moderator
  • Trade Count: (0)
  • Full Member
  • *****
  • Offline Offline
  • Posts: 7473
  • Last login:Today at 02:50:19 pm
  • Quote me with care
Thanks for explaining, I see your point. Given your explanation, aren't we actually talking a -two- frame delay with this method? So because frame emulation is not directly attached to the copy/flip in vblank (but done way before that) and will only take a fraction of the frame (rest is mostly waiting for vblank) you get approximately one frame delay; add to this the one frame of delay because of the whole concept of "frame based" emulation, don't we end up with two frames of delay?

Indeed. I meant that this model only represents 1 additional frame of delay with respect to the original hardware, which probably already worked with 1 frame of delay in most situations, as long as it was designed to poll input once per frame during vblank.

Quote
So to visualize (example case where there's an input event mid frame):

Code: [Select]
0ms ----------------display------------15.4ms -vblank- 16.7ms||0ms ----------------display----------15.4ms -vblank- 16.7ms||0ms ----------------display----------15.4ms -vblank- 16.7ms||
||...emulate | -------> wait sync -----> blit ----> emulate..||...emulate | -------> wait sync -----> blit ----> emulate..||...emulate | -------> wait sync -----> blit ----> emulate..||
||<---------------  (x) user input ------------------------->||<-----------------  (x) not shown ------------------------>||<-----------------  (x) = shown! ------------------------> ||

Yes, this is exactly what's going on, provided the OS is fast enough to notify us the input within the current frame time.

Quote
(*)I guess this is not entirely the same 1-frame-of-lag issue as the double buffer (there are two there), but isn't this exclusively the minimum delay (1-frame) that is achievable when dealing with frame based emulation?

Exactly, but what I had in mind when I wrote "single buffering" is not what you drew on your scheme below. By "fast computer" I meant fast enough for *compositing* the frame directly in vram during vblank, but I was not considering the whole emulation of it. Just wanted to prove that we have the same concept here even if no intermediate buffer exists.

But of course this:

Quote
So to visualize the input delay in the case of "emulate+blit in vblank" :
Code: [Select]
0ms ----------------display------------15.4ms -vblank- 16.7ms||0ms ----------------display------------15.4ms -vblank- 16.7ms||
||--------------------------------------| emulate + blit.....||----------------------------------------| emulate + blit.....||
||<---------------  (x) user input ------------------------->||<----------------- (x) = shown! --------------------------->||

... is a completely different animal, and I agree with you it would be *nearly* the holy grail of emulation. This is probably the best we can get on a Windows-like OS as hardware gets faster. And probably could be considered perfect emulation for many systems.

However, many old systems were capable of running code during hblank, this was often used for changing video settings in order to create interesting effects, but probably some games could have also polled inputs during this period. We would need to check case by case and I don't know the details, but it's obvious that with the above scheme we would be missing this sub-frame precision (if that matters, that's another story).

I believe this scheme is not too difficult to achieve, though it would need some non-trivial reorganization of MAME rendering. However, I guess the CPU requirements would be very high:

16.67 / 1.33 = 12.50 x 100 = 1250%

... so in order to have fluent emulation of a 60 Hz game you'd need that MAME could emulate it at least at 1250%.

For truly perfect emulation we would need to emulate and render line by line, synchronizing to hblank instead of vblank. This is to avoid the need of pre-rendering a whole frame and allow us to read input at any point in the frame. This is feasible as video hblank triggers interrupts much like vblank, but unfortunately under Windows we don't have reliable access to this information, as far as I know. It's possible to read the current scanline so something could be done, but I doubt it would be accurate enough. The pros are that as we'd be spreading the emulation time during the whole frame time, a very modest PC could do. The contras: the emulators would possibly need a very complete rewrite. With your idea, on the other hand, the same basic emulation code will serve.

Quote
Thanks, makes it clear. There is however one question that I'm wondering about. I read all of these stories about the HID USB polling rate in windows being 125hz, i.e. polling about every 8ms, and people trying to overclock the USB ports (through USBPORT/HIDUSB patches) to 250/500/1000hz. I'm not sure what to believe of this all, and whether it's true for all versions of windows. But if there's some truth to it, then it would actually mean that input changes are only signalled 2 times a frame? If so, it would also raise the question whether or not something could be done about it from a software developers point of view?

I'm sorry I don't have much information about this. I've read about this too here and there, but have never got into tweaking USB inputs. As far as I understand it, if we could poll the hardware *directly* once per frame in sync with vblank it should be enough, but I guess that even if we use DirectInput to poll the keyboard state this matrix will only get updated at the usb polling rate which is independent from us, so yes, in theory increasing the polling rate will improve our chances that the information returned by DirectInput is up-to-date when we read it.

Quote
First about the display model that is used in Windows Vista and 7. The Desktop Compositor Engine on which the Aero interface is built has its own vertical synchronization routines, which are known to possibly interfere with emulator vertical sync routines. I've encountered this with bsnes and some other emulators, where the (smooth) scrolling would show a hick-up every now and then. After disabling the desktop composition for such a program it ran flawlessly. So just in case you encounter weird things when testing GM stuff on W7..

Yeah I had read about the Aero thing. Well, actually this buffering *should* be disabled while in full screen mode, if that's not the case then W7 should definitely not be an option for emulation. Anyway, BSNES does not work in full screen mode, it just runs at your desktop resolution if I remind right, so that could be the reason.

Quote
Second issue may have an even greater effect. It's about the video drivers in Windows (from NVidia/AMD) that sort of seem to have a will of their own when it comes to buffering. The culprit in question is the so called "flip queue size" (ATI/AMD) or "Maximum Pre-rendered Frames" (Nvidia) variable in these drivers, which normally defaults to three. To my experience this value can be a cause for serious additional lag, especially when the emulator is intending to use the minimal amount of buffering.

Well this is something new to me, and it sounds like it could be a possible reason why triplebuffering, which uses flipping, has such a bad reputation. I doubt I've experienced that with XP + Catalyst but will definitely investigate it.

Quote
I originally got triggered on this subject by the PC-Engine "Ootake" emulator author, the original topic can be found at the Ootake page here: http://www.ouma.jp/ootake/delay-win7vista.html. It's about lowering causes of input delay in Vista and 7, but halfway down the page it also mentions that the flip queue size settings have an effect in WindowsXP too (given modern enough PC).

Yeah I had read that article where the author explains the mechanism he uses for reducing input lag in his emulator, very inspiring! A good friend pointed it to me long ago.
« Last Edit: November 10, 2012, 11:52:59 am by Calamity »
Important note: posts reporting GM issues without a log will be IGNORED.
Steps to create a log:
 - From command line, run: groovymame.exe -v romname >romname.txt
 - Attach resulting romname.txt file to your post, instead of pasting it.

CRT Emudriver, VMMaker & Arcade OSD downloads, documentation and discussion:  Eiusdemmodi

Dr.Venom

  • Trade Count: (0)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 270
  • Last login:May 08, 2018, 05:06:54 am
  • I want to build my own arcade controls!
Quote
But of course this:

Quote
So to visualize the input delay in the case of "emulate+blit in vblank" :
Code: [Select]
0ms ----------------display------------15.4ms -vblank- 16.7ms||0ms ----------------display------------15.4ms -vblank- 16.7ms||
||--------------------------------------| emulate + blit.....||----------------------------------------| emulate + blit.....||
||<---------------  (x) user input ------------------------->||<----------------- (x) = shown! --------------------------->||

... is a completely different animal, and I agree with you it would be *nearly* the holy grail of emulation. This is probably the best we can get on a Windows-like OS as hardware gets faster. And probably could be considered perfect emulation for many systems.

That is great, and leaves some promises at least for the future of frame based emulation.

Quote
However, many old systems were capable of running code during hblank, this was often used for changing video settings in order to create interesting effects, but probably some games could have also polled inputs during this period. We would need to check case by case and I don't know the details, but it's obvious that with the above scheme we would be missing this sub-frame precision (if that matters, that's another story).

I believe this scheme is not too difficult to achieve, though it would need some non-trivial reorganization of MAME rendering. However, I guess the CPU requirements would be very high:

16.67 / 1.33 = 12.50 x 100 = 1250%

... so in order to have fluent emulation of a 60 Hz game you'd need that MAME could emulate it at least at 1250%.

I don't fully understand your calculation, would be great if you could elaborate a bit on this..

Quote
For truly perfect emulation we would need to emulate and render line by line, synchronizing to hblank instead of vblank. This is to avoid the need of pre-rendering a whole frame and allow us to read input at any point in the frame. This is feasible as video hblank triggers interrupts much like vblank, but unfortunately under Windows we don't have reliable access to this information, as far as I know. It's possible to read the current scanline so something could be done, but I doubt it would be accurate enough. The pros are that as we'd be spreading the emulation time during the whole frame time, a very modest PC could do. The contras: the emulators would possibly need a very complete rewrite. With your idea, on the other hand, the same basic emulation code will serve.

I definately agree with your reasoning for using hblank for truly perfect emulation, and the pros and contras you mention of using the various methods. While contemplating what you wrote another idea popped into my mind, that combines sort of the things we spoke about. I'm thinking of a method that *could* very much be a close approximation of a  line by line sync, while possibly only making "modest" changes to the emulator core.

Biggest question would be if it is currently possible to PAUSE/START the MAME emulation core at will -multiple times- during a frame with only a -modest- change to the code?

If that's possible, then the following model should be possible (in theory for now at least):

  • Chop the frame emulation of the core in N chunks of visible lines (by using "pause/start"), with each N chunk consisting of [1/N * total visible scanlines]
  • spread the N chunks over the realworld frame time by using PAUSE/WAIT/START after emulating each chunk
  • in parallel use D3DRASTER_STATUS to read where the real monitor scanline is approximately, and continually make sure to blit ahead (with some margin) the next chunk

So suppose we're running an NTSC screen with 240 visible lines and 262 total lines, we're chopping the frame in *3* chunks (80 lines each) + vsync, then it would look something like this:

Code: [Select]
real line nr.            -> 240---------------------------------------------262/0--------------------------------------------------80-------------------------------------------------160-------------------------------------------------240(wrap)         
real display (>front buf)->  |-----------------REAL VBLANK---------------------| display of chunk *1* (lines 0-80)---------------->| display of chunk *2* (lines 81-160)--------------->| display of chunk *3* (lines 161-240)------------->||
emu core (>back buf)     ->  | emu chunk *1* (lines 0-80) -> pause+blit+wait-->| emu chunk *2* (lines 81-160) -> pause+blit+wait-->| emu chunk *3* (lines 160-240)-> pause+blit+wait--->| ---------> wrap when next frame------------------>||
D3DRASTER_STATUS         ->  | poll/wait for real display line 0 ------------->| poll/wait for real display line 80--------------->| poll/wait for real display line 160--------------->| poll/wait for real display line 240 (vblank)----->||   
input polling            ->  |-------------------input enabled---------------->| --------------------input enabled---------------->| --------------------input enabled----------------->|--------------------input enabled----------------->||

Advantages:
- The maximum amount of lag versus a real system would be in the order of magnitude of *only* 1/3 of a frame!
- Basic emulation code would serve, -if- it's possible to PAUSE/START the main emulation thread? (Only modest adjustment to code base?)
- Spreads emulation time over the real frame time: a relatively modest PC could do?
- Could be an approximation for sub frame precision?

I guess the above is sort of my my final (high level, I admit) thought on getting to perfection within the frame based emulation core, given the information that has come forward from our (very nice and useful) discussions. Hopefully the above comes across and actually could make some sense from real coding perspective. And if so, it might be improved further? (I guess that would be a yes :) )

Quote
I'm sorry I don't have much information about this. I've read about this too here and there, but have never got into tweaking USB inputs. As far as I understand it, if we could poll the hardware *directly* once per frame in sync with vblank it should be enough, but I guess that even if we use DirectInput to poll the keyboard state this matrix will only get updated at the usb polling rate which is independent from us, so yes, in theory increasing the polling rate will improve our chances that the information returned by DirectInput is up-to-date when we read it.

In theory it should work indeed, but unfortunately I'm not finding any "hard" evidence on the topic. I wish there would be some official Microsoft spec sheets on how these things are actually implemented. So that we would know what the default rates are in the different versions of windows, and whether/how they apply to different HID's, like mouse, keyboard and joypad/sticks. Unfortunately this seems rather hard to come by.

Quote
Yeah I had read about the Aero thing. Well, actually this buffering *should* be disabled while in full screen mode, if that's not the case then W7 should definitely not be an option for emulation. Anyway, BSNES does not work in full screen mode, it just runs at your desktop resolution if I remind right, so that could be the reason.

Yes you're quite right, bsnes runs in a full screen window, so that's why it's affected by the WDM. Shouldn't indeed be the case with real fullscreen applications.

Quote
Quote
Second issue may have an even greater effect. It's about the video drivers in Windows (from NVidia/AMD) that sort of seem to have a will of their own when it comes to buffering. The culprit in question is the so called "flip queue size" (ATI/AMD) or "Maximum Pre-rendered Frames" (Nvidia) variable in these drivers, which normally defaults to three. To my experience this value can be a cause for serious additional lag, especially when the emulator is intending to use the minimal amount of buffering.

Well this is something new to me, and it sounds like it could be a possible reason why triplebuffering, which uses flipping, has such a bad reputation. I doubt I've experienced that with XP + Catalyst but will definitely investigate it.

Ah yes, I forgot to mention that the Radeon Pro utility -only- works with 32-bit applications. It uses some kind of "hook" system, that simply doesn't work for 64 bit applications. It's one of the reasons I'm compiling most of the emulator stuff as 32-bit applications.

Whether or not RP is really applying your settings is shown by the taskbar status icon, like in the image below.



Lastly, regarding compiling GroovyMAME/UME with the double buffer patch you mentioned earlier. I tried compiling the 146 + u  releases again, but now by having installed an older MinGW-MAME distribution, but it still gives me compilation errors :(.  I found your suggestion in another forumpost, where you suggested to use Compile MAME 64 v1.22.  Now I've been searching my head off  for this util, but I could only find a v1.23 version , which (unfortunately) is already updated for the new toolchain. Do you have any other tips regarding this, or could the v1.22 version be put up shortly somewhere? Otherwise I guess I'll have to wait until the new patch comes out. (Which isn't that big a problem, but I thought I'd just ask.)
« Last Edit: November 11, 2012, 06:40:34 pm by Dr.Venom »

Calamity

  • Moderator
  • Trade Count: (0)
  • Full Member
  • *****
  • Offline Offline
  • Posts: 7473
  • Last login:Today at 02:50:19 pm
  • Quote me with care
I don't fully understand your calculation, would be great if you could elaborate a bit on this..

Well, it's just a rough calculation, based in an average duration of VBLANK around 1.33 ms for a 15 kHz CRT, you get that in order to fit one complete emulation cycle in the VBLANK period you need the emulator to run at least at:

16.67 / 1.33 = 12.50 times faster than the original machine did (x 100 = 1250% as MAME expresses it).

So unless a certain game runs at 1250% unthrottled on your hardware it won't be possible to get emulated inside VBLANK.

Biggest question would be if it is currently possible to PAUSE/START the MAME emulation core at will -multiple times- during a frame with only a -modest- change to the code?

That's the point. I'm afraid I have no clue of how the actual architecture of MAME works on the emulator side, or even if there's a general way of doing things common to all systems. However I would be deeply surprised if this a PAUSE/START thing could be implemented through modest changes.

Quote
Code: [Select]
real line nr.            -> 240---------------------------------------------262/0--------------------------------------------------80-------------------------------------------------160-------------------------------------------------240(wrap)         
real display (>front buf)->  |-----------------REAL VBLANK---------------------| display of chunk *1* (lines 0-80)---------------->| display of chunk *2* (lines 81-160)--------------->| display of chunk *3* (lines 161-240)------------->||
emu core (>back buf)     ->  | emu chunk *1* (lines 0-80) -> pause+blit+wait-->| emu chunk *2* (lines 81-160) -> pause+blit+wait-->| emu chunk *3* (lines 160-240)-> pause+blit+wait--->| ---------> wrap when next frame------------------>||
D3DRASTER_STATUS         ->  | poll/wait for real display line 0 ------------->| poll/wait for real display line 80--------------->| poll/wait for real display line 160--------------->| poll/wait for real display line 240 (vblank)----->||   
input polling            ->  |-------------------input enabled---------------->| --------------------input enabled---------------->| --------------------input enabled----------------->|--------------------input enabled----------------->||

This would definitely be awesome if it could be achieved. It reminds me of the method implemented by Ootake's author, however I think he didn't link the emulation of the different chunks to the actual scanlines. In any case it's a good thing if it serves to raise interest and awareness about this stuff among emulator writers.

Quote
In theory it should work indeed, but unfortunately I'm not finding any "hard" evidence on the topic. I wish there would be some official Microsoft spec sheets on how these things are actually implemented. So that we would know what the default rates are in the different versions of windows, and whether/how they apply to different HID's, like mouse, keyboard and joypad/sticks. Unfortunately this seems rather hard to come by.

Of course for the above scheme to be worth the pain it should be paired with an almost real-time report of input events. Don't expect many spec sheets, it's like with custom video modes, this stuff is just beyond what's considered ortodox PC usage. What amazes me is that there's not much official concern that I know of being gamers one of the main targets of PC industry.

Quote
Ah yes, I forgot to mention that the Radeon Pro utility -only- works with 32-bit applications. It uses some kind of "hook" system, that simply doesn't work for 64 bit applications. It's one of the reasons I'm compiling most of the emulator stuff as 32-bit applications.

Whether or not RP is really applying your settings is shown by the taskbar status icon, like in the image below.

I did some research on this "Flip Queue Size" thing. Well, it seems it's controlled by a registry key named FlipQueueSize, so it should work without the need of any utility. This key is read by the ati3duag.dll file. I dug in the disassembly and found this:

Code: [Select]
.text:00015ACB                 push    offset aFlipqueuesize ; "FlipQueueSize"
.text:00015AD0                 call    sub_39AA0
.text:00015AD5                 mov     eax, [esi+4]
.text:00015AD8                 cmp     eax, 0Ah
.text:00015ADB                 jbe     short loc_15AE6
.text:00015ADD                 mov     dword ptr [esi+4], 0Ah
.text:00015AE4                 jmp     short loc_15AF2
.text:00015AE6 ; ---------------------------------------------------------------------------
.text:00015AE6
.text:00015AE6 loc_15AE6:                              ; CODE XREF: .text:00015ADBj
.text:00015AE6                 cmp     eax, 2
.text:00015AE9                 jnb     short loc_15AF2
.text:00015AEB                 mov     dword ptr [esi+4], 2
.text:00015AF2
.text:00015AF2 loc_15AF2:                              ; CODE XREF: .text:00015AE4j
.text:00015AF2                                         ; .text:00015AE9j
.text:00015AF2                 mov     eax, [esi+4]

It's interesting because it shows that the minimum value allowed is 2. BTW this is from Catalyst 9.3

Quote
Now I've been searching my head off  for this util, but I could only find a v1.23 version , which (unfortunately) is already updated for the new toolchain. Do you have any other tips regarding this, or could the v1.22 version be put up shortly somewhere? Otherwise I guess I'll have to wait until the new patch comes out. (Which isn't that big a problem, but I thought I'd just ask.)

Better wait for the new patch, that will work with the new toolchain, hopefully I have some time to put everything together soon.
« Last Edit: November 12, 2012, 05:44:30 pm by Calamity »
Important note: posts reporting GM issues without a log will be IGNORED.
Steps to create a log:
 - From command line, run: groovymame.exe -v romname >romname.txt
 - Attach resulting romname.txt file to your post, instead of pasting it.

CRT Emudriver, VMMaker & Arcade OSD downloads, documentation and discussion:  Eiusdemmodi

Dr.Venom

  • Trade Count: (0)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 270
  • Last login:May 08, 2018, 05:06:54 am
  • I want to build my own arcade controls!
Well, it's just a rough calculation, based in an average duration of VBLANK around 1.33 ms for a 15 kHz CRT, you get that in order to fit one complete emulation cycle in the VBLANK period you need the emulator to run at least at:

16.67 / 1.33 = 12.50 times faster than the original machine did (x 100 = 1250% as MAME expresses it).

So unless a certain game runs at 1250% unthrottled on your hardware it won't be possible to get emulated inside VBLANK.

Ah OK, that makes sense. 12.5 times the original machine's speed is quite beefy. And it wouldn't surprise me if there's an average time and a standard deviation to the time mame takes to emulate individual frames. In other words there will probably be frames that need less than the 12.5 and others would need more than the 12.5 times faster (beefing up the required specs even more to keep it running full frame rate).

A sort of easy "patch" in the meantime, until PC's get more powerful, would be to "burst" emulate each frame starting at the middle of the real frame, instead of the beginning of the frame (as mame does it now I understand). That would then lower the required speed to 200% of the original machine's its speed, something that most modern PC's should be able to handle for many of the emulated systems. But then again, until PC's get much more powerful it's probably safer to start at the beginning of the frame as MAME does it now, so that the chance of a frame being skipped by missing vblank is the lowest for a wide range of PC's.

Quote
Biggest question would be if it is currently possible to PAUSE/START the MAME emulation core at will -multiple times- during a frame with only a -modest- change to the code?

That's the point. I'm afraid I have no clue of how the actual architecture of MAME works on the emulator side, or even if there's a general way of doing things common to all systems. However I would be deeply surprised if this a PAUSE/START thing could be implemented through modest changes.

OK. Well maybe someday we get some more information on the actual architecture on the emulator side. It would be interesting to know whether or not it could be implemented with moderate changes. Unfortunately I'm only starting out at programming, learning the basics, so for now (and probably the foreseeable future given the learning curve) I can't be much of a help in this regard.

Quote
Quote
Code: [Select]
real line nr.            -> 240---------------------------------------------262/0--------------------------------------------------80-------------------------------------------------160-------------------------------------------------240(wrap)         
real display (>front buf)->  |-----------------REAL VBLANK---------------------| display of chunk *1* (lines 0-80)---------------->| display of chunk *2* (lines 81-160)--------------->| display of chunk *3* (lines 161-240)------------->||
emu core (>back buf)     ->  | emu chunk *1* (lines 0-80) -> pause+blit+wait-->| emu chunk *2* (lines 81-160) -> pause+blit+wait-->| emu chunk *3* (lines 160-240)-> pause+blit+wait--->| ---------> wrap when next frame------------------>||
D3DRASTER_STATUS         ->  | poll/wait for real display line 0 ------------->| poll/wait for real display line 80--------------->| poll/wait for real display line 160--------------->| poll/wait for real display line 240 (vblank)----->||   
input polling            ->  |-------------------input enabled---------------->| --------------------input enabled---------------->| --------------------input enabled----------------->|--------------------input enabled----------------->||

This would definitely be awesome if it could be achieved. It reminds me of the method implemented by Ootake's author, however I think he didn't link the emulation of the different chunks to the actual scanlines. In any case it's a good thing if it serves to raise interest and awareness about this stuff among emulator writers.

Yes definately.

Quote
Quote
In theory it should work indeed, but unfortunately I'm not finding any "hard" evidence on the topic. I wish there would be some official Microsoft spec sheets on how these things are actually implemented. So that we would know what the default rates are in the different versions of windows, and whether/how they apply to different HID's, like mouse, keyboard and joypad/sticks. Unfortunately this seems rather hard to come by.

Of course for the above scheme to be worth the pain it should be paired with an almost real-time report of input events. Don't expect many spec sheets, it's like with custom video modes, this stuff is just beyond what's considered ortodox PC usage. What amazes me is that there's not much official concern that I know of being gamers one of the main targets of PC industry.

Indeed, getting to an almost realtime video display/emulation, would be much less effective if the input polling side would not keep up. There seems to be some concern/demand from the FPS community on the usb polling rate though. With some manufactures bringing out dedicated mouses using  dedicated drivers in which the polling rate can be set, like this one from Corsair (http://www.corsair.com/vengeance-m60-performance-fps-laser-gaming-mouse.html ). It claims selectable response times of 1000Hz, 500Hz, 250Hz, or 125Hz (1ms, 2ms, 4ms or 8ms), but very much unfortunately so I haven't seen these kind of gamer dedicated hardware for joypads and joysticks.

Quote
Quote
Ah yes, I forgot to mention that the Radeon Pro utility -only- works with 32-bit applications. It uses some kind of "hook" system, that simply doesn't work for 64 bit applications. It's one of the reasons I'm compiling most of the emulator stuff as 32-bit applications.

Whether or not RP is really applying your settings is shown by the taskbar status icon, like in the image below.

I did some research on this "Flip Queue Size" thing. Well, it seems it's controlled by a registry key named FlipQueueSize, so it should work without the need of any utility.

Great that you've been digging deeper into this. Could you post the registry key path in which you find this specific key? (A search in my Win7 64-bit registry didn't reveal the key.)

Quote
This key is read by the ati3duag.dll file. I dug in the disassembly and found this:

Code: [Select]
.text:00015ACB                 push    offset aFlipqueuesize ; "FlipQueueSize"
.text:00015AD0                 call    sub_39AA0
.text:00015AD5                 mov     eax, [esi+4]
.text:00015AD8                 cmp     eax, 0Ah
.text:00015ADB                 jbe     short loc_15AE6
.text:00015ADD                 mov     dword ptr [esi+4], 0Ah
.text:00015AE4                 jmp     short loc_15AF2
.text:00015AE6 ; ---------------------------------------------------------------------------
.text:00015AE6
.text:00015AE6 loc_15AE6:                              ; CODE XREF: .text:00015ADBj
.text:00015AE6                 cmp     eax, 2
.text:00015AE9                 jnb     short loc_15AF2
.text:00015AEB                 mov     dword ptr [esi+4], 2
.text:00015AF2
.text:00015AF2 loc_15AF2:                              ; CODE XREF: .text:00015AE4j
.text:00015AF2                                         ; .text:00015AE9j
.text:00015AF2                 mov     eax, [esi+4]

It's interesting because it shows that the minimum value allowed is 2. BTW this is from Catalyst 9.3

That's some very cool digging :) and definitely interesting. Could it be that the RadeonPro tool patches this value on runtime? I remember someone "proving" somewhere that the flipqueuesize got changed adequately by the RadeonPro tool, but I can't remember when/where I read this. I would also be very much interested if you get any additional findings on this matter.

Quote
Quote
Now I've been searching my head off  for this util, but I could only find a v1.23 version , which (unfortunately) is already updated for the new toolchain. Do you have any other tips regarding this, or could the v1.22 version be put up shortly somewhere? Otherwise I guess I'll have to wait until the new patch comes out. (Which isn't that big a problem, but I thought I'd just ask.)

Better wait for the new patch, that will work with the new toolchain, hopefully I have some time to put everything together soon.

Great, I will.

In the meantime I was wondering if there's a possibility that I could "lift out" only the 'changeres' functionality from your patch, and apply that to the official target. Purely for my personal use, to test how it would work with the rendering of the main build (which works for me with ddraw and syncrefresh). My first attempt worked in the sense that it did change the visible resolutions from within the game on the fly, but it doesn't call the accompanying realtime screenswitch. Could you possibly give me a pointer on what to look for regarding this?
« Last Edit: November 12, 2012, 08:06:15 pm by Dr.Venom »

Dr.Venom

  • Trade Count: (0)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 270
  • Last login:May 08, 2018, 05:06:54 am
  • I want to build my own arcade controls!
Ah yes, I forgot to mention that the Radeon Pro utility -only- works with 32-bit applications. It uses some kind of "hook" system, that simply doesn't work for 64 bit applications. It's one of the reasons I'm compiling most of the emulator stuff as 32-bit applications.

Whether or not RP is really applying your settings is shown by the taskbar status icon, like in the image below.

I did some research on this "Flip Queue Size" thing. Well, it seems it's controlled by a registry key named FlipQueueSize, so it should work without the need of any utility. This key is read by the ati3duag.dll file. I dug in the disassembly and found this:

Code: [Select]
.text:00015ACB                 push    offset aFlipqueuesize ; "FlipQueueSize"
.text:00015AD0                 call    sub_39AA0
.text:00015AD5                 mov     eax, [esi+4]
.text:00015AD8                 cmp     eax, 0Ah
.text:00015ADB                 jbe     short loc_15AE6
.text:00015ADD                 mov     dword ptr [esi+4], 0Ah
.text:00015AE4                 jmp     short loc_15AF2
.text:00015AE6 ; ---------------------------------------------------------------------------
.text:00015AE6
.text:00015AE6 loc_15AE6:                              ; CODE XREF: .text:00015ADBj
.text:00015AE6                 cmp     eax, 2
.text:00015AE9                 jnb     short loc_15AF2
.text:00015AEB                 mov     dword ptr [esi+4], 2
.text:00015AF2
.text:00015AF2 loc_15AF2:                              ; CODE XREF: .text:00015AE4j
.text:00015AF2                                         ; .text:00015AE9j
.text:00015AF2                 mov     eax, [esi+4]

It's interesting because it shows that the minimum value allowed is 2. BTW this is from Catalyst 9.3

For my personal interest and for the sake of science (I'll post back here), I would like to check the above for my Windows 7 Catalyst 12_6 Legacy driver. Would this application http://www.reflector.net/ be suitable for decompiling the mentioned ati3duag.DLL? Or is there maybe another (hopefully free) application that you could recommend? Thanks..

Calamity

  • Moderator
  • Trade Count: (0)
  • Full Member
  • *****
  • Offline Offline
  • Posts: 7473
  • Last login:Today at 02:50:19 pm
  • Quote me with care
For my personal interest and for the sake of science (I'll post back here), I would like to check the above for my Windows 7 Catalyst 12_6 Legacy driver. Would this application http://www.reflector.net/ be suitable for decompiling the mentioned ati3duag.DLL? Or is there maybe another (hopefully free) application that you could recommend? Thanks..

Hex-rays made IDA 5.0 free for non-commercial use, it's the tool that I use:

http://www.hex-rays.com/products/ida/support/download_freeware.shtml

Keep in mind that a minimum value of 2 possibly makes sense as you need at least 2 elements in order to have a queue.
Important note: posts reporting GM issues without a log will be IGNORED.
Steps to create a log:
 - From command line, run: groovymame.exe -v romname >romname.txt
 - Attach resulting romname.txt file to your post, instead of pasting it.

CRT Emudriver, VMMaker & Arcade OSD downloads, documentation and discussion:  Eiusdemmodi

Dr.Venom

  • Trade Count: (0)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 270
  • Last login:May 08, 2018, 05:06:54 am
  • I want to build my own arcade controls!
Hex-rays made IDA 5.0 free for non-commercial use, it's the tool that I use:

http://www.hex-rays.com/products/ida/support/download_freeware.shtml

Thanks.

I've been looking for the ati3duag.dll in the Windows7 (64-bit) 12_6 Catalyst drivers, but guess what, that file is not a part of the driver anymore. ati2edxx.dll and ati2erec.dll, are the only ati# dll files in the whole package. I guess some parts must have been changed from 9_3 to 12_6. Nonetheless thanks for pointing me to Hex-rays, I'm sure I'll be making use of it sooner or later.

Quote
Keep in mind that a minimum value of 2 possibly makes sense as you need at least 2 elements in order to have a queue.

It seems one of those undocumented areas (once again). Personally I'm not sure whether the minimum value of 2 makes sense. The equivalent of flipqueuesize on the NVidia driver (max frames to render ahead) is known to officially have been supporting values of 0-8 in the older drivers and 1-4 in the newer drivers.

Just for if you still have the time and energy (after our earlier discussion), I found some interesting bits on it from a trusted source.

In the Anandtech article "Triple Buffering: Why We Love It", which you know I'm certain, is some interesting information on the flipqueuesize settings in the ad-hoc added 'UPDATE' (at the end of the article) , of which I highlighted the parts that I think are related to the things we discussed. Intuitively I'd say this seems quite close to the truth about the matter.

http://www.anandtech.com/show/2794/4

Quote
UPDATE: There has been a lot of discussion in the comments of the differences between the page flipping method we are discussing in this article and implementations of a render ahead queue. In render ahead, frames cannot be dropped. This means that when the queue is full, what is displayed can have a lot more lag. Microsoft doesn't implement triple buffering in DirectX, they implement render ahead (from 0 to 8 frames with 3 being the default).

The major difference in the technique we've described here is the ability to drop frames when they are outdated. Render ahead forces older frames to be displayed. Queues can help smoothness and stuttering as a few really quick frames followed by a slow frame end up being evened out and spread over more frames. But the price you pay is in lag (the more frames in the queue, the longer it takes to empty the queue and the older the frames are that are displayed).

In order to maintain smoothness and reduce lag, it is possible to hold on to a limited number of frames in case they are needed but to drop them if they are not (if they get too old). This requires a little more intelligent management of already rendered frames and goes a bit beyond the scope of this article.

Some game developers implement a short render ahead queue and call it triple buffering (because it uses three total buffers). They certainly cannot be faulted for this, as there has been a lot of confusion on the subject and under certain circumstances this setup will perform the same as triple buffering as we have described it (but definitely not when framerate is higher than refresh rate).

Both techniques allow the graphics card to continue doing work while waiting for a vertical refresh when one frame is already completed. When using double buffering (and no render queue), while vertical sync is enabled, after one frame is completed nothing else can be rendered out which can cause stalling and degrade actual performance.

When vsync is not enabled, nothing more than double buffering is needed for performance, but a render queue can still be used to smooth framerate if it requires a few old frames to be kept around. This can keep instantaneous framerate from dipping in some cases, but will (even with double buffering and vsync disabled) add lag and input latency.

Their conclusions confirm (IMO) that the flipqueuesize is useful for adding smoothness, but that it does comes at the price of adding latency. It also confirms that the Microsoft implementation allows for a setting of 0-8, which sort of seem orthogonal to the 9_3 driver limiting the value at 2? (But maybe I'm missing something.)

Interesting stuff at least, especially when the goal is to achieve an (almost) lagless implementation for emulation.

Calamity

  • Moderator
  • Trade Count: (0)
  • Full Member
  • *****
  • Offline Offline
  • Posts: 7473
  • Last login:Today at 02:50:19 pm
  • Quote me with care
A sort of easy "patch" in the meantime, until PC's get more powerful, would be to "burst" emulate each frame starting at the middle of the real frame, instead of the beginning of the frame (as mame does it now I understand). That would then lower the required speed to 200% of the original machine's its speed, something that most modern PC's should be able to handle for many of the emulated systems. But then again, until PC's get much more powerful it's probably safer to start at the beginning of the frame as MAME does it now, so that the chance of a frame being skipped by missing vblank is the lowest for a wide range of PC's.

Yeah that would be quite feasible, through some clever modification of the throttling function. I can imagine it could be cool adding a slider control to adjust how 'late' within the frame period you want the emulation to start, so the user could optimize this feature depending on the game and host cpu. As you pointed, the time it takes to emulate individual frames of a game is very uneven,  and unfortunately you cannot know at first hand how long it will take to emulate a frame so you need to find the safe point where no retrace is missed. I'm definitely going to try and implement this as soon as I can.

It's only that I find input lag a very elusive matter, so I'm not probably the best to test. You may create a very complicated piece of code just to find you can't notice any difference.

Quote
Great that you've been digging deeper into this. Could you post the registry key path in which you find this specific key? (A search in my Win7 64-bit registry didn't reveal the key.)

Oh the value is not supposed to exist unless some of these tweaking apps adds it, its named FlipQueueSize and should reside in the same key where the driver stores its variables (same place where we add the modelines).

Quote
That's some very cool digging :) and definitely interesting. Could it be that the RadeonPro tool patches this value on runtime? I remember someone "proving" somewhere that the flipqueuesize got changed adequately by the RadeonPro tool, but I can't remember when/where I read this. I would also be very much interested if you get any additional findings on this matter.

I seriously doubt that RadeonPro is patching this on runtime, maybe it's using a hook to intercept some stuff but I bet that for the flip queue they just use the registry key, anyway I can't say that for sure.
Important note: posts reporting GM issues without a log will be IGNORED.
Steps to create a log:
 - From command line, run: groovymame.exe -v romname >romname.txt
 - Attach resulting romname.txt file to your post, instead of pasting it.

CRT Emudriver, VMMaker & Arcade OSD downloads, documentation and discussion:  Eiusdemmodi

Calamity

  • Moderator
  • Trade Count: (0)
  • Full Member
  • *****
  • Offline Offline
  • Posts: 7473
  • Last login:Today at 02:50:19 pm
  • Quote me with care
I've been looking for the ati3duag.dll in the Windows7 (64-bit) 12_6 Catalyst drivers, but guess what, that file is not a part of the driver anymore. ati2edxx.dll and ati2erec.dll, are the only ati# dll files in the whole package. I guess some parts must have been changed from 9_3 to 12_6. Nonetheless thanks for pointing me to Hex-rays, I'm sure I'll be making use of it sooner or later.

Apart from the Catalyst version, W7 uses a new driver model, different from XP's.

Quote
It seems one of those undocumented areas (once again). Personally I'm not sure whether the minimum value of 2 makes sense. The equivalent of flipqueuesize on the NVidia driver (max frames to render ahead) is known to officially have been supporting values of 0-8 in the older drivers and 1-4 in the newer drivers.

I'm not sure either :) Just thinking of some possibilities. We don't know if that queue is just appended the one created by the programmer, which would be catastrophic (say I code a triple buffer which ends up being a 3+2 = 5 buffer chain!), or on the other hand they're just forcing a minimum of 2 (double buffering), so 3 would be 3 after all.

Important note: posts reporting GM issues without a log will be IGNORED.
Steps to create a log:
 - From command line, run: groovymame.exe -v romname >romname.txt
 - Attach resulting romname.txt file to your post, instead of pasting it.

CRT Emudriver, VMMaker & Arcade OSD downloads, documentation and discussion:  Eiusdemmodi

Calamity

  • Moderator
  • Trade Count: (0)
  • Full Member
  • *****
  • Offline Offline
  • Posts: 7473
  • Last login:Today at 02:50:19 pm
  • Quote me with care
In the Anandtech article "Triple Buffering: Why We Love It", which you know I'm certain, is some interesting information on the flipqueuesize settings in the ad-hoc added 'UPDATE' (at the end of the article) , of which I highlighted the parts that I think are related to the things we discussed. Intuitively I'd say this seems quite close to the truth about the matter.

I read that article long ago and honestly I'm not sure if that UPDATE was already there, but YES, definitely that's the answer, at least for the part concerning the fake nature of the triple buffering implementation by DirectX. That explanation completely matches my experience with DirectDraw's flipping functions. So the confirmation that DirectX's triple buffer is a just queue is enough to avoid using it.

It's funny how once you get a new direction you find hundreds of Google references to this fact  ;D

Actually they seem to have fixed this behaviour in newer versions of DirectX, so:

http://msdn.microsoft.com/en-us/library/windows/desktop/bb172585%28v=vs.85%29.aspx

Quote

D3DPRESENT_FORCEIMMEDIATE

D3DPRESENT_INTERVAL_IMMEDIATE is enforced on this Present call. This flag can only be specified when using D3DSWAPEFFECT_FLIPEX. Windowed and fullscreen presentation behaviors are the same. This is especially useful for media apps that want to discard frames that have been detected as late and present subsequent frames at composition time. An invalid parameter error will be returned if this flag is improperly specified. When multiple consecutive frames with D3DPRESENT_FORCEIMMEDIATEs are queued, only the last frame is displayed, for both windowed and fullscreen presentation. A sample application that uses D3DPRESENT_FORCEIMMEDIATE and D3DSWAPEFFECT_FLIPEX is the D3D9ExFlipEx sample on the MSDN Code Gallery.

This flag is available in Direct3D 9Ex on Windows 7 or later operating systems.

When using D3DSWAPEFFECT_FLIPEX, each frame presented using D3DPRESENT_INTERVAL_IMMEDIATE or D3DPRESENT_INTERVAL_FORCEIMMEDIATE will override the previous frame's present interval. For example, if you queue the following frames using the following swap effects: frame A (D3DPRESENT_INTERVAL_ONE), frame B(D3DPRESENT_INTERVAL_ONE), frame C(D3DPRESENT_INTERVAL_ONE), frame D(D3DPRESENT_INTERVAL_FORCEIMMEDIATE), frame D will override frame C's present interval. The displayed frames per present interval are frame A, frame B, (frame C overridden by) frame D.


Unfortunately we don't have this for Windows XP  :angry:
« Last Edit: November 15, 2012, 06:19:13 pm by Calamity »
Important note: posts reporting GM issues without a log will be IGNORED.
Steps to create a log:
 - From command line, run: groovymame.exe -v romname >romname.txt
 - Attach resulting romname.txt file to your post, instead of pasting it.

CRT Emudriver, VMMaker & Arcade OSD downloads, documentation and discussion:  Eiusdemmodi

Dr.Venom

  • Trade Count: (0)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 270
  • Last login:May 08, 2018, 05:06:54 am
  • I want to build my own arcade controls!
Yeah that would be quite feasible, through some clever modification of the throttling function. I can imagine it could be cool adding a slider control to adjust how 'late' within the frame period you want the emulation to start, so the user could optimize this feature depending on the game and host cpu. As you pointed, the time it takes to emulate individual frames of a game is very uneven,  and unfortunately you cannot know at first hand how long it will take to emulate a frame so you need to find the safe point where no retrace is missed. I'm definitely going to try and implement this as soon as I can.

That would be very cool :)

I can imagine it would be useful to have a configuration parameter that is not too granular, but provide say a few steps for lowering the "frame delay". Say a setting of 0 would equal emulate+blit in vblank (near holy grail), 1 a quarter frame delay, each next adds a quarter frame delay, with setting 4 equal to "emulate at beginning of frame" (the most safe setting, equal to the current implementation).

This would keep things simple, without triggering people to try and optimize for the millisecond. That could possibly result in a false sense of accuracy all together, because of the nature of the multitasking OS.

Quote
It's only that I find input lag a very elusive matter, so I'm not probably the best to test. You may create a very complicated piece of code just to find you can't notice any difference.

It can be a very elusive matter. In my experience the bigger differences can be noticed by explicit testing for it. You'll mostly notice those by going back and forth between new and old in a short period of time. But, IMHO, for noticing the more subtle improvements a different approach is needed. For those you mostly need to play a fast shoot 'm up - that you know by heart -  for a while a few times. Then let it rest. Then go back to the old method, and play for a longer time. Then let it rest. Mostly in the course of a day, or a few days, you'll get a sense of the "swiftness" of new versus old. 

Of course, this presupposes that you have accurate material to test with, so using a -wired- joystick/joypad that is by itself accurate is essential. Using one of these chinese joypad adapters (for connecting PS2/SNES joypad etc.. to PC) are no go, as they all run at 100Hz or worse, causing a 10ms delay by itself (added to the 8ms in windows), and being a source of too much "noise" to do proper tests on the software. I myself am using a Suzo "The Arcade" digital joystick, with an adapter that runs at 1000hz (1ms), which negates any (additional) delay from the hardware side.

Second it presupposes, that you have the software environment set up properly, so for example testing in a window on Vista/7 with "Aero" enabled is no go. Or testing with the flipqueuesize at the video driver / windows default is no go. Etc.  Once both hardware and software setup are appropriate, and as such the usual sources of lag have been elimated, only then it's possible to do adequate testing.

I guess in addition to proper and extensive testing, it would help to get some statistics from the software\emulation itself. It would as such be extremely helpful if it would be possible to keep a counter running within the emulation that 1) logs the average time between start of frame emulate until vblank and 2) logs the number of instances where vblank is missed / a frame has been lost (ofcourse these should be near zero in a proper test).  Combining these statistics with the above mentioned testing methods, should give enough accuracy and certainty on whether a new method provides an improvement.

Quote
Quote
It seems one of those undocumented areas (once again). Personally I'm not sure whether the minimum value of 2 makes sense. The equivalent of flipqueuesize on the NVidia driver (max frames to render ahead) is known to officially have been supporting values of 0-8 in the older drivers and 1-4 in the newer drivers.

I'm not sure either :) Just thinking of some possibilities. We don't know if that queue is just appended the one created by the programmer, which would be catastrophic (say I code a triple buffer which ends up being a 3+2 = 5 buffer chain!), or on the other hand they're just forcing a minimum of 2 (double buffering), so 3 would be 3 after all.

True. You would imagine it's the last option. But on the other it's strange that a driver limits the value to a minimum of two, while Windows allows for a lower setting. But then again, the way this (the driver) works in both WindowsXP and Windows7+ might be quite different all together. At least as you pointed out, they're already using a different driver model, so that might indeed include a different approach on the whole flipqueuesize thing.

In the Anandtech article "Triple Buffering: Why We Love It" [...]

I read that article long ago and honestly I'm not sure if that UPDATE was already there, but YES, definitely that's the answer, at least for the part concerning the fake nature of the triple buffering implementation by DirectX. That explanation completely matches my experience with DirectDraw's flipping functions. So the confirmation that DirectX's triple buffer is a just queue is enough to avoid using it.

It's funny how once you get a new direction you find hundreds of Google references to this fact  ;D

That sounds familiar ;D     

Quote
Actually they seem to have fixed this behaviour in newer versions of DirectX, so:

http://msdn.microsoft.com/en-us/library/windows/desktop/bb172585%28v=vs.85%29.aspx

Quote
D3DPRESENT_FORCEIMMEDIATE[...]

This flag is available in Direct3D 9Ex on Windows 7 or later operating systems.

Unfortunately we don't have this for Windows XP  :angry:

That's indeed unfortunate.

On the other had, it's fortunate that Microsoft has been addressing and improving these issues in Windows Vista/7/8. In that regards I noticed two interesting other things also.

In Windows 7 you can call a function SetMaximumFrameLatency which

Quote
Sets the number of frames that the system is allowed to queue for rendering. [...] The maximum number of back buffer frames that a driver can queue. The value defaults to 3, but can range from 1 to 16."

SetMaximumFrameLatency: http://msdn.microsoft.com/en-us/library/windows/desktop/ff471334%28v=vs.85%29.aspx
GetMaximumFrameLatency: http://msdn.microsoft.com/en-us/library/windows/desktop/ff471332(v=vs.85).aspx

Notice how the lowest value that can be forced is 1.

Another improvement that has been added to Windows 7 is on the audio front:

Quote
The following features have been improved in Windows 7:

In Windows 7 share mode streams run in low-latency mode. The audio engine runs in pull mode with a significant reduction in latency. This is very useful for communication applications that require low audio stream latency for faster streaming."

http://msdn.microsoft.com/en-us/library/windows/desktop/dd756612%28v=vs.85%29.aspx

Even better, there's also the addition of exclusive-mode streaming, which allows for very low latency streams called "Pro-Audio", see : http://msdn.microsoft.com/en-us/library/windows/desktop/dd370844%28v=vs.85%29.aspx

which to me sounds ideal (pun intented ;) ) for the purpose of emulation and getting audio latency as low as possible.

All of the above says to me that Windows 7 is possibly as good (if not better?) an alternative to WindowsXP as an emulation platform? The only thing you have to be *very* aware of is knowing about Aero and how to disable it when running emulation in a window, lowering the flipqueuesize setting in general (which defaults to 3 in Win7 because of Aero), and - not the least - use an emulator that actually makes use of these improved rendering possibilities...

A heated debate I know, but with the above "guidelines" in mind, we should probably be more open minded on Windows 7 as a good platform for emulation?  Well I'm already, so I'm biased...

Out of interest, would it be possible to develop CRT emudriver for the Win7+ platform, or are there specific things about WindowsXP that are needed for it?
« Last Edit: November 17, 2012, 06:51:16 am by Dr.Venom »

jimmy2x2x

  • Trade Count: (+1)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 1215
  • Last login:December 19, 2018, 01:29:48 am
 :burgerking: Say Black Dynamite

 :afro: Hush now burgerking, don't interrupt my kung-fu

 :burgerking: Sorry to interrupt your kung-fu, but...

 :burgerking: Could Dr.Venom be Dr(i)Ve(r)m(an)

 :afro: DYNA-MITE, DYNA-MITE!


Dr.Venom

  • Trade Count: (0)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 270
  • Last login:May 08, 2018, 05:06:54 am
  • I want to build my own arcade controls!
Quote
It's only that I find input lag a very elusive matter, so I'm not probably the best to test. You may create a very complicated piece of code just to find you can't notice any difference.

It can be a very elusive matter. In my experience the bigger differences can be noticed by explicit testing for it. You'll mostly notice those by going back and forth between new and old in a short period of time. But, IMHO, for noticing the more subtle improvements a different approach is needed. For those you mostly need to play a fast shoot 'm up - that you know by heart -  for a while a few times. Then let it rest. Then go back to the old method, and play for a longer time. Then let it rest. Mostly in the course of a day, or a few days, you'll get a sense of the "swiftness" of new versus old. 

Of course, this presupposes that you have accurate material to test with, so using a -wired- joystick/joypad that is by itself accurate is essential. Using one of these chinese joypad adapters (for connecting PS2/SNES joypad etc.. to PC) are no go, as they all run at 100Hz or worse, causing a 10ms delay by itself (added to the 8ms in windows), and being a source of too much "noise" to do proper tests on the software. I myself am using a Suzo "The Arcade" digital joystick, with an adapter that runs at 1000hz (1ms), which negates any (additional) delay from the hardware side.

Second it presupposes, that you have the software environment set up properly, so for example testing in a window on Vista/7 with "Aero" enabled is no go. Or testing with the flipqueuesize at the video driver / windows default is no go. Etc.  Once both hardware and software setup are appropriate, and as such the usual sources of lag have been elimated, only then it's possible to do adequate testing.

I guess in addition to proper and extensive testing, it would help to get some statistics from the software\emulation itself. It would as such be extremely helpful if it would be possible to keep a counter running within the emulation that 1) logs the average time between start of frame emulate until vblank and 2) logs the number of instances where vblank is missed / a frame has been lost (ofcourse these should be near zero in a proper test).  Combining these statistics with the above mentioned testing methods, should give enough accuracy and certainty on whether a new method provides an improvement.

I guess  the above quote (from my previous post), to make a long story short, is simply saying that I'd be willing to test any improvements :D

I had another thought on the matter to fully objectively and accurately test for input latency. Not sure if and how it would exactly work, but the idea would be as follows.

Connect a Photodiode (http://en.wikipedia.org/wiki/Photodiode) to a joystick/joypad button. Preferably via wires, so that the photodiode can be attached directly to the glass of the CRT screen. Additionally it would require a sample test program running in the emulation that flashes a single frame from black to white and back (photodiode converts the light into current, triggering "active" button signal on joystick), allowing the test program to objectively and accurately measure the time between the flipping/blitting command for the single white frame, and the time the input signal is received. Could possibly be an interesting path to research if and when we would want to get to the bottom of input latency, in a scientific way.

Calamity

  • Moderator
  • Trade Count: (0)
  • Full Member
  • *****
  • Offline Offline
  • Posts: 7473
  • Last login:Today at 02:50:19 pm
  • Quote me with care
So I've been doing some tests these days, and the new option named -frame_delay is going to be ready for the next release. I've done it so that a frame time is divided in 10 parts, so a frame_delay value of 0 (default) means the emulation starts at the beginning of the frame time, as always. A value of 5 means the emulation is postponed to the middle of the frame, and so on (you have 1 tenth of a frame of granularity). I *think* I've done it right and has been working for me, however I need to test it more thoroughly. I have to admit that I can't notice any difference myself.

Second, I've removed the third buffer in -triplebuffer, so now it can be used as an asynchronous implementation of double buffering removing the extra frame in the queue.

PD: I leave the Win7 vs XP / input latency measurement issues for later posts...
Important note: posts reporting GM issues without a log will be IGNORED.
Steps to create a log:
 - From command line, run: groovymame.exe -v romname >romname.txt
 - Attach resulting romname.txt file to your post, instead of pasting it.

CRT Emudriver, VMMaker & Arcade OSD downloads, documentation and discussion:  Eiusdemmodi

Dr.Venom

  • Trade Count: (0)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 270
  • Last login:May 08, 2018, 05:06:54 am
  • I want to build my own arcade controls!
So I've been doing some tests these days, and the new option named -frame_delay is going to be ready for the next release. I've done it so that a frame time is divided in 10 parts, so a frame_delay value of 0 (default) means the emulation starts at the beginning of the frame time, as always. A value of 5 means the emulation is postponed to the middle of the frame, and so on (you have 1 tenth of a frame of granularity).

That's awesome, I'm looking forward to test driving it.

Quote
I *think* I've done it right and has been working for me, however I need to test it more thoroughly. I have to admit that I can't notice any difference myself.

Did you manage to get a "frame missed" counter or logging in?  That would be your double check, as pushing the -frame_delay from 0 in steps upwards to 10, should (from a certain value) also see an increased number of frames being missed.

Quote
Second, I've removed the third buffer in -triplebuffer, so now it can be used as an asynchronous implementation of double buffering removing the extra frame in the queue.

That's great, it will be exciting to see how the updated version performs. For my understanding, this means it operates with 1 backbuffer and 1 frontbuffer (the one that's drawn to the screen)? And (not wanting to assume too much) how does this exactly differ from the non-asynchronous double buffer?

I was wondering if it would be possible, if time and energy permits, to add the methods of "flip" or "blit" as a separate configurable option to GroovyUME? Not sure how much work it would be, but it might influence the effectiveness of the -frame_delay on different hardware/software configurations and could possibly also be of general benefit when configuring/optimizing GroovyUME with different setups.

Quote
PD: I leave the Win7 vs XP / input latency measurement issues for later posts...

That's perfectly fine. First things first...   

krick

  • Trade Count: (+1)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 2006
  • Last login:May 23, 2025, 03:48:36 am
  • Gotta have blue hair.
Regarding input lag from the hardware perspective...

If you use a keyboard encoder attached to the PC using the PS/2 interface, it is assigned IRQ1 at the hardware level, which is the IRQ with the highest priority.  So, in theory, this should have the lowest lag possible on the hardware side.   However, I'm not sure what the operating system, driver, etc... do afterwards, lag-wise.

Also, I'm not sure about keyboards, but other inputs like mice are handled in MAME using the RawInput API, which may be different than DirectInput from a lag standpoint.

Check out this info on PS/2 keyboards vs USB keyboards...
http://www.tomshardware.com/reviews/mechanical-switch-keyboard,2955-5.html
Hantarex Polo 15KHz
Sapphire Radeon HD 7750 2GB (GCN)
GroovyMAME 0.197.017h_d3d9ex
CRT Emudriver & CRT Tools 2.0 beta 13 (Crimson 16.2.1 for GCN cards)
Windows 7 Home Premium 64-bit
Intel Core i7-4790K @ 4.8GHz
ASUS Z87M-PLUS Motherboard

Calamity

  • Moderator
  • Trade Count: (0)
  • Full Member
  • *****
  • Offline Offline
  • Posts: 7473
  • Last login:Today at 02:50:19 pm
  • Quote me with care
Did you manage to get a "frame missed" counter or logging in?  That would be your double check, as pushing the -frame_delay from 0 in steps upwards to 10, should (from a certain value) also see an increased number of frames being missed.

That would be a bit tricky to implement as it would require accurate time measurements. Anyway I trust my eye more than anything else on this regard.


Quote
For my understanding, this means it operates with 1 backbuffer and 1 frontbuffer (the one that's drawn to the screen)? And (not wanting to assume too much) how does this exactly differ from the non-asynchronous double buffer?

Yes that's it. By asynchronous I mean that it can drop frames if required (i.e. the game loop can run as fast as it wants without caring about the video card's duties). This can be achieved either by triple buffering (the real one, not the DirectX crap) or by moving the double buffering code into a separate thread (this is what GM does).

Quote
I was wondering if it would be possible, if time and energy permits, to add the methods of "flip" or "blit" as a separate configurable option to GroovyUME?

That would be possible, however you don't have one explicit way of "blitting" in Direct3D AFAIK, the most similar thing seems the D3DSWAPEFFECT_COPY option but I doubt it actually represents an advantage.

Well, there's some interesting stuff I found when testing this -frame_delay method. At first I was using the D3D's default method built in MAME: flip + D3DPRESENT_INTERVAL_ONE, for v-syncing. This method seems the most efficient for catching the vblank period. However there's something odd to it.

I'll try to explain the problem. The -frame_delay option is implemented by moving the throttling wait "loop" after the screen update code, instead of placing it before as it currently is in MAME. Then instead of waiting for a full frame period, we just wait for a fraction of that period, as defined by the -frame_delay option.

One would expect the screen update code to return at exact periods of time forced by the v-sync code, but this is only true for DirectDraw. The Direc3D "present" method seems to take random amounts of time to exit. This is a problem, because it frustrates our efforts to add an accurate wait loop after the update screen code, as it leads to a completely uneven frame rate.

I was about to trash the whole thing but then tried by using the GetRasterStatus method within a loop and right after that performing the flip operation with the D3DPRESENT_INTERVAL_IMMEDIATE flag. Surprinsingly this worked like a charm, but only for CRT monitors!!! For some reason when testing this on LCD monitors the GetRasterStatus seems too slow reporting the VBLANK and you can clearly see static tearing usually on the upper part of the screen. I believe it's always equally slow, it's only that CRT monitors tend to have a longer blanking period so this issue gets masked. But the nice thing is that even if delayed, it flags at exact intervals like a swiss clock, as opposite to the v-synced flipping method.

As a side effect adding this option has resolved a problem I was having for creating a clean implementation of frequency scaling. Unfortunately the flags D3DPRESENT_INTERVAL_TWO, etc, which I expected to use for this do not work under DirectX 9 according to my tests, and I didn't want to resource to the -redraw patch that was plaguing GM with dead locks. Now setting the -frame_delay option to 4-5 works fantastic to force MAME into jumping one out of two vertical retraces, making possible to achieve smooth scrolling for games running at scaled vertical frequencies.

Quote
While doing some first tests with the new frame_delay feature, I also found something interesting that fixes the issue with the <240 line modes running way too fast on my setup (win7+soft15khz), as we spoke about some time ago (see: http://forum.arcadecontrols.com/index.php/topic,120331.msg1313434.html#msg1313434). That speed issue for those specific screenmodes is completely fixed now when I set the frame_delay parameter to 1 (or higher)!

My understanding of this is that the DirectDraw's WaitForVerticalBlank method works somewhat differently under Windows 7, so maybe if the emulation is too fast you might end up having a new frame ready *before* the previous vblank actually ended, thus making the new WaitForVerticalBlank call to return immediately. This should be avoided with the DDWAITVB_BLOCKBEGIN flag, but for some reason it might not be working (this could be totally wrong). So adding a delay would give enough time for the VBLANK to end and MAME to catch the next blanking period instead.

« Last Edit: November 26, 2012, 07:16:04 pm by Calamity »
Important note: posts reporting GM issues without a log will be IGNORED.
Steps to create a log:
 - From command line, run: groovymame.exe -v romname >romname.txt
 - Attach resulting romname.txt file to your post, instead of pasting it.

CRT Emudriver, VMMaker & Arcade OSD downloads, documentation and discussion:  Eiusdemmodi

krick

  • Trade Count: (+1)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 2006
  • Last login:May 23, 2025, 03:48:36 am
  • Gotta have blue hair.
In MAME, some frames take more time to render, depending on what's happening on screen.  Doesn't that affect what you guys are trying to do?
Hantarex Polo 15KHz
Sapphire Radeon HD 7750 2GB (GCN)
GroovyMAME 0.197.017h_d3d9ex
CRT Emudriver & CRT Tools 2.0 beta 13 (Crimson 16.2.1 for GCN cards)
Windows 7 Home Premium 64-bit
Intel Core i7-4790K @ 4.8GHz
ASUS Z87M-PLUS Motherboard

Calamity

  • Moderator
  • Trade Count: (0)
  • Full Member
  • *****
  • Offline Offline
  • Posts: 7473
  • Last login:Today at 02:50:19 pm
  • Quote me with care
In MAME, some frames take more time to render, depending on what's happening on screen.  Doesn't that affect what you guys are trying to do?

Definitely, but by placing the wait loop (throttling code) right after the v-synced draw operation, we make sure the difference is absorbed by the wait_for_vblank loop.

So that: emulation_time + wait_for_vblank_time = constant

What we try to achieve is to reduce to the maximum the time the emulator spends waiting for vertical blank, just before the frames that take longer to get emulated start overflowing the time slice provided by the vertical blank. So one needs to explore the right value for -frame_delay, that obviously is game and host system specific.
Important note: posts reporting GM issues without a log will be IGNORED.
Steps to create a log:
 - From command line, run: groovymame.exe -v romname >romname.txt
 - Attach resulting romname.txt file to your post, instead of pasting it.

CRT Emudriver, VMMaker & Arcade OSD downloads, documentation and discussion:  Eiusdemmodi

Dr.Venom

  • Trade Count: (0)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 270
  • Last login:May 08, 2018, 05:06:54 am
  • I want to build my own arcade controls!
Yes that's it. By asynchronous I mean that it can drop frames if required (i.e. the game loop can run as fast as it wants without caring about the video card's duties). This can be achieved either by triple buffering (the real one, not the DirectX crap) or by moving the double buffering code into a separate thread (this is what GM does).

Thanks for explaining.

Quote
Well, there's some interesting stuff I found when testing this -frame_delay method. At first I was using the D3D's default method built in MAME: flip + D3DPRESENT_INTERVAL_ONE, for v-syncing. This method seems the most efficient for catching the vblank period. However there's something odd to it.
[...]
I was about to trash the whole thing but then tried by using the GetRasterStatus method within a loop and right after that performing the flip operation with the D3DPRESENT_INTERVAL_IMMEDIATE flag. Surprinsingly this worked like a charm, but only for CRT monitors!!!

I'm *really* happy that you didn't thrash the whole thing, as the final implementation is working so well! Also great to see that you could make use of the GetRasterStatus method, and get such a rock solid implementation and performance.

I did some more extensive testing on the frame_delay and for some systems I can reliably run it at a setting of 8, making a *noticable* improvement versus the default! On very close comparison, running MSX2 emulation (which reliably takes a frame_delay of 8 on my system) side by side a real machine, and switching between the two, makes the input response hardly distinguishable from the real thing anymore. It really feels like the holy grail for (frame based) emulation :)

Quote
For some reason when testing this on LCD monitors the GetRasterStatus seems too slow reporting the VBLANK and you can clearly see static tearing usually on the upper part of the screen. I believe it's always equally slow, it's only that CRT monitors tend to have a longer blanking period so this issue gets masked. But the nice thing is that even if delayed, it flags at exact intervals like a swiss clock, as opposite to the v-synced flipping method.

LCD's apparently don't need a real blanking anymore (which is understandable from a hardware perspective), but I once read that it got implemented for compatibility reasons. In any case, if it's a sort of artificial value on LCD, then it might be of influence on how long the "blanking" takes for different manufacturers LCD panels. Purely guessing though..

What you could possibly try is to check for the scanline number during active display (which I understand the GetRasterStatus should return when not in vblank) and if that's also as stable as the swiss clock, you could time the flip a few lines before "reported" vblank. That could possibly mask the issue for LCD while possibly also having the flip still in vblank for CRT. On the other hand, since GroovyMAME is meant for CRT, I guess you wouldn't want to fiddle too much with the current implementation..

Quote
Quote
While doing some first tests with the new frame_delay feature, I also found something interesting that fixes the issue with the <240 line modes running way too fast on my setup (win7+soft15khz), as we spoke about some time ago (see: http://forum.arcadecontrols.com/index.php/topic,120331.msg1313434.html#msg1313434). That speed issue for those specific screenmodes is completely fixed now when I set the frame_delay parameter to 1 (or higher)!

My understanding of this is that the DirectDraw's WaitForVerticalBlank method works somewhat differently under Windows 7, so maybe if the emulation is too fast you might end up having a new frame ready *before* the previous vblank actually ended, thus making the new WaitForVerticalBlank call to return immediately. This should be avoided with the DDWAITVB_BLOCKBEGIN flag, but for some reason it might not be working (this could be totally wrong). So adding a delay would give enough time for the VBLANK to end and MAME to catch the next blanking period instead.

Ah yes indeed, that could certainly be it. I'm extremely happy though that it's working as good as it does now with the delay.

With regards to the settings, would you performance wise recommend activating hardware stretch when using ddraw, or leave the hwstretch disabled?

Calamity

  • Moderator
  • Trade Count: (0)
  • Full Member
  • *****
  • Offline Offline
  • Posts: 7473
  • Last login:Today at 02:50:19 pm
  • Quote me with care
Hi Dr.Venom,

Thanks for testing the -frame_delay option. I'm glad to hear that it actually makes a difference. I do notice it feels very responsive but can't honestly say for sure that I feel a difference as compared to the normal GM's -syncrefresh method. But I don't have the best controls to test, my sticks are not the best ones for accurate control.

As a note, while doing tests for the refresh scaling feature, I played ddonpach by rotating my PC CRT monitor and using a 115 Hz mode for real hardware scanlines in combination with -syncrefresh and -frame_delay 5 in order to discard one out of two vblanks. Well this *really* felt like a different game, probably I was influenced by the fact of running tated but as a fact I reached to a level that I had never got before without adding new credits.

And a month ago or so, I visited a friend whose cab was set up by me before GM existed. At the time I believe I was using -triplebuffer -nothrottle to achieve smooth scrolling, combined with 200 different modes to cover the main systems. Well, I was shocked by how unplayable the games felt, as compared to what we have know. But anyway I was under the effects of alcohol so that's not of scientific value.

I think I should split this topic from the GroovyUME's thread into its own "GroovyMAME's lag discussion" so it has more visibility.
« Last Edit: December 01, 2012, 07:08:18 am by Calamity »
Important note: posts reporting GM issues without a log will be IGNORED.
Steps to create a log:
 - From command line, run: groovymame.exe -v romname >romname.txt
 - Attach resulting romname.txt file to your post, instead of pasting it.

CRT Emudriver, VMMaker & Arcade OSD downloads, documentation and discussion:  Eiusdemmodi

Dr.Venom

  • Trade Count: (0)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 270
  • Last login:May 08, 2018, 05:06:54 am
  • I want to build my own arcade controls!
Hi krick,

Also, I'm not sure about keyboards, but other inputs like mice are handled in MAME using the RawInput API, which may be different than DirectInput from a lag standpoint.

Yes, the keyboard should also be using the rawinput API, as suggested by Microsoft here: msdn.microsoft.com/en-us/library/ee418864.aspx. See halfway down: "Overall, using DirectInput offers no advantages when reading data from mouse or keyboard devices, and the use of DirectInput in these scenarios is discouraged."

Interesting in the same link and paragraph is the remark "DirectInput creates a second thread to read WM_INPUT data, and using the DirectInput APIs will add more overhead than simply reading WM_INPUT directly.". Which to me says that RawInput - in theory at least- might have lower latency than DirectInput (because of the increased overhead).  I stress the *in theory* at least, as usually there are more factors to take into account. That could be interesting, as RawInput can be used as an API for joysticks and other devices also. If I interpreted it correctly, MAME is using DirectInput 7 for joysticks/pads, which could then possibly be improved by changing to using the RawInput api?  Questionmark though.. and from what I know it's not quite that easy to implement the RawInput api in a way that makes it robustly work with the various different joystick/joypad devices out there.

Quote
Check out this info on PS/2 keyboards vs USB keyboards...
http://www.tomshardware.com/reviews/mechanical-switch-keyboard,2955-5.html

Thanks for the link, that certainly made for an interesting read.

Dr.Venom

  • Trade Count: (0)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 270
  • Last login:May 08, 2018, 05:06:54 am
  • I want to build my own arcade controls!
As a note, while doing tests for the refresh scaling feature, I played ddonpach by rotating my PC CRT monitor and using a 115 Hz mode for real hardware scanlines in combination with -syncrefresh and -frame_delay 5 in order to discard one out of two vblanks. Well this *really* felt like a different game, probably I was influenced by the fact of running tated but as a fact I reached to a level that I had never got before without adding new credits.

It could be a coincidence, but I also reached a level on Gradius 2 (MSX version in UME), which I never reached before on emulation...

Quote
And a month ago or so, I visited a friend whose cab was set up by me before GM existed. At the time I believe I was using -triplebuffer -nothrottle to achieve smooth scrolling, combined with 200 different modes to cover the main systems. Well, I was shocked by how unplayable the games felt, as compared to what we have know.

That's the interesting part, once your frame of reference changes, you'll start to notice those things. That's why in general I always try to run it comparing directly against real hardware (which admittedly is easier to do for home consoles than Arcade hardware).

Quote
But anyway I was under the effects of alcohol so that's not of scientific value.
;D

Quote
I think I should split this topic from the GroovyUME's thread into its own "GroovyMAME's lag discussion" so it has more visibility.

Yes, that would be helpful.