Main Restorations Software Audio/Jukebox/MP3 Everything Else Buy/Sell/Trade
Project Announcements Monitor/Video GroovyMAME Merit/JVL Touchscreen Meet Up Retail Vendors
Driving & Racing Woodworking Software Support Forums Consoles Project Arcade Reviews
Automated Projects Artwork Frontend Support Forums Pinball Forum Discussion Old Boards
Raspberry Pi & Dev Board controls.dat Linux Miscellaneous Arcade Wiki Discussion Old Archives
Lightguns Arcade1Up Try the site in https mode Site News

Unread posts | New Replies | Recent posts | Rules | Chatroom | Wiki | File Repository | RSS | Submit news

  

Author Topic: Enhance performance under Linux on modern CPU  (Read 2937 times)

0 Members and 1 Guest are viewing this topic.

Doozer

  • Trade Count: (0)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 498
  • Last login:June 12, 2023, 09:19:49 am
  • Z80 ERROR
Enhance performance under Linux on modern CPU
« on: January 20, 2015, 04:06:09 am »
Hi,

For Linux enthusiasts (like groovyarcade users) here are some tips to achieve a stable 100% (no hiccups) emulation with groovyume.

When using a multicore CPU, context switching and ume multi threading option can degrade the performance compared to a single core system (like P4 cpu). Mame/ume requires at a frequency equivalent to at least 3GHz+ to show 100% emulation in all non 3D game. Newer CPU have multiple core but less GHZ per core. Because mame/use does not use multiple threads it is better to focus the power on a single CPU and benefit from turbo/max frequency all the time. I will describe how to disable the extra core or hyperthreading to achieve best performance. (note: it is also possible to disable the extra cores inside your BIOS menu if the option is available, nevertheless point 3 is still required)

1) from the bootload

Inside the configuration file (i.e. syslinux.cfg, grub.conf, menu.lst, ...) add the following to the `append` line: maxcpus=1

You should have something like :
Code: [Select]
append root=/dev/disk/by-label/GA rw quiet splash maxcpus=1 vga=785 video=VGA-1:640x480ec

The lie can differ and under groovyarcade the vga and video settings are system dependent.

2) from within a Linux command line or script executed as root.

Code: [Select]
find /sys/devices/system/cpu/ -maxdepth 1 -name 'cpu[0-9]*' -not -name 'cpu0' -exec sh -c "echo 0 > '{}'/online" \;

This will disable all cores but not the cpu0. The line does a "echo 0 > /sys/devices/system/cpu1/online" to cpu1 to the last one.

3) Last common step to both method is now to set the governor. It will allow the system to operate the core at max frequency. (processor with turbo mode will stay in turbo mode and allow you to use extra MHz)

Code: [Select]
echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

There is no need to modify the ume.ini "multithreading" or "numprocessors" parameters. I have now expected ume emulation performance and stable 100% emulation speed.

Enjoy.
« Last Edit: January 21, 2015, 10:00:30 am by Doozer »

Doozer

  • Trade Count: (0)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 498
  • Last login:June 12, 2023, 09:19:49 am
  • Z80 ERROR
Re: Enhance performance under Linux on modern CPU
« Reply #1 on: January 21, 2015, 09:42:56 am »
Another method to play with the core/ht of the system is to dedicate a core for groovyume and move (as much as possible) other threads to the remaining cores. To achieve this we will use the cpuset feature of recent kernel. I know that it is also possible to do an isolcpus first (boot parameter) to have kernel processes bound to other core but having them still on the dedicated one is not an issue. Still ACPI, interrupts and their vector can be optimized.

First we enable cpuset on a dedicated location. (everything is performed here as root)

Code: [Select]
mkdir /cpuset
mount -t cpuset none /cpuset/

Next step we isolate the 2nd cpu core and disable the hyperthreading only on that one. The loop ensure that all sibling (HT linked to core) are disabled. The cpu/threading number is not constant and depends on the processor family (number of core and HT capability).

Code: [Select]
/bin/echo 0 > /sys/devices/system/cpu/cpu1/online
sleep 1
/bin/echo 1 > /sys/devices/system/cpu/cpu1/online
/bin/echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor

for i in  `cat /sys/devices/system/cpu/cpu1/topology/thread_siblings_list | sed 's/,/ /g'`;
do
  if ( test $i -ne 1 ) then
    /bin/echo 0 > /sys/devices/system/cpu/cpu${i}/online
  fi
done

Now we create a dedicated context for exclusive cpu execution. We dedicate core 1 (cpu1) in an exclusive manner sharing global memory and not using the thread scheduler because we do not need to balance the load anymore.

Code: [Select]
mkdir /cpuset/single
/bin/echo 1 > /cpuset/single/cpuset.cpus
/bin/echo 1 > /cpuset/single/cpuset.cpu_exclusive
/bin/echo 0 > /cpuset/single/cpuset.mems
/bin/echo 0 > /cpuset/single/cpuset.sched_load_balance

The cpu is still available to the root context and can have forked or new processes interfering to the new context. To lower the pressure from other processes we create a second context with remaining cpus (not containing core 1).

Code: [Select]
mkdir /cpuset/system
SYS_CPU="0"
for i in `find /sys/devices/system/cpu/cpu*/online`;
do
  if ( test "`cat $i`" -eq 1 ) then
    CPU_ID=`echo $i | sed 's/.*cpu\([0-9]*\)\/.*/\1/'`
    if ( test "$CPU_ID" -ne 1 ) then
      SYS_CPU=$SYS_CPU","$CPU_ID
    fi
  fi
done

/bin/echo $SYS_CPU > /cpuset/system/cpuset.cpus
/bin/echo 0 > /cpuset/system/cpuset.mems       

Now we migrate all possible thread to the system set. You will still see i/o warning here due to the kernel thread being not movable to a subset. This is normal and not critical.

Code: [Select]
for i in `cat /cpuset/tasks`; do /bin/echo ${i} > /cpuset/system/tasks; done

Now you have core 1 at full speed (turbo mode) available to execute time critical application. We can start the frontend process or groovyume process inside this set. Moving the frontend process has the advantage that all child processes will be executed in the exclusive context. Otherwise, a launch script for each emulator must be created to move the process inside this context. Replace advmenu with your frontend executable (like attrac...)

Code: [Select]
echo `ps -u | grep advmenu | awk '{print $2}'` > /cpuset/single/tasks

The major benefit compared to previously given solution is that you offload system and service tasks to the remaining cpus. Even X will be managed from outside the core allocate to the emulator.

Enjoy.




Doozer

  • Trade Count: (0)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 498
  • Last login:June 12, 2023, 09:19:49 am
  • Z80 ERROR
Re: Enhance performance under Linux on modern CPU
« Reply #2 on: January 22, 2015, 04:48:22 am »
For cpu isolation tasks access, change the access mode to authorize write to the tasks special file.

Code: [Select]
chmod a+w /cpuset/single/tasks

To complete the automation, in the /home/arcade/.xinitrc file, replace the line with advmenu with the following:

Code: [Select]
sh -c 'echo $$ > /cpuset/single/tasks;advmenu'

This will ensure that advmenu and all children will be executed on the reserved core.
« Last Edit: January 22, 2015, 08:48:12 am by Doozer »

bulbousbeard

  • Trade Count: (0)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 522
  • Last login:August 25, 2015, 11:58:25 pm
  • I want to build my own arcade controls!
Re: Enhance performance under Linux on modern CPU
« Reply #3 on: January 22, 2015, 11:05:04 am »
Sounds like ---smurf-poo--- to me. I'm pretty sure that MAME uses multiple threads even on single core setups.

If software is working properly, you don't have to disable cores on a CPU. That's not an acceptable solution IMO.

Doozer

  • Trade Count: (0)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 498
  • Last login:June 12, 2023, 09:19:49 am
  • Z80 ERROR
Re: Enhance performance under Linux on modern CPU
« Reply #4 on: January 23, 2015, 05:46:29 am »
Sounds like ---saint's minion-poo--- to me. I'm pretty sure that MAME uses multiple threads even on single core setups.

If software is working properly, you don't have to disable cores on a CPU. That's not an acceptable solution IMO.

I agree that if everything is working properly, there is NO need to do optimization. In that case spurious interrupts and latency must not be a concern if the number or cycles are sufficient to do the job.

Remeber that MAME is not multithreaded oriented. I target systems requiring `-mt` option to be enabled to achieve 100% emulation speed on multi CPU configurations. Otherwise, those system have only 50% emulation speed (explanation is given in one thread on this forum, mainly to balance load involving a sound thread). 

Unless you have an unlocked CPU (K version), if you want to enable the turbo boost functionality with the highest multiplier you need to disable core(s). e.g: for a mid range i5 CPU already suffering from the 50% barrier on multiple core scheduler, cpu isolation bring back the number to 100%. On the other side, intel speed step enabled CPUs might not apply the same multiplier depending of the number of active core/hyperthread.

Code: [Select]
i5 - 6xx CPU Max TURBO Multiplier (if Enabled) with 1/2/3/4 Cores is  27x/26x/0x/0x
Bus clock frequency (BCLK) 133.00 MHz
Base multiplier is 25x

formula is : BCLK x MULT = true frequency

In normal speed = 25 x 133 = ~3.3GHz
In turbo mode speed = 27 x 133 = ~3.6GHz

Extra 300 MHz might play a role depending on the emulation pressure.

Indeed, groovyume uses several threads, at least 4:

Code: [Select]
groovymame(1684)-+-{SDLAudioDev1}(1706)
                 |-{SDLTimer}(1701)
                 |-{groovymame}(1703)
                 `-{groovymame}(1704)  <--- 2nd thead is present only with ume.ini multithreading option enabled

A solution is always acceptable if it brings an enhancement ;-)
« Last Edit: January 24, 2015, 02:00:34 am by Doozer »

Doozer

  • Trade Count: (0)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 498
  • Last login:June 12, 2023, 09:19:49 am
  • Z80 ERROR
Re: Enhance performance under Linux on modern CPU
« Reply #5 on: January 23, 2015, 12:24:38 pm »

Last step, how to offload the interrupts from the isolated CPU. These steps are executed at boot time.

First, check that irqbalance is not active on the system (it is not active on groovyarcade)

Back to previous example with isolation of CPU1, we will change the smp_affinity on this cpu.

We have to compute the CPU bit position (CPUPOS) and the isolation mask (MASK).

Code: [Select]
SINGLE_CPU=1
MASK=`perl -e '$val=$ARGV[0];$new=0xFF & ~(1<<$val);printf "%02x\n",$new;' ${SINGLE_CPU}`
CPUPOS=`perl -e '$val=$ARGV[0];$new=1<<$val;printf "%02x\n",$new;' ${SINGLE_CPU}`

First we exclude the CPU from the default affinity.

Code: [Select]
echo $MASK > /proc/irq/default_smp_affinity

Now for each interrupts (some interrupt, like 0,2... are not movable). We remove cpu1 from the set and in case it was the unique cpu, we move it to cpu0. Iy is easier to create a script and process the changes using xargs.

Code: [Select]
find /proc/irq/ -name 'smp_affinity' -print0 | xargs -0 -I'{}' ./your_irq_script.sh ${SINGLE_CPU} "{}"

Code: [Select]
  OLD=`cat $2`
  MASK=`perl -e '$val=$ARGV[0];$new=0xFF & ~(1<<$val);printf "%02x\n",$new;' ${SINGLE_CPU}`
  CPUPOS=`perl -e '$val=$ARGV[0];$new=1<<$val;printf "%02x\n",$new;' ${SINGLE_CPU}`

  NEW=`perl -e '$val=hex($ARGV[0]);$mask=hex($ARGV[1]);$new=$val & $mask;if($new == 0 ) { print "01\n"; } elsif ($val != $new) {printf "%02x\n",$new;}' $OLD $MASK`

  if ( test -n "$NEW" ) then
    ID=`echo $2 | sed 's+.*irq/\([0-9]*\)/smp.*.*+\1+'`
    if ( test "$ID" -gt 2 ) then
      echo $NEW > $2
    fi
  fi

I have intentionally not optimized the previous script and commands sequences.

You can check interruptions on each cpu by doing a cat /proc/interrupts

Doozer

  • Trade Count: (0)
  • Full Member
  • ***
  • Offline Offline
  • Posts: 498
  • Last login:June 12, 2023, 09:19:49 am
  • Z80 ERROR
Re: Enhance performance under Linux on modern CPU
« Reply #6 on: January 23, 2015, 12:31:41 pm »

For information, cpu1 is an arbitrary choice. It depends on the CPU architecture which define how your sibling (hyperthreading) is wired. Sometime, the second core start at another position.

Sibling case 1 (used previously). Here cpu1 is linked with cpu3

Code: [Select]
      Core Hyperthread_cpu
Cpu0  0    2
Cpu1  1    3
Cpu2  0    0
Cpu3  1    1

Sibling case 2. In that case cpu2 is the second core and cpu3 the hyperthreaded cpu.

Code: [Select]
      Core Hyperthread_cpu
Cpu0  0    0
Cpu1  0    1
Cpu2  1    2
Cpu3  1    3