Just to qualify, I haven't looked at the source, and I have no idea how MAME is architected but I can make some educated guesses. The issue is that multi-core CPU's have only come into vogue in the past couple of years. Prior to that, multi-core systems were somewhat common in the Apple space, but not really on PC's. As such, I don't think much focus was put into architecting the system to take advantage of the extra cores. However, as we move forward there are serious physical limitations to how far we can scale frequency. Extra performance is only going to be had by going wider (more ALU's), and multi-core. MAME cannot simply rely on single CPU's frequency continuing to go higher -- it has to take advantage of the newer architectures to extract maximum performance.
As for your sync'ing argument, I agree that there is a penalty associated with it, but if the operations are properly pipelined, the benefit far outweighs the synchronization. This is just like any modern day superscalar CPU where you can have many operations in flight simultaneously but they still need to sync up at the end of the pipe. At the end of the day, you still get a huge performance win by getting better utilization of the underlying resources. Running an emulator is really not all that different from running other types of code. With some creative thinking, it's often possible to parallelize many problems, or at the very least possible to pipeline or break down the code so that different parts can be run concurrently.
I am sure it's not a trivial problem to solve, but I also don't think it's an impossible problem either.