It also depends on the complexity of the pipeline and how "smart" your compiler is. Intel's C compiler will out-optimize almost anybody on their modern CPUs on tight inner-loop kernel performance simply because the pipeline is so complex that it's very difficult for a human to 1) fully know/understand all the interactions, and 2) keep it full. People who are really good with compilers and know the pipeline extremely well have sat down and automated the process. Even gcc does amazingly well on many x86 targets.
On smaller MCUs with much simpler pipelines and less mature compilers, yeah, a human can often out-optimize the compiler without too much effort. Of course, the real speed gains usually come from things the compiler doesn't know how to do, like improving the runtime complexity of the algorithm. You can do this in C or assembly, but the guys willing to write tight-coded assembly are usually more willing to put forth the effort and better at it. In general, I've found that, if you have a fixed time budget on a project, you do much better to write the whole thing in C, profile it, and figure out where the hotspots are and give them manual attention. Inspecting the output of the compiler (e.g. a disassembly with symbols) is often a very useful thing.
Also, figure you can now get a ~50MHz 32-bit MCU (ARM Cortex-M0) for like $1-2. That's so much CPU time that you'll get far better gains from using the thing properly (e.g. taking advantage of DMA) than trying to count instructions. I do still use 8-bit MCUs, occasionally, mostly when I want a small package (e.g. 8 pins) and something really cheap in low quantity. My goto part is usually something in the tinyAVR series. I've used e.g. a tiny25 recently which is about 50 cents in low quantity and can at 16MHz off the internal oscillator if you don't care too much about timing accuracy and jitter. Anything where I need USB, at this point, I'm going to reach for an ARM.