Will the dual microcontroller UART solution offer low enough latency? hardware interrupts are probably the way to go if I know my interrupts. Simple shift registers will do if they are fast enough, I suppose.
I don't know. I'm in over my head a bit, here.
Depends on how fast you run it. Without a transceiver in the middle (just running TTL level) you can usually get 1MBaud in async mode out of modern UARTs on 8-bit MCUs (faster yet on some e.g. ARMs). If you've got, say, 48 inputs, that works out to somewhere on the order of 60 microseconds to move the data over the wire (maybe add on some time for message level framing). That corresponds to a sample rate of ~16kHz. A 16MHz CPU can probably keep up with this.
Most games only sample once per frame, but some do sample once every scanline, so you're just good enough. You could potentially sync up the sampling to the horizontal sync (i.e. trigger off the video sync line) to keep things consistent rather than free running, but I doubt you'd notice either way. Even extremely demanding applications like music rhythm games or fighters are fine at 500-1000Hz (and arguably lower).
Most SPI controllers can go a fair bit faster. 8MHz is common on the AVRs used on Arduinos, though the CPU can barely service the transceiver that fast, and you do need to move the data from the IO port to the SPI controller or vice versa. The ARMs I use in a lot of designs these days can often do 25-50MHz on their SPI ports, but you essentially require DMA to actually keep up with that. I'd expect you could get about 200-250kHz effective sampling rate using SPI and two back-to-back 8-bit micros if you really wanted to, but it's wholly unnecessary.
Shift registers can usually be clocked at at least 5-10MHz, depending on logic family. AHC series can usually go upwards of 25-50MHz.
Note at at >1MHz or so speeds, so called "high speed" or "transmission line" effects can become significant. At >10MHz, you're almost certainly going to run into problems if your cabling is more than a foot or two long, especially if it's a parallel bus. Good power bypassing (0.1uF ceramic caps) on the ICs is imperative.