Looks like I'm late to the party!
As far as I know you can read the controller port at any time - especially this way since we use the D0 line like a digital-in - no latching involved (which as a bonus means it's immune to any of the DMC false read problems). That external NSF thing I had done worked pretty much the way you described. It was simple for me since I didn't have to worry about the NMI - I didn't have to update the display or really worry about the engine. I just polled constantly in an infinite loop - the only thing in my NMI was an RTI.
The code pretty much went like this:
InfLoop:
LDA $4017 ; Checking for the first bit (A button) - this is basically used as a digital in
CMP oldState
BEQ + ; If it's the same as the old state (0->0, 1->1) then skip (creates a rising-edge detect)
STA oldState
CMP #$01 ; Being here implies that there was change, if it was 0->1 then play a "frame"
BNE +
JSR PlayNSF
+ JMP InfLoop
So on any rising edge signal the NSF would step to the next frame - in the PR8 case a rising edge would trigger a step. If we were using 24ppqn we can just divide down pretty simply using some 7400 logic as suggested earlier. The big gotcha is monitoring that line enough so that we don't miss a transition.
I'll mull it over when I get home from work and post back- excellent job as always Neil.