The first step was to disable all the debug output from the compatibility library; that halved the rendering time to a shade over 24s. I found that rather disappointing; I was expecting the debug output to take up at least two-thirds of the time.
I turned on optimisation in the library compilation, -O4 reduced the render time to under 18s.
Two optimisation suggestions from Jake Waskett were to ensure that jump targets were on 16-byte boundaries and to fix up calls to scan from a fixed location and jump to the returned call so that on the second attempt, there wouldn't be a relatively expensive scan call. The former shaved about 0.1s off the render time, but the latter gave a significant improvement, taking the render time down to just over 15s.
During all this, I noticed the SETcc operations and realised that I could use SETO %al ; LAHF to get all the necessary x86 flags into %ax (previously, I'd been using pushf/popf). Of course, when I googled for that combination I found a description of someone writing an ARM JIT compiler using pushf, with a comment recommending the seto/lahf combination; it's all about knowing what to look for! Anyway, once the flags are in %ax, it's fairly easy to get the four flags we want into the bottom nibble of %al by rotating %ah, masking and or'ing with %al: ( ror $3, %ah ; and $0xe, %ah ; or %ah, %al ). The flags aren't in the same order used by ARM, but a 16-byte lookup table can translate between the two in one instruction.
Obviously, there are more flag reads than flag writes (there's no point in setting the flags if they aren't going to be read at least once), so I added a new global variable to be set at the same time the flags were which contains a 16 entry bitmap, one for each condition code (EQ, NE, GT, etc.), so that a conditional instruction just has to test a known bit in a known variable and use the ZF to behave appropriately.
That change was fairly major (but only affected three files), and takes the render time down to 13 seconds.
Next thing to try was to eliminate the extra code for each load or store that checks for non-aligned accesses. The idea is to set the x86 flag that causes a SIGBUS signal to be generated for unaligned accesses and load the registers as necessary before moving on to the next instruction. Unaligned accesses in ARM code will probably be relatively rare and the speedup in the normal memory accesses should more than make up for the slower signal handling. Since the only routines called from emulated ARM code are scan_arm_code and (when debugging is enabled) dump_regs), those routines would reset the flag on entry and restore its state on exit.
That "optimisation" slowed the render time down to 15s again.
Since the only unaligned access from scan_arm_code is likely to be when setting a 32-bit constant in an instruction, I stopped manipulating the flag in scan_arm_code and tried modifying cache_32bit to write its four bytes one at a time, instead, and the time improved again to a little over 12s. However, the code I was testing didn't include any unaligned accesses, and since I hadn't written the signal handler anyway, I've decided to call it a day for the time being and leave that optimisation out.
Future optimisation possibilities:
finish the SIGBUS solution
Improve the hash table lookup
Use mov $constant,arm_emulator_regs[n] for constant loads into registers
Combine consecutive ARM instructions that load a constant into a register
Remember if the flags (or a register's contents) are stored in a register from last time.
All of these things have a chance of making the scan_arm_code routine slower and negating their speed improvements, but they're probably worth a try.
The other thing to do is to profile the ARM code somewhat by generating code to increment counters when, for example, flags are set, flags are read, scan_arm_code and get_hash_entry are called, etc. At the moment, I notice that a sequence of ARM instructions leading up to a decision point (conditional jump, swi, etc.) is rarely much more than ten instructions.
The instruction emulator file stands at 1761 lines (55158 bytes).