In JamVM 1.5.0 I released the "inlining interpreter" which copies code blocks together in a similar way to a simple JIT (but the code is compiled by gcc, rather than being generated natively as in a JIT). This achieved an impressive speed improvement and I've been keen to optimise it further.
The major thing which has been in my sights is the remaining dispatches between adjacent basic blocks. For instance the fallthrough edge of an "if" and across the edge caused by a jump target.
Currently, the unit of inlining is a basic block because this has only one entry point and one exit point. Blocks which contain instructions which need to be rewritten (quickening, e.g. after symbolic resolution) must be executed first using threaded dispatch. If we inlined across blocks we will end up with blocks with multiple entry and multiple exit points. In this case, depending on control flow, we may complete the block before all the block has been executed, or we may never reach the end of the block because a side exit is always taken. In the first case, inlining can't be done, and in the second inlining will never occur.
My first approach to solve this was simplistic, and was more of an experiment to "test the water". So I wasn't too surprised when it didn't yield any speed improvement.
My second attempt was much more complex (after inlining check the edges and create a longer block if the blocks across the edge are both inlined, and fix up internal targets). But this showed no significant general speed improvement (order of 2%) although specific microbenchmarks showed over 100%.
So for the last week I've been trying to explain the results. After some experiments I've come to the conclusion that it is due to instruction cache locality. Basically, the merged block (which may be merged many times) ends up in a different location to the non-mergeable blocks which remain in the initial position. Previously, inlining exhibited good cache locality due to blocks being allocated in execution order. This was destroyed by block merging. The effects of this counteracted the speed improvements leading to no change overall.
This was the position I was in on Friday (which added to my despondency). However, on the weekend I rethought the problem and came up with a third approach. I've partially implemented it and hopefully should be able to test it in a few days. Fingers crossed!