Could you break each verse into three images: Bass clef, verse, treble clef?
Then replace the verse image with a much smaller image containing only the linking bar.
Then simply put the text, as text following the bar image.
So it'd be something like this:
[image: BASS CLEF with NOTATION]
[image: LINKING BAR]Text of Verse
[image: TREBLE CLEF with NOTATION]
It'd then just be a matter of figuring out how to tell whatever tool/format that you're using to put the pieces so close together that no gap shows.
More work, but the best of both worlds -- and the image slicing is probably scriptable.
m a r
|