From your post, I understand that, from an image X and a caption C, you want to obtain an image Y showing the caption C under the image X.
Do you have a strong reason for preferring this over using the following simple structure?
Code:
<div>
<img ... />
<p>Caption text here</p>
</div>