When I retrieve a Project Gutenberg ebook in HTML form, I usually leave the page number (href) references in, but remove the actual PG #'s using a RegEx, like the below example written in Perl:
Code:
#Remove page numbering
$html =~ s#<span class='pagenum'><(.*[^>])>.*</span>#<$1>#gi ;
$html =~ s#<span class=\"pagenum\"><(.*[^>])>.*</span>#<$1>#gi ;
It just leaves the <a name/id> reference i.e. <$1>.