Getting rid of the footer text and page count

MiB · 12-13-2015, 06:53 PM

I'm converting a book from pdf, so I have a page footer with page count I want to loose.

I came up with:

Code:

<p class="calibre1"> <i class="calibre4">Title of the Book by Author</i> <i class="calibre4">[0-9]*</i></p>

and in my test app RegExRX this take any number in the i tag, like 1, 10, 22, 150, 2354 and so on in my source example text from the book.

It would seem Calibre didn't remove even one of these. Why? Doesn't Calibre support the regex above?
How can I make sure every occurrence throughout the book is removed?

eschwartz · 12-13-2015, 07:43 PM

Are you sure you are matching the right stuff?
You need to match against the input content, not the post-processed result.

In the S&R tab of the conversion dialog, click the wand button to get a preview of the pdftohtml result, which is what the regex will operate on.

Or in the Editor, you can S&R the EPUB (?) with more granularity.

MiB · 12-14-2015, 02:22 PM

Quote:

Originally Posted by eschwartz

Are you sure you are matching the right stuff?
You need to match against the input content, not the post-processed result.

In the S&R tab of the conversion dialog, click the wand button to get a preview of the pdftohtml result, which is what the regex will operate on.

Or in the Editor, you can S&R the EPUB (?) with more granularity.

The input content is pdf text, not html, no? I tried with

Code:

Title of the Book by Author [0-9]*

but this doesn't work either. I'd prefer it to work on the pdf text like this as Calibre is marking up the same string in different ways. This makes it hard to write a regex to find all variations of the tags and remove them along with the text.

eschwartz · 12-14-2015, 02:28 PM

Well, once again check the pdftohtml intermediate content. Use the Regex Builder wizard to make sure you match the right stuff.

There will be HTML, not just text. pdftohtml is a third-party utility that comes from poppler, and it should be predictable enough -- calibre performs the S&R before stomping all over the markup with its CSS-flattening algorithm.

Normally the regex is applied to the raw contents of the input format, i.e. unzipped EPUB/AZW3 (X)HTML. But PDF is, ah, complicated, so it has to be turned into HTML before you can convert that HTML to something else.

MiB · 12-18-2015, 05:03 AM

Quote:

Originally Posted by eschwartz

Normally the regex is applied to the raw contents of the input format, i.e. unzipped EPUB/AZW3 (X)HTML. But PDF is, ah, complicated, so it has to be turned into HTML before you can convert that HTML to something else.

Thanks for these clarifications. In the end I gave up and cut and pasted the html into Sublime Text and RegexRX where I made a more open regex and managed to get rid of all footers. Some of the issues stemmed from the fact that Calibre used different HTML classes for the same footer.

This was quite hard to discover in Calibre. Hopefully I learned something for my next title,

eschwartz · 12-19-2015, 09:43 PM

That last bit I said was just an explanation as to why you cannot merely open the book in calibre's Editor to see what needs changing, etc.

You ALWAYS need to use the Regex Builder in calibre in order to find out what to change, or do something else to examine the SOURCE INPUT.
PDF is slightly more complicated because the source input comes from pdftohtml.
But you still need to look at the SOURCE INPUT.

I don't know how many times I need to say this, but calibre's S&R operates on the SOURCE INPUT!
Looking at the converted book in the Editor is worthless, because the book has already been converted and therefore it is no longer the SOURCE INPUT.

Here is a picture, in case it helps....

MiB · 12-25-2015, 05:15 AM

Quote:

Originally Posted by eschwartz

You ALWAYS need to use the Regex Builder in calibre in order to find out what to change, or do something else to examine the SOURCE INPUT.
PDF is slightly more complicated because the source input comes from pdftohtml.
But you still need to look at the SOURCE INPUT.

I don't know how many times I need to say this, but calibre's S&R operates on the SOURCE INPUT!
Looking at the converted book in the Editor is worthless, because the book has already been converted and therefore it is no longer the SOURCE INPUT.

Here is a picture, in case it helps....

I did find this wizard, but as I found positive results to be absent from the results it was easier to see what you were doing. What I lack is getting the results and adjusting the regex to get the results I want. It would be useful if regex patterns could be shared easily between users. So if I make or need some insights on a specific problem there's a pattern library I can look at and learn from as well as share my results to.

I'll try the wizard again for my next title. It's wonderful to get my fav pdfs as e-books just as with the books that came also in e-book format from the beginning. Calibre is a useful tool.

eschwartz · 12-25-2015, 02:44 PM

Oh, well if you just need help with using the right regex to match specific text, why didn't you say so?

Post some code examples and we can help you write a regex.

theducks · 12-25-2015, 03:48 PM

MIB
I find it Oh So Much easier to do these kinds of tasks with the E-book Editor, where you can see exactly what (and works) you get searching.
Also, you can do multiple passes, rather than trying for a catchall on a single pass.

S&R in Sigil has a right click: tokenize in the search box that will help tame the search instead of a wild wildcard

eschwartz · 12-25-2015, 03:52 PM

@theducks

You cannot edit PDF in the Editor. Or in Sigil.

You can dump the PDF with pdftohtml and import that into the Editor I guess...

theducks · 12-25-2015, 03:54 PM

Quote:

Originally Posted by eschwartz

@theducks

You cannot edit PDF in the Editor. Or in Sigil.

You can dump the PDF with pdftohtml and import that into the Editor I guess...

Convert first

Then clean

eschwartz · 12-25-2015, 04:07 PM

Yeah, after CSS flattening.

Conversion is ideally the last of all possible steps...

MiB · 01-01-2016, 05:01 AM

Quote:

Originally Posted by eschwartz

Oh, well if you just need help with using the right regex to match specific text, why didn't you say so?

Post some code examples and we can help you write a regex.

I need this help in Calibre. As it is now to just get a matching pattern I can use RegExRX.

The real problem is that after this I have to find out why it didn't work. It would be better if I interactively could see how many matches I'd get for a pattern already in pre-conversion in Calibre.

It was way easier to clean the converted results in Sublime, but this does feel like a lot of work with copy and paste.

A side issue is how you can influence the html generation, as much uses non-optimal tag structures. For example I'd love if you could help conversion with suggesting what in the current title signifies a heading, what's a list and so on.

12-13-2015, 06:53 PM	#1
MiB Junior Member Posts: 8 Karma: 10 Join Date: Dec 2015 Device: iPhone	Getting rid of the footer text and page count I'm converting a book from pdf, so I have a page footer with page count I want to loose. I came up with: Code: <p class="calibre1"> <i class="calibre4">Title of the Book by Author</i> <i class="calibre4">[0-9]*</i></p> and in my test app RegExRX this take any number in the i tag, like 1, 10, 22, 150, 2354 and so on in my source example text from the book. It would seem Calibre didn't remove even one of these. Why? Doesn't Calibre support the regex above? How can I make sure every occurrence throughout the book is removed?

12-13-2015, 07:43 PM	#2
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	Are you sure you are matching the right stuff? You need to match against the input content, not the post-processed result. In the S&R tab of the conversion dialog, click the wand button to get a preview of the pdftohtml result, which is what the regex will operate on. Or in the Editor, you can S&R the EPUB (?) with more granularity. Last edited by eschwartz; 12-13-2015 at 07:45 PM.

12-14-2015, 02:28 PM	#4
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	Well, once again check the pdftohtml intermediate content. Use the Regex Builder wizard to make sure you match the right stuff. There will be HTML, not just text. pdftohtml is a third-party utility that comes from poppler, and it should be predictable enough -- calibre performs the S&R before stomping all over the markup with its CSS-flattening algorithm. Normally the regex is applied to the raw contents of the input format, i.e. unzipped EPUB/AZW3 (X)HTML. But PDF is, ah, complicated, so it has to be turned into HTML before you can convert that HTML to something else. Last edited by eschwartz; 12-14-2015 at 02:31 PM.

12-19-2015, 09:43 PM	#6
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	That last bit I said was just an explanation as to why you cannot merely open the book in calibre's Editor to see what needs changing, etc. You ALWAYS need to use the Regex Builder in calibre in order to find out what to change, or do something else to examine the SOURCE INPUT. PDF is slightly more complicated because the source input comes from pdftohtml. But you still need to look at the SOURCE INPUT. I don't know how many times I need to say this, but calibre's S&R operates on the SOURCE INPUT! Looking at the converted book in the Editor is worthless, because the book has already been converted and therefore it is no longer the SOURCE INPUT. Here is a picture, in case it helps.... Attached Thumbnails

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Cannot get rid of text-indent	dawood	Conversion	3	10-21-2013 10:10 PM
Removing the Footer with Page numbers and book title?	omro	Kobo Reader	24	11-11-2012 03:47 AM
Regexp and Alternate Page Header/Footer	adad	Calibre	5	01-15-2011 09:03 PM
PDF Conversion - Removing Header / Footer Text	heb	Sony Reader	9	07-11-2010 11:02 PM
PRS-500 What should the Total Page Count text display?	Nogg	Sony Reader Dev Corner	8	09-07-2007 07:04 PM

12-25-2015, 02:44 PM	#8
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	Oh, well if you just need help with using the right regex to match specific text, why didn't you say so? Post some code examples and we can help you write a regex.

12-25-2015, 03:48 PM	#9
theducks Well trained by Cats Posts: 30,959 Karma: 60358908 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	MIB I find it Oh So Much easier to do these kinds of tasks with the E-book Editor, where you can see exactly what (and works) you get searching. Also, you can do multiple passes, rather than trying for a catchall on a single pass. S&R in Sigil has a right click: tokenize in the search box that will help tame the search instead of a wild wildcard

12-25-2015, 03:52 PM	#10
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	@theducks You cannot edit PDF in the Editor. Or in Sigil. You can dump the PDF with pdftohtml and import that into the Editor I guess...

12-25-2015, 04:07 PM	#12
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	Yeah, after CSS flattening. Conversion is ideally the last of all possible steps...

Advert

Advert