Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 12-13-2015, 06:53 PM   #1
MiB
Junior Member
MiB began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Dec 2015
Device: iPhone
Getting rid of the footer text and page count

I'm converting a book from pdf, so I have a page footer with page count I want to loose.

I came up with:

Code:
<p class="calibre1"> <i class="calibre4">Title of the Book by Author</i> <i class="calibre4">[0-9]*</i></p>
and in my test app RegExRX this take any number in the i tag, like 1, 10, 22, 150, 2354 and so on in my source example text from the book.

It would seem Calibre didn't remove even one of these. Why? Doesn't Calibre support the regex above?
How can I make sure every occurrence throughout the book is removed?
MiB is offline   Reply With Quote
Old 12-13-2015, 07:43 PM   #2
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
Are you sure you are matching the right stuff?
You need to match against the input content, not the post-processed result.

In the S&R tab of the conversion dialog, click the wand button to get a preview of the pdftohtml result, which is what the regex will operate on.



Or in the Editor, you can S&R the EPUB (?) with more granularity.

Last edited by eschwartz; 12-13-2015 at 07:45 PM.
eschwartz is offline   Reply With Quote
Advert
Old 12-14-2015, 02:22 PM   #3
MiB
Junior Member
MiB began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Dec 2015
Device: iPhone
Quote:
Originally Posted by eschwartz View Post
Are you sure you are matching the right stuff?
You need to match against the input content, not the post-processed result.

In the S&R tab of the conversion dialog, click the wand button to get a preview of the pdftohtml result, which is what the regex will operate on.



Or in the Editor, you can S&R the EPUB (?) with more granularity.
The input content is pdf text, not html, no? I tried with
Code:
Title of the Book by Author [0-9]*
but this doesn't work either. I'd prefer it to work on the pdf text like this as Calibre is marking up the same string in different ways. This makes it hard to write a regex to find all variations of the tags and remove them along with the text.
MiB is offline   Reply With Quote
Old 12-14-2015, 02:28 PM   #4
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
Well, once again check the pdftohtml intermediate content. Use the Regex Builder wizard to make sure you match the right stuff.

There will be HTML, not just text. pdftohtml is a third-party utility that comes from poppler, and it should be predictable enough -- calibre performs the S&R before stomping all over the markup with its CSS-flattening algorithm.



Normally the regex is applied to the raw contents of the input format, i.e. unzipped EPUB/AZW3 (X)HTML. But PDF is, ah, complicated, so it has to be turned into HTML before you can convert that HTML to something else.

Last edited by eschwartz; 12-14-2015 at 02:31 PM.
eschwartz is offline   Reply With Quote
Old 12-18-2015, 05:03 AM   #5
MiB
Junior Member
MiB began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Dec 2015
Device: iPhone
Quote:
Originally Posted by eschwartz View Post
Normally the regex is applied to the raw contents of the input format, i.e. unzipped EPUB/AZW3 (X)HTML. But PDF is, ah, complicated, so it has to be turned into HTML before you can convert that HTML to something else.
Thanks for these clarifications. In the end I gave up and cut and pasted the html into Sublime Text and RegexRX where I made a more open regex and managed to get rid of all footers. Some of the issues stemmed from the fact that Calibre used different HTML classes for the same footer.

This was quite hard to discover in Calibre. Hopefully I learned something for my next title,
MiB is offline   Reply With Quote
Advert
Old 12-19-2015, 09:43 PM   #6
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
That last bit I said was just an explanation as to why you cannot merely open the book in calibre's Editor to see what needs changing, etc.


You ALWAYS need to use the Regex Builder in calibre in order to find out what to change, or do something else to examine the SOURCE INPUT.
PDF is slightly more complicated because the source input comes from pdftohtml.
But you still need to look at the SOURCE INPUT.


I don't know how many times I need to say this, but calibre's S&R operates on the SOURCE INPUT!
Looking at the converted book in the Editor is worthless, because the book has already been converted and therefore it is no longer the SOURCE INPUT.

Here is a picture, in case it helps....
Attached Thumbnails
Click image for larger version

Name:	calibre-convert-s&r.png
Views:	359
Size:	76.3 KB
ID:	144736  
eschwartz is offline   Reply With Quote
Old 12-25-2015, 05:15 AM   #7
MiB
Junior Member
MiB began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Dec 2015
Device: iPhone
Quote:
Originally Posted by eschwartz View Post

You ALWAYS need to use the Regex Builder in calibre in order to find out what to change, or do something else to examine the SOURCE INPUT.
PDF is slightly more complicated because the source input comes from pdftohtml.
But you still need to look at the SOURCE INPUT.


I don't know how many times I need to say this, but calibre's S&R operates on the SOURCE INPUT!
Looking at the converted book in the Editor is worthless, because the book has already been converted and therefore it is no longer the SOURCE INPUT.

Here is a picture, in case it helps....
I did find this wizard, but as I found positive results to be absent from the results it was easier to see what you were doing. What I lack is getting the results and adjusting the regex to get the results I want. It would be useful if regex patterns could be shared easily between users. So if I make or need some insights on a specific problem there's a pattern library I can look at and learn from as well as share my results to.

I'll try the wizard again for my next title. It's wonderful to get my fav pdfs as e-books just as with the books that came also in e-book format from the beginning. Calibre is a useful tool.
MiB is offline   Reply With Quote
Old 12-25-2015, 02:44 PM   #8
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
Oh, well if you just need help with using the right regex to match specific text, why didn't you say so?

Post some code examples and we can help you write a regex.
eschwartz is offline   Reply With Quote
Old 12-25-2015, 03:48 PM   #9
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 30,959
Karma: 60358908
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
MIB
I find it Oh So Much easier to do these kinds of tasks with the E-book Editor, where you can see exactly what (and works) you get searching.
Also, you can do multiple passes, rather than trying for a catchall on a single pass.

S&R in Sigil has a right click: tokenize in the search box that will help tame the search instead of a wild wildcard
theducks is offline   Reply With Quote
Old 12-25-2015, 03:52 PM   #10
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
@theducks

You cannot edit PDF in the Editor. Or in Sigil.

You can dump the PDF with pdftohtml and import that into the Editor I guess...
eschwartz is offline   Reply With Quote
Old 12-25-2015, 03:54 PM   #11
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 30,959
Karma: 60358908
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by eschwartz View Post
@theducks

You cannot edit PDF in the Editor. Or in Sigil.

You can dump the PDF with pdftohtml and import that into the Editor I guess...
Convert first
Then clean
theducks is offline   Reply With Quote
Old 12-25-2015, 04:07 PM   #12
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
Yeah, after CSS flattening.

Conversion is ideally the last of all possible steps...
eschwartz is offline   Reply With Quote
Old 01-01-2016, 05:01 AM   #13
MiB
Junior Member
MiB began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Dec 2015
Device: iPhone
Quote:
Originally Posted by eschwartz View Post
Oh, well if you just need help with using the right regex to match specific text, why didn't you say so?

Post some code examples and we can help you write a regex.
I need this help in Calibre. As it is now to just get a matching pattern I can use RegExRX.

The real problem is that after this I have to find out why it didn't work. It would be better if I interactively could see how many matches I'd get for a pattern already in pre-conversion in Calibre.

It was way easier to clean the converted results in Sublime, but this does feel like a lot of work with copy and paste.

A side issue is how you can influence the html generation, as much uses non-optimal tag structures. For example I'd love if you could help conversion with suggesting what in the current title signifies a heading, what's a list and so on.
MiB is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Cannot get rid of text-indent dawood Conversion 3 10-21-2013 10:10 PM
Removing the Footer with Page numbers and book title? omro Kobo Reader 24 11-11-2012 03:47 AM
Regexp and Alternate Page Header/Footer adad Calibre 5 01-15-2011 09:03 PM
PDF Conversion - Removing Header / Footer Text heb Sony Reader 9 07-11-2010 11:02 PM
PRS-500 What should the Total Page Count text display? Nogg Sony Reader Dev Corner 8 09-07-2007 07:04 PM


All times are GMT -4. The time now is 05:23 AM.


MobileRead.com is a privately owned, operated and funded community.