Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 11-23-2009, 07:33 PM   #16
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Quote:
Originally Posted by matthias View Post
what seems a little strange to me, since neither in the input-, nor in the parsed-folder it is like this.
To me it looks like the regex is applyed somewhere in between converting the input to the parsed, but for sure before closing the tags (what in my eyes isn't very user-friendly or at least confusing, since noone seems to know it, and its neighter shown in the input nor the parsed-folder)
Honestly, it's gone though so many revisions I've pretty much lost track of it. For some reason I have the feeling it's matching input after all of the internal regexes are applied. I'll get it cleaned up and working properly soonish.

Quote:
Originally Posted by matthias View Post
Also, i think the preconfigured regex (immediately after the installation) should be adjusted to the new practice, because i don't think any footer or header-removement will be done with the standard-regex anymore.
Another change I have on my todo list. I plan on removing the option to have it applied. The default will be removed and it will automatically apply the regex if one is present.

Eventually the new PDF input engine will be complete and removing the header and footer will be automatic. This regex based system will be renamed to "remove content".
user_none is offline   Reply With Quote
Old 11-24-2009, 03:13 AM   #17
matthias
Enthusiast
matthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura about
 
Posts: 25
Karma: 4212
Join Date: Nov 2009
Location: South Tyrol, Italy
Device: Sony Reader PRS-505
It is just a bit confusing since it's not shown anywhere in the debug-folder (probably not well documented or explained).
Probably an (easy) example with explanations somewhere near the place to insert the regex would help finding the right regex (I've had to check out various sites to find the right regex-syntax, and probably not everyone has the patience or time to do so).

I actually like the posibility to check or uncheck the "remove footer" - checkbox (or even a "remove content-checkbox"), since not on every book i like to apply the regex (some have pagenumbers, others don't), and the way it is I don't have to copy/delete the regex I generally use, I just have to check or uncheck that box.
matthias is offline   Reply With Quote
Old 01-12-2010, 06:26 PM   #18
ac4lt
Connoisseur
ac4lt began at the beginning.
 
ac4lt's Avatar
 
Posts: 61
Karma: 36
Join Date: Jan 2010
Location: Reston, Virginia, US
Device: ipad
I'm having problems with this as well. Here are the details:

Using calibre 0.6.33, I'm trying to convert a pdf to an epub.

In the pdf the last line of the page is a line number. I'm trying to write a regex to remove this.

Setting the debug for conversion I've been able to look at the input, parsed and processed directories.

An example from the input directory shows the last couple of lines of a page:
Code:
The stranger had clambered through the ditch and up the bank,<br>
8<br>
Looking at the parsed directory, I see this:
Code:
The stranger had clambered through the ditch and up the bank, 8</p><p>
Also, my headers have also been removed though we're only at the parsed step and the pdf line unwrapping appears to have been done.

The processed directory shows:
Code:
The stranger had clambered through the ditch and up the bank, 8</p><p class="calibre1">
Though the header was no problem, I can't find a regex to remove the footer. I've tried:

Code:
\d+<br>
and
Code:
\d+</p><p>
and lots of other variations that don't work.

It's not clear to me when the regex processing is done. That is to say, whether it is before or after the conversion to xhtml and line unwrapping have occurred. Speculating, I'd say it's after.

The problem is that it appears impossible to refer to P tags in the regex. They never work.

I've tried everything suggested in this thread and so far nothing has worked.

Anyone have any ideas?
ac4lt is offline   Reply With Quote
Old 01-13-2010, 04:25 PM   #19
poodlemama
Member
poodlemama began at the beginning.
 
poodlemama's Avatar
 
Posts: 10
Karma: 10
Join Date: Jan 2010
Device: Sony PRS 600
WoW I am so glad to have found this thread! Thank yall!
poodlemama is offline   Reply With Quote
Old 01-14-2010, 02:58 AM   #20
matthias
Enthusiast
matthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura about
 
Posts: 25
Karma: 4212
Join Date: Nov 2009
Location: South Tyrol, Italy
Device: Sony Reader PRS-505
Hello ac4lt,

Try this one, i had a similar problem, and trying it out it worked for me (that was still in a previous version, but i dont think it changed yet)

Code:
\d+<p>
or even
Code:
\s*\d+<p>
depending on how the result looks (double blanks are removed with the second one)
i hope it still works, but can't test it at the moment because i don't have calibre installed at work.

Last edited by matthias; 01-14-2010 at 03:01 AM.
matthias is offline   Reply With Quote
Old 01-14-2010, 11:19 AM   #21
poodlemama
Member
poodlemama began at the beginning.
 
poodlemama's Avatar
 
Posts: 10
Karma: 10
Join Date: Jan 2010
Device: Sony PRS 600
Question

Quote:
Originally Posted by matthias View Post
so now the regexp is applyed to the "parsed"-output of the debug-folder?

in my case, the new regexp should be
Code:
(\d+\s*</p><p>)
since the file shows

Code:
11  </p><p>
but it still won't work.
I have been applying the above to one of my ebpub and it starts off highlighting the appropriate areas in the book but when I convert it still doesn't work... I have the remove footers checked so what am I skipping?
poodlemama is offline   Reply With Quote
Old 01-14-2010, 12:53 PM   #22
ac4lt
Connoisseur
ac4lt began at the beginning.
 
ac4lt's Avatar
 
Posts: 61
Karma: 36
Join Date: Jan 2010
Location: Reston, Virginia, US
Device: ipad
As with poodlemama neither works for me.
ac4lt is offline   Reply With Quote
Old 01-14-2010, 04:59 PM   #23
poodlemama
Member
poodlemama began at the beginning.
 
poodlemama's Avatar
 
Posts: 10
Karma: 10
Join Date: Jan 2010
Device: Sony PRS 600
OK let me ask if this is the correct process...

I have a PDF book. . . I add it to my Calibre library... I go to convert ebook and update all metadata and the go to Page set up... make sure input and output are correct.. go to structure detection.. click on "Remove Footer" and click on the wizard tool... type in (\d+\s*</p><p>) and see the highlighted page numbers and codes... push ok and then click ok to start the conversion.. what am I missing?
poodlemama is offline   Reply With Quote
Old 01-14-2010, 06:07 PM   #24
Sydney's Mom
Wizard
Sydney's Mom ought to be getting tired of karma fortunes by now.Sydney's Mom ought to be getting tired of karma fortunes by now.Sydney's Mom ought to be getting tired of karma fortunes by now.Sydney's Mom ought to be getting tired of karma fortunes by now.Sydney's Mom ought to be getting tired of karma fortunes by now.Sydney's Mom ought to be getting tired of karma fortunes by now.Sydney's Mom ought to be getting tired of karma fortunes by now.Sydney's Mom ought to be getting tired of karma fortunes by now.Sydney's Mom ought to be getting tired of karma fortunes by now.Sydney's Mom ought to be getting tired of karma fortunes by now.Sydney's Mom ought to be getting tired of karma fortunes by now.
 
Sydney's Mom's Avatar
 
Posts: 2,899
Karma: 6995721
Join Date: Dec 2008
Location: Idaho, on the side of a mountain
Device: Kindle Oasis, Fire 3d Gen and 5th Gen and Samsung Tab S
I have not been able to get pdf conversion to work with headers and footers. I convert to prc with mobipocket creator, and them import into Calibre.
Sydney's Mom is offline   Reply With Quote
Old 01-15-2010, 05:36 AM   #25
matthias
Enthusiast
matthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura about
 
Posts: 25
Karma: 4212
Join Date: Nov 2009
Location: South Tyrol, Italy
Device: Sony Reader PRS-505
Try leaving the </p> away, it won't be highlighted in the preview, but for me it worked in the conversion.
matthias is offline   Reply With Quote
Old 01-15-2010, 06:32 AM   #26
ac4lt
Connoisseur
ac4lt began at the beginning.
 
ac4lt's Avatar
 
Posts: 61
Karma: 36
Join Date: Jan 2010
Location: Reston, Virginia, US
Device: ipad
Quote:
Originally Posted by matthias View Post
Try leaving the </p> away, it won't be highlighted in the preview, but for me it worked in the conversion.
I did leave out the '</p>" when I tried your suggestions. Unfortunately, it didn't help.
ac4lt is offline   Reply With Quote
Old 01-15-2010, 07:21 AM   #27
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,897
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Kindle PaperWhite SE 11th Gen
Quote:
Originally Posted by ac4lt View Post
Unfortunately, it didn't help.
I can't help you get it to work in Calibre, I use a work around.

I import the pdf into the free download of Mobipocket Creator. Importing it strips out some trashy headers and creates a html file.


UPDATE: It was too long ago for my memory to be reliable so I just grabbed a couple of horrible PDF files and reinstalled Mobipocket Creator. I discovered I was wrong.

Mobipocket Creator removed many atrocious headers for me in the past so I fooled my old memory into thinking it worked all the time. Even when it removed a bad text header I had to manually remove page numbers or other junk using wordpad or MS Word. I primarily used Mobipocket Creator to remove some trash headers and change a PDF to HTML so I could do a quick clean up of the html file.

Sorry for the confusion.... now where else did I post this error?

Last edited by DoctorOhh; 01-17-2010 at 09:49 PM. Reason: My Info was wrong
DoctorOhh is offline   Reply With Quote
Old 01-15-2010, 07:24 AM   #28
ac4lt
Connoisseur
ac4lt began at the beginning.
 
ac4lt's Avatar
 
Posts: 61
Karma: 36
Join Date: Jan 2010
Location: Reston, Virginia, US
Device: ipad
I'll take a look at that option. Thanks for the info!
ac4lt is offline   Reply With Quote
Old 01-17-2010, 08:49 PM   #29
mago55
Junior Member
mago55 began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Nov 2009
Device: sony ebook PR-505
An easy solution at last?

Quote:
Originally Posted by dwanthny View Post
I can't help you get it to work in Calibre, I use a work around.

I import the pdf into the free download of Mobipocket Creator. Importing it strips out headers and footers and creates a html file. Now you can either drag the html to calibre or you can have it build an ebook. Building an ebook will create a PRC file. The prc file can be dragged to Calibre and converted to the format of your choice.

The html and the prc file end up in My Documents\My Publications by default.

It is quick and easy and does what I need while keeping my sanity in tact.
HI everyone!. Ive been going mad over those headers and footers. English is not my first language and i have to work hard to understand all those terms.
I have no idea what a "header regular expression" is, "regex", or how to work out the wand button, although i have tried many times.
Same as everyone, im trying to convert pdf, but in my case is to lrf, the sony ebook format.
I've read what dwanthny said about that Mobipocket Creator, and i think it sounds much easier than removing headers and footers with Calibre.
My questions are: What exactly is to "build an ebook"?, once i create an html with mobipocket from a pdf file, can i just drag that html into Calibre and convert it into an lrf for my sony ebook? or do i have to convert html into a prc first and then to lrf?.
I apologize if my questions are not very clear and i thank you for any help in this!
mago55 is offline   Reply With Quote
Old 01-17-2010, 09:45 PM   #30
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,897
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Kindle PaperWhite SE 11th Gen
Quote:
Originally Posted by mago55 View Post
What exactly is to "build an ebook"?, once i create an html with mobipocket from a pdf file, can i just drag that html into Calibre and convert it into an lrf for my sony ebook? or do i have to convert html into a prc first and then to lrf?.
I was wrong. It was too long ago for my memory to be reliable so I just grabbed a couple of horrible PDF files and reinstalled Mobipocket Creator.

Mobipocket Creator removed many atrocious headers for me in the past so I fooled my old memory into thinking it worked all the time. Even when it removed a bad text header I had to manually remove page numbers or other junk using wordpad or MS Word. I primarily used Mobipocket Creator to remove some trash headers and change a PDF to HTML so I could do a quick clean up of the html file.

Sorry for the confusion.... now where else did I post this error?
DoctorOhh is offline   Reply With Quote
Reply

Tags
calibre pdf footer remove


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Structure Detection - Remove Header (or Footer) Regex DarkKipper Conversion 69 11-09-2013 12:21 PM
Regex help to remove HTML footer neonbible Calibre 4 09-09-2010 09:42 AM
footer removal help icy Calibre 7 08-27-2010 01:21 PM
remove PDF footer containing variable? irisclara Calibre 10 03-06-2010 10:53 PM
RFE: Remove remove tags in bulk edit magphil Calibre 0 08-11-2009 10:37 AM


All times are GMT -4. The time now is 06:25 PM.


MobileRead.com is a privately owned, operated and funded community.