Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 10-14-2009, 03:44 AM   #1
cdecaf
Junior Member
cdecaf began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Oct 2009
Device: prs 505
Remove Footer

Hi,

I have been thinkering about this now for two days and couldn't find a clue.

My problem is I want to convert a bought pdf to epub (so sorry no sample - but I try to find one). The page numbers of the pdf show up just somewhere in between the text so I want them gone.

In the regex window (the one that does come up after clicking the magic wand next to "remove footer") the page numbers show up as
Code:
45 </p><p>
so I try
Code:
\d+ *</p><p>
Code:
[0-9]+ *</p><p>
and
Code:
\d+ *\<\/p\>\<p\>
all of these mark the correct regions in the window - but none of these show an effect in the produced epub.

Just as test I tried
Code:
\d+
- and yes it removes ALL numbers from the converted text.

Also I tried
Code:
ebook-convert book.pdf .epub --debug-pipeline
but it seems the option is no longer possible - all I get is the help for ebook-convert

So does anybody have an idea what I could try?
Thanks.

btw. I'm using calibre 0.6.17 on OSX Snow Leopard.
cdecaf is offline   Reply With Quote
Old 10-14-2009, 05:59 AM   #2
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
The header removal does appear to be broken. Can you open a ticket so I don't forget to fix it?

Also, --debug-pipeline has been renamed to --debug.
user_none is offline   Reply With Quote
Advert
Old 10-14-2009, 06:47 AM   #3
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
I take that back, it is working. Make sure you have the remove footer checkbox checked above the regex.
user_none is offline   Reply With Quote
Old 10-14-2009, 10:08 AM   #4
cdecaf
Junior Member
cdecaf began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Oct 2009
Device: prs 505
Ah that --debug helps a lot.

If I look at the produced directories i think the regex window shows the "parsed" subdirectory. But the regexp acts on the "input".
Also I noticed in the regexp I need to replace all the "&nbsp;" with "\xA0"
cdecaf is offline   Reply With Quote
Old 10-14-2009, 10:48 AM   #5
cdecaf
Junior Member
cdecaf began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Oct 2009
Device: prs 505
Ok now I'm happy. I found it in the "input" produced by debug.

The regexp that worked for me:
Code:
\d+\xA0*<br>
cdecaf is offline   Reply With Quote
Advert
Old 10-14-2009, 05:18 PM   #6
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Quote:
Originally Posted by cdecaf View Post
If I look at the produced directories i think the regex window shows the "parsed" subdirectory. But the regexp acts on the "input".
I will correct this.
user_none is offline   Reply With Quote
Old 10-14-2009, 05:33 PM   #7
cdecaf
Junior Member
cdecaf began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Oct 2009
Device: prs 505
That would be great.

Thanks for the quick pointers also.
cdecaf is offline   Reply With Quote
Old 11-19-2009, 03:21 PM   #8
matthias
Enthusiast
matthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura about
 
Posts: 25
Karma: 4212
Join Date: Nov 2009
Location: South Tyrol, Italy
Device: Sony Reader PRS-505
Hello,

i've found this topic after looking for something about the same.

i've tried it yesterday at work with an older Version (i think it was 0.6.22) and it worked without any problems, but today, at an other PC i tried to use the same regexp but it wont work (even with the same pdf file)

in the input-section with debug it shows up like this:
Code:
11&nbsp;&nbsp;<br>
the regexp i'm trying to use to remove the page numbers is
Code:
(\d+\xA0*<br>)
the exact same worked with the older Version, but it won't today. Have there been any changes from that older Version, or is something in the regexp wrong?? i've tryed now everything i'm able to, but i can't find the answer. (although i'm not very comfortable using regexp)
matthias is offline   Reply With Quote
Old 11-19-2009, 04:28 PM   #9
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Quote:
Originally Posted by matthias View Post
in the input-section with debug it shows up like this:
Code:
11&nbsp;&nbsp;<br>
the regexp i'm trying to use to remove the page numbers is
Code:
(\d+\xA0*<br>)
the exact same worked with the older Version, but it won't today. Have there been any changes from that older Version, or is something in the regexp wrong?
The regex matches later in the conversion pipeline in the newer versions. 0.6.22 sounds about right when the change was made. Entities such as &nbsp; are now converted to the character they represent before the regex is applied.
user_none is offline   Reply With Quote
Old 11-19-2009, 04:38 PM   #10
matthias
Enthusiast
matthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura about
 
Posts: 25
Karma: 4212
Join Date: Nov 2009
Location: South Tyrol, Italy
Device: Sony Reader PRS-505
so now the regexp is applyed to the "parsed"-output of the debug-folder?

in my case, the new regexp should be
Code:
(\d+\s*</p><p>)
since the file shows

Code:
11  </p><p>
but it still won't work.
matthias is offline   Reply With Quote
Old 11-19-2009, 04:48 PM   #11
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,826
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
EDIT: Never mind
kovidgoyal is online now   Reply With Quote
Old 11-20-2009, 03:31 AM   #12
matthias
Enthusiast
matthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura about
 
Posts: 25
Karma: 4212
Join Date: Nov 2009
Location: South Tyrol, Italy
Device: Sony Reader PRS-505
i still don't get it to work, ...

in the debug-input-folder it shows me the following string:
Code:
11&nbsp;&nbsp;<br>
in the debug-parsed-folder it's then
Code:
11  </p><p>
Earlyer (now i've given it a peek, i used 0.6.16 when it worked) i used
Code:
\d+\xA0*<br>
which won't work anymore, since it seems that there has been changed something in the pipeline. But also replacing the \xA0 with \s (what should be any whitespace-character) doesn't work.

I tryed to use \d+\s+, which works, is applyed somewhere between the input and the parsed section, but isn't quite what i wantet, since it removes every digit in the file followed by a whitespace.

When exactly is the regex applyed to the file? Right on the input-file, or are there some steps in the pipline that change that code bevore applying the regex?
matthias is offline   Reply With Quote
Old 11-22-2009, 01:29 PM   #13
matthias
Enthusiast
matthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura about
 
Posts: 25
Karma: 4212
Join Date: Nov 2009
Location: South Tyrol, Italy
Device: Sony Reader PRS-505
Isn't anyone else having trouble removing page numbers from their pdfs?

Bevore applying the regex to the input-section, what happens to the tag "<br>"?
it seems like its not there anymore when the regex is applyed.

how do i have to change my previous regex (used in 0.6.16) to get it to work with the newer Versions of calibre (what is a great program in my eyes)?

Last edited by matthias; 11-23-2009 at 05:55 AM.
matthias is offline   Reply With Quote
Old 11-23-2009, 06:15 AM   #14
matthias
Enthusiast
matthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura about
 
Posts: 25
Karma: 4212
Join Date: Nov 2009
Location: South Tyrol, Italy
Device: Sony Reader PRS-505
i have been able to resolve my problem using:
Code:
(\d+\s+<p>)
what seems a little strange to me, since neither in the input-, nor in the parsed-folder it is like this.
To me it looks like the regex is applyed somewhere in between converting the input to the parsed, but for sure before closing the tags (what in my eyes isn't very user-friendly or at least confusing, since noone seems to know it, and its neighter shown in the input nor the parsed-folder)

Also, i think the preconfigured regex (immediately after the installation) should be adjusted to the new practice, because i don't think any footer or header-removement will be done with the standard-regex anymore.

i hope someone of the developers will take care of this problem, or at least explain what the thougs behind this are, so we can understand better when exactly the regex is applyed.
matthias is offline   Reply With Quote
Old 11-23-2009, 03:02 PM   #15
Ydieh
Junior Member
Ydieh began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Nov 2009
Device: iPod touch
Quote:
Originally Posted by matthias View Post
i have been able to resolve my problem using:
Code:
(\d+\s+<p>)
what seems a little strange to me, since neither in the input-, nor in the parsed-folder it is like this.
To me it looks like the regex is applyed somewhere in between converting the input to the parsed, but for sure before closing the tags (what in my eyes isn't very user-friendly or at least confusing, since noone seems to know it, and its neighter shown in the input nor the parsed-folder)

Also, i think the preconfigured regex (immediately after the installation) should be adjusted to the new practice, because i don't think any footer or header-removement will be done with the standard-regex anymore.

i hope someone of the developers will take care of this problem, or at least explain what the thougs behind this are, so we can understand better when exactly the regex is applyed.
for the solution for my problem.
I've tried "<br>" and "</p><p>" in my expression and they both didn't work.
Using only "<p>" works.
Ydieh is offline   Reply With Quote
Reply

Tags
calibre pdf footer remove

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Structure Detection - Remove Header (or Footer) Regex DarkKipper Conversion 69 11-09-2013 12:21 PM
Regex help to remove HTML footer neonbible Calibre 4 09-09-2010 09:42 AM
footer removal help icy Calibre 7 08-27-2010 01:21 PM
remove PDF footer containing variable? irisclara Calibre 10 03-06-2010 10:53 PM
RFE: Remove remove tags in bulk edit magphil Calibre 0 08-11-2009 10:37 AM


All times are GMT -4. The time now is 09:07 AM.


MobileRead.com is a privately owned, operated and funded community.