Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 10-18-2009, 11:21 PM   #1
robertlc
Junior Member
robertlc began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Oct 2009
Device: Kindle 2
[Old Thread] Removing ABBYY header in a PDF

I have a few PDF files someone else converted using ABBYY PDF Transformer.

Each page has a graphic in both top corners.

Looking at the page in Calibre's header wizard shows the encoding behind the graphic as this:

<a href="http://www.abbyy.com/buy"><b>PDF Transform</b></a><p>
<a href="http://www.abbyy.com/buy"><b>PDF Transform</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>Y</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>Y</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>Y</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>er</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>Y</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>er</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>B</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>2</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>B</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>2</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>B</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>.0</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>B</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>.0</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>A</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>A</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>Click here to buy</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>Click here to buy</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>w</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>w</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>w</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>w</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>w . </b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>w</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>A B B YY.com</b></a></p><p>
<a href="http://www.abbyy.com/buy"><b>.A B BYY.com</b></a></p><p>

When converted to a mobi file, I get a bunch of lines that start with "PDF Transform" and then several letters, a couple of "Click here to buy," some more letters and then the "A B B YY.com" and ".A B BYY.com".

On my Kindle 2, this makes up about 2.5 pages I have to skip through every 3 or so pages and is annoying.

Can someone please tell me what I need to enter in the "Header Regular Expression" box?

Thanks in advance!
robertlc is offline   Reply With Quote
Old 10-19-2009, 05:03 AM   #2
hairybiker
Banned
hairybiker began at the beginning.
 
Posts: 82
Karma: 10
Join Date: Aug 2009
Device: Tolino Shine 3
I have same issue so if anybody has a fix I would be grateful as well.
hairybiker is offline   Reply With Quote
Advert
Old 10-19-2009, 08:36 PM   #3
charleski
Wizard
charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.
 
Posts: 1,196
Karma: 1281258
Join Date: Sep 2009
Device: PRS-505
Have you tried just using Notepad on the html and doing a global search/replace on the offending lines? (replace field left blank).
charleski is offline   Reply With Quote
Old 10-20-2009, 04:51 AM   #4
hairybiker
Banned
hairybiker began at the beginning.
 
Posts: 82
Karma: 10
Join Date: Aug 2009
Device: Tolino Shine 3
Thats what I have been doing until now, except using textcrawler and a macro, but if it could be removed when imported it would be easier, esp. since I have to run the app in a VM.
hairybiker is offline   Reply With Quote
Old 10-20-2009, 10:21 AM   #5
DerSchwarzePrinz
Member
DerSchwarzePrinz began at the beginning.
 
Posts: 24
Karma: 16
Join Date: Aug 2009
Device: Pocketbook 360, Sony PRS-T1
Quote:
Originally Posted by robertlc View Post
I have a few PDF files someone else converted using ABBYY PDF Transformer.

Each page has a graphic in both top corners.

Looking at the page in Calibre's header wizard shows the encoding behind the graphic as this:

<...>

When converted to a mobi file, I get a bunch of lines that start with "PDF Transform" and then several letters, a couple of "Click here to buy," some more letters and then the "A B B YY.com" and ".A B BYY.com".

On my Kindle 2, this makes up about 2.5 pages I have to skip through every 3 or so pages and is annoying.

Can someone please tell me what I need to enter in the "Header Regular Expression" box?

Thanks in advance!
Have you tried the following regular expression?

PDF Transform .+ \.com

PDF Transform = Start of text, that should be removed
.+ = one or more characters
\.com = End of text, that should be removed
DerSchwarzePrinz is offline   Reply With Quote
Advert
Old 10-20-2009, 11:31 AM   #6
hairybiker
Banned
hairybiker began at the beginning.
 
Posts: 82
Karma: 10
Join Date: Aug 2009
Device: Tolino Shine 3
Interestingly it didn't do anything for that, but changing to PDF.+\.com does remove all the junk

Cheers,must lean regex
hairybiker is offline   Reply With Quote
Old 10-20-2009, 12:57 PM   #7
robertlc
Junior Member
robertlc began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Oct 2009
Device: Kindle 2
Quote:
Originally Posted by DerSchwarzePrinz View Post
Have you tried the following regular expression?

PDF Transform .+ \.com

PDF Transform = Start of text, that should be removed
.+ = one or more characters
\.com = End of text, that should be removed

Doesn't work when I put it in Calibre. The text is still there.

How did you do it HairyBiker?
robertlc is offline   Reply With Quote
Old 10-20-2009, 01:02 PM   #8
robertlc
Junior Member
robertlc began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Oct 2009
Device: Kindle 2
Quote:
Originally Posted by hairybiker View Post
Interestingly it didn't do anything for that, but changing to PDF.+\.com does remove all the junk

Cheers,must lean regex
Just tried it your way HairyBiker, and it doesn't remove the junk for me.
robertlc is offline   Reply With Quote
Old 10-20-2009, 01:07 PM   #9
hairybiker
Banned
hairybiker began at the beginning.
 
Posts: 82
Karma: 10
Join Date: Aug 2009
Device: Tolino Shine 3
I just selected the pdf, choose convert, then in the Structure Detection, clicked the Remove Header and put in the "PDF.+\.com" into the box removing the default one. If you click on the wizard it will show you what is being removed.
hairybiker is offline   Reply With Quote
Old 10-20-2009, 02:43 PM   #10
robertlc
Junior Member
robertlc began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Oct 2009
Device: Kindle 2
Did all that. Can't get rid of it all.
robertlc is offline   Reply With Quote
Old 10-20-2009, 04:32 PM   #11
hairybiker
Banned
hairybiker began at the beginning.
 
Posts: 82
Karma: 10
Join Date: Aug 2009
Device: Tolino Shine 3
strange could you send me a copy of one that doesn't remove?
hairybiker is offline   Reply With Quote
Old 10-20-2009, 07:59 PM   #12
charleski
Wizard
charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.charleski ought to be getting tired of karma fortunes by now.
 
Posts: 1,196
Karma: 1281258
Join Date: Sep 2009
Device: PRS-505
Quote:
Originally Posted by hairybiker View Post
Thats what I have been doing until now, except using textcrawler and a macro, but if it could be removed when imported it would be easier, esp. since I have to run the app in a VM.
But then calibre would be turning into an app for editing the files rather than simply converting them.

If you have a large number of files that need identical editing, it might be more efficient to write a script to pass them through grep and then pipe the output to calibre's command-line converter.
charleski is offline   Reply With Quote
Old 10-21-2009, 09:24 AM   #13
hairybiker
Banned
hairybiker began at the beginning.
 
Posts: 82
Karma: 10
Join Date: Aug 2009
Device: Tolino Shine 3
If I was better at Linux command scripting then that is what I would do, but since I am still learning it ...
hairybiker is offline   Reply With Quote
Old 10-21-2009, 09:55 AM   #14
cdecaf
Junior Member
cdecaf began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Oct 2009
Device: prs 505
Hi, I had a similar problem recently. (see https://www.mobileread.com/forums/showthread.php?t=59282 )

My problem was the regexp editor doesn't show you the text the regex acts on.

Try something like:
Code:
(?ism)<a href="http://www.abbyy.com/buy"><b>(\w|\s)*</b></a>(<br>)?
cdecaf is offline   Reply With Quote
Old 10-22-2009, 09:43 AM   #15
JvdW
Zealot
JvdW doesn't litterJvdW doesn't litter
 
Posts: 115
Karma: 150
Join Date: Jul 2008
Location: Netherlands Veenendaal
Device: Palm T5, Sony PRS-505, Nook Color
Quote:
Originally Posted by cdecaf View Post
Hi, I had a similar problem recently. (see https://www.mobileread.com/forums/showthread.php?t=59282 )

My problem was the regexp editor doesn't show you the text the regex acts on.

Try something like:
Code:
(?ism)<a href="http://www.abbyy.com/buy"><b>(\w|\s)*</b></a>(<br>)?
Calibre 0.6.19 does show you the correct text whether from an epub file or in my case from a pdf.
Problem that I have right now is that it doesn't highlight the text which the regexp works on. The regexp does work when I convert the pdf to epub

I wanted to show what is displayed and what I think should happen but now I'm getting an error:
Code:
ERROR: ERROR: Unhandled exception: <b>WindowsError</b>:[Error 6] The handle is invalid

Traceback (most recent call last):
  File "site-packages\calibre\gui2\convert\regex_builder.py", line 101, in button_clicked
  File "site-packages\calibre\gui2\convert\regex_builder.py", line 90, in open_book
  File "site-packages\calibre\ebooks\oeb\iterator.py", line 141, in __enter__
  File "site-packages\calibre\customize\conversion.py", line 208, in __call__
  File "site-packages\calibre\ebooks\pdf\input.py", line 33, in convert
  File "site-packages\calibre\ebooks\pdf\pdftohtml.py", line 49, in pdftohtml
  File "subprocess.py", line 614, in __init__
  File "subprocess.py", line 735, in _get_handles
  File "subprocess.py", line 761, in _make_inheritable
WindowsError: [Error 6] The handle is invalid
I'll see what I can do when I'm at home tonight.

Regards,

Joop
JvdW is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
removing unwanted pages ABBYY finereader sovre Workshop 3 08-04-2011 03:05 AM
Removing Header from .IMP ronin688 Fictionwise eBookwise 2 12-12-2010 07:36 PM
Removing a header pckopp Calibre 1 12-11-2010 01:33 PM
Removing header syntax. boromirofborg Calibre 0 07-21-2010 12:33 AM
PDF Conversion - Removing Header / Footer Text heb Sony Reader 9 07-11-2010 11:02 PM


All times are GMT -4. The time now is 03:12 AM.


MobileRead.com is a privately owned, operated and funded community.