[Old Thread] Removing ABBYY header in a PDF

robertlc · 10-18-2009, 11:21 PM

I have a few PDF files someone else converted using ABBYY PDF Transformer.

Each page has a graphic in both top corners.

Looking at the page in Calibre's header wizard shows the encoding behind the graphic as this:

<a href="http://www.abbyy.com/buy">PDF Transform</a>
<a href="http://www.abbyy.com/buy">PDF Transform</a>
<a href="http://www.abbyy.com/buy">Y</a>
<a href="http://www.abbyy.com/buy">Y</a>
<a href="http://www.abbyy.com/buy">Y</a>
<a href="http://www.abbyy.com/buy">er</a>
<a href="http://www.abbyy.com/buy">Y</a>
<a href="http://www.abbyy.com/buy">er</a>
<a href="http://www.abbyy.com/buy">B</a>
<a href="http://www.abbyy.com/buy">2</a>
<a href="http://www.abbyy.com/buy">B</a>
<a href="http://www.abbyy.com/buy">2</a>
<a href="http://www.abbyy.com/buy">B</a>
<a href="http://www.abbyy.com/buy">.0</a>
<a href="http://www.abbyy.com/buy">B</a>
<a href="http://www.abbyy.com/buy">.0</a>
<a href="http://www.abbyy.com/buy">A</a>
<a href="http://www.abbyy.com/buy">A</a>
<a href="http://www.abbyy.com/buy">Click here to buy</a>
<a href="http://www.abbyy.com/buy">Click here to buy</a>
<a href="http://www.abbyy.com/buy">w</a>
<a href="http://www.abbyy.com/buy">w</a>
<a href="http://www.abbyy.com/buy">w</a>
<a href="http://www.abbyy.com/buy">w</a>
<a href="http://www.abbyy.com/buy">w . </a>
<a href="http://www.abbyy.com/buy">w</a>
<a href="http://www.abbyy.com/buy">A B B YY.com</a>
<a href="http://www.abbyy.com/buy">.A B BYY.com</a>

When converted to a mobi file, I get a bunch of lines that start with "PDF Transform" and then several letters, a couple of "Click here to buy," some more letters and then the "A B B YY.com" and ".A B BYY.com".

On my Kindle 2, this makes up about 2.5 pages I have to skip through every 3 or so pages and is annoying.

Can someone please tell me what I need to enter in the "Header Regular Expression" box?

Thanks in advance!

hairybiker · 10-19-2009, 05:03 AM

I have same issue so if anybody has a fix I would be grateful as well.

charleski · 10-19-2009, 08:36 PM

Have you tried just using Notepad on the html and doing a global search/replace on the offending lines? (replace field left blank).

hairybiker · 10-20-2009, 04:51 AM

Thats what I have been doing until now, except using textcrawler and a macro, but if it could be removed when imported it would be easier, esp. since I have to run the app in a VM.

DerSchwarzePrinz · 10-20-2009, 10:21 AM

Quote:

Originally Posted by robertlc

I have a few PDF files someone else converted using ABBYY PDF Transformer.

Each page has a graphic in both top corners.

Looking at the page in Calibre's header wizard shows the encoding behind the graphic as this:

<...>

When converted to a mobi file, I get a bunch of lines that start with "PDF Transform" and then several letters, a couple of "Click here to buy," some more letters and then the "A B B YY.com" and ".A B BYY.com".

On my Kindle 2, this makes up about 2.5 pages I have to skip through every 3 or so pages and is annoying.

Can someone please tell me what I need to enter in the "Header Regular Expression" box?

Thanks in advance!

Have you tried the following regular expression?

PDF Transform .+ \.com

PDF Transform = Start of text, that should be removed
.+ = one or more characters
\.com = End of text, that should be removed

hairybiker · 10-20-2009, 11:31 AM

Interestingly it didn't do anything for that, but changing to PDF.+\.com does remove all the junk

Cheers,must lean regex

robertlc · 10-20-2009, 12:57 PM

Quote:

Originally Posted by DerSchwarzePrinz

Have you tried the following regular expression?

PDF Transform .+ \.com

PDF Transform = Start of text, that should be removed
.+ = one or more characters
\.com = End of text, that should be removed

Doesn't work when I put it in Calibre. The text is still there.

How did you do it HairyBiker?

robertlc · 10-20-2009, 01:02 PM

Quote:

Originally Posted by hairybiker

Interestingly it didn't do anything for that, but changing to PDF.+\.com does remove all the junk

Cheers,must lean regex

Just tried it your way HairyBiker, and it doesn't remove the junk for me.

hairybiker · 10-20-2009, 01:07 PM

I just selected the pdf, choose convert, then in the Structure Detection, clicked the Remove Header and put in the "PDF.+\.com" into the box removing the default one. If you click on the wizard it will show you what is being removed.

robertlc · 10-20-2009, 02:43 PM

Did all that. Can't get rid of it all.

hairybiker · 10-20-2009, 04:32 PM

strange could you send me a copy of one that doesn't remove?

charleski · 10-20-2009, 07:59 PM

Quote:

Originally Posted by hairybiker

Thats what I have been doing until now, except using textcrawler and a macro, but if it could be removed when imported it would be easier, esp. since I have to run the app in a VM.

But then calibre would be turning into an app for editing the files rather than simply converting them.

If you have a large number of files that need identical editing, it might be more efficient to write a script to pass them through grep and then pipe the output to calibre's command-line converter.

hairybiker · 10-21-2009, 09:24 AM

If I was better at Linux command scripting then that is what I would do, but since I am still learning it ...

cdecaf · 10-21-2009, 09:55 AM

Hi, I had a similar problem recently. (see https://www.mobileread.com/forums/showthread.php?t=59282 )

My problem was the regexp editor doesn't show you the text the regex acts on.

Try something like:

Code:

(?ism)<a href="http://www.abbyy.com/buy"><b>(\w|\s)*</b></a>(<br>)?

JvdW · 10-22-2009, 09:43 AM

Quote:

Originally Posted by cdecaf

Hi, I had a similar problem recently. (see https://www.mobileread.com/forums/showthread.php?t=59282 )

My problem was the regexp editor doesn't show you the text the regex acts on.

Try something like:

Code:

(?ism)<a href="http://www.abbyy.com/buy"><b>(\w|\s)*</b></a>(<br>)?

Calibre 0.6.19 does show you the correct text whether from an epub file or in my case from a pdf.
Problem that I have right now is that it doesn't highlight the text which the regexp works on. The regexp does work when I convert the pdf to epub

I wanted to show what is displayed and what I think should happen but now I'm getting an error:

Code:

ERROR: ERROR: Unhandled exception: <b>WindowsError</b>:[Error 6] The handle is invalid

Traceback (most recent call last):
  File "site-packages\calibre\gui2\convert\regex_builder.py", line 101, in button_clicked
  File "site-packages\calibre\gui2\convert\regex_builder.py", line 90, in open_book
  File "site-packages\calibre\ebooks\oeb\iterator.py", line 141, in __enter__
  File "site-packages\calibre\customize\conversion.py", line 208, in __call__
  File "site-packages\calibre\ebooks\pdf\input.py", line 33, in convert
  File "site-packages\calibre\ebooks\pdf\pdftohtml.py", line 49, in pdftohtml
  File "subprocess.py", line 614, in __init__
  File "subprocess.py", line 735, in _get_handles
  File "subprocess.py", line 761, in _make_inheritable
WindowsError: [Error 6] The handle is invalid

I'll see what I can do when I'm at home tonight.

Regards,

Joop

10-18-2009, 11:21 PM	#1
robertlc Junior Member Posts: 4 Karma: 10 Join Date: Oct 2009 Device: Kindle 2	[Old Thread] Removing ABBYY header in a PDF I have a few PDF files someone else converted using ABBYY PDF Transformer. Each page has a graphic in both top corners. Looking at the page in Calibre's header wizard shows the encoding behind the graphic as this: <a href="http://www.abbyy.com/buy"><b>PDF Transform</b></a><p> <a href="http://www.abbyy.com/buy"><b>PDF Transform</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>Y</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>Y</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>Y</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>er</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>Y</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>er</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>B</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>2</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>B</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>2</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>B</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>.0</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>B</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>.0</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>A</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>A</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>Click here to buy</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>Click here to buy</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>w</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>w</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>w</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>w</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>w . </b></a></p><p> <a href="http://www.abbyy.com/buy"><b>w</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>A B B YY.com</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>.A B BYY.com</b></a></p><p> When converted to a mobi file, I get a bunch of lines that start with "PDF Transform" and then several letters, a couple of "Click here to buy," some more letters and then the "A B B YY.com" and ".A B BYY.com". On my Kindle 2, this makes up about 2.5 pages I have to skip through every 3 or so pages and is annoying. Can someone please tell me what I need to enter in the "Header Regular Expression" box? Thanks in advance!

10-21-2009, 09:55 AM	#14
cdecaf Junior Member Posts: 9 Karma: 10 Join Date: Oct 2009 Device: prs 505	Hi, I had a similar problem recently. (see https://www.mobileread.com/forums/showthread.php?t=59282 ) My problem was the regexp editor doesn't show you the text the regex acts on. Try something like: Code: (?ism)<a href="http://www.abbyy.com/buy"><b>(\w\|\s)*</b></a>(<br>)?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
removing unwanted pages ABBYY finereader	sovre	Workshop	3	08-04-2011 03:05 AM
Removing Header from .IMP	ronin688	Fictionwise eBookwise	2	12-12-2010 07:36 PM
Removing a header	pckopp	Calibre	1	12-11-2010 01:33 PM
Removing header syntax.	boromirofborg	Calibre	0	07-21-2010 12:33 AM
PDF Conversion - Removing Header / Footer Text	heb	Sony Reader	9	07-11-2010 11:02 PM

10-19-2009, 05:03 AM	#2
hairybiker Banned Posts: 82 Karma: 10 Join Date: Aug 2009 Device: Tolino Shine 3	I have same issue so if anybody has a fix I would be grateful as well.

10-19-2009, 08:36 PM	#3
charleski Wizard Posts: 1,196 Karma: 1281258 Join Date: Sep 2009 Device: PRS-505	Have you tried just using Notepad on the html and doing a global search/replace on the offending lines? (replace field left blank).

10-20-2009, 04:51 AM	#4
hairybiker Banned Posts: 82 Karma: 10 Join Date: Aug 2009 Device: Tolino Shine 3	Thats what I have been doing until now, except using textcrawler and a macro, but if it could be removed when imported it would be easier, esp. since I have to run the app in a VM.

10-20-2009, 11:31 AM	#6
hairybiker Banned Posts: 82 Karma: 10 Join Date: Aug 2009 Device: Tolino Shine 3	Interestingly it didn't do anything for that, but changing to PDF.+\.com does remove all the junk Cheers,must lean regex

10-20-2009, 01:07 PM	#9
hairybiker Banned Posts: 82 Karma: 10 Join Date: Aug 2009 Device: Tolino Shine 3	I just selected the pdf, choose convert, then in the Structure Detection, clicked the Remove Header and put in the "PDF.+\.com" into the box removing the default one. If you click on the wizard it will show you what is being removed.

10-20-2009, 02:43 PM	#10
robertlc Junior Member Posts: 4 Karma: 10 Join Date: Oct 2009 Device: Kindle 2	Did all that. Can't get rid of it all.

10-20-2009, 04:32 PM	#11
hairybiker Banned Posts: 82 Karma: 10 Join Date: Aug 2009 Device: Tolino Shine 3	strange could you send me a copy of one that doesn't remove?

10-21-2009, 09:24 AM	#13
hairybiker Banned Posts: 82 Karma: 10 Join Date: Aug 2009 Device: Tolino Shine 3	If I was better at Linux command scripting then that is what I would do, but since I am still learning it ...