10-18-2009, 11:21 PM | #1 |
Junior Member
Posts: 4
Karma: 10
Join Date: Oct 2009
Device: Kindle 2
|
[Old Thread] Removing ABBYY header in a PDF
I have a few PDF files someone else converted using ABBYY PDF Transformer.
Each page has a graphic in both top corners. Looking at the page in Calibre's header wizard shows the encoding behind the graphic as this: <a href="http://www.abbyy.com/buy"><b>PDF Transform</b></a><p> <a href="http://www.abbyy.com/buy"><b>PDF Transform</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>Y</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>Y</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>Y</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>er</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>Y</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>er</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>B</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>2</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>B</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>2</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>B</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>.0</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>B</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>.0</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>A</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>A</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>Click here to buy</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>Click here to buy</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>w</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>w</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>w</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>w</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>w . </b></a></p><p> <a href="http://www.abbyy.com/buy"><b>w</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>A B B YY.com</b></a></p><p> <a href="http://www.abbyy.com/buy"><b>.A B BYY.com</b></a></p><p> When converted to a mobi file, I get a bunch of lines that start with "PDF Transform" and then several letters, a couple of "Click here to buy," some more letters and then the "A B B YY.com" and ".A B BYY.com". On my Kindle 2, this makes up about 2.5 pages I have to skip through every 3 or so pages and is annoying. Can someone please tell me what I need to enter in the "Header Regular Expression" box? Thanks in advance! |
10-19-2009, 05:03 AM | #2 |
Banned
Posts: 82
Karma: 10
Join Date: Aug 2009
Device: Tolino Shine 3
|
I have same issue so if anybody has a fix I would be grateful as well.
|
10-19-2009, 08:36 PM | #3 |
Wizard
Posts: 1,196
Karma: 1281258
Join Date: Sep 2009
Device: PRS-505
|
Have you tried just using Notepad on the html and doing a global search/replace on the offending lines? (replace field left blank).
|
10-20-2009, 04:51 AM | #4 |
Banned
Posts: 82
Karma: 10
Join Date: Aug 2009
Device: Tolino Shine 3
|
Thats what I have been doing until now, except using textcrawler and a macro, but if it could be removed when imported it would be easier, esp. since I have to run the app in a VM.
|
10-20-2009, 10:21 AM | #5 | |
Member
Posts: 24
Karma: 16
Join Date: Aug 2009
Device: Pocketbook 360, Sony PRS-T1
|
Quote:
PDF Transform .+ \.com PDF Transform = Start of text, that should be removed .+ = one or more characters \.com = End of text, that should be removed |
|
10-20-2009, 11:31 AM | #6 |
Banned
Posts: 82
Karma: 10
Join Date: Aug 2009
Device: Tolino Shine 3
|
Interestingly it didn't do anything for that, but changing to PDF.+\.com does remove all the junk
Cheers,must lean regex |
10-20-2009, 12:57 PM | #7 | |
Junior Member
Posts: 4
Karma: 10
Join Date: Oct 2009
Device: Kindle 2
|
Quote:
Doesn't work when I put it in Calibre. The text is still there. How did you do it HairyBiker? |
|
10-20-2009, 01:02 PM | #8 |
Junior Member
Posts: 4
Karma: 10
Join Date: Oct 2009
Device: Kindle 2
|
|
10-20-2009, 01:07 PM | #9 |
Banned
Posts: 82
Karma: 10
Join Date: Aug 2009
Device: Tolino Shine 3
|
I just selected the pdf, choose convert, then in the Structure Detection, clicked the Remove Header and put in the "PDF.+\.com" into the box removing the default one. If you click on the wizard it will show you what is being removed.
|
10-20-2009, 02:43 PM | #10 |
Junior Member
Posts: 4
Karma: 10
Join Date: Oct 2009
Device: Kindle 2
|
Did all that. Can't get rid of it all.
|
10-20-2009, 04:32 PM | #11 |
Banned
Posts: 82
Karma: 10
Join Date: Aug 2009
Device: Tolino Shine 3
|
strange could you send me a copy of one that doesn't remove?
|
10-20-2009, 07:59 PM | #12 | |
Wizard
Posts: 1,196
Karma: 1281258
Join Date: Sep 2009
Device: PRS-505
|
Quote:
If you have a large number of files that need identical editing, it might be more efficient to write a script to pass them through grep and then pipe the output to calibre's command-line converter. |
|
10-21-2009, 09:24 AM | #13 |
Banned
Posts: 82
Karma: 10
Join Date: Aug 2009
Device: Tolino Shine 3
|
If I was better at Linux command scripting then that is what I would do, but since I am still learning it ...
|
10-21-2009, 09:55 AM | #14 |
Junior Member
Posts: 9
Karma: 10
Join Date: Oct 2009
Device: prs 505
|
Hi, I had a similar problem recently. (see https://www.mobileread.com/forums/showthread.php?t=59282 )
My problem was the regexp editor doesn't show you the text the regex acts on. Try something like: Code:
(?ism)<a href="http://www.abbyy.com/buy"><b>(\w|\s)*</b></a>(<br>)? |
10-22-2009, 09:43 AM | #15 | |
Zealot
Posts: 115
Karma: 150
Join Date: Jul 2008
Location: Netherlands Veenendaal
Device: Palm T5, Sony PRS-505, Nook Color
|
Quote:
Problem that I have right now is that it doesn't highlight the text which the regexp works on. The regexp does work when I convert the pdf to epub I wanted to show what is displayed and what I think should happen but now I'm getting an error: Code:
ERROR: ERROR: Unhandled exception: <b>WindowsError</b>:[Error 6] The handle is invalid Traceback (most recent call last): File "site-packages\calibre\gui2\convert\regex_builder.py", line 101, in button_clicked File "site-packages\calibre\gui2\convert\regex_builder.py", line 90, in open_book File "site-packages\calibre\ebooks\oeb\iterator.py", line 141, in __enter__ File "site-packages\calibre\customize\conversion.py", line 208, in __call__ File "site-packages\calibre\ebooks\pdf\input.py", line 33, in convert File "site-packages\calibre\ebooks\pdf\pdftohtml.py", line 49, in pdftohtml File "subprocess.py", line 614, in __init__ File "subprocess.py", line 735, in _get_handles File "subprocess.py", line 761, in _make_inheritable WindowsError: [Error 6] The handle is invalid Regards, Joop |
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
removing unwanted pages ABBYY finereader | sovre | Workshop | 3 | 08-04-2011 03:05 AM |
Removing Header from .IMP | ronin688 | Fictionwise eBookwise | 2 | 12-12-2010 07:36 PM |
Removing a header | pckopp | Calibre | 1 | 12-11-2010 01:33 PM |
Removing header syntax. | boromirofborg | Calibre | 0 | 07-21-2010 12:33 AM |
PDF Conversion - Removing Header / Footer Text | heb | Sony Reader | 9 | 07-11-2010 11:02 PM |