major problems converting pdf

dapex · 01-08-2011, 07:55 AM

Hi all, I have to admit I am pretty new to ebook readers, Wife bought me a kindle for xmas and its great, however most my reading material is in PDF format and when I use Calibre to convert books I always get a messed up conversion. I end up with the name of the book randomly inserted into the pages, or quite simply text appears in the converted book that isnt in the PDF.

Can anyone advise why this is happening, or am I simply expecting to much?

I thought I should be able to take a perfectly formatted pdf and convert it to epub or mobi and have the same output?

If it helps I have uploaded a PDF and the mobi conversion in a zip file here
http://www.fileserve.com/file/EQPxZHd

If anyone can hekp then please do as its really ruining the reading experience at present.

Cheers

edited to add, just found out whats causing one particular issue, just dont know how to resolve it, I have a few PDF's and at the top of each page it has the page number and the title of the book, when I convert these PDF's into either epub or mobi (doing epub conversion for a friend with a samsung ereader) the page number and book title are being made bold and larger text and then being insterted into the middle of the sentance, so the conversion isnt able to tell thats its the start of a new page, I ahve no idea how to tell it thats this is the start of a new page, ideally I want it to ignore the page number and the book title unless it can add it as it is in the PDF.

Any thoughts?

ldolse · 01-08-2011, 09:14 AM

You need to write a regular expression to remove the header/footer. Go to structure detection under the conversion options, enable either 'remove header' or 'remove footer', and then enter the appropriate regular expression. You can click the magic wand button to pull up a wizard to help you write/test it. There's several a tutorial in the Calibre manual and several tutorials online for regular expressions/regex if you're not familiar with them.

dapex · 01-08-2011, 10:39 AM

cheers for that, tried the remove header and footer but that didnt seem to do anything, will google for the tuturials and see if that helps

ldolse · 01-08-2011, 10:44 AM

Found the Calibre tutorial - couldn't find it when I posted before:
https://www.calibre-ebook.com/user_manual/regexp.html

dapex · 01-12-2011, 09:23 AM

OK, read the tutorial you showed me and to be honest its way over my head. I have had a look at the page struction detection section on calibre and found out where the problem is. below is a section of the PDF file I am currently working on

When you're just about to be really 
mean to someone you love, you could stop and do this. And with 
<hr>
<A name=28></a>26 Using Your Brain 
the look that's on your faces right now, who knows what you 
could get into . . . .all kinds of fun trouble! </a>26 Using Your Brain is at the top of a page and its the page number and chapter title, this is on the top of every page but the A name= changes number every time in increments of one.
Because the software doesnt realise this is the page number and chapter title it is adding it into the text of the book which is obviously a tad annoying.

Can anyone tell me how I can tell calibre to either ignore the <A name=28></a>26 Using Your Brain or tell it that this is a page header and so to just put it at the top of the page in smaller txt instead of in the middle of a sentance???

Please help as this is a problem on many of the PDF's I have and its really bugging me that I cant fix it. (I can fix it buy going into a PDF editor and manually removing each page number etc) but as you can imagine, this is a painfully slow process and when I have loads of PDF's to do its not really practical.

Cheers

Dave

itimpi · 01-12-2011, 10:01 AM

You should use the option to remove headers (and/or footers) in the Structure Detection part of PDF input. Note despite their names these are really just generic string removal options - it is just that header/footer removal is their commenst usage.

You have to construct a regex expression that is specific to the file in question. However it is quite easy to do in most cases if you take advantage of the wizard. The steps I use are:
- Press the Wizard button alongside the inpout text box for one of the above options, and select the PDF file
- When the window opens up, find an example of the text you want to remove, and then copy/paste it into the regex box at the top replacing what is already there.
- replace anywhere there is a number with \d* to allow for any number of any length. This handles things like the page number varying.
- replace anywhere there is white space with \s*. This also handle tab, newlines etc
- Press the Test button to make sure the text you want removed is highlighted - if not you probably got one of the \ d* or \s* replacements wrong
- If the correct text was highlighted, scroll down to the next occurrence of similar strings to check it was also highlighted so that you have generalised the expression correctly
- Press OK
- Make sure the checkbox to use the expression just created is ticked.
- Repeat if necessary for the footer box as typically the footers need a different regex to the header.
- Press OK to actually do the conversion
- When conversion completes you can view the results to check they are what you want.

It sounds more complicated than it actually turns out to be, and you do not have to really understand regex to carry out the above steps.

The settings you used fir this particular book will be remembered so if you need to tweak the settings you last set will be the new starting point.

cybmole · 01-12-2011, 10:39 AM

there are other ways to remove the header & footer , without learning/using regex- google pdfscissors or search for it in this forum

ldolse · 01-12-2011, 11:13 AM

All you need to do is delete the

Code:

<i>26 Using Your Brain</i><br>

references.

The stuff with <A name=....> gets deleted as part of the default processing, so you don't need to particularly worry about that.

The regex should be something like:

Code:

<i>\d+\s*Using\s*Your\s*Brain\s*</i>\s*<br>

There are alternate tools like Cybmole mentioned, mobipocket creator, etc which should be able to do a basic conversion as well, perhaps with less pain and suffering on your part. That said I haven't tried them, so can't really comment.

Sydney's Mom · 01-12-2011, 07:09 PM

Unfortunately, this is over my head. I use Mobipocket Creator Pro to convert pdf. Calibre does epub flawlessly, but fine-tuning pdf is too much for me. MPC does a really good job - just import, then click build

vulcan_girl · 01-12-2011, 08:58 PM

Like Sydney's Mom, this is way over my head as well. I'm perfectly happy using Briss to crop and not converting the file, especially if it's a book that I'm not planning on reading more than once. If it's one I'd like to keep, I'd probably try and get it in anther format.

01-08-2011, 07:55 AM	#1
dapex Junior Member Posts: 3 Karma: 10 Join Date: Jan 2011 Device: kindle	major problems converting pdf Hi all, I have to admit I am pretty new to ebook readers, Wife bought me a kindle for xmas and its great, however most my reading material is in PDF format and when I use Calibre to convert books I always get a messed up conversion. I end up with the name of the book randomly inserted into the pages, or quite simply text appears in the converted book that isnt in the PDF. Can anyone advise why this is happening, or am I simply expecting to much? I thought I should be able to take a perfectly formatted pdf and convert it to epub or mobi and have the same output? If it helps I have uploaded a PDF and the mobi conversion in a zip file here http://www.fileserve.com/file/EQPxZHd If anyone can hekp then please do as its really ruining the reading experience at present. Cheers edited to add, just found out whats causing one particular issue, just dont know how to resolve it, I have a few PDF's and at the top of each page it has the page number and the title of the book, when I convert these PDF's into either epub or mobi (doing epub conversion for a friend with a samsung ereader) the page number and book title are being made bold and larger text and then being insterted into the middle of the sentance, so the conversion isnt able to tell thats its the start of a new page, I ahve no idea how to tell it thats this is the start of a new page, ideally I want it to ignore the page number and the book title unless it can add it as it is in the PDF. Any thoughts? Last edited by dapex; 01-08-2011 at 08:33 AM.

01-12-2011, 09:23 AM	#5
dapex Junior Member Posts: 3 Karma: 10 Join Date: Jan 2011 Device: kindle	OK, read the tutorial you showed me and to be honest its way over my head. I have had a look at the page struction detection section on calibre and found out where the problem is. below is a section of the PDF file I am currently working on When you're just about to be really<br> mean to someone you love, you could stop and do this. And with<br> <hr> <A name=28></a><i>26 Using Your Brain</i><br> the look that's on your faces right now, who knows what you<br> could get into . . . .all kinds of fun trouble!<br Basically the whole line <A name=28></a><i>26 Using Your Brain</i><br> is at the top of a page and its the page number and chapter title, this is on the top of every page but the A name= changes number every time in increments of one. Because the software doesnt realise this is the page number and chapter title it is adding it into the text of the book which is obviously a tad annoying. Can anyone tell me how I can tell calibre to either ignore the <A name=28></a><i>26 Using Your Brain</i><br> or tell it that this is a page header and so to just put it at the top of the page in smaller txt instead of in the middle of a sentance??? Please help as this is a problem on many of the PDF's I have and its really bugging me that I cant fix it. (I can fix it buy going into a PDF editor and manually removing each page number etc) but as you can imagine, this is a painfully slow process and when I have loads of PDF's to do its not really practical. Cheers Dave

01-12-2011, 11:13 AM	#8
ldolse Wizard Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	All you need to do is delete the Code: <i>26 Using Your Brain</i><br> references. The stuff with <A name=....> gets deleted as part of the default processing, so you don't need to particularly worry about that. The regex should be something like: Code: <i>\d+\sUsing\sYour\sBrain\s</i>\s*<br> There are alternate tools like Cybmole mentioned, mobipocket creator, etc which should be able to do a basic conversion as well, perhaps with less pain and suffering on your part. That said I haven't tried them, so can't really comment.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
problems converting .pdf to .rb for REB1100	RKnack	Conversion	5	08-15-2011 03:44 AM
Problems with converting pdf to mobi	Holger	Calibre	1	08-28-2010 12:41 AM
Problems with converting Palm PDB-PDF files to other formats/show in calibre-viewer	Tobago	Calibre	7	04-29-2010 05:57 PM
DR1000 two major problems with 2.0 firmware	splendor	iRex	29	04-18-2010 05:11 AM

01-08-2011, 09:14 AM	#2
ldolse Wizard Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	You need to write a regular expression to remove the header/footer. Go to structure detection under the conversion options, enable either 'remove header' or 'remove footer', and then enter the appropriate regular expression. You can click the magic wand button to pull up a wizard to help you write/test it. There's several a tutorial in the Calibre manual and several tutorials online for regular expressions/regex if you're not familiar with them.

01-08-2011, 10:39 AM	#3
dapex Junior Member Posts: 3 Karma: 10 Join Date: Jan 2011 Device: kindle	cheers for that, tried the remove header and footer but that didnt seem to do anything, will google for the tuturials and see if that helps

01-08-2011, 10:44 AM	#4
ldolse Wizard Posts: 1,337 Karma: 123457 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Found the Calibre tutorial - couldn't find it when I posted before: https://www.calibre-ebook.com/user_manual/regexp.html

01-12-2011, 10:01 AM	#6
itimpi Wizard Posts: 4,553 Karma: 950151 Join Date: Nov 2008 Device: Sony PRS-950, iphone/ipad (Marvin/iBooks/QuickReader)	You should use the option to remove headers (and/or footers) in the Structure Detection part of PDF input. Note despite their names these are really just generic string removal options - it is just that header/footer removal is their commenst usage. You have to construct a regex expression that is specific to the file in question. However it is quite easy to do in most cases if you take advantage of the wizard. The steps I use are: - Press the Wizard button alongside the inpout text box for one of the above options, and select the PDF file - When the window opens up, find an example of the text you want to remove, and then copy/paste it into the regex box at the top replacing what is already there. - replace anywhere there is a number with \d* to allow for any number of any length. This handles things like the page number varying. - replace anywhere there is white space with \s. This also handle tab, newlines etc - Press the Test button to make sure the text you want removed is highlighted - if not you probably got one of the \ d or \s* replacements wrong - If the correct text was highlighted, scroll down to the next occurrence of similar strings to check it was also highlighted so that you have generalised the expression correctly - Press OK - Make sure the checkbox to use the expression just created is ticked. - Repeat if necessary for the footer box as typically the footers need a different regex to the header. - Press OK to actually do the conversion - When conversion completes you can view the results to check they are what you want. It sounds more complicated than it actually turns out to be, and you do not have to really understand regex to carry out the above steps. The settings you used fir this particular book will be remembered so if you need to tweak the settings you last set will be the new starting point.

01-12-2011, 10:39 AM	#7
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	there are other ways to remove the header & footer , without learning/using regex- google pdfscissors or search for it in this forum

01-12-2011, 07:09 PM	#9
Sydney's Mom Wizard Posts: 2,899 Karma: 6995721 Join Date: Dec 2008 Location: Idaho, on the side of a mountain Device: Kindle Oasis, Fire 3d Gen and 5th Gen and Samsung Tab S	Unfortunately, this is over my head. I use Mobipocket Creator Pro to convert pdf. Calibre does epub flawlessly, but fine-tuning pdf is too much for me. MPC does a really good job - just import, then click build

01-12-2011, 08:58 PM	#10
vulcan_girl Groupie Posts: 156 Karma: 1010345 Join Date: Jun 2009 Device: PRS 350	Like Sydney's Mom, this is way over my head as well. I'm perfectly happy using Briss to crop and not converting the file, especially if it's a book that I'm not planning on reading more than once. If it's one I'd like to keep, I'd probably try and get it in anther format.

Advert

Advert