MobileRead Forums - View Single Post

itimpi · 01-12-2011, 09:01 AM

You should use the option to remove headers (and/or footers) in the Structure Detection part of PDF input. Note despite their names these are really just generic string removal options - it is just that header/footer removal is their commenst usage.

You have to construct a regex expression that is specific to the file in question. However it is quite easy to do in most cases if you take advantage of the wizard. The steps I use are:
- Press the Wizard button alongside the inpout text box for one of the above options, and select the PDF file
- When the window opens up, find an example of the text you want to remove, and then copy/paste it into the regex box at the top replacing what is already there.
- replace anywhere there is a number with \d* to allow for any number of any length. This handles things like the page number varying.
- replace anywhere there is white space with \s*. This also handle tab, newlines etc
- Press the Test button to make sure the text you want removed is highlighted - if not you probably got one of the \ d* or \s* replacements wrong
- If the correct text was highlighted, scroll down to the next occurrence of similar strings to check it was also highlighted so that you have generalised the expression correctly
- Press OK
- Make sure the checkbox to use the expression just created is ticked.
- Repeat if necessary for the footer box as typically the footers need a different regex to the header.
- Press OK to actually do the conversion
- When conversion completes you can view the results to check they are what you want.

It sounds more complicated than it actually turns out to be, and you do not have to really understand regex to carry out the above steps.

The settings you used fir this particular book will be remembered so if you need to tweak the settings you last set will be the new starting point.

01-12-2011, 09:01 AM	#6
itimpi Wizard Posts: 4,553 Karma: 950151 Join Date: Nov 2008 Device: Sony PRS-950, iphone/ipad (Marvin/iBooks/QuickReader)	You should use the option to remove headers (and/or footers) in the Structure Detection part of PDF input. Note despite their names these are really just generic string removal options - it is just that header/footer removal is their commenst usage. You have to construct a regex expression that is specific to the file in question. However it is quite easy to do in most cases if you take advantage of the wizard. The steps I use are: - Press the Wizard button alongside the inpout text box for one of the above options, and select the PDF file - When the window opens up, find an example of the text you want to remove, and then copy/paste it into the regex box at the top replacing what is already there. - replace anywhere there is a number with \d* to allow for any number of any length. This handles things like the page number varying. - replace anywhere there is white space with \s. This also handle tab, newlines etc - Press the Test button to make sure the text you want removed is highlighted - if not you probably got one of the \ d or \s* replacements wrong - If the correct text was highlighted, scroll down to the next occurrence of similar strings to check it was also highlighted so that you have generalised the expression correctly - Press OK - Make sure the checkbox to use the expression just created is ticked. - Repeat if necessary for the footer box as typically the footers need a different regex to the header. - Press OK to actually do the conversion - When conversion completes you can view the results to check they are what you want. It sounds more complicated than it actually turns out to be, and you do not have to really understand regex to carry out the above steps. The settings you used fir this particular book will be remembered so if you need to tweak the settings you last set will be the new starting point.