MobileRead Forums - View Single Post

ac4lt · 01-12-2010, 07:26 PM

I'm having problems with this as well. Here are the details:

Using calibre 0.6.33, I'm trying to convert a pdf to an epub.

In the pdf the last line of the page is a line number. I'm trying to write a regex to remove this.

Setting the debug for conversion I've been able to look at the input, parsed and processed directories.

An example from the input directory shows the last couple of lines of a page:

Code:

The stranger had clambered through the ditch and up the bank,<br>
8<br>

Looking at the parsed directory, I see this:

Code:

The stranger had clambered through the ditch and up the bank, 8</p><p>

Also, my headers have also been removed though we're only at the parsed step and the pdf line unwrapping appears to have been done.

The processed directory shows:

Code:

The stranger had clambered through the ditch and up the bank, 8</p><p class="calibre1">

Though the header was no problem, I can't find a regex to remove the footer. I've tried:

Code:

\d+<br>

and

Code:

\d+</p><p>

and lots of other variations that don't work.

It's not clear to me when the regex processing is done. That is to say, whether it is before or after the conversion to xhtml and line unwrapping have occurred. Speculating, I'd say it's after.

The problem is that it appears impossible to refer to P tags in the regex. They never work.

I've tried everything suggested in this thread and so far nothing has worked.

Anyone have any ideas?

01-12-2010, 07:26 PM	#18
ac4lt Connoisseur Posts: 61 Karma: 36 Join Date: Jan 2010 Location: Reston, Virginia, US Device: ipad	I'm having problems with this as well. Here are the details: Using calibre 0.6.33, I'm trying to convert a pdf to an epub. In the pdf the last line of the page is a line number. I'm trying to write a regex to remove this. Setting the debug for conversion I've been able to look at the input, parsed and processed directories. An example from the input directory shows the last couple of lines of a page: Code: The stranger had clambered through the ditch and up the bank,<br> 8<br> Looking at the parsed directory, I see this: Code: The stranger had clambered through the ditch and up the bank, 8</p><p> Also, my headers have also been removed though we're only at the parsed step and the pdf line unwrapping appears to have been done. The processed directory shows: Code: The stranger had clambered through the ditch and up the bank, 8</p><p class="calibre1"> Though the header was no problem, I can't find a regex to remove the footer. I've tried: Code: \d+<br> and Code: \d+</p><p> and lots of other variations that don't work. It's not clear to me when the regex processing is done. That is to say, whether it is before or after the conversion to xhtml and line unwrapping have occurred. Speculating, I'd say it's after. The problem is that it appears impossible to refer to P tags in the regex. They never work. I've tried everything suggested in this thread and so far nothing has worked. Anyone have any ideas?