Regular Expression Help - Page 4

vinco · 05-25-2010, 11:14 PM

Quote:

Originally Posted by chaley

With apologies in advance for asking, did you remember to check the 'Remove footer' checkbox?

In your example, there are two spaces between 'Page' and '1'. If that is an accurate copy, then you need to match more than 1 space there. Try

Code:

<b>Page +\d+</b></p><p>

If that doesn't work, then it wouldn't surprise me if there are newlines buried in the middle of the the text you are trying to match. Try

Code:

<b>\s*Page +\d+\s*</b>\s*</p>\s*<p>

Remove footer is indeed checked. I tried your code, and still fail to remove the offending sections. Anyone willing to take a stab at the PDF and try to figure out what I'm still doing wrong?

tonyx3 · 05-26-2010, 12:12 AM

Sure, vinco,

Gimme a d/l link and I'll take a stab at it...

tonyx3 · 05-26-2010, 02:53 AM

Ok, so vinco sent me the pdf he's having trouble with, and I can confirm that the previously mentioned regexes, which highlight the proper matches in the tester, don't remove those matches when converting.

I have no idea why, but it's definitely a bug.

Oddly, when I used the resulant epub (which still had the page numbers) as the input, and adjusted the regex to match the page numbers and surrounding tags in the epub, it correctly removed them in the output. (vinco, this is your temporary workaround solution).

So why is the syntax highlighter showing the regex matches, but the converter not removing them?

vinco · 05-26-2010, 09:58 AM

Thanks for the assist, tonyx3. I'll put that workaround into force for now.

kovidgoyal · 05-26-2010, 11:37 AM

because the wizard doesn't operate on exactly the same html as the conversion pipeline, typically there may be white space difference between the two.

vinco · 05-26-2010, 12:14 PM

Quote:

Originally Posted by kovidgoyal

because the wizard doesn't operate on exactly the same html as the conversion pipeline, typically there may be white space difference between the two.

Other than doing a series of conversions, do you have any workaround suggestions? I can get a copy of the PDF to you as well if interested.

kovidgoyal · 05-26-2010, 12:27 PM

use the debug option to look at the actual intermediate html generated by the conversion process.

vinco · 05-26-2010, 07:53 PM

Quote:

Originally Posted by kovidgoyal

use the debug option to look at the actual intermediate html generated by the conversion process.

For anyone interested, using the debug route, a sample section becomes

Code:

&nbsp; &nbsp; &nbsp;Since nothing material was destroyed when the Eddorians were forced into the next plane of<br>
existence, their historical records also have become available. Those records-folios and tapes and<br>
playable discs of platinum alloy, resistant indefinitely even to Eddore's noxious atmosphere agree with<br>
those of the Arisians upon this point. Immediately before the Coalescence began there was one, and only<br>
<b>Page&nbsp;&nbsp;1</b><br>

<hr>
<A name=2></a>one, planetary solar system in the Second Galaxy; and, until the advent of Eddore, the Second Galaxy<br>
was entirely devoid of intelligent life.&nbsp;<br>

tonyx3 · 05-26-2010, 11:19 PM

So in this example, it looks like the problem is that the wizard can't tell the difference between a regular space and a non-breaking space, right?
That would be a problem. A 'white space difference' as Kovid said.

Quote:

Originally Posted by kovidgoyal

because the wizard doesn't operate on exactly the same html as the conversion pipeline, typically there may be white space difference between the two.

Is there some reason for this?

I mean, I'm sure there's some reason, but is it absolutely necessary?
It seems like it would be better if we were able to write and test our regexes based on the code that the conversion pipeline actually uses, to avoid errors like this one.

kovidgoyal · 05-27-2010, 12:18 AM

Well yeah, but the conversion pipeline cant be run (for various technical reasons) inside the GUI, so the GUI basically uses a trick to use an approximation of the conversion pipeline. It works fine in most cases, where you don't have unusual input files, but in some cases, like this, the approximation isn't good enough.

I could of course run the conversion pipeline in a separate process and then take the output of that into the GUI, but that is too much work. I prefer to spend the time just improving the PDF engine so it removes headers and footers automatically.

tonyx3 · 05-27-2010, 03:07 AM

I see. So calibre uses two different pdf-to-html engines?

The one used in the conversion pipeline is obviously returning different results from the one used in the regex wizard.

Quote:

I prefer to spend the time just improving the PDF engine so it removes headers and footers automatically.

That would be amazing.

Unfortunately, I've never once had the defaults work on removing headers or footers from PDF's. I've always had to write my own regex. And on multiple occasions I've had them match perfectly in the preview, and then not get removed in the conversion. (which is one reason I wish the preview html matched the conversion html)

I'm sure PDF conversion, given the format's nature, must be one of the bigger headaches in developing the conversion system.

DoctorOhh · 05-27-2010, 04:54 AM

Quote:

Originally Posted by tonyx3

Quote:

Originally Posted by kovidgoyal

I prefer to spend the time just improving the PDF engine so it removes headers and footers automatically.

That would be amazing.

Unfortunately, I've never once had the defaults work on removing headers or footers from PDF's.

I believe he is referring to improving a not yet released PDF engine. One which non of us has had a chance to try yet because it isn't finished.

tonyx3 · 05-27-2010, 06:02 AM

Quote:

Originally Posted by dwanthny

I believe he is referring to improving a not yet released PDF engine. One which non of us as had a chance to try yet because it isn't finished.

Awesome. I look forward to it!

TreborPugly · 06-11-2010, 11:54 PM

Hi. I've been using Calibre for a few weeks and I'm really enjoying it.

I adopted a regular expression for Adding from this thread that does a great job for my files:

Code:

^((?P<author>([^\-_0-9]+)(?=\s*-\s*)(?!\s*-\s*[0-9.]+)|\b))(\s*-\s*)?((?P<series>[^0-9\-]+)(\s*-\s*)?(?P<series_index>[0-9.]+)\s*-\s*)?(?P<title>[^\-_0-9]+)

However, I have some files that are zipped, and the file name includes the format of the file inside the zip, like so:

Author name - Book title (htm).zip

So this will get imported with "Book title (htm)" as the title, rather than just "Book title". Then I have to manually merge things. The parenthesis might be (htm), or (rtf), or (txt), etc... I can get it to ignore the parenthesis by adding a [(] to the end of my regex, but then that breaks the adding for files that don't have a ( in them.

I'm new to regex, and I've done some reading of reference suggested from inside of Calibre (which is how I learned enough to put my little addition on), but I've been trying to figure out a way to use the | operator unsucessfully.

I'd be pleased with any solution that works, and if you have the time a brief description of why it works.

My expectation is that I want to match ( or nothing, but not sure how to do the nothing. ie, is there some way to tell it to start over if a match fails?

Thanks in advance.

Starson17 · 06-12-2010, 07:50 AM

Quote:

Originally Posted by TreborPugly

I have some files that are zipped, and the file name includes the format of the file inside the zip, like so:

Author name - Book title (htm).zip

So this will get imported with "Book title (htm)" as the title, rather than just "Book title". Then I have to manually merge things. The parenthesis might be (htm), or (rtf), or (txt), etc... I can get it to ignore the parenthesis by adding a [(] to the end of my regex, but then that breaks the adding for files that don't have a ( in them.

Try this:

Code:

^((?P<author>([^\_0-9]+)(?=\s*-\s*)(?!\s*-\s*[0-9.]+)|\b))(\s*-\s*)?((?P<series>[^0-9\-]+) ([-#] ?)?(?P<series_index>[0-9.]+)?\s*-\s*)?(?P<title>[^(]+)

It excludes the open paren from the title.

06-11-2010, 11:54 PM	#59
TreborPugly Member Posts: 13 Karma: 954 Join Date: Jun 2010 Device: Mobipocket reader on Blackberry, XO using FBreader, Kindle	Hi. I've been using Calibre for a few weeks and I'm really enjoying it. I adopted a regular expression for Adding from this thread that does a great job for my files: Code: ^((?P<author>([^\-_0-9]+)(?=\s-\s)(?!\s-\s[0-9.]+)\|\b))(\s-\s)?((?P<series>[^0-9\-]+)(\s-\s)?(?P<series_index>[0-9.]+)\s-\s)?(?P<title>[^\-_0-9]+) However, I have some files that are zipped, and the file name includes the format of the file inside the zip, like so: Author name - Book title (htm).zip So this will get imported with "Book title (htm)" as the title, rather than just "Book title". Then I have to manually merge things. The parenthesis might be (htm), or (rtf), or (txt), etc... I can get it to ignore the parenthesis by adding a [(] to the end of my regex, but then that breaks the adding for files that don't have a ( in them. I'm new to regex, and I've done some reading of reference suggested from inside of Calibre (which is how I learned enough to put my little addition on), but I've been trying to figure out a way to use the \| operator unsucessfully. I'd be pleased with any solution that works, and if you have the time a brief description of why it works. My expectation is that I want to match ( or nothing, but not sure how to do the nothing. ie, is there some way to tell it to start over if a match fails? Thanks in advance.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Regular Expression Help	smartmart	Calibre	5	10-17-2010 05:19 AM
Need Help Creating a Regular Expression	Worm	Calibre	9	08-18-2010 01:20 PM
Regular Expression Help Needed	dloyer4	Calibre	1	07-25-2010 10:37 PM
Help with the regular expression	Dysonco	Calibre	9	03-22-2010 10:45 PM
I don't know how to use wilcards and regular expression....	superanima	Sigil	4	02-21-2010 09:42 AM

05-26-2010, 12:12 AM	#47
tonyx3 Connoisseur Posts: 55 Karma: 10 Join Date: Jan 2010 Device: Nexus One	Sure, vinco, Gimme a d/l link and I'll take a stab at it...

05-26-2010, 02:53 AM	#48
tonyx3 Connoisseur Posts: 55 Karma: 10 Join Date: Jan 2010 Device: Nexus One	Ok, so vinco sent me the pdf he's having trouble with, and I can confirm that the previously mentioned regexes, which highlight the proper matches in the tester, don't remove those matches when converting. I have no idea why, but it's definitely a bug. Oddly, when I used the resulant epub (which still had the page numbers) as the input, and adjusted the regex to match the page numbers and surrounding tags in the epub, it correctly removed them in the output. (vinco, this is your temporary workaround solution). So why is the syntax highlighter showing the regex matches, but the converter not removing them?

05-26-2010, 09:58 AM	#49
vinco Junior Member Posts: 7 Karma: 10 Join Date: May 2010 Device: Nook	Thanks for the assist, tonyx3. I'll put that workaround into force for now.

05-26-2010, 11:37 AM	#50
kovidgoyal creator of calibre Posts: 46,419 Karma: 29634066 Join Date: Oct 2006 Location: Mumbai, India Device: Various	because the wizard doesn't operate on exactly the same html as the conversion pipeline, typically there may be white space difference between the two.

05-26-2010, 12:27 PM	#52
kovidgoyal creator of calibre Posts: 46,419 Karma: 29634066 Join Date: Oct 2006 Location: Mumbai, India Device: Various	use the debug option to look at the actual intermediate html generated by the conversion process.

05-27-2010, 12:18 AM	#55
kovidgoyal creator of calibre Posts: 46,419 Karma: 29634066 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Well yeah, but the conversion pipeline cant be run (for various technical reasons) inside the GUI, so the GUI basically uses a trick to use an approximation of the conversion pipeline. It works fine in most cases, where you don't have unusual input files, but in some cases, like this, the approximation isn't good enough. I could of course run the conversion pipeline in a separate process and then take the output of that into the GUI, but that is too much work. I prefer to spend the time just improving the PDF engine so it removes headers and footers automatically.