Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 05-25-2010, 11:14 PM   #46
vinco
Junior Member
vinco began at the beginning.
 
Posts: 7
Karma: 10
Join Date: May 2010
Device: Nook
Quote:
Originally Posted by chaley View Post
With apologies in advance for asking, did you remember to check the 'Remove footer' checkbox?

In your example, there are two spaces between 'Page' and '1'. If that is an accurate copy, then you need to match more than 1 space there. Try
Code:
<b>Page +\d+</b></p><p>
If that doesn't work, then it wouldn't surprise me if there are newlines buried in the middle of the the text you are trying to match. Try
Code:
<b>\s*Page +\d+\s*</b>\s*</p>\s*<p>
Remove footer is indeed checked. I tried your code, and still fail to remove the offending sections. Anyone willing to take a stab at the PDF and try to figure out what I'm still doing wrong?
vinco is offline  
Old 05-26-2010, 12:12 AM   #47
tonyx3
Connoisseur
tonyx3 began at the beginning.
 
Posts: 55
Karma: 10
Join Date: Jan 2010
Device: Nexus One
Sure, vinco,


Gimme a d/l link and I'll take a stab at it...
tonyx3 is offline  
Advert
Old 05-26-2010, 02:53 AM   #48
tonyx3
Connoisseur
tonyx3 began at the beginning.
 
Posts: 55
Karma: 10
Join Date: Jan 2010
Device: Nexus One
Ok, so vinco sent me the pdf he's having trouble with, and I can confirm that the previously mentioned regexes, which highlight the proper matches in the tester, don't remove those matches when converting.

I have no idea why, but it's definitely a bug.

Oddly, when I used the resulant epub (which still had the page numbers) as the input, and adjusted the regex to match the page numbers and surrounding tags in the epub, it correctly removed them in the output. (vinco, this is your temporary workaround solution).



So why is the syntax highlighter showing the regex matches, but the converter not removing them?
tonyx3 is offline  
Old 05-26-2010, 09:58 AM   #49
vinco
Junior Member
vinco began at the beginning.
 
Posts: 7
Karma: 10
Join Date: May 2010
Device: Nook
Thanks for the assist, tonyx3. I'll put that workaround into force for now.
vinco is offline  
Old 05-26-2010, 11:37 AM   #50
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,869
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
because the wizard doesn't operate on exactly the same html as the conversion pipeline, typically there may be white space difference between the two.
kovidgoyal is offline  
Advert
Old 05-26-2010, 12:14 PM   #51
vinco
Junior Member
vinco began at the beginning.
 
Posts: 7
Karma: 10
Join Date: May 2010
Device: Nook
Quote:
Originally Posted by kovidgoyal View Post
because the wizard doesn't operate on exactly the same html as the conversion pipeline, typically there may be white space difference between the two.
Other than doing a series of conversions, do you have any workaround suggestions? I can get a copy of the PDF to you as well if interested.
vinco is offline  
Old 05-26-2010, 12:27 PM   #52
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,869
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
use the debug option to look at the actual intermediate html generated by the conversion process.
kovidgoyal is offline  
Old 05-26-2010, 07:53 PM   #53
vinco
Junior Member
vinco began at the beginning.
 
Posts: 7
Karma: 10
Join Date: May 2010
Device: Nook
Quote:
Originally Posted by kovidgoyal View Post
use the debug option to look at the actual intermediate html generated by the conversion process.
For anyone interested, using the debug route, a sample section becomes
Code:
&nbsp; &nbsp; &nbsp;Since nothing material was destroyed when the Eddorians were forced into the next plane of<br>
existence, their historical records also have become available. Those records-folios and tapes and<br>
playable discs of platinum alloy, resistant indefinitely even to Eddore's noxious atmosphere agree with<br>
those of the Arisians upon this point. Immediately before the Coalescence began there was one, and only<br>
<b>Page&nbsp;&nbsp;1</b><br>

<hr>
<A name=2></a>one, planetary solar system in the Second Galaxy; and, until the advent of Eddore, the Second Galaxy<br>
was entirely devoid of intelligent life.&nbsp;<br>
vinco is offline  
Old 05-26-2010, 11:19 PM   #54
tonyx3
Connoisseur
tonyx3 began at the beginning.
 
Posts: 55
Karma: 10
Join Date: Jan 2010
Device: Nexus One
So in this example, it looks like the problem is that the wizard can't tell the difference between a regular space and a non-breaking space, right?
That would be a problem. A 'white space difference' as Kovid said.


Quote:
Originally Posted by kovidgoyal View Post
because the wizard doesn't operate on exactly the same html as the conversion pipeline, typically there may be white space difference between the two.

Is there some reason for this?

I mean, I'm sure there's some reason, but is it absolutely necessary?
It seems like it would be better if we were able to write and test our regexes based on the code that the conversion pipeline actually uses, to avoid errors like this one.
tonyx3 is offline  
Old 05-27-2010, 12:18 AM   #55
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,869
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Well yeah, but the conversion pipeline cant be run (for various technical reasons) inside the GUI, so the GUI basically uses a trick to use an approximation of the conversion pipeline. It works fine in most cases, where you don't have unusual input files, but in some cases, like this, the approximation isn't good enough.

I could of course run the conversion pipeline in a separate process and then take the output of that into the GUI, but that is too much work. I prefer to spend the time just improving the PDF engine so it removes headers and footers automatically.
kovidgoyal is offline  
Old 05-27-2010, 03:07 AM   #56
tonyx3
Connoisseur
tonyx3 began at the beginning.
 
Posts: 55
Karma: 10
Join Date: Jan 2010
Device: Nexus One
I see. So calibre uses two different pdf-to-html engines?

The one used in the conversion pipeline is obviously returning different results from the one used in the regex wizard.



Quote:
I prefer to spend the time just improving the PDF engine so it removes headers and footers automatically.
That would be amazing.

Unfortunately, I've never once had the defaults work on removing headers or footers from PDF's. I've always had to write my own regex. And on multiple occasions I've had them match perfectly in the preview, and then not get removed in the conversion. (which is one reason I wish the preview html matched the conversion html)

I'm sure PDF conversion, given the format's nature, must be one of the bigger headaches in developing the conversion system.
tonyx3 is offline  
Old 05-27-2010, 04:54 AM   #57
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
Quote:
Originally Posted by tonyx3 View Post
Quote:
Originally Posted by kovidgoyal View Post
I prefer to spend the time just improving the PDF engine so it removes headers and footers automatically.
That would be amazing.

Unfortunately, I've never once had the defaults work on removing headers or footers from PDF's.
I believe he is referring to improving a not yet released PDF engine. One which non of us has had a chance to try yet because it isn't finished.

Last edited by DoctorOhh; 05-27-2010 at 06:47 PM.
DoctorOhh is offline  
Old 05-27-2010, 06:02 AM   #58
tonyx3
Connoisseur
tonyx3 began at the beginning.
 
Posts: 55
Karma: 10
Join Date: Jan 2010
Device: Nexus One
Quote:
Originally Posted by dwanthny View Post
I believe he is referring to improving a not yet released PDF engine. One which non of us as had a chance to try yet because it isn't finished.
Awesome. I look forward to it!
tonyx3 is offline  
Old 06-11-2010, 11:54 PM   #59
TreborPugly
Member
TreborPugly has learned how to read e-booksTreborPugly has learned how to read e-booksTreborPugly has learned how to read e-booksTreborPugly has learned how to read e-booksTreborPugly has learned how to read e-booksTreborPugly has learned how to read e-booksTreborPugly has learned how to read e-booksTreborPugly has learned how to read e-books
 
Posts: 13
Karma: 954
Join Date: Jun 2010
Device: Mobipocket reader on Blackberry, XO using FBreader, Kindle
Thumbs down

Hi. I've been using Calibre for a few weeks and I'm really enjoying it.

I adopted a regular expression for Adding from this thread that does a great job for my files:

Code:
^((?P<author>([^\-_0-9]+)(?=\s*-\s*)(?!\s*-\s*[0-9.]+)|\b))(\s*-\s*)?((?P<series>[^0-9\-]+)(\s*-\s*)?(?P<series_index>[0-9.]+)\s*-\s*)?(?P<title>[^\-_0-9]+)
However, I have some files that are zipped, and the file name includes the format of the file inside the zip, like so:

Author name - Book title (htm).zip

So this will get imported with "Book title (htm)" as the title, rather than just "Book title". Then I have to manually merge things. The parenthesis might be (htm), or (rtf), or (txt), etc... I can get it to ignore the parenthesis by adding a [(] to the end of my regex, but then that breaks the adding for files that don't have a ( in them.

I'm new to regex, and I've done some reading of reference suggested from inside of Calibre (which is how I learned enough to put my little addition on), but I've been trying to figure out a way to use the | operator unsucessfully.

I'd be pleased with any solution that works, and if you have the time a brief description of why it works.

My expectation is that I want to match ( or nothing, but not sure how to do the nothing. ie, is there some way to tell it to start over if a match fails?

Thanks in advance.
TreborPugly is offline  
Old 06-12-2010, 07:50 AM   #60
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TreborPugly View Post
I have some files that are zipped, and the file name includes the format of the file inside the zip, like so:

Author name - Book title (htm).zip

So this will get imported with "Book title (htm)" as the title, rather than just "Book title". Then I have to manually merge things. The parenthesis might be (htm), or (rtf), or (txt), etc... I can get it to ignore the parenthesis by adding a [(] to the end of my regex, but then that breaks the adding for files that don't have a ( in them.
Try this:
Code:
^((?P<author>([^\_0-9]+)(?=\s*-\s*)(?!\s*-\s*[0-9.]+)|\b))(\s*-\s*)?((?P<series>[^0-9\-]+) ([-#] ?)?(?P<series_index>[0-9.]+)?\s*-\s*)?(?P<title>[^(]+)
It excludes the open paren from the title.

Last edited by Starson17; 06-12-2010 at 08:12 AM.
Starson17 is offline  
Closed Thread

Tags
regex, regular expressions


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Regular Expression Help smartmart Calibre 5 10-17-2010 05:19 AM
Need Help Creating a Regular Expression Worm Calibre 9 08-18-2010 01:20 PM
Regular Expression Help Needed dloyer4 Calibre 1 07-25-2010 10:37 PM
Help with the regular expression Dysonco Calibre 9 03-22-2010 10:45 PM
I don't know how to use wilcards and regular expression.... superanima Sigil 4 02-21-2010 09:42 AM


All times are GMT -4. The time now is 11:12 PM.


MobileRead.com is a privately owned, operated and funded community.