|  05-25-2010, 11:14 PM | #46 | |
| Junior Member  Posts: 7 Karma: 10 Join Date: May 2010 Device: Nook | Quote: 
 | |
|   | 
|  05-26-2010, 12:12 AM | #47 | 
| Connoisseur  Posts: 55 Karma: 10 Join Date: Jan 2010 Device: Nexus One | 
			
			Sure, vinco, Gimme a d/l link and I'll take a stab at it... | 
|   | 
|  05-26-2010, 02:53 AM | #48 | 
| Connoisseur  Posts: 55 Karma: 10 Join Date: Jan 2010 Device: Nexus One | 
			
			Ok, so vinco sent me the pdf he's having trouble with, and I can confirm that the previously mentioned regexes, which highlight the proper matches in the tester, don't remove those matches when converting.   I have no idea why, but it's definitely a bug. Oddly, when I used the resulant epub (which still had the page numbers) as the input, and adjusted the regex to match the page numbers and surrounding tags in the epub, it correctly removed them in the output. (vinco, this is your temporary workaround solution). So why is the syntax highlighter showing the regex matches, but the converter not removing them? | 
|   | 
|  05-26-2010, 09:58 AM | #49 | 
| Junior Member  Posts: 7 Karma: 10 Join Date: May 2010 Device: Nook | 
			
			Thanks for the assist, tonyx3.  I'll put that workaround into force for now.
		 | 
|   | 
|  05-26-2010, 11:37 AM | #50 | 
| creator of calibre            Posts: 45,598 Karma: 28548962 Join Date: Oct 2006 Location: Mumbai, India Device: Various | 
			
			because the wizard doesn't operate on exactly the same html as the conversion pipeline, typically there may be white space difference between the two.
		 | 
|   | 
|  05-26-2010, 12:14 PM | #51 | 
| Junior Member  Posts: 7 Karma: 10 Join Date: May 2010 Device: Nook | 
			
			Other than doing a series of conversions, do you have any workaround suggestions?  I can get a copy of the PDF to you as well if interested.
		 | 
|   | 
|  05-26-2010, 12:27 PM | #52 | 
| creator of calibre            Posts: 45,598 Karma: 28548962 Join Date: Oct 2006 Location: Mumbai, India Device: Various | 
			
			use the debug option to look at the actual intermediate html generated by the conversion process.
		 | 
|   | 
|  05-26-2010, 07:53 PM | #53 | |
| Junior Member  Posts: 7 Karma: 10 Join Date: May 2010 Device: Nook | Quote: 
 Code:      Since nothing material was destroyed when the Eddorians were forced into the next plane of<br> existence, their historical records also have become available. Those records-folios and tapes and<br> playable discs of platinum alloy, resistant indefinitely even to Eddore's noxious atmosphere agree with<br> those of the Arisians upon this point. Immediately before the Coalescence began there was one, and only<br> <b>Page  1</b><br> <hr> <A name=2></a>one, planetary solar system in the Second Galaxy; and, until the advent of Eddore, the Second Galaxy<br> was entirely devoid of intelligent life. <br> | |
|   | 
|  05-26-2010, 11:19 PM | #54 | |
| Connoisseur  Posts: 55 Karma: 10 Join Date: Jan 2010 Device: Nexus One | 
			
			So in this example, it looks like the problem is that the wizard can't tell the difference between a regular space and a non-breaking space, right?   That would be a problem. A 'white space difference' as Kovid said. Quote: 
 Is there some reason for this? I mean, I'm sure there's some reason, but is it absolutely necessary? It seems like it would be better if we were able to write and test our regexes based on the code that the conversion pipeline actually uses, to avoid errors like this one. | |
|   | 
|  05-27-2010, 12:18 AM | #55 | 
| creator of calibre            Posts: 45,598 Karma: 28548962 Join Date: Oct 2006 Location: Mumbai, India Device: Various | 
			
			Well yeah, but the conversion pipeline cant be run (for various technical reasons) inside the GUI, so the GUI basically uses a trick to use an approximation of the conversion pipeline. It works fine in most cases, where you don't have unusual input files, but in some cases, like this, the approximation isn't good enough.  I could of course run the conversion pipeline in a separate process and then take the output of that into the GUI, but that is too much work. I prefer to spend the time just improving the PDF engine so it removes headers and footers automatically. | 
|   | 
|  05-27-2010, 03:07 AM | #56 | |
| Connoisseur  Posts: 55 Karma: 10 Join Date: Jan 2010 Device: Nexus One | 
			
			I see.  So calibre uses two different pdf-to-html engines?   The one used in the conversion pipeline is obviously returning different results from the one used in the regex wizard. Quote: 
 Unfortunately, I've never once had the defaults work on removing headers or footers from PDF's. I've always had to write my own regex. And on multiple occasions I've had them match perfectly in the preview, and then not get removed in the conversion. (which is one reason I wish the preview html matched the conversion html) I'm sure PDF conversion, given the format's nature, must be one of the bigger headaches in developing the conversion system. | |
|   | 
|  05-27-2010, 04:54 AM | #57 | 
| US Navy, Retired            Posts: 9,897 Karma: 13806776 Join Date: Feb 2009 Location: North Carolina Device: Icarus Illumina XL HD, Kindle PaperWhite SE 11th Gen | 
			
			I believe he is referring to improving a not yet released PDF engine.  One which non of us has had a chance to try yet because it isn't finished.
		 Last edited by DoctorOhh; 05-27-2010 at 06:47 PM. | 
|   | 
|  05-27-2010, 06:02 AM | #58 | 
| Connoisseur  Posts: 55 Karma: 10 Join Date: Jan 2010 Device: Nexus One | |
|   | 
|  06-11-2010, 11:54 PM | #59 | 
| Member         Posts: 13 Karma: 954 Join Date: Jun 2010 Device: Mobipocket reader on Blackberry, XO using FBreader, Kindle |   
			
			Hi.  I've been using Calibre for a few weeks and I'm really enjoying it. I adopted a regular expression for Adding from this thread that does a great job for my files: Code: ^((?P<author>([^\-_0-9]+)(?=\s*-\s*)(?!\s*-\s*[0-9.]+)|\b))(\s*-\s*)?((?P<series>[^0-9\-]+)(\s*-\s*)?(?P<series_index>[0-9.]+)\s*-\s*)?(?P<title>[^\-_0-9]+) Author name - Book title (htm).zip So this will get imported with "Book title (htm)" as the title, rather than just "Book title". Then I have to manually merge things. The parenthesis might be (htm), or (rtf), or (txt), etc... I can get it to ignore the parenthesis by adding a [(] to the end of my regex, but then that breaks the adding for files that don't have a ( in them. I'm new to regex, and I've done some reading of reference suggested from inside of Calibre (which is how I learned enough to put my little addition on), but I've been trying to figure out a way to use the | operator unsucessfully. I'd be pleased with any solution that works, and if you have the time a brief description of why it works. My expectation is that I want to match ( or nothing, but not sure how to do the nothing. ie, is there some way to tell it to start over if a match fails? Thanks in advance. | 
|   | 
|  06-12-2010, 07:50 AM | #60 | |
| Wizard            Posts: 4,004 Karma: 177841 Join Date: Dec 2009 Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T | Quote: 
 Code: ^((?P<author>([^\_0-9]+)(?=\s*-\s*)(?!\s*-\s*[0-9.]+)|\b))(\s*-\s*)?((?P<series>[^0-9\-]+) ([-#] ?)?(?P<series_index>[0-9.]+)?\s*-\s*)?(?P<title>[^(]+) Last edited by Starson17; 06-12-2010 at 08:12 AM. | |
|   | 
|  | 
| Tags | 
| regex, regular expressions | 
| 
 | 
|  Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| Regular Expression Help | smartmart | Calibre | 5 | 10-17-2010 05:19 AM | 
| Need Help Creating a Regular Expression | Worm | Calibre | 9 | 08-18-2010 01:20 PM | 
| Regular Expression Help Needed | dloyer4 | Calibre | 1 | 07-25-2010 10:37 PM | 
| Help with the regular expression | Dysonco | Calibre | 9 | 03-22-2010 10:45 PM | 
| I don't know how to use wilcards and regular expression.... | superanima | Sigil | 4 | 02-21-2010 09:42 AM |