| 
			
			 | 
		#16 | 
| 
			
			
			
			 Wizard 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337 
				Karma: 123457 
				Join Date: Apr 2009 
				Location: Malaysia 
				
				
				Device: PRS-650, iPhone 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Did you miss the post about how Calibre does this already today?  You use the document as a dictionary to see if the the word exists without a hyphen already.  This technique automatically handles all languages and made-up/obscure words.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#17 | 
| 
			
			
			
			 Grand Sorcerer 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 13,698 
				Karma: 79983758 
				Join Date: Nov 2007 
				Location: Toronto 
				
				
				Device: Libra H2O, Libra Colour 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Yes... but I hate to say there are still cases where a dictionary approach will fail; and JS wants perfections....
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#18 | 
| 
			
			
			
			 Wizard 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337 
				Karma: 123457 
				Join Date: Apr 2009 
				Location: Malaysia 
				
				
				Device: PRS-650, iPhone 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Ah, I interpreted JS's comment as another vote for false negatives vs. false positives.  Using the document as a dict can't guarantee you'll remove every hyphen that should be removed, but it's an excellent technique to ensure that all the ones which are supposed to stay will stay. 
		
	
		
		
		
		
		
		
		
		
		
		
	
	Implementing proper multi-language stemming and adding an optional external dictionary would increase the detection rate even more, but it's debatable whether that's worth the effort.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#19 | |
| 
			
			
			
			 US Navy, Retired 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,897 
				Karma: 13806776 
				Join Date: Feb 2009 
				Location: North Carolina 
				
				
				Device: Icarus Illumina XL HD, Kindle PaperWhite SE 11th Gen 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#20 | 
| 
			
			
			
			 Zealot 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 133 
				Karma: 2142 
				Join Date: Oct 2011 
				Location: Spain 
				
				
				Device: I'm an iRex man: 8x DR1000S, 4x DR800SG, 4x DR800S 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			I'd be very interested in giving this a try if it generated clean HTML. Does it? What's the current status?
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#21 | 
| 
			
			
			
			 Member 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 10 
				Karma: 1538 
				Join Date: Sep 2011 
				Location: Sweden 
				
				
				Device: Sony PRS-350 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Of course. The current implementation still outperforms my solution greatly both in speed and quality for some cases. Looks like I will need more development time than expected.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#22 | 
| 
			
			
			
			 Zealot 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 133 
				Karma: 2142 
				Join Date: Oct 2011 
				Location: Spain 
				
				
				Device: I'm an iRex man: 8x DR1000S, 4x DR800SG, 4x DR800S 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			I don't get it. Your solution is still greatly outperformed by the current implementation of what? Calibre? 
		
	
		
		
		
		
		
		
		
		
		
		
	
	Wrt to speed, I'd say interpreted languages like python are not your friends, but anyway... Maybe adding to what outperforms you is a better option over trying to replace it, unless you have a good reason to code independently.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#23 | |
| 
			
			
			
			 US Navy, Retired 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,897 
				Karma: 13806776 
				Join Date: Feb 2009 
				Location: North Carolina 
				
				
				Device: Icarus Illumina XL HD, Kindle PaperWhite SE 11th Gen 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			There isn't anything to get.  roffLOL is developing this and in his opinion calibre's current implementation is still a little better and a little faster in some cases 
		
	
		
		
		
		
		
		
		
		
		
		
	
	Quote: 
	
  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#24 | ||
| 
			
			
			
			 Member 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 10 
				Karma: 1538 
				Join Date: Sep 2011 
				Location: Sweden 
				
				
				Device: Sony PRS-350 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 Quote: 
	
 Besides, I'm not beaten yet. Some cases is not all cases, and some cases may be fixed. If I cannot match calibres current implementation, I will work on it instead. To be honest I haven't even looked at it yet, but it has shown some weird errors (like dropping doubles of tightly spaced l:s (L)) which makes me suspect that our implementation approaches differs on quite a low level. There is a value in trying different approaches too. Are double columns even in use? I have found a single book with a layout in that manner.  | 
||
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#25 | |
| 
			
			
			
			 Wizard 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,553 
				Karma: 950151 
				Join Date: Nov 2008 
				
				
				
				Device: Sony PRS-950, iphone/ipad (Marvin/iBooks/QuickReader) 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 Of course the real solution is to not start with PDF, but often this is the only format available.  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#26 | 
| 
			
			
			
			 Member 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 10 
				Karma: 1538 
				Join Date: Sep 2011 
				Location: Sweden 
				
				
				Device: Sony PRS-350 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			But for magazines you can't really expect a double column either, might as well be three or more, and maqazines often follow a weird logical structure, so even if the columns were identified, appending them in correct order would be errorprone.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#27 | |
| 
			
			
			
			 Wizard 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337 
				Karma: 123457 
				Join Date: Apr 2009 
				Location: Malaysia 
				
				
				Device: PRS-650, iPhone 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 Not sure which part of your progress is hitting snags, but the new pdf engine in Calibre does an initial conversion from pdf to xml using compiled code. The XML retains all the critical formatting information. The output Calibre produces today does not use the XML I'm talking about. You need to use calibre from the CLI with debug enabled - add the argument --new-pdf-engine if you want to see what I'm talking about. Last edited by ldolse; 10-21-2011 at 11:07 AM.  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#28 | 
| 
			
			
			
			 Member 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 10 
				Karma: 1538 
				Join Date: Sep 2011 
				Location: Sweden 
				
				
				Device: Sony PRS-350 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Thanks! I shall try it. If it is for the benefit of academics and sci fi-readers, it should certainly be supported, no matter the cost  
		
	
		
		
		
		
		
		
		
		
		
		
	
	![]() Any source for such sci-fi-magz?  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#29 | 
| 
			
			
			
			 Wizard 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337 
				Karma: 123457 
				Join Date: Apr 2009 
				Location: Malaysia 
				
				
				Device: PRS-650, iPhone 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Well the example I was thinking of is here: 
		
	
		
		
		
		
		
		
		
		
		
		
		
			http://www.starshipsofa.com/anthology/ebook/ Not sure of other good sources, just know that I've seen the two column format used in print for this type of content. Edit - I don't think these use two column, but since you seem to be interested in other scifi sources: http://www.hubfiction.com/ http://www.heliotropemag.com/category/heliotrope-issue/  
		Last edited by ldolse; 10-21-2011 at 03:11 PM.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#30 | 
| 
			
			
			
			 Zealot 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 133 
				Karma: 2142 
				Join Date: Oct 2011 
				Location: Spain 
				
				
				Device: I'm an iRex man: 8x DR1000S, 4x DR800SG, 4x DR800S 
				
				
				 | 
	
	
	
		
		
			
			 
				
				The "real thing".
			 
			
			
			If you need ideas, I'd have a look at PDF.js. After all, I doubt conversion from PDF to HTML can go beyond that  
		
	
		
		
		
		
		
		
		
		
		
		
		
			 
		Last edited by MrWarper; 10-30-2011 at 03:32 PM. Reason: title, typo  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
![]()  | 
            
        
            
| Tags | 
| conversion, pdf | 
            
  | 
    
			 
			Similar Threads
		 | 
	||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| Problem with html -> Mobi conversion - html tags visible. | khromov | Calibre | 9 | 08-06-2011 12:25 PM | 
| HTML Conversion | yoss15 | Conversion | 12 | 07-28-2011 05:42 PM | 
| clean HTML or PDF before mobi conversion in Calibre | mark235 | Calibre | 9 | 12-25-2010 10:37 PM | 
| PDF to WORD/HTML conversion, "special characters and marks" errors | chengyibo | 3 | 11-06-2010 01:43 AM | |
| Today only - Free IntraPDF conversion tool (PDF -> HTML) | Bob Russell | 7 | 04-10-2007 01:16 PM | |