![]() |
#16 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Did you miss the post about how Calibre does this already today? You use the document as a dictionary to see if the the word exists without a hyphen already. This technique automatically handles all languages and made-up/obscure words.
|
![]() |
![]() |
![]() |
#17 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 13,299
Karma: 78876004
Join Date: Nov 2007
Location: Toronto
Device: Libra H2O, Libra Colour
|
Yes... but I hate to say there are still cases where a dictionary approach will fail; and JS wants perfections....
|
![]() |
![]() |
Advert | |
|
![]() |
#18 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Ah, I interpreted JS's comment as another vote for false negatives vs. false positives. Using the document as a dict can't guarantee you'll remove every hyphen that should be removed, but it's an excellent technique to ensure that all the ones which are supposed to stay will stay.
Implementing proper multi-language stemming and adding an optional external dictionary would increase the detection rate even more, but it's debatable whether that's worth the effort. |
![]() |
![]() |
![]() |
#19 | |
US Navy, Retired
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,889
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Kindle PaperWhite SE 11th Gen
|
Quote:
|
|
![]() |
![]() |
![]() |
#20 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 133
Karma: 2142
Join Date: Oct 2011
Location: Spain
Device: I'm an iRex man: 8x DR1000S, 4x DR800SG, 4x DR800S
|
I'd be very interested in giving this a try if it generated clean HTML. Does it? What's the current status?
|
![]() |
![]() |
Advert | |
|
![]() |
#21 |
Member
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 10
Karma: 1538
Join Date: Sep 2011
Location: Sweden
Device: Sony PRS-350
|
Of course. The current implementation still outperforms my solution greatly both in speed and quality for some cases. Looks like I will need more development time than expected.
|
![]() |
![]() |
![]() |
#22 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 133
Karma: 2142
Join Date: Oct 2011
Location: Spain
Device: I'm an iRex man: 8x DR1000S, 4x DR800SG, 4x DR800S
|
I don't get it. Your solution is still greatly outperformed by the current implementation of what? Calibre?
Wrt to speed, I'd say interpreted languages like python are not your friends, but anyway... Maybe adding to what outperforms you is a better option over trying to replace it, unless you have a good reason to code independently. |
![]() |
![]() |
![]() |
#23 | |
US Navy, Retired
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,889
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Kindle PaperWhite SE 11th Gen
|
There isn't anything to get. roffLOL is developing this and in his opinion calibre's current implementation is still a little better and a little faster in some cases
Quote:
|
|
![]() |
![]() |
![]() |
#24 | ||
Member
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 10
Karma: 1538
Join Date: Sep 2011
Location: Sweden
Device: Sony PRS-350
|
Quote:
Quote:
Besides, I'm not beaten yet. Some cases is not all cases, and some cases may be fixed. If I cannot match calibres current implementation, I will work on it instead. To be honest I haven't even looked at it yet, but it has shown some weird errors (like dropping doubles of tightly spaced l:s (L)) which makes me suspect that our implementation approaches differs on quite a low level. There is a value in trying different approaches too. Are double columns even in use? I have found a single book with a layout in that manner. |
||
![]() |
![]() |
![]() |
#25 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,553
Karma: 950151
Join Date: Nov 2008
Device: Sony PRS-950, iphone/ipad (Marvin/iBooks/QuickReader)
|
Quote:
Of course the real solution is to not start with PDF, but often this is the only format available. |
|
![]() |
![]() |
![]() |
#26 |
Member
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 10
Karma: 1538
Join Date: Sep 2011
Location: Sweden
Device: Sony PRS-350
|
But for magazines you can't really expect a double column either, might as well be three or more, and maqazines often follow a weird logical structure, so even if the columns were identified, appending them in correct order would be errorprone.
|
![]() |
![]() |
![]() |
#27 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Quote:
Not sure which part of your progress is hitting snags, but the new pdf engine in Calibre does an initial conversion from pdf to xml using compiled code. The XML retains all the critical formatting information. The output Calibre produces today does not use the XML I'm talking about. You need to use calibre from the CLI with debug enabled - add the argument --new-pdf-engine if you want to see what I'm talking about. Last edited by ldolse; 10-21-2011 at 10:07 AM. |
|
![]() |
![]() |
![]() |
#28 |
Member
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 10
Karma: 1538
Join Date: Sep 2011
Location: Sweden
Device: Sony PRS-350
|
Thanks! I shall try it. If it is for the benefit of academics and sci fi-readers, it should certainly be supported, no matter the cost
![]() Any source for such sci-fi-magz? |
![]() |
![]() |
![]() |
#29 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Well the example I was thinking of is here:
http://www.starshipsofa.com/anthology/ebook/ Not sure of other good sources, just know that I've seen the two column format used in print for this type of content. Edit - I don't think these use two column, but since you seem to be interested in other scifi sources: http://www.hubfiction.com/ http://www.heliotropemag.com/category/heliotrope-issue/ ![]() Last edited by ldolse; 10-21-2011 at 02:11 PM. |
![]() |
![]() |
![]() |
#30 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 133
Karma: 2142
Join Date: Oct 2011
Location: Spain
Device: I'm an iRex man: 8x DR1000S, 4x DR800SG, 4x DR800S
|
The "real thing".
If you need ideas, I'd have a look at PDF.js. After all, I doubt conversion from PDF to HTML can go beyond that
![]() Last edited by MrWarper; 10-30-2011 at 02:32 PM. Reason: title, typo |
![]() |
![]() |
![]() |
Tags |
conversion, pdf |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Problem with html -> Mobi conversion - html tags visible. | khromov | Calibre | 9 | 08-06-2011 11:25 AM |
HTML Conversion | yoss15 | Conversion | 12 | 07-28-2011 04:42 PM |
clean HTML or PDF before mobi conversion in Calibre | mark235 | Calibre | 9 | 12-25-2010 09:37 PM |
PDF to WORD/HTML conversion, "special characters and marks" errors | chengyibo | 3 | 11-06-2010 12:43 AM | |
Today only - Free IntraPDF conversion tool (PDF -> HTML) | Bob Russell | 7 | 04-10-2007 12:16 PM |