PDF -> HTML conversion - Page 2

ldolse · 10-03-2011, 09:32 PM

Quote:

Originally Posted by PeterT

How on earth do you expect a program to tell whether or not the hyphen SHOULD be there!

Did you miss the post about how Calibre does this already today? You use the document as a dictionary to see if the the word exists without a hyphen already. This technique automatically handles all languages and made-up/obscure words.

PeterT · 10-03-2011, 09:39 PM

Yes... but I hate to say there are still cases where a dictionary approach will fail; and JS wants perfections....

ldolse · 10-03-2011, 09:47 PM

Ah, I interpreted JS's comment as another vote for false negatives vs. false positives. Using the document as a dict can't guarantee you'll remove every hyphen that should be removed, but it's an excellent technique to ensure that all the ones which are supposed to stay will stay.

Implementing proper multi-language stemming and adding an optional external dictionary would increase the detection rate even more, but it's debatable whether that's worth the effort.

DoctorOhh · 10-03-2011, 10:50 PM

Quote:

Originally Posted by ldolse

Quote:

Originally Posted by PeterT

Yes... but I hate to say there are still cases where a dictionary approach will fail; and JS wants perfections....

Ah, I interpreted JS's comment as another vote for false negatives vs. false positives. Using the document as a dict can't guarantee you'll remove every hyphen that should be removed, but it's an excellent technique to ensure that all the ones which are supposed to stay will stay.

@PeterT - While JS is often a perfectionist in certain things (like aspect ratio) saying JS wants perfection from his small input into this conversation seems akin to mind reading to me. We would all like perfection, but I interpreted JS's comment as another vote for false negatives vs. false positives too. As long as the wrong hyphens are not deleted, afterwards individual editing can deliver perfection.

MrWarper · 10-14-2011, 03:00 AM

I'd be very interested in giving this a try if it generated clean HTML. Does it? What's the current status?

roffLOL · 10-16-2011, 04:11 PM

Quote:

Originally Posted by MrWarper

I'd be very interested in giving this a try if it generated clean HTML. Does it? What's the current status?

Of course. The current implementation still outperforms my solution greatly both in speed and quality for some cases. Looks like I will need more development time than expected.

MrWarper · 10-17-2011, 08:48 AM

I don't get it. Your solution is still greatly outperformed by the current implementation of what? Calibre?

Wrt to speed, I'd say interpreted languages like python are not your friends, but anyway... Maybe adding to what outperforms you is a better option over trying to replace it, unless you have a good reason to code independently.

DoctorOhh · 10-17-2011, 08:54 AM

Quote:

Originally Posted by MrWarper

I don't get it.

There isn't anything to get. roffLOL is developing this and in his opinion calibre's current implementation is still a little better and a little faster in some cases

Quote:

"The current implementation still outperforms my solution greatly both in speed and quality for some cases"

I would like to see roffLOL take a look at calibre's "in development" PDF conversion engine that handles 2 columns and see if he could add the finishing touches to that conversion engine.

roffLOL · 10-21-2011, 05:13 AM

Quote:

Originally Posted by MrWarper

Wrt to speed, I'd say interpreted languages like python are not your friends, but anyway.

Sure, but I feel that speed is more often in design, and opposite in lack thereof, than in languages' benchmarks =)

Quote:

Originally Posted by MrWarper

Maybe adding to what outperforms you is a better option over trying to replace it, unless you have a good reason to code independently.

Good reason? I felt like it. Competition and what not. Recreation after a sucky workday at a sucky workplace producing sucky code. It's not about doing the optimal thing, but about having fun. Even if my work were to be wasted, atleast my time wasn't.
Besides, I'm not beaten yet. Some cases is not all cases, and some cases may be fixed.

If I cannot match calibres current implementation, I will work on it instead.
To be honest I haven't even looked at it yet, but it has shown some weird errors (like dropping doubles of tightly spaced l:s (L)) which makes me suspect that our implementation approaches differs on quite a low level. There is a value in trying different approaches too.

Quote:

Originally Posted by dwanthny

I would like to see roffLOL take a look at calibre's "in development" PDF conversion engine that handles 2 columns and see if he could add the finishing touches to that conversion engine.

Are double columns even in use? I have found a single book with a layout in that manner.

itimpi · 10-21-2011, 06:10 AM

Quote:

Originally Posted by roffLOL

Are double columns even in use? I have found a single book with a layout in that manner.

Probably not very common in fiction oriented books. It is very common in magazines, and often users may want to be able to convert them on an reader device. I have also seen it in books that are technical in nature and these are again likely conversion candidates.

Of course the real solution is to not start with PDF, but often this is the only format available.

roffLOL · 10-21-2011, 06:27 AM

But for magazines you can't really expect a double column either, might as well be three or more, and maqazines often follow a weird logical structure, so even if the columns were identified, appending them in correct order would be errorprone.

ldolse · 10-21-2011, 10:04 AM

Quote:

Originally Posted by roffLOL

But for magazines you can't really expect a double column either, might as well be three or more, and maqazines often follow a weird logical structure, so even if the columns were identified, appending them in correct order would be errorprone.

For the most popular magazines this is true, but there are classes of mags where this two columns is quite common. Science Fiction and other writing oriented mags for example. It's also a very common format in academic journals. I believe that the only thing the new pdf engine attempts to support is two column.

Not sure which part of your progress is hitting snags, but the new pdf engine in Calibre does an initial conversion from pdf to xml using compiled code. The XML retains all the critical formatting information. The output Calibre produces today does not use the XML I'm talking about. You need to use calibre from the CLI with debug enabled - add the argument --new-pdf-engine if you want to see what I'm talking about.

roffLOL · 10-21-2011, 01:38 PM

Thanks! I shall try it. If it is for the benefit of academics and sci fi-readers, it should certainly be supported, no matter the cost

Any source for such sci-fi-magz?

ldolse · 10-21-2011, 01:56 PM

Well the example I was thinking of is here:
http://www.starshipsofa.com/anthology/ebook/

Not sure of other good sources, just know that I've seen the two column format used in print for this type of content.

Edit - I don't think these use two column, but since you seem to be interested in other scifi sources:
http://www.hubfiction.com/
http://www.heliotropemag.com/category/heliotrope-issue/

MrWarper · 10-30-2011, 02:32 PM

If you need ideas, I'd have a look at PDF.js. After all, I doubt conversion from PDF to HTML can go beyond that

10-21-2011, 01:56 PM	#29
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Well the example I was thinking of is here: http://www.starshipsofa.com/anthology/ebook/ Not sure of other good sources, just know that I've seen the two column format used in print for this type of content. Edit - I don't think these use two column, but since you seem to be interested in other scifi sources: http://www.hubfiction.com/ http://www.heliotropemag.com/category/heliotrope-issue/ Last edited by ldolse; 10-21-2011 at 02:11 PM.

10-30-2011, 02:32 PM	#30
MrWarper Zealot Posts: 133 Karma: 2142 Join Date: Oct 2011 Location: Spain Device: I'm an iRex man: 8x DR1000S, 4x DR800SG, 4x DR800S	The "real thing". If you need ideas, I'd have a look at PDF.js. After all, I doubt conversion from PDF to HTML can go beyond that Last edited by MrWarper; 10-30-2011 at 02:32 PM. Reason: title, typo

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Problem with html -> Mobi conversion - html tags visible.	khromov	Calibre	9	08-06-2011 11:25 AM
HTML Conversion	yoss15	Conversion	12	07-28-2011 04:42 PM
clean HTML or PDF before mobi conversion in Calibre	mark235	Calibre	9	12-25-2010 09:37 PM
PDF to WORD/HTML conversion, "special characters and marks" errors	chengyibo	PDF	3	11-06-2010 12:43 AM
Today only - Free IntraPDF conversion tool (PDF -> HTML)	Bob Russell	PDF	7	04-10-2007 12:16 PM

10-03-2011, 09:39 PM	#17
PeterT Grand Sorcerer Posts: 13,535 Karma: 78910202 Join Date: Nov 2007 Location: Toronto Device: Libra H2O, Libra Colour	Yes... but I hate to say there are still cases where a dictionary approach will fail; and JS wants perfections....

10-03-2011, 09:47 PM	#18
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Ah, I interpreted JS's comment as another vote for false negatives vs. false positives. Using the document as a dict can't guarantee you'll remove every hyphen that should be removed, but it's an excellent technique to ensure that all the ones which are supposed to stay will stay. Implementing proper multi-language stemming and adding an optional external dictionary would increase the detection rate even more, but it's debatable whether that's worth the effort.

10-14-2011, 03:00 AM	#20
MrWarper Zealot Posts: 133 Karma: 2142 Join Date: Oct 2011 Location: Spain Device: I'm an iRex man: 8x DR1000S, 4x DR800SG, 4x DR800S	I'd be very interested in giving this a try if it generated clean HTML. Does it? What's the current status?

10-17-2011, 08:48 AM	#22
MrWarper Zealot Posts: 133 Karma: 2142 Join Date: Oct 2011 Location: Spain Device: I'm an iRex man: 8x DR1000S, 4x DR800SG, 4x DR800S	I don't get it. Your solution is still greatly outperformed by the current implementation of what? Calibre? Wrt to speed, I'd say interpreted languages like python are not your friends, but anyway... Maybe adding to what outperforms you is a better option over trying to replace it, unless you have a good reason to code independently.

10-21-2011, 06:27 AM	#26
roffLOL Member Posts: 10 Karma: 1538 Join Date: Sep 2011 Location: Sweden Device: Sony PRS-350	But for magazines you can't really expect a double column either, might as well be three or more, and maqazines often follow a weird logical structure, so even if the columns were identified, appending them in correct order would be errorprone.

10-21-2011, 01:38 PM	#28
roffLOL Member Posts: 10 Karma: 1538 Join Date: Sep 2011 Location: Sweden Device: Sony PRS-350	Thanks! I shall try it. If it is for the benefit of academics and sci fi-readers, it should certainly be supported, no matter the cost Any source for such sci-fi-magz?

Advert

Advert