![]() |
#1 |
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Jan 2013
Device: sony
|
Any plans to upgrade pdftohtml.exe
While trying to do convertion from a pdf, i did came accross lot's of the listed elements.
Aditionnal searches make me realize that a portion of the heartbeat of that translation rely on pdftohtml release 0.36 from 24 june 2003. This was relying on an pretty old version of some of the underlying xpdf lib. While further searching i found someone published a modified version of source claiming to fix the paragraph issue http://minnie.tuhs.org/Programs/Pdftohtml/index.html . Still this sounds to rely on an poppler 0.8 version while now they seams to be at 0.22 . Did anyone had a chance to look at those, are there any plan to repackage and build this pdftohtml with main latest library to have the benefits of their evolution, the one claimed for line break and paragraph but likely others. .. |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,195
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
The version of pdftohtml in calibre comes from the latest release of poppler.
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Jan 2013
Device: sony
|
Ok thanks then it likley means that the person modified the original source of pdftohml.
Will have further look after work but sounds that /HtmlOutputDev.cc is where some changes occurs with introduction of reflow to better handle the paragraph and <br> generations. // Heuristic: if the last character in str1 is a hyphen, // turn off addNewline. This will "glue" hyphenated words // that have been split over multiple lines. if (reFlow && str1->text[str1->len -1] == '-') { addNewline=0; // Also remove the hyphen str1->len--; str1->htext->del(str1->htext->getLength() - 1, 1); } //printf("coalesce %d %d %f? ", str1->dir, str2->dir, d); // Is str2 a new paragraph? if (nextLine && ( ====== and after |
![]() |
![]() |
![]() |
#4 | |
US Navy, Retired
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,889
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Kindle PaperWhite SE 11th Gen
|
Quote:
I think it is safe to say the changes have long ago been incorporated into calibre. |
|
![]() |
![]() |
![]() |
#5 |
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Jan 2013
Device: sony
|
Let go back to orign. Reason of the question isn't to say or bargain on version but trying to fix an issue. Also share some research result if it could benefit other.
I have a pdf file that i wanted to translate into epub to optimize reading. I face 3 issues and some additionnal minor one. 2 can be fixed though regexp replacement rules, many minor are handled pretty well throught the caliber heuristic mode. Unfortunately the paragraph issue does exist . That is why i started some hypothesis and direction, but if the creator of the solution state he build with latest version it means investigation of resolution is somewhere else. That is why i made the second post, as it sounds the changes isn't in the in depth of the library but rather in another file of the pdftohtml itself which hasn't been changed for a long time. The paragraph issue mention in the article does exist for me and if i used the pdf of that article the output goes into the element below. you will see that each end of visual line as a <br> but in reality the paragraph end is later. So i believe this is confirming calibre process same way, and as such if someone made some improvment could be beneficial to have a look.... Now i'm a newbie to all that, but truely the end result could be improved and if that is the solution ... ![]() I don't have a developper worstation, neither a microsoft compiler , i started to download cygwin, gcc, several make imake and cmake, but sounds this is not very productive for the number of files + needs to identify the dependencies lib. <i><b>From Wikipedia, the free encyclopedia</b></i><br> Douglas Noël Adams (11 March 1952 – 11 May 2001) was an En-<br> glish author, comic radio dramatist, and musician. He is best<br> known as the author of the <i>Hitchhiker’s Guide to the Galaxy </i>series.<br> <i>Hitchhiker’s </i>began on radio, and developed into a “trilogy” of five<br> books (which sold more than fifteen million copies during his life-<br> time) as well as a television series, a comic book series, a computer<br> game, and a feature film that was completed after Adams’ death.<br> The series has also been adapted for live theatre using various<br> scripts; the earliest such productions used material newly written<br> by Adams. He was known to some fans as <i>Bop Ad </i>(after his illegi-<br> ble signature), or by his initials ‘DNA’; he was born the year before<br> the elucidation of the structure of “<i>the meaning of life</i>” or D.N.A. by<br> Francis Crick and James Watson in Cambridge i.e. where he was<br> born.<br> In addition to <i>The Hitchhiker’s Guide to the Galaxy</i>, Douglas Adams<br> wrote or co-wrote three stories of the science fiction television se-<br> |
![]() |
![]() |
Advert | |
|
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Why can’t I complete the Kobo Touch firmware upgrade or Desktop upgrade? | DarrellAtKobo | Kobo Reader | 31 | 08-31-2012 10:45 PM |
Don't upgrade your iPhone 3GS to iOS4, DO upgrade your iPad | Bookbee | Apple Devices | 19 | 07-30-2010 10:10 AM |
PRS-500 lit2lrf.exe | waykohler | Sony Reader Dev Corner | 38 | 12-25-2009 05:51 AM |
Newbie question - pdftohtml error | phantom_cyclist | Calibre | 5 | 10-02-2009 06:21 AM |
pdftohtml Batch Conversion | kad032000 | Sony Reader | 8 | 06-27-2008 09:50 AM |