Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 01-22-2013, 04:22 PM   #1
nymano
Junior Member
nymano began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jan 2013
Device: sony
Any plans to upgrade pdftohtml.exe

While trying to do convertion from a pdf, i did came accross lot's of the listed elements.
Aditionnal searches make me realize that a portion of the heartbeat of that translation rely on pdftohtml release 0.36 from 24 june 2003. This was relying on an pretty old version of some of the underlying xpdf lib.
While further searching i found someone published a modified version of source claiming to fix the paragraph issue
http://minnie.tuhs.org/Programs/Pdftohtml/index.html .

Still this sounds to rely on an poppler 0.8 version while now they seams to be at 0.22 .
Did anyone had a chance to look at those, are there any plan to repackage and build this pdftohtml with main latest library to have the benefits of their evolution, the one claimed for line break and paragraph but likely others. ..
nymano is offline   Reply With Quote
Old 01-22-2013, 10:05 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 24,816
Karma: 4369673
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
The version of pdftohtml in calibre comes from the latest release of poppler.
kovidgoyal is online now   Reply With Quote
Old 01-23-2013, 02:42 AM   #3
nymano
Junior Member
nymano began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jan 2013
Device: sony
Ok thanks then it likley means that the person modified the original source of pdftohml.

Will have further look after work but sounds that /HtmlOutputDev.cc
is where some changes occurs with introduction of reflow to better handle the paragraph and <br> generations.
// Heuristic: if the last character in str1 is a hyphen,
// turn off addNewline. This will "glue" hyphenated words
// that have been split over multiple lines.
if (reFlow && str1->text[str1->len -1] == '-') {
addNewline=0;
// Also remove the hyphen
str1->len--;
str1->htext->del(str1->htext->getLength() - 1, 1);
}

//printf("coalesce %d %d %f? ", str1->dir, str2->dir, d);

// Is str2 a new paragraph?
if (nextLine && (
======
and after
nymano is offline   Reply With Quote
Old 01-23-2013, 03:27 AM   #4
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 8,570
Karma: 12369681
Join Date: Feb 2009
Location: North Carolina
Device: Nexus 7
Quote:
Originally Posted by nymano View Post
While further searching i found someone published a modified version of source claiming to fix the paragraph issue
http://minnie.tuhs.org/Programs/Pdftohtml/index.html . ..
Interesting, but the "paragraph issue" described does not exist in calibre. The version of poppler referred to was 0.8.3 and it looks like the changes were added in v0.8.5 and the latest version 0.22 is 73 versions removed from 0.8.5.

I think it is safe to say the changes have long ago been incorporated into calibre.
DoctorOhh is online now   Reply With Quote
Old 01-23-2013, 03:25 PM   #5
nymano
Junior Member
nymano began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Jan 2013
Device: sony
Let go back to orign. Reason of the question isn't to say or bargain on version but trying to fix an issue. Also share some research result if it could benefit other.

I have a pdf file that i wanted to translate into epub to optimize reading.
I face 3 issues and some additionnal minor one.
2 can be fixed though regexp replacement rules, many minor are handled pretty well throught the caliber heuristic mode.

Unfortunately the paragraph issue does exist .
That is why i started some hypothesis and direction, but if the creator of the solution state he build with latest version it means investigation of resolution is somewhere else.

That is why i made the second post, as it sounds the changes isn't in the in depth of the library but rather in another file of the pdftohtml itself which hasn't been changed for a long time.

The paragraph issue mention in the article does exist for me and if i used the pdf of that article the output goes into the element below.
you will see that each end of visual line as a <br> but in reality the paragraph end is later. So i believe this is confirming calibre process same way, and as such if someone made some improvment could be beneficial to have a look....
Now i'm a newbie to all that, but truely the end result could be improved and if that is the solution ...

I don't have a developper worstation, neither a microsoft compiler , i started to download cygwin, gcc, several make imake and cmake, but sounds this is not very productive for the number of files + needs to identify the dependencies lib.


<i><b>From Wikipedia, the free encyclopedia</b></i><br>
Douglas Noël Adams (11 March 1952 – 11 May 2001) was an En-<br>
glish author, comic radio dramatist, and musician. He is best<br>
known as the author of the <i>Hitchhiker’s Guide to the Galaxy </i>series.<br>
<i>Hitchhiker’s </i>began on radio, and developed into a “trilogy” of five<br>
books (which sold more than fifteen million copies during his life-<br>
time) as well as a television series, a comic book series, a computer<br>
game, and a feature film that was completed after Adams’ death.<br>
The series has also been adapted for live theatre using various<br>
scripts; the earliest such productions used material newly written<br>
by Adams. He was known to some fans as <i>Bop Ad </i>(after his illegi-<br>
ble signature), or by his initials ‘DNA’; he was born the year before<br>
the elucidation of the structure of “<i>the meaning of life</i>” or D.N.A. by<br>
Francis Crick and James Watson in Cambridge i.e. where he was<br>
born.<br>
In addition to <i>The Hitchhiker’s Guide to the Galaxy</i>, Douglas Adams<br>
wrote or co-wrote three stories of the science fiction television se-<br>
nymano is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Why can’t I complete the Kobo Touch firmware upgrade or Desktop upgrade? DarrellAtKobo Kobo Reader 31 08-31-2012 10:45 PM
Don't upgrade your iPhone 3GS to iOS4, DO upgrade your iPad Bookbee Apple Devices 19 07-30-2010 10:10 AM
PRS-500 lit2lrf.exe waykohler Sony Reader Dev Corner 38 12-25-2009 05:51 AM
Newbie question - pdftohtml error phantom_cyclist Calibre 5 10-02-2009 06:21 AM
pdftohtml Batch Conversion kad032000 Sony Reader 8 06-27-2008 09:50 AM


All times are GMT -4. The time now is 11:39 PM.


MobileRead.com is a privately owned, operated and funded community.