Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 05-18-2010, 03:43 PM   #1
miquel
Junior Member
miquel began at the beginning.
 
miquel's Avatar
 
Posts: 7
Karma: 10
Join Date: May 2010
Location: Heidelberg, Germany
Device: Amazon Kindle 2
Question PDF line unwrap

Hi,
I wanted to give a hand in auto-detecting line breaks, headers and footers in PDFs, so I've been tinkering with the code.

Now, am I a bit slow, or is opt_unwrap_factor picked up in the gui, and never carried over for conversion?

Should this be happening at ./ebooks/conversion/preprocess.py:252 ?

Thanks!
Miquel
miquel is offline   Reply With Quote
Old 05-18-2010, 03:57 PM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by miquel View Post
Now, am I a bit slow, or is opt_unwrap_factor picked up in the gui, and never carried over for conversion?

Should this be happening at ./ebooks/conversion/preprocess.py:252 ?
I'm not 100% sure of what you're asking. FYI, here is where unwrap_factor and opt_unwrap_factor are used in the code (courtesy of UltraEdit - recent code, but not latest.).

Code:
Find 'unwrap_factor' in ' src\calibre\ebooks\conversion\preprocess.py' :
 src\calibre\ebooks\conversion\preprocess.py/252:         if getattr(self.extra_opts, 'unwrap_factor', 0.0) > 0.01:
 src\calibre\ebooks\conversion\preprocess.py/253:             length = line_length(html, getattr(self.extra_opts, 'unwrap_factor'))
Found 'unwrap_factor' 2 time(s).
----------------------------------------
Find 'unwrap_factor' in ' src\calibre\ebooks\html\input.py' :
 src\calibre\ebooks\html\input.py/266:         OptionRecommendation(name='unwrap_factor', recommended_value=0.0,
Found 'unwrap_factor' 1 time(s).
----------------------------------------
Find 'unwrap_factor' in ' src\calibre\ebooks\pdb\pdf\reader.py' :
 src\calibre\ebooks\pdb\pdf\reader.py/24:         setattr(self.options, 'unwrap_factor', 0.5)
Found 'unwrap_factor' 1 time(s).
----------------------------------------
Find 'unwrap_factor' in ' src\calibre\ebooks\pdf\input.py' :
 src\calibre\ebooks\pdf\input.py/25:         OptionRecommendation(name='unwrap_factor', recommended_value=0.5,
Found 'unwrap_factor' 1 time(s).
----------------------------------------
Find 'unwrap_factor' in ' src\calibre\gui2\convert\pdf_input.py' :
 src\calibre\gui2\convert\pdf_input.py/17:             ['no_images', 'unwrap_factor'])
Found 'unwrap_factor' 1 time(s).
----------------------------------------
Find 'unwrap_factor' in ' src\calibre\gui2\convert\pdf_input.ui' :
 src\calibre\gui2\convert\pdf_input.ui/23:       <cstring>opt_unwrap_factor</cstring>
 src\calibre\gui2\convert\pdf_input.ui/41:     <widget class="QDoubleSpinBox" name="opt_unwrap_factor">
Found 'unwrap_factor' 2 time(s).
----------------------------------------
Find 'unwrap_factor' in ' src\calibre\gui2\convert\pdf_input_ui.py' :
 src\calibre\gui2\convert\pdf_input_ui.py/23:         self.opt_unwrap_factor = QtGui.QDoubleSpinBox(Form)
 src\calibre\gui2\convert\pdf_input_ui.py/24:         self.opt_unwrap_factor.setMaximum(1.0)
 src\calibre\gui2\convert\pdf_input_ui.py/25:         self.opt_unwrap_factor.setSingleStep(0.01)
 src\calibre\gui2\convert\pdf_input_ui.py/26:         self.opt_unwrap_factor.setProperty("value", 0.5)
 src\calibre\gui2\convert\pdf_input_ui.py/27:         self.opt_unwrap_factor.setObjectName("opt_unwrap_factor")
 src\calibre\gui2\convert\pdf_input_ui.py/28:         self.gridLayout.addWidget(self.opt_unwrap_factor, 0, 1, 1, 1)
 src\calibre\gui2\convert\pdf_input_ui.py/32:         self.label_2.setBuddy(self.opt_unwrap_factor)
Found 'unwrap_factor' 8 time(s).
Search complete, found 'unwrap_factor' 16 time(s). (7 files.)
Starson17 is offline   Reply With Quote
Advert
Old 05-19-2010, 10:48 AM   #3
miquel
Junior Member
miquel began at the beginning.
 
miquel's Avatar
 
Posts: 7
Karma: 10
Join Date: May 2010
Location: Heidelberg, Germany
Device: Amazon Kindle 2
Hey, thanks for checking!
Yes, I grepped with the same results. The only place where the unwrap_factor property seems to be read is in preprocess.py. Problem is, I've added print statements around that and they don't get shown. That's why I'm asking if it's used in practice.

I'm going to do some more homework then, and see why it never gets to the print statements when converting from PDF. I just wanted to make sure unwrapping hadn't been disabled for some reason, and I was on a wild goose chase!

Miquel
miquel is offline   Reply With Quote
Old 05-19-2010, 02:14 PM   #4
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by miquel View Post
I've added print statements around that and they don't get shown. That's why I'm asking if it's used in practice.
AFAIK, it's still used, although I can't confirm whether you've got the right variable name. I assume you're using calibre-debug -g to start Calibre and see your print statements?
Starson17 is offline   Reply With Quote
Old 05-21-2010, 04:37 PM   #5
miquel
Junior Member
miquel began at the beginning.
 
miquel's Avatar
 
Posts: 7
Karma: 10
Join Date: May 2010
Location: Heidelberg, Germany
Device: Amazon Kindle 2
Oh darn... Mea culpa... I wasn't starting with calibre-debug. The unwrap code is indeed called, and I have to do a better job of reading the documentation :S

Thanks a lot!
miquel is offline   Reply With Quote
Advert
Old 05-23-2010, 06:05 PM   #6
miquel
Junior Member
miquel began at the beginning.
 
miquel's Avatar
 
Posts: 7
Karma: 10
Join Date: May 2010
Location: Heidelberg, Germany
Device: Amazon Kindle 2
Hi again,
Starson17, thanks a lot for your help on this thread. I've submitted a patch with a different approach to line unwrapping, let's see what people think!
Patch's here: http://bugs.calibre-ebook.com/ticket/5597
miquel is offline   Reply With Quote
Old 05-23-2010, 06:38 PM   #7
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by miquel View Post
Hi again,
Starson17, thanks a lot for your help on this thread. I've submitted a patch with a different approach to line unwrapping, let's see what people think!
Patch's here: http://bugs.calibre-ebook.com/ticket/5597
Read Kovid's comments on your ticket regarding the new pdf conversion engine.
Starson17 is offline   Reply With Quote
Old 05-23-2010, 07:25 PM   #8
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
Quote:
Originally Posted by miquel View Post
Hi again,
Starson17, thanks a lot for your help on this thread. I've submitted a patch with a different approach to line unwrapping, let's see what people think!
Patch's here: http://bugs.calibre-ebook.com/ticket/5597
I have no knowledge in this area but could the method this fellow took in creating this extension for openoffice.org's Writer be applied to cleaning up PDF file conversions.

From his page:

Quote:
Do you have problems with
texts having unwanted
line breaks like
this one?

This happens because there are some unwanted paragraph marks along the text. If we take the text from a PDF, inevitably we will get a paragraph mark at each end of line.

Now, or you delete them one by one with a lot of patience, or you can use the macro MyTXTcleaner that will do the work for you.
DoctorOhh is offline   Reply With Quote
Old 05-25-2010, 08:51 AM   #9
miquel
Junior Member
miquel began at the beginning.
 
miquel's Avatar
 
Posts: 7
Karma: 10
Join Date: May 2010
Location: Heidelberg, Germany
Device: Amazon Kindle 2
Hey there,
Yup, I got Kovid's comment, but haven't had a chance to look into the new pdf conversion engine yet. I'll try and port this functionality there, if the new engine doesn't already support it. Also, there might be lessons to be learned or reused from MyTXTcleaner (thx dwanthny).

The other thing I was looking into was autodetecting the regex for headers and footers, btw, which is another match for this engine.

Talk to you soon with more news!
Miquel
miquel is offline   Reply With Quote
Old 05-25-2010, 09:55 AM   #10
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by miquel View Post
Talk to you soon with more news!
Miquel
Contributions are always very welcome. PDF conversion is one area that certainly needs work. There are many users looking forward to the new PDF conversion engine (although I think the new 0.7.x release is even more important).
Starson17 is offline   Reply With Quote
Old 05-25-2010, 11:00 AM   #11
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,826
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
As it's going to be a little while before i can work on the new engine, if you want to work on it, feel free.
kovidgoyal is offline   Reply With Quote
Old 05-25-2010, 11:18 AM   #12
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kovidgoyal View Post
As it's going to be a little while before i can work on the new engine, if you want to work on it, feel free.
I was hoping to encourage Miquel to work on it
Look at how well it worked with Charles - the fence got whitewashed and all I had to do was point to where it needed a few touchups.

I don't have much confidence that PDF conversion will ever be very good. I'd personally choose to work on something that I thought would have a decent chance of being successful, rather than something that will always have problems. Perhaps I'm wrong about how successful a PDF conversion can be, but I've disliked PDFs for a very long time.
Starson17 is offline   Reply With Quote
Old 05-25-2010, 11:23 AM   #13
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,826
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Yeah PDF conversions will never get to thepoint of say LIT or MOBI conversions. But the existing level can be improved significantly by doing context aware line unwrapping. WHich means using things like the statistics for line lengths on the whole page and font size changes, spacing between lines, etc. to detect whether a line should be unwrapped or not.

And by doing all this you get header and footer and multicolumn support for free.
kovidgoyal is offline   Reply With Quote
Old 05-26-2010, 04:25 PM   #14
miquel
Junior Member
miquel began at the beginning.
 
miquel's Avatar
 
Posts: 7
Karma: 10
Join Date: May 2010
Location: Heidelberg, Germany
Device: Amazon Kindle 2
Hello? Right here guys
Like it or not, there's plenty of pdf books out there, and conversion is really not up to par today, so, who cares whose fence it is?
Anyway, I already said I'd have a look
miquel is offline   Reply With Quote
Old 05-26-2010, 05:18 PM   #15
miquel
Junior Member
miquel began at the beginning.
 
miquel's Avatar
 
Posts: 7
Karma: 10
Join Date: May 2010
Location: Heidelberg, Germany
Device: Amazon Kindle 2
OK Kovid, I'd like to confirm a couple of things with you, please
The new pdf engine:

1. Takes the pdf file, and passes it to the C plugin implementation of PDF reflow. That returns an xml with the pdf's draw commands (a pdf in xml if you will)

2. PDFDocument takes the xml and generates the html that's used as a base for conversion

3. The rest of ebook conversion takes the html into whatever other format is needed

My plan would then be to hack into PDFDocument, take the xml, do the unwrapping and header+footer detection, and end up making the html there.

Is that what you had in mind? Or did you intend the reflow plugin to, you know, reflow (ie unwrap) the pdf? I personally prefer pdfreflow being a pdf-to-xml-that-we-can-work-on-in-python converter

Did I get it right? What did you have in mind?
Thanks!
miquel is offline   Reply With Quote
Reply

Tags
conversion, linebreak, pdf, unwrap

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
possible bug about.pdf Unwrap zambosky Calibre 5 06-20-2010 09:53 AM
Line Spacing on PDF to Epub conversion poodlemama Calibre 2 05-03-2010 08:28 PM
PDF Line Un-Wrap Factor bug? jotekman Calibre 2 03-15-2010 11:43 AM
PDF line spacing jjansen Calibre 3 03-08-2010 11:46 AM
PDF to ePub (New line problem) Dark123 Calibre 3 02-13-2010 08:41 PM


All times are GMT -4. The time now is 02:10 PM.


MobileRead.com is a privately owned, operated and funded community.