MobileRead Forums - View Single Post

kiwidude · 03-30-2011, 04:17 AM

@user_none - thanks for taking the time to take a look and explain. I really know zero about PDF conversion so appreciate the information you have given here and on the ticket. I fully expected that if it was anything other than a trivial change there would not be any interest in making a change to the code. I will try Kovid's suggestion of taking a look at the reflow.py stuff for processing PDF files.

@drMerry - to be honest I don't have a massive interest in trying to support really badly OCR'd documents. I would much rather support the majority of what users are after which is an ISBN from valid documents (that they are less likely to be binning!). As performance is already an issue I don't intend to compound that.

I have a suggestion from Kovid about some alternative code in Calibre to use for handling PDFs (part of the new in progress PDF engine) so I will give that a go and see what options if any I could introduce around it. I haven't looked at it yet but from what Kovid mentioned scanning the first 10 pages only will be easy to support, however I would guess scanning the last few might not be possible without scanning the whole thing. Still, at least offering that as a config option could help performance for the majority of docs where ISBN is at the front.

In terms of your case (1) above of the ISBN immediately followed by a linefeed. If you can PM me a link to where I can download the doc I will give it a spin once I have changed the PDF handling code and see if we can handle that case.

03-30-2011, 04:17 AM	#26
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	@user_none - thanks for taking the time to take a look and explain. I really know zero about PDF conversion so appreciate the information you have given here and on the ticket. I fully expected that if it was anything other than a trivial change there would not be any interest in making a change to the code. I will try Kovid's suggestion of taking a look at the reflow.py stuff for processing PDF files. @drMerry - to be honest I don't have a massive interest in trying to support really badly OCR'd documents. I would much rather support the majority of what users are after which is an ISBN from valid documents (that they are less likely to be binning!). As performance is already an issue I don't intend to compound that. I have a suggestion from Kovid about some alternative code in Calibre to use for handling PDFs (part of the new in progress PDF engine) so I will give that a go and see what options if any I could introduce around it. I haven't looked at it yet but from what Kovid mentioned scanning the first 10 pages only will be easy to support, however I would guess scanning the last few might not be possible without scanning the whole thing. Still, at least offering that as a config option could help performance for the majority of docs where ISBN is at the front. In terms of your case (1) above of the ISBN immediately followed by a linefeed. If you can PM me a link to where I can download the doc I will give it a spin once I have changed the PDF handling code and see if we can handle that case.