MobileRead Forums - View Single Post - k2pdfopt: optimizes PDFs for viewing on e-readers

Psyny · 01-14-2015, 10:13 AM

Quote:

Originally Posted by willus

It's not too much--I'm just not sure I understand entirely what you want this option to do, but I get the idea that it involves using the OCR layer to detect the columns. I suppose it would also be nice if I could figure out a way to ignore background graphics. There is probably a way to do that using the MuPDF API that isn't too difficult.

Hi Willus, sorry for the delay.

Indeed the idea to use OCR layer to detect columns/regions.
That could help in cases where graphics could mess up region detection.

For sample, this text:

Using the command suggested earlier:
-corc[+] [i|t] <inches>

Where:
<inches> : max ocr/markings distance k2pdfopt will look for another text to define a colum.
+ : allow process of areas without oct/markings
i : to include in the colum the area around markings defined by <inches>
t : to not include in the colum the areas around markings defined by <inches>

k2pdfopt could use OCR layer ( in green ), add an offset of it ( in purple ) to detect columns/regions ( in red ):

The command to this result, based on the sugestion, would be like:
-corc i 1.0

The offset is used to detect near OCR layers, if another OCR layer touch the offset of another OCR layer, its is considered the same column/region.

The "i" parameter would be used to add the offset area to the final region slice.
If we use "t" instead, the offset area is still used to detect nearest areas, but not included in final region. Like: -corc t 1.0

If the offset is increased to the point of crossing another OCR area, the region should be the same, like: -corc i 2.0

This should help better use of regions in files with heavy background usage, but good OCR layers.

Dont know if possible... =P