MobileRead Forums - View Single Post - k2pdfopt: optimizes PDFs for viewing on e-readers

Steven630 · 12-16-2017, 03:56 AM

Thank you for your great tool.

This is the best software I've ever used for PDF optimization.

While converting a Chinese document, I have found some unnecessary line breaks. I use the v2.42 on Windows and "smart line breaks" is unchecked for this conversion (when checked, more unnecessary line breaks appear).

Command line: -dev kp2 -fs 16.5 -col 1 -ws -0
Additional Options: -bp[-] -om 0.2 -y

For example, the underlined part is in one single sentence, separated by a comma, but k2pdfopt broke that into two lines (see images). I have more examples if you need them.

Click image for larger version

Name: before.png
Views: 335
Size: 154.0 KB
ID: 160693

Click image for larger version

Name: after.png
Views: 324
Size: 159.0 KB
ID: 160694

Here is the PDF line break.pdf. To save space, I have only kept the page in question. If you convert this file with the setting above, the problematic lines are the last line of result page 1 and the first line of result page 2 (should be on the same line).

I don't know if it has something to do with the fact that the words are in Chinese. In Chinese, there is no space between words. In a sentence, there are only characters and punctuation marks. Non-native speakers can think of it roughly as numbers plus punctuation marks.

For example

Code:

234946543，************。

Suppose that each number is a Chinese character, and some words consist of multiple characters. Let's say that "23" stands for "students", 4 "don’t", 94 "like", 65 "that", 43 "teacher". So "234946543" means "The students don’t like that teacher". There are no space between words or characters. We know how to separate words (23-4-94-65-43) just by reading the segment 234946543.

And a Chinese word that has multiple characters can be separated between lines. Normally, a single line has a (relatively) fixed number of characters. If a line has, say, a width of eight characters, this sentence would be

Code:

23494654
3,*******
*****。

Code:

学生不喜欢那个老
师，◎◎◎◎◎◎
◎◎◎◎◎。

Even though 43（老师） is the word for "teacher", the two characters that make up the word—4 老 and 3 师—are still on different lines, and this is the norm. It’s a bit like

Quote:

The students don’t like that tea-
cher, because she always assigns
a lot of homework.

Since we don’t have space between words, there’s no need for hyphens to divide words at the end of lines either.

K2pdfopt seems to treat everything between two punctuation marks as a long word since there's no noticeable space between two Chinese characters.

Chinese may be the only language that doesn’t have space between words. (In fact, ancient Chinese books don’t even have punctuation marks, so children would first learn how to divide sentences. )

Is it possible that k2pdfopt adopt a different approach when it comes to Chinese (add an option of “source text is in Chinese” to the interactive menu, perhaps) ? If the option is ticked, the software find characters instead of words.

Even if the issue with the images attached is irrelevant to Chinese, I would still recommend an improved mode to handle Chinese documents. Thank you.