View Full Version : Any good Perl scripters out there?


Goshzilla
10-09-2007, 04:31 PM
I have the software installed on my comp but I have never written a perl script before. I have done programming in Java and C++, so I naively thought that a program to hyphenate every word in a paragraph with a softhyphen(one that does not show up unless the word is positioned at the end of a line) would be easy. Turns out it isn't. But luckily it was throughly researched by Donald Knuth when he wrote his LaTex program.

There is a perl command equivalent that uses the LaTex hyphenation scheme

http://www.gemjack.com/gems/text-hyphen-1.0.0/classes/Text/Hyphen.html

I would like to be able to write a simple script that can take an entire html file, hyphenate the body's text with softhyphens and then save it back to the html file. The reason why this interests me is that for alot of ebooks I make for the Gemstar have alot of blank spaces between words due to fullscreen justification.

note: I would prefer to work in html since the texts I use are Project Gutenberg texts, and I convert them to html using the Gutenmark tool.

kovidgoyal
10-09-2007, 05:46 PM
Use beatifulsoup + python hyphenate shouldn't need more than a 100 line script.

JSWolf
10-09-2007, 05:47 PM
If we put in soft hyphens in an LRS would it work when converted to LRF?

kovidgoyal
10-09-2007, 05:54 PM
It might depends on whether SONY's reader software supports soft hyphens.

kovidgoyal
10-09-2007, 06:04 PM
I just checked it doesn't. In fact it worse than doesn't. It converts them to real hyphens. And doesn't line break on them.

NatCh
10-09-2007, 06:11 PM
It converts them to real hyphens. And doesn't line break on them.Ouch.

JSWolf
10-09-2007, 06:32 PM
That's odd. As I've seen it break on a real hyphen if the real hyphen is already in the text. Doesn't break on em dashes for some idiotic reason.

NatCh
10-09-2007, 06:41 PM
Doesn't break on em dashes for some idiotic reason.Do the em dashes in question have spaces on either side of them?

I.e. "word — word" vs. "word—word"

The reason I ask is that the absence of spaces may make the parser think it's supposed to be a single word — I think the first is the "technically" proper usage, anyway ....

JSWolf
10-09-2007, 06:45 PM
I've seen "word— word" vs. "word—word" vs. "word — word"

If you have "word—word" it's treated as one long word. "word — word" might be wrapped at word leaving "— word" on the next line. If it fits, "word—" would be on the previous line and word would start the next line which looks best.

NatCh
10-09-2007, 06:58 PM
Any idea what it does with non-breaking spaces?

jbenny
10-09-2007, 07:13 PM
Do the em dashes in question have spaces on either side of them?

I.e. "word — word" vs. "word—word"

The reason I ask is that the absence of spaces may make the parser think it's supposed to be a single word — I think the first is the "technically" proper usage, anyway ....

Actually, I am pretty sure that the correct usage of an em-dash has no spaces on either side, although some people use them. Of course, the typical ASCII representation of an em-dash is -- (two dashes).

Goshzilla
10-09-2007, 09:13 PM
Use beatifulsoup + python hyphenate shouldn't need more than a 100 line script.

How exactly would this work. I want to be able to preserve the original tags, styles etc. Except all the text within the body of the work will be hyphenated using soft hyphens.

The way I'm reading the way hyphenate.py file is written, it meerly returns an array with each substring of text where a hyphen can go, I suppose then I would have to write a for-loop based on the length of that array to create a string with the softhyphen apended something like

if say I took that a=hyphenate_word(perfect)
a=['per', 'fect']
string=a[0]
then I would want a for loop iterating for i=from 1 to length(a)-1
## something to append to string such that it appends "$$softhyphen$$+a[i]"
(writing it that way so that the softhyphen is never inserted at the end of a word)

now I just need to figure out how to use this BeautifulSoup script to figure out how to get the text inbetween the <body></body> tags, while preserving formatting tags like <p> and <br>, I don't want to completely drop the formatting, I only want those words to be modified then placed back into the html file.

kovidgoyal
10-09-2007, 09:20 PM
u'\u00ad'.join(result_of_call_to_hyphenate)

you'd need to recursively process the root tag of beautiful soup. All text is a tag of class NavigableString. Use isinstance to test for it. I'm not sure how easy it is to replace strings in the soup, but once you figure that out all you need to do is print unicode(soup).encode('utf-8')

Goshzilla
10-09-2007, 10:02 PM
Maybe I should just edit the original gutenberg txt file first, hyphenate the txt file, that I know I have a rough idea of how to write out the script.

kovidgoyal
10-09-2007, 10:11 PM
For a txt file it's as simple as


import re
from hyphenate import hyphenate_word as hyphenate
src = open('file', 'rb').read()
result = re.sub('\S+', lambda match : u'\u00ad'.join(hyphenate(match.group())), src)


Actually if you can come up with a regexp that matches only text between tags you can use this technique for HTML as well.

kovidgoyal
10-09-2007, 10:24 PM
Since I'm on a roll here's one for HTML it might need a little adjustment


import re
from hyphenate import hyphenate_word as hyphenate

def process_text(match):
src = match.group(1)
return re.sub('\S+', lambda match : u'\u00ad'.join(hyphenate(match.group())), src)

src = open('file', 'rb').read()
result = re.sub(r'>([^><]+)<', process_text, src)

Goshzilla
10-09-2007, 11:29 PM
okay I can defintely see the advantages of hyphenating from the html file because I have to deal with Gutenmark, a program that was written specifically under the assumption that gutenberg text is formatted in a specific way.

DaleDe
10-10-2007, 01:26 AM
okay I can defintely see the advantages of hyphenating from the html file because I have to deal with Gutenmark, a program that was written specifically under the assumption that gutenberg text is formatted in a specific way.

The simplest way to add soft hyphens doesn't even require any programming. You could use a sed script to do a global substitution of each word with its soft hyphen equivalent. This would be a quick a dirty way to accomplish the goal particularly if you really only want to hyphenate longer words.

Goshzilla
10-10-2007, 11:22 PM
I've noticed some bizar hyphenations being used in the program. For instance sometimes words like "y-our" get hyphenated, same goes for "s-mall" and "u-sual" in some instances plural forms "word-s" or "travel-s" when I use the hyphenation python script on individiual words like that it doesn't return erroneous results, but when I take an entire line, use

a=f.readline()
b=a.split()
(while loop that goes through the length of b)
a=re.sub(b[i],'^'.join(hyphenate(b[i])), a)


the line hyphenates words incorrectly. I think this has something to do with the way .sub works by finding a "pattern" in the string, so once a line has been altered, sub doesn't quit work as well because I'm looking at a new string.


update: It's the dang punctuation that messes it up, if I don't remove the punctuation marks before calling the hyphenation method, the word gets hyphenated incorrectly. I could either just alter the simple script I wrote, or alter the hyphenate code.

I could use a line like a=re.sub("[,;:.!]", '', a) to remove the punctuation marks, but when there is an apostrophe I would like to cut out the word preceeding the apostrophe mark, so something like "Washington's" becomes "Washington"

wallcraft
10-10-2007, 11:49 PM
Do the em dashes in question have spaces on either side of them?

I.e. "word — word" vs. "word—word"

The reason I ask is that the absence of spaces may make the parser think it's supposed to be a single word — I think the first is the "technically" proper usage, anyway .... Space are usually included (or they used to be) in Britain, but in the US the spaces are usually omitted. See the wikipedia article on Dash (http://en.wikipedia.org/wiki/Dash).

NatCh
10-11-2007, 10:35 AM
Thanks for the primer, wallcraft — all this time I time I thought I was being non-standard, and I just learned I've been developing my own "house style!" :grin:

kovidgoyal
10-11-2007, 12:28 PM
If you do modify hyphenate send me the modified code.

I've noticed some bizar hyphenations being used in the program. For instance sometimes words like "y-our" get hyphenated, same goes for "s-mall" and "u-sual" in some instances plural forms "word-s" or "travel-s" when I use the hyphenation python script on individiual words like that it doesn't return erroneous results, but when I take an entire line, use

a=f.readline()
b=a.split()
(while loop that goes through the length of b)
a=re.sub(b[i],'^'.join(hyphenate(b[i])), a)


the line hyphenates words incorrectly. I think this has something to do with the way .sub works by finding a "pattern" in the string, so once a line has been altered, sub doesn't quit work as well because I'm looking at a new string.


update: It's the dang punctuation that messes it up, if I don't remove the punctuation marks before calling the hyphenation method, the word gets hyphenated incorrectly. I could either just alter the simple script I wrote, or alter the hyphenate code.

I could use a line like a=re.sub("[,;:.!]", '', a) to remove the punctuation marks, but when there is an apostrophe I would like to cut out the word preceeding the apostrophe mark, so something like "Washington's" becomes "Washington"

Goshzilla
10-13-2007, 08:18 PM
I never did modify the existing hyphenation algorithm. I only know basic coding, basic objects, I am still learning about writing my own search trees and hash tables, and I still haven't figured all the little nuances in Python just yet.

So here is what I wrote:
import re
from hyphenate import hyphenate_word as hyphenate
def process(y):
if (y=='\n'):
return y
b=re.sub("[,()*;:!?.]", '', y)
b=re.sub('"','',b)
b=re.sub('[\[\]{}<>]','',b)
k=b.split()
i=0
while(i<len(k)):
y=re.sub(k[i],'^'.join(hyphenate(k[i])),y)
i=i+1
return y


import re
from hyphenate import hyphenate_word as hyphenate
from process import process as process
f=open('gltrv10.txt', 'rb')
g=open('r23.txt', 'w')
a=f.readline()
while(a!=''):
g.write(process(a))
a=f.readline()

kovidgoyal
10-13-2007, 10:12 PM
Cool glad you could fix your problem and keep learning python, it's a rewarding experience.

Goshzilla
04-08-2010, 12:47 AM
I just thought I would revive this really old thread, in that I discovered that the Nook does support the soft hyphens. Only sort of, it breaks the word at the point where the soft hyphen is used, but it does not insert a "-" at the point where it breaks the word. Still, that's better than nothing.

Now I would like to be able to edit the contents of epub files to include hyphenations for every word found within the body of the text. I imagine this "BeautifulSoup" program will be what I need to accomplish such a task?

kovidgoyal
04-08-2010, 01:05 AM
Yes beautifulsoup can be used for that.

Goshzilla
04-08-2010, 05:56 PM
So I've been working on this all this morning and I have come across a stopping block.

I can extract text from anything within the <p> tags, but I have a problem trying to write a loop to hyphenate a sentence, the issue is that there are specially defined characters like "&rdquo;" which is the entry for a right sided double quotation mark. I can strip out the text from a paragraph, but it still leaves symbols like that intact.


update***

I got it done, finally. Though it isn't the most efficient code in the world, but it can now scan through an entire directory for html files, and modify them one by one, inserting hyphens. There is however one issue I'd like to get around. I am hitting a stopping block when I try to edit an html file that is encoded in something other than ascii.

kovidgoyal
04-08-2010, 10:23 PM
Use the xml_to_unicode function in calibre and pass Beautifulsoup the unicode.

Goshzilla
04-09-2010, 02:55 PM
Okay I've done as much as I could out of it, I've uploaded an epub of Hound of the Baskervilles to demonstrate the soft hyphen reflow through out the entire text.

update: I'm perfecting the code step by step. With the latest iteration, I can have the program crawl through a directory with subfolders each containing html files, and then modify them all subfolder by subfolder. Now I can do bulk editing quickly.