08-10-2016, 12:36 AM | #1 |
Wizard
Posts: 3,413
Karma: 13369310
Join Date: May 2008
Location: Launceston, Tasmania
Device: Sony PRS T3, Kobo Glo, Kindle Touch, iPad, Samsung SB 2 tablet
|
How can one remove excess carriage returns?
I've been sent a doc file to turn into an ebook. My usual practice is to run the doc file through Atlantis Word Processor to turn it in to an HTML file, and go on from there.
But the author, for reasons best known to her, has ended every line of text with a carriage return, so Atlantis turns every line of text into a paragraph. Also, the author has not separated 'real' paragraphs in the text. Can anyone suggest a way to remove to get rid of the excess carriage returns/paragraphs? I vaguely remember a tool with which one could select a block of text, then press a key combination to remove all the <p> and </p> tags except the ones starting and finishing the block. But I can't remember which software it was in. Any suggestions? |
08-10-2016, 01:00 AM | #2 | |
Bookmaker & Cat Slave
Posts: 11,466
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
We see that quite often. I just saw several like that. By the time I finish explaining about broken paragraphs, what it takes to clean them, etc., the prospective client's eyes have glazed over, and they usually leave to find some other bookmaker that doesn't bother them with that codswallop (to quote one rather infamous near-miss client). nb: I'm actually afraid to ask you about the nature/topic of the book. I'm afraid one of those that came through my door have ended up at yours!!! We use in-house regex. It's the best way. Do one pass for those that have two in a row--(last line of a real para, and an empty para), and then for one, and then, sadly, you have to do the rest by hand/eye. Particularly surrounding those that break across pages, of course (if this was a scan). Can't really be done automatically. On a commercial note, I hope that a) you asked them what crappy "auto-convert" program they used to give you this utterly FAKE Word file ($5 says it is an export from Adobe Acrobat--"save as Word"), or it's the output from a scan, or some bollocks like that, and b) that you are CHARGING to do all this extra work. That stuff is total nonsense. Your rates, presumably, are like ours--from a CLEAN source file, if using a word-processing file, right? Seriously--if you're like us, you charge one rate for "from Word" and something a lot more expensive "from PDF" and so on. We frequently get this "faux-Word files," with prospectives thinking that we can't TELL that it was a PDF five minutes ago. Ask for the actual source--probably easier for you, and more expensive for them, but you should be paid for the actual time you're putting in. Sheesh. Hitch |
|
08-10-2016, 01:35 AM | #3 | |
Wizard
Posts: 3,413
Karma: 13369310
Join Date: May 2008
Location: Launceston, Tasmania
Device: Sony PRS T3, Kobo Glo, Kindle Touch, iPad, Samsung SB 2 tablet
|
Quote:
Thanks anyway. |
|
08-10-2016, 02:19 AM | #4 |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
In the Post-OCR procedure of my add-in this is getting handled. It tries to be smart at it as well. It can happen that a line ends with a period but that it is not the last line of the paragraph. In those cases the procedure will try to check if the first word of the next line would fit behind the period without overrunning the line (I hope I make sense here). If it fits, the line is probably the end of a paragraph. If not, than the paragraph continues at the next line.
After this usually a couple of unknowns are still present (e.g. a heading usually does not end with a period), With the Search&Replace procedure the last remaining dubious end of lines are investigated and fixed. That is manual (question is asked if the replace should be done). This saves me a lot of time. For an average book the Post-OCR and these specific S&R commands take no more than 2-5 minutes. |
08-10-2016, 02:24 AM | #5 | |
Grand Sorcerer
Posts: 5,611
Karma: 23187563
Join Date: Dec 2010
Device: Kindle PW2
|
Quote:
1. Replace all consecutive line-breaks (^p^p) with a dummy character, e.g. ###. 2. Replace all remaining hard line-breaks with spaces. 3. Replace the dummy character (###) with hard line-breaks (^p). (You might have to replace ###### with ### first.) |
|
08-10-2016, 02:58 AM | #6 | |
Wizard
Posts: 3,450
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
|
Quote:
I personally would use Regular expressions. If there is something like an empty line between real paragraphs, I would do a quick solution as the Doitsu in previous post suggested. If there is no empty line between paragraphs there might be a tab character at the beginning of the paragraph or, if you are lucky a few spaces, or the line might have different intent. I would try to use that. If all else fails I would find all lines that end with a dot followed by a CRLF and replace it with something like ### real paragraph here ###, then do the same thing for question mark, exclamation point, and also dot followed by a [closing] quote mark ... you get the idea. Then I would replace all CRLF with a space, replace all the ### real paragraph here ### markers with CRLF and then check for two consecutive spaces (several times, after there are no more to replace). Or, you could craft a regular expression that would replace any letter followed by a CRLF (end of line/paragraph) with the same letter followed by a space. Another trick would be to use elaborate algorithm that OCR programs use. Just print the text into a pdf and run that through OCR program ;-) OCR programs use the tricks described above, plus they look at the number of characters on line, they look at the justification, if the text is fully justified and many other clever tricks. It also depends on how much of the original formatting from the word you want to preserve. I might just use search and replace from Word to insert formatting markup looking for specific formatting (such as style) and placing marks like {H1} at the beginning of the text where formatting changes and then export the text to a *.txt file and massage that with a powerful editor with real regular expressions (Gvim is my choice). |
|
08-10-2016, 03:59 AM | #7 | |
null operator (he/him)
Posts: 20,691
Karma: 26966376
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Quote:
Same technique can be used in plain text files - in that case you look for \n (sometimes \r\n) rather than ^p BR Last edited by BetterRed; 08-10-2016 at 04:04 AM. |
|
08-10-2016, 04:17 AM | #8 | |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
Quote:
|
|
08-10-2016, 08:14 AM | #9 | |
null operator (he/him)
Posts: 20,691
Karma: 26966376
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Quote:
Select a contemporary current affairs program on your TV or youtube, squelch the audio, and turn on subtitles - and count how many times the presenter and guests are allegged to have said things like 'breaks it', 'queue easy', 'helicoptor mummy' etc They're a few of the of common mistakes auto transcribers are making today. A few months ago 'Lou house bombast you crane' popped up a lot Ψ² In part its why I love the Mark feature in ebook-tools. After a while one gets to be proficient at reading the mind of the machine BR Last edited by BetterRed; 08-10-2016 at 08:22 AM. |
|
08-10-2016, 10:20 PM | #10 | |
Bookmaker & Cat Slave
Posts: 11,466
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
I have wondered if there's some way for me to donate my time for subtitling. Mr. Hitch needs the subtitles for all the UK and Aussie stuff that we watch, and the subtitles/closed captioning are simply DREADFUL. I mean, dreadful. I don't know how the hell anyone can manage, if they can't fill in the blanks through hearing. It's unbelievable. It's not any better for US TV; it's pretty much as awful/worse. I know that some of the services are PAID, so that's the worst part. If, like DP and PG, it was all donated time, okay...I could wince and ignore it, but a commercial service? Appalling. </rant> Hitch |
|
08-11-2016, 04:49 AM | #11 |
Wizard
Posts: 3,413
Karma: 13369310
Join Date: May 2008
Location: Launceston, Tasmania
Device: Sony PRS T3, Kobo Glo, Kindle Touch, iPad, Samsung SB 2 tablet
|
Thanks for all the suggestions.
I had a little chat with the co-author, and she's in process of manually removing the excess carriage returns, chapter by chapter. She misses a few, but it's no problem to find them when I'm proofing. And it makes a tremendous difference to the ease of setting up the HTML. |
08-11-2016, 01:31 PM | #12 | |
Bookmaker & Cat Slave
Posts: 11,466
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
I always try to prepare them, and explain that generally, the misspelt words or misread words (fiat=hat type of thing) aren't going to be the really hard bits; the hard bits are exactly what you're dealing with. The pilcrows that have landed utterly out of place. The paras that break at the end of one page--at the end of a sentence, and a new sentence, flush to the left, is at the top of the next page. New paragraph, after a scene break? Or just a continuation of the previous? They also get freaked out when they can't figure out how to remove the section breaks. I tell them that our automated clips will find, depending on the book, between 80-97% of the broken paras. But we see books that our in-house analyses tell us have 600, 800 or more broken paragraphs. When you start figuring how many possible errors there are, if you only find 80%--man, that adds up. They almost always ask us to do it (usually until I mention the fees involved, to do it by hand/eye), but the part that they don't get is how much of it has to be READ, to get those that can only be corrected with context. It's frustrating. There are some things, over the years, that I've found good everyday exemplars and analogies for; but broken paragraphs seems to be nearly impossible to explain to someone who doesn't really "get" paragraph codes, styles, outlines, headings, and all that good stuff. Even with screenshots to explain, and quick-n-dirty exports to HTML, viewed in a browser resized to screen size. They just don't understand the WHY. Why their sentences are breaking in half. You know, until someone understands the role of a paragraph, as a fundamental component of a word-processing document (you know, character, word, paragraph, etc.), trying to explain it is usually hopeless. That, or, I'm just a really crappy explainer. At least your story has a reasonably happy ending, Alex! Hitch |
|
08-12-2016, 04:37 AM | #13 |
Wizard
Posts: 3,413
Karma: 13369310
Join Date: May 2008
Location: Launceston, Tasmania
Device: Sony PRS T3, Kobo Glo, Kindle Touch, iPad, Samsung SB 2 tablet
|
That's interesting. Could you give us the URL for your site please so I could read the article?
|
08-12-2016, 04:40 PM | #14 |
Bookmaker & Cat Slave
Posts: 11,466
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
|
08-12-2016, 11:58 PM | #15 | |
Curmudgeon
Posts: 629
Karma: 1623086
Join Date: Jan 2012
Device: iPad, iPhone, Nook Simple Touch
|
Quote:
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Easy way to remove carriage returns between paragraphs? | Alda | Sigil | 1 | 11-07-2014 11:00 AM |
Fixing a document with too many carriage returns? | bizzybody | Workshop | 3 | 12-22-2012 08:17 AM |
Carriage Returns not translating | oldbitcollector | Sigil | 2 | 04-21-2011 03:20 AM |
Removing excess carriage returns | Halk | Calibre | 5 | 05-17-2009 02:35 PM |
Forcing carriage returns | KindleHog | Amazon Kindle | 3 | 05-01-2009 01:14 PM |