How can one remove excess carriage returns?

AlexBell · 08-10-2016, 12:36 AM

I've been sent a doc file to turn into an ebook. My usual practice is to run the doc file through Atlantis Word Processor to turn it in to an HTML file, and go on from there.

But the author, for reasons best known to her, has ended every line of text with a carriage return, so Atlantis turns every line of text into a paragraph.

Also, the author has not separated 'real' paragraphs in the text.

Can anyone suggest a way to remove to get rid of the excess carriage returns/paragraphs? I vaguely remember a tool with which one could select a block of text, then press a key combination to remove all the <p> and </p> tags except the ones starting and finishing the block. But I can't remember which software it was in. Any suggestions?

Hitch · 08-10-2016, 01:00 AM

Quote:

Originally Posted by AlexBell

I've been sent a doc file to turn into an ebook. My usual practice is to run the doc file through Atlantis Word Processor to turn it in to an HTML file, and go on from there.

But the author, for reasons best known to her, has ended every line of text with a carriage return, so Atlantis turns every line of text into a paragraph.

Also, the author has not separated 'real' paragraphs in the text.

Can anyone suggest a way to remove to get rid of the excess carriage returns/paragraphs? I vaguely remember a tool with which one could select a block of text, then press a key combination to remove all the <p> and </p> tags except the ones starting and finishing the block. But I can't remember which software it was in. Any suggestions?

Oh, dear.

We see that quite often. I just saw several like that. By the time I finish explaining about broken paragraphs, what it takes to clean them, etc., the prospective client's eyes have glazed over, and they usually leave to find some other bookmaker that doesn't bother them with that codswallop (to quote one rather infamous near-miss client).

nb: I'm actually afraid to ask you about the nature/topic of the book. I'm afraid one of those that came through my door have ended up at yours!!!

We use in-house regex. It's the best way. Do one pass for those that have two in a row--(last line of a real para, and an empty para), and then for one, and then, sadly, you have to do the rest by hand/eye. Particularly surrounding those that break across pages, of course (if this was a scan). Can't really be done automatically.

On a commercial note, I hope that a) you asked them what crappy "auto-convert" program they used to give you this utterly FAKE Word file ($5 says it is an export from Adobe Acrobat--"save as Word"), or it's the output from a scan, or some bollocks like that, and b) that you are CHARGING to do all this extra work. That stuff is total nonsense. Your rates, presumably, are like ours--from a CLEAN source file, if using a word-processing file, right?

Seriously--if you're like us, you charge one rate for "from Word" and something a lot more expensive "from PDF" and so on. We frequently get this "faux-Word files," with prospectives thinking that we can't TELL that it was a PDF five minutes ago. Ask for the actual source--probably easier for you, and more expensive for them, but you should be paid for the actual time you're putting in.

Sheesh.

Hitch

AlexBell · 08-10-2016, 01:35 AM

Quote:

Originally Posted by Hitch

Oh, dear.

On a commercial note, I hope that a) you asked them what crappy "auto-convert" program they used to give you this utterly FAKE Word file ($5 says it is an export from Adobe Acrobat--"save as Word"), or it's the output from a scan, or some bollocks like that, and b) that you are CHARGING to do all this extra work.

Hitch

No, I've done other work from her and have a pdf of the original ornate book. I'm sure she did the carriage returns to get the text, illustrations, and poetry neatly lined up on the page - the doc file text when opened looks quite like the page on the pdf.

Thanks anyway.

Toxaris · 08-10-2016, 02:19 AM

In the Post-OCR procedure of my add-in this is getting handled. It tries to be smart at it as well. It can happen that a line ends with a period but that it is not the last line of the paragraph. In those cases the procedure will try to check if the first word of the next line would fit behind the period without overrunning the line (I hope I make sense here). If it fits, the line is probably the end of a paragraph. If not, than the paragraph continues at the next line.

After this usually a couple of unknowns are still present (e.g. a heading usually does not end with a period), With the Search&Replace procedure the last remaining dubious end of lines are investigated and fixed. That is manual (question is asked if the replace should be done).

This saves me a lot of time. For an average book the Post-OCR and these specific S&R commands take no more than 2-5 minutes.

Doitsu · 08-10-2016, 02:24 AM

Quote:

Originally Posted by AlexBell

Can anyone suggest a way to remove to get rid of the excess carriage returns/paragraphs?

Since MS Word allows you to search for hard returns with ^p, the easiest solution would be to:

1. Replace all consecutive line-breaks (^p^p) with a dummy character, e.g. ###.
2. Replace all remaining hard line-breaks with spaces.
3. Replace the dummy character (###) with hard line-breaks (^p). (You might have to replace ###### with ### first.)

kacir · 08-10-2016, 02:58 AM

Quote:

Originally Posted by AlexBell

... Also, the author has not separated 'real' paragraphs in the text.

Can anyone suggest a way to remove to get rid of the excess carriage returns/paragraphs?

well ... it depends on how the book is formatted.

I personally would use Regular expressions.
If there is something like an empty line between real paragraphs, I would do a quick solution as the Doitsu in previous post suggested.
If there is no empty line between paragraphs there might be a tab character at the beginning of the paragraph or, if you are lucky a few spaces, or the line might have different intent. I would try to use that.

If all else fails I would find all lines that end with a dot followed by a CRLF and replace it with something like ### real paragraph here ###, then do the same thing for question mark, exclamation point, and also dot followed by a [closing] quote mark ... you get the idea.
Then I would replace all CRLF with a space, replace all the ### real paragraph here ### markers with CRLF and then check for two consecutive spaces (several times, after there are no more to replace).

Or, you could craft a regular expression that would replace any letter followed by a CRLF (end of line/paragraph) with the same letter followed by a space.

Another trick would be to use elaborate algorithm that OCR programs use. Just print the text into a pdf and run that through OCR program ;-)
OCR programs use the tricks described above, plus they look at the number of characters on line, they look at the justification, if the text is fully justified and many other clever tricks.

It also depends on how much of the original formatting from the word you want to preserve.
I might just use search and replace from Word to insert formatting markup looking for specific formatting (such as style) and placing marks like {H1} at the beginning of the text where formatting changes and then export the text to a *.txt file and massage that with a powerful editor with real regular expressions (Gvim is my choice).

BetterRed · 08-10-2016, 03:59 AM

Quote:

Originally Posted by Doitsu

Since MS Word allows you to search for hard returns with ^p, the easiest solution would be to:

1. Replace all consecutive line-breaks (^p^p) with a dummy character, e.g. ###.
2. Replace all remaining hard line-breaks with spaces.
3. Replace the dummy character (###) with hard line-breaks (^p). (You might have to replace ###### with ### first.)

- if original uses # as a bullet point marker I use @@@ (three snails)

Same technique can be used in plain text files - in that case you look for \n (sometimes \r\n) rather than ^p

BR

Toxaris · 08-10-2016, 04:17 AM

Quote:

Originally Posted by Doitsu

Since MS Word allows you to search for hard returns with ^p, the easiest solution would be to:

1. Replace all consecutive line-breaks (^p^p) with a dummy character, e.g. ###.
2. Replace all remaining hard line-breaks with spaces.
3. Replace the dummy character (###) with hard line-breaks (^p). (You might have to replace ###### with ### first.)

That would cause a lot of false hits and replacements depending on the formatting. I would never recommend this.

BetterRed · 08-10-2016, 08:14 AM

Quote:

Originally Posted by Toxaris

That would cause a lot of false hits and replacements depending on the formatting. I would never recommend this.

Works good enough on the dozen or so TV and radio interview transcriptions I bash into shape every week, 'misplaced' line endings are the least of the difficulties of dealing with automated transcription text -

Select a contemporary current affairs program on your TV or youtube, squelch the audio, and turn on subtitles - and count how many times the presenter and guests are allegged to have said things like 'breaks it', 'queue easy', 'helicoptor mummy' etc

They're a few of the of common mistakes auto transcribers are making today. A few months ago 'Lou house bombast you crane' popped up a lot Ψ²

In part its why I love the Mark feature in ebook-tools. After a while one gets to be proficient at reading the mind of the machine

BR

Hitch · 08-10-2016, 10:20 PM

Quote:

Originally Posted by BetterRed

Works good enough on the dozen or so TV and radio interview transcriptions I bash into shape every week, 'misplaced' line endings are the least of the difficulties of dealing with automated transcription text -

Select a contemporary current affairs program on your TV or youtube, squelch the audio, and turn on subtitles - and count how many times the presenter and guests are allegged to have said things like 'breaks it', 'queue easy', 'helicoptor mummy' etc

They're a few of the of common mistakes auto transcribers are making today. A few months ago 'Lou house bombast you crane' popped up a lot Ψ²

In part its why I love the Mark feature in ebook-tools. After a while one gets to be proficient at reading the mind of the machine

BR

BUT, I gotta comment on this.

I have wondered if there's some way for me to donate my time for subtitling. Mr. Hitch needs the subtitles for all the UK and Aussie stuff that we watch, and the subtitles/closed captioning are simply DREADFUL. I mean, dreadful. I don't know how the hell anyone can manage, if they can't fill in the blanks through hearing. It's unbelievable.

It's not any better for US TV; it's pretty much as awful/worse. I know that some of the services are PAID, so that's the worst part. If, like DP and PG, it was all donated time, okay...I could wince and ignore it, but a commercial service? Appalling.

</rant>

Hitch

AlexBell · 08-11-2016, 04:49 AM

Thanks for all the suggestions.

I had a little chat with the co-author, and she's in process of manually removing the excess carriage returns, chapter by chapter. She misses a few, but it's no problem to find them when I'm proofing. And it makes a tremendous difference to the ease of setting up the HTML.

Hitch · 08-11-2016, 01:31 PM

Quote:

Originally Posted by AlexBell

Thanks for all the suggestions.

I had a little chat with the co-author, and she's in process of manually removing the excess carriage returns, chapter by chapter. She misses a few, but it's no problem to find them when I'm proofing. And it makes a tremendous difference to the ease of setting up the HTML.

You know, it's funny. We have an article, that sets out all the steps that it takes for an author-pub to go from physical book to eBook using Scanning, on our site, and in our canned responses. I realize that's not what you're doing, but...similar enough. No matter how many times I explain it, they never get it. Until afterward, when they get the scan, whether it's raw, or the scanning company has done a basic word-match edit (from the PDF proof to the Word file, I mean).

I always try to prepare them, and explain that generally, the misspelt words or misread words (fiat=hat type of thing) aren't going to be the really hard bits; the hard bits are exactly what you're dealing with. The pilcrows that have landed utterly out of place. The paras that break at the end of one page--at the end of a sentence, and a new sentence, flush to the left, is at the top of the next page. New paragraph, after a scene break? Or just a continuation of the previous?

They also get freaked out when they can't figure out how to remove the section breaks.

I tell them that our automated clips will find, depending on the book, between 80-97% of the broken paras. But we see books that our in-house analyses tell us have 600, 800 or more broken paragraphs. When you start figuring how many possible errors there are, if you only find 80%--man, that adds up. They almost always ask us to do it (usually until I mention the fees involved, to do it by hand/eye), but the part that they don't get is how much of it has to be READ, to get those that can only be corrected with context.

It's frustrating. There are some things, over the years, that I've found good everyday exemplars and analogies for; but broken paragraphs seems to be nearly impossible to explain to someone who doesn't really "get" paragraph codes, styles, outlines, headings, and all that good stuff. Even with screenshots to explain, and quick-n-dirty exports to HTML, viewed in a browser resized to screen size. They just don't understand the WHY. Why their sentences are breaking in half. You know, until someone understands the role of a paragraph, as a fundamental component of a word-processing document (you know, character, word, paragraph, etc.), trying to explain it is usually hopeless. That, or, I'm just a really crappy explainer.

At least your story has a reasonably happy ending, Alex!

Hitch

AlexBell · 08-12-2016, 04:37 AM

That's interesting. Could you give us the URL for your site please so I could read the article?

Hitch · 08-12-2016, 04:40 PM

Quote:

Originally Posted by AlexBell

That's interesting. Could you give us the URL for your site please so I could read the article?

Via PM. I'm never 100% sure about the rules, so...

FYI: it's JUST about the steps, but you're welcome to it.

Hitch

dgatwood · 08-12-2016, 11:58 PM

Quote:

Originally Posted by Hitch

BUT, I gotta comment on this.

I have wondered if there's some way for me to donate my time for subtitling. Mr. Hitch needs the subtitles for all the UK and Aussie stuff that we watch, and the subtitles/closed captioning are simply DREADFUL. I mean, dreadful. I don't know how the hell anyone can manage, if they can't fill in the blanks through hearing. It's unbelievable.

It's not any better for US TV; it's pretty much as awful/worse. I know that some of the services are PAID, so that's the worst part. If, like DP and PG, it was all donated time, okay...I could wince and ignore it, but a commercial service? Appalling.

</rant>

Equally off-topic reply: These days, most subtitling is done automatically by speech recognition software. That's why it is so bad. Just imagine saying, "Siri, type 'It was the best of times, it was the worst of times'" and you'll get the idea.

08-10-2016, 12:36 AM	#1
AlexBell Wizard Posts: 3,413 Karma: 13369310 Join Date: May 2008 Location: Launceston, Tasmania Device: Sony PRS T3, Kobo Glo, Kindle Touch, iPad, Samsung SB 2 tablet	How can one remove excess carriage returns? I've been sent a doc file to turn into an ebook. My usual practice is to run the doc file through Atlantis Word Processor to turn it in to an HTML file, and go on from there. But the author, for reasons best known to her, has ended every line of text with a carriage return, so Atlantis turns every line of text into a paragraph. Also, the author has not separated 'real' paragraphs in the text. Can anyone suggest a way to remove to get rid of the excess carriage returns/paragraphs? I vaguely remember a tool with which one could select a block of text, then press a key combination to remove all the <p> and </p> tags except the ones starting and finishing the block. But I can't remember which software it was in. Any suggestions?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Easy way to remove carriage returns between paragraphs?	Alda	Sigil	1	11-07-2014 11:00 AM
Fixing a document with too many carriage returns?	bizzybody	Workshop	3	12-22-2012 08:17 AM
Carriage Returns not translating	oldbitcollector	Sigil	2	04-21-2011 03:20 AM
Removing excess carriage returns	Halk	Calibre	5	05-17-2009 02:35 PM
Forcing carriage returns	KindleHog	Amazon Kindle	3	05-01-2009 01:14 PM

08-10-2016, 02:19 AM	#4
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	In the Post-OCR procedure of my add-in this is getting handled. It tries to be smart at it as well. It can happen that a line ends with a period but that it is not the last line of the paragraph. In those cases the procedure will try to check if the first word of the next line would fit behind the period without overrunning the line (I hope I make sense here). If it fits, the line is probably the end of a paragraph. If not, than the paragraph continues at the next line. After this usually a couple of unknowns are still present (e.g. a heading usually does not end with a period), With the Search&Replace procedure the last remaining dubious end of lines are investigated and fixed. That is manual (question is asked if the replace should be done). This saves me a lot of time. For an average book the Post-OCR and these specific S&R commands take no more than 2-5 minutes.

08-11-2016, 04:49 AM	#11
AlexBell Wizard Posts: 3,413 Karma: 13369310 Join Date: May 2008 Location: Launceston, Tasmania Device: Sony PRS T3, Kobo Glo, Kindle Touch, iPad, Samsung SB 2 tablet	Thanks for all the suggestions. I had a little chat with the co-author, and she's in process of manually removing the excess carriage returns, chapter by chapter. She misses a few, but it's no problem to find them when I'm proofing. And it makes a tremendous difference to the ease of setting up the HTML.

08-12-2016, 04:37 AM	#13
AlexBell Wizard Posts: 3,413 Karma: 13369310 Join Date: May 2008 Location: Launceston, Tasmania Device: Sony PRS T3, Kobo Glo, Kindle Touch, iPad, Samsung SB 2 tablet	That's interesting. Could you give us the URL for your site please so I could read the article?