View Full Version : utility to eliminate unwanted line breaks in txt


profnachos
11-20-2007, 12:18 AM
I have written a little program to eliminate unwanted line breaks from txt files converted from PDF (I am a freelance programmer). This after getting unsatisfying responses to this thread I posted (http://www.mobileread.com/forums/showthread.php?t=16224). Nothing personal against those who helped, but none of the options looked attractive.

So what you will do is, open a PDF file using your Adobe Reader, chose the option to "Save as Text." Once the pdf file has been converted to txt, you'd use my utility to get rid of unwanted lines before importing it a format of your choice.

I'd love to make it available to the public and get feedback, but the only problem is, I do not have a public website to make the program available to you. So let me know if you wish to corroborate with me on this.

Also, if the interest level is high, I would like to write an utility to fix HTML files to do the same.

If somebody has done this already, please let me know.

Thanks. My email is profnachos@gmail.com

NatCh
11-20-2007, 02:11 AM
Hey, profnachos, you're probably not going to get much attention right now, not because you don't deserve it, but because everyone is in a frenzy over the Kindle Launch. I'd suggest you give it a few days to die down and try it then. :wink:

Also, a number of folks have used this forum to collaborate on apps, so you might want to consider that option. :nice:

HarryT
11-20-2007, 02:30 AM
What makes your newline remover better than any of the dozen or so similar ones that we already have?

profnachos
11-20-2007, 02:40 AM
What makes your newline remover better than any of the dozen or so similar ones that we already have?

I never made that claim, and I specifically sought for a useful tool to do the same in this post (http://www.mobileread.com/forums/showthread.php?t=16224), but of the responses in the thread pointed to "the dozen or so similar ones we already have."

Perhaps you can list the dozen or so similar ones, so I don't have to continue to work on this.

If you are looking for an online piss fight, then move on. Not interested.

HarryT
11-20-2007, 03:34 AM
There's no need to be impolite. I'm certainly not criticising your work; I was wondering what benefits your tool offered over those which already exist?

If you have "Word" on your PC, an excellent alterative is Stingo's Word Macro (search the forum for it) which, in addition to removing newlines, does a number of other tidying up operations too.

A tool that I've used myself is a little command-line freeware app called "textify" which offers a range of nice formatting options for text files (eg leaving blank lines between paragraphs, no blank lines but indentation, or wrapping up the text file in "<P>" HTML paragraph markers. A Google search will find it.

Another much more sophisticated tool is "Gutenmark" (http://www.sandroid.org/GutenMark/) which has all sorts of facilities for converting plain text to marked-up HTML.

Doing a Google search for "freeware newline remover" will show many more.

What facilities does your tool offer?

profnachos
11-20-2007, 11:58 AM
I apologize for the tone of my response. I thought your response read, "What makes you THINK," which was not the case. I need a reading comprehension course :o

I will look them up. No, there is nothing special about my tool, and your suggestions look great.

There's no need to be impolite. I'm certainly not criticising your work; I was wondering what benefits your tool offered over those which already exist?

If you have "Word" on your PC, an excellent alterative is Stingo's Word Macro (search the forum for it) which, in addition to removing newlines, does a number of other tidying up operations too.

A tool that I've used myself is a little command-line freeware app called "textify" which offers a range of nice formatting options for text files (eg leaving blank lines between paragraphs, no blank lines but indentation, or wrapping up the text file in "<P>" HTML paragraph markers. A Google search will find it.

Another much more sophisticated tool is "Gutenmark" (http://www.sandroid.org/GutenMark/) which has all sorts of facilities for converting plain text to marked-up HTML.

Doing a Google search for "freeware newline remover" will show many more.

What facilities does your tool offer?

HarryT
11-20-2007, 12:02 PM
No problem - this is a very easy medium in which to misunderstand the tone of a post.

profnachos
11-23-2007, 03:08 AM
There's no need to be impolite. I'm certainly not criticising your work; I was wondering what benefits your tool offered over those which already exist?

If you have "Word" on your PC, an excellent alterative is Stingo's Word Macro (search the forum for it) which, in addition to removing newlines, does a number of other tidying up operations too.

A tool that I've used myself is a little command-line freeware app called "textify" which offers a range of nice formatting options for text files (eg leaving blank lines between paragraphs, no blank lines but indentation, or wrapping up the text file in "<P>" HTML paragraph markers. A Google search will find it.

Another much more sophisticated tool is "Gutenmark" (http://www.sandroid.org/GutenMark/) which has all sorts of facilities for converting plain text to marked-up HTML.

Doing a Google search for "freeware newline remover" will show many more.

What facilities does your tool offer?

Harry, are there tools that rework an html file that has the <br> tags all over the place. When you use pdftohtml, that is what happens. It does not differentiate between paragraphs and line breaks.

All the tools you mentioned are for txt files. Thanks.

HarryT
11-23-2007, 03:13 AM
Could you not do it the same way that the text file clean-up tools work - treat two consecutive <br>'s as a paragraph break, and then delete all the others? That's all that springs to mind at present, I'm afraid!

kacir
11-23-2007, 08:37 AM
I have written a little program to eliminate unwanted line breaks from txt files converted from PDF ... I'd love to make it available to the public and get feedback, but the only problem is, I do not have a public website to make the program available to you.

Yes, I would like to try/test your program.

You can attach a file to your post here. Many people do.
You can register on one of many servers like SourceForge.
You can register on some freehosting server.
You can upload your file to rapidshare server.

By the way.
Have you seen par?
http://www.nicemice.net/par/

profnachos
11-27-2007, 06:23 PM
Could you not do it the same way that the text file clean-up tools work - treat two consecutive <br>'s as a paragraph break, and then delete all the others? That's all that springs to mind at present, I'm afraid!

Well, not all paragraphs are handled with two consecutive br tags. I converted Crime and Punishment from PDF to HTML with pdftohtml. I don't see two consecutive <br>'s anywhere.

I am thinking that if there is a period right before the <br> tag, that is the end of the paragraph. Of course it won't always be right, but that seems to be the best "guess."

profnachos
11-27-2007, 06:24 PM
Yes, I would like to try/test your program.

You can attach a file to your post here. Many people do.
You can register on one of many servers like SourceForge.
You can register on some freehosting server.
You can upload your file to rapidshare server.

By the way.
Have you seen par?
http://www.nicemice.net/par/

I think I want to focus on a tool to clean up HTML. Looks like Gutenmark which gets a mention above already does a decent job of cleaning up text files, so there is no need to reinvent the wheel.