Remove new lines from a text file

I want to copy some text from a PDF file into a word processor. Copying works, but the line breaks from the PDF formatting are also copied. This messes up the text and results in disconnected text with many rows. I decided to use the command line to fix this automatically. I searched a bit and it seems that there are two steps that have to be taken.

First, I copied the text from the PDF and save it as in a simple text file, let’s name it something like lines.txt.

As it is described in Post 5 part of this discussion, let’s add an empty space at the end of each line and output it in a new file:

sed 's/$/ /' ./lines.txt > linesSpace.txt

Then, lets remove the newlines from there, as it is suggested in this post:

tr -d '\n' < linesSpace.txt > noNewLines.txt

If there were no spaces inserted after the last word at each line, removing the line breaks will merge the last word with the first word from the row below. I decided to write myself a simple script that combines these:

#!/bin/bash

echo "Enter full path to your input file:"
read INPUT

echo "What will your output file be? Output:"
read OUTPUT

# Add space to the end of each row
sed 's/$/ /' $INPUT | tr -d '\n' > $OUTPUT

That’s it!



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s