I want to copy some text from a PDF file into a word processor. Copying works, but the line breaks from the PDF formatting are also copied. This messes up the text and results in disconnected text with many rows. I decided to use the command line to fix this automatically. I searched a bit and it seems that there are two steps that have to be taken.
First, I copied the text from the PDF and save it as in a simple text file, let’s name it something like
sed 's/$/ /' ./lines.txt > linesSpace.txt
Then, lets remove the newlines from there, as it is suggested in this post:
tr -d '\n' < linesSpace.txt > noNewLines.txt
If there were no spaces inserted after the last word at each line, removing the line breaks will merge the last word with the first word from the row below. I decided to write myself a simple script that combines these:
#!/bin/bash echo "Enter full path to your input file:" read INPUT echo "What will your output file be? Output:" read OUTPUT # Add space to the end of each row sed 's/$/ /' $INPUT | tr -d '\n' > $OUTPUT