Linux Informatics – Cleaning Data With ‘grep’

Say you have a larger text file with a bunch of lines that you don’t need or don’t want in there, how do you remove them? When working with multiple large files, such as log files, this can be a a major waste of time when done it by hand.

If you are lucky enough that the lines you need to filter out from your data have a common pattern (which just happens to be the case with most properly formatted log files) then properly learning the ‘grep’ command can save you massive amounts of time. It’s also really not that hard as you only need to learn how to utilize various options and/or special characters in conjunction with ‘grep’

Here’s a small snippet from a larger ‘.bash_history’ file I was working with in CentOS 7.

#1537917992
whoami
#1538087893
ls -l
#1538087953
cat zone*
#1538087974
cat zone*| more

The .bash_history file is a log file which each user has in their home directory, this file contains all commands the user ran in the bash shell

This specific bash history file has lines that show the epoch timestamp before each command that was actually entered by the user.
This could be very useful in certain situations where you need the time that a command was entered. Yet these timestamps can be problematic if you have a script trying to access this file’s information and your script isn’t configured to read information with this very specific format.

How can we remove these unwanted lines?

Well this small snippet of only 8 lines and it would be easy enough to remove any lines by hand. The full log file I was working with had 254 lines to work with which would be considered small in terms of log files, but still quite a huge pain if you had to do it by hand!

One utility that would fit this job perfectly would be ‘grep’ which typically searches through a file line by line and spitting back lines that contain the matching pattern that you are searching for. You can also use the ‘-v’ option with grep to “invert” the information you receive back from it; so instead of grep showing you lines that DO match it will show you lines that DON’T match.

Removing lines with the 'grep' utility

The ‘-v’ option with grep makes a great tool for removing lines that have a common pattern!

[simterm] $ grep -v pattern filename.txt
[/simterm] (this would output all lines from a file named ‘filename.txt’ which don’t have the word ‘pattern’)

This example would still output any lines that any variant of the word ‘pattern’ which where the word isn’t in all lower-case letters, such as ‘Pattern’ or ‘PATTERN’ or ‘pattErn’. This is because grep does case-sensitive searches by default.

Case-insensitive searches with 'grep'

Case-insensitive searches are done with the ‘-i’ option.

[simterm] $ grep -v -i pattern filename.txt
[/simterm]

Easier combination of multiple options

In most environments it’s completely acceptable and easier to combine short options, like this

[simterm] grep -vi pattern filename.txt
[/simterm] (note: typically shouldn’t/can’t combine ‘long options’ which are those that are started with a doubledash ‘––’ )

The specific log file that I was working with didn’t require any case sensitive or insensitive searches though, as I’m simply trying remove lines that start with pound-sign.

Easier combination of multiple options

This is how you would you remove all lines starting with ‘#’ in a file named ‘.bash_history’

[simterm] grep -v ^\# .bash_history
[/simterm]

… This can be quite confusing to understand if you’re just starting out. Let’s break it down!

Command Breakdown : grep -v ^# filename

-v is the option to invert the search results, so grep will only output lines that don’t have the pattern you are searching for

^\# this what we are actually telling grep to search for on each line… But you might be wondering right now “I thought we would only be searching for #… what does ^# even mean?!”

^ is a special character which grep will read as “any line starting with” so if we didn’t use that then it would all lines that had a pound-sign anywhere on the line and not just at the beginning

\ is called an escape character… character following the backslash won’t be read as a special command, but instead as just plaintext

# is what we are actually searching for on the lines!

filename is of course the name or path to the file we want grep to work with and search through

read more on ‘escape characters’ — http://tldp.org/LDP/Bash-Beginners-Guide/h tml/sect_03_03.html

Now that you know some useful grep basics
Let’s work on using the outputted data that ‘grep’ gives us in various ways, such as actually viewing this data in a better format or saving it for later by using pipes or redirects.

Piping the output into 'more'

One option is piping the output into another command. Here’s an example piping the output into ‘more’ for viewing large data outputs in a terminal environment by hand.

[simterm] grep -v ^\# .bash_history | more
[/simterm] a simple pipe, by using using ‘ | ‘ between commands

Redirecting the output into a new file

Another option is saving the cleaned up data into a new file, by redirecting our grep output into another file.
(Don’t redirect into the same file!!… This nukes the file; leaving it blank!)

[simterm] grep -v ^\# .bash_history > .bash_history_cleaned
[/simterm] a simple redirect, using ‘ > ‘ and then the new file’s name

Redirecting the output into the same file (overwriting the original!)

In less frequent cases, you may just want to quickly clean up a file and permanently trash unwanted lines from it.

[simterm] grep -v ^\# .bash_history > temp.file && cat temp.file > .bash_history && rm temp.file
[/simterm] This example redirects the cleaned data into temporary file, then outputs that temporary file back into the original file. Overwriting all of the old file’s data with the cleaned up data!

What exactly is the ‘&&’ from that last command?
&& is an ‘operator’ and is used as a way to string multiple commands. Specifically with the ‘&&’ operator, the next command is dependent on the command before it. This means the next command won’t execute unless the command before it ran successfully.

(more on ‘operands’ https://www.gnu.org/software/bash/manual/bashref.html#Lists)

— Hope this has helped you on your learning journey, thanks for reading! —

Leave a Reply Cancel reply