These past few weeks, I have been tasked to do some optimization and load testing in a transaction table in the system I am working on. With this, I do learn new stuff from it. It was a challenging and satisfying experience overall. In this article we will focus on the large SQL file that I received to start the load testing.

These are the new Linux commands that I learned when working on that task:

Unix sed utility
Unix find command

Personally, this article works more a notebook for me. But I think it would be nice to share my learning journey publicly.

Unix `sed` Utility

The sed or stream editor in Unix is a CLI utility used to stream and manipulate a text file.

I used this command to process a database dump of tables, data, and everything from QAT server to an SQL file. This SQL file will be restored to my local database instance to enable testing locally with the same load from the QAT server without touching the data from QAT server. The problem with the dumped SQL file is that it is very large – like 4GB large! Sure postgres can handle all of that, but the problem lies with the generated SQL code: the schema and owner names are hard-coded in the file, therefore it doesn't work in my local setup, and I have to edit it. Using Find and Replace feature in VS Code crashes it and IntelliJ IDEA opens the file in read-only mode because of the file size.

With those problem, the a solution that I thought is to manipulate the file content without opening the entire file. First thing that came to my mind is the sed command.

Again, the problems that need to be solved are:

Remove the ownership setting code from the file for each object.
Change the schema name to use the schema in my local machine.

Problem #1 - Delete all set owner command

This is done by deleting all the lines that contains ALTER <something> OWNER TO owner_name; statement in the code.

The sed command to do it is:

sed -i 'EXTENSION' '/PATTERN/d' FILENAME

The sed delete line command that I used in MacOS (BSD sed).

Command breakdown:

-i – Edit the input file.
EXTENSION – The filename extension of the backup file. If there is no need for backup, and empty string can be provided as a value.
PATTERN – The string pattern to search in the file.
d – Flags that the line should be deleted.
FILENAME – The input file.

In my case, the command I run somewhat looks like this:

sed -i .bak '/OWNER TO qat_db_owner/d' core-database_march-2020.sql

In GNU sed, the (almost) similar delete line command is:

sed '/OWNER TO qat_db_owner/d' core-database_march-2020.sql > reduced.sql

The result of the sed command will be printed to stdout and will be streamed into reduced.sql.

Problem #2 - Change the schema name

The schema used in the QAT server is different from the schema name used in my machine. This is solved by using Find and Replace functionality in the text editors. The similar sed command for this is:

sed -i EXTENSION 's/OLD_TEXT/NEW_TEXT/g' FILENAME

The sed replace all text command that I used in MacOS.

Command breakdown:

-i – Edit the input file.
EXTENSION – The backup file extension.
s – Substitute text flag.
OLD_TEXT – The pattern to search and be replaced.
NEW_TEXT – The text to replace text matched the OLD_TEXT.
g – Global replace flag. This flag will replace all occurrence of OLD_TEXT; not just the first one.
FILENAME – The input file.

In my case, the command I run looks something like this:

sed -i .bak 's/_qat/_dev/g' core-database_march-2020.sql

This command will replace project_qat schema name to project_dev. Note that I only searched for _qat string and replaced it with _dev with reasons I forgot.

The similar replace text command in GNU sed is:

sed 's/_qat/_dev/g' core-database_march-2020.sql > replaced.sql

The result of this sed command will be printed in stdout and will be streamed to replaced.sql.

Unix `find` Command

The Unix find command is command to, you know, find files in a directory. This command prints the absoulte filename of the matched files from the arguments passed in the command. What made find better than running grep from ls is that the find command not only searches with a matching filename but has more flexibility when searching files.

I used find command since I need to archive and delete files that is older the given date. This is done by using find's -mtime (modified time) argument. Another feature that I need from find is the -exec argument that runs a command for each found file.

Requirement #1 - Archive all files at least 3 months old

I need to delete a specific set of files. But before that, the files must be backed-up first. To do this, I fed the result of a find command to a tar command. The command I run looks like this:

tar czvf /home/produser/backup_mar-2020.tar.gz $(find /files-directory -mtime +120)

A simple tar command that compresses files more than 120 days old.

There are 2 commands run in this small script:

find
- /files-directory – The top-level directory to search from.
- -mtime +120 – Matches all files older than 120 days.
tar
- c – Creates a tar file.
- z – Zips the tar file.
- v – Print verbose report to stdout.
- f /home/produser/backup_mar-2020.tar.gz – The output filename.
- $(...) – Evaluates the command inside and uses its result as program argument. In the case of tar command, this will be the list of files to be tarballed.

After running this command, a tar file will be created at /home/produser directory. This tarball can now be moved to a proper storage and the deletion can proceed.

Requirement #2 - Delete all files at least 3 months old

Now that that the files are backed-up, they can now be deleted safely. To ensure the same results from the previous find command, the same arguments will be passed with additional delete command. The command I run looks like this:

find /files-directory -mtime +120 -exec rm {} \;

find command breakdown:

/files-directory -mtime +120 – Same as above. Shows all files older than 120 days from /files-directory top-level directory.
-exec – Execute a command for each file found.
rm {} – The command to execute. Here the basic rm command is used to delete a file and the {} will be replaced with the filename of the matching file.
\; – The ; symbol is used to terminate the statement, in this case rm {}. The ; symbol is escaped to prevent misinterpretation by the shell that runs the find command.

To emphasize why I used $(...) syntax from the archive operation and used -exec ... \; in the second operation depends on how tar and rm accepts files as input.

In the case of rm command, I used the -exec argument to to delete files because the syntax is easier – no need to nest bash commands using the $(...) syntax. Additionally, using rm command to each files matched is no different from deleting all the matched files at once – all the files I want gone will be gone.

rm text_1.txt
rm text_2.txt
rm text_3.txt

# is similar to
rm text_1.txt text_2.txt text_3.txt

As far as I'm concerned, those 3 files will be deleted. This is why -exec rm {} \; is enough to solve this problem.

In the case of tar command, the list of files must be sent to the command at once. This is because we are supposed to group the files to a single tar file.

While researching a solution to my problem, I have found another way to do this by using the -delete flag. The command is less complex from using -exec argument but it doesn't seem to work in the server, not sure why. Might be the OS or bash version? I didn't really dug onto it.

find /files-directory -mtime +120 -delete

The -delete argument looks more intuitive than -exec rm {} \;. Too bad it doesn't work on the machine I am working with.

Conclusion

Unix offers powerful command-line utilities and commands to work with large files without using much of you computer resources: sed and find. Since these commands are from Unix, these commands are also available from derived systems such as Linux and MacOS.

The sed utility is a powerful command-line application that enables streaming and editing text files without using large amount of memory when working with large files.

I used sed to delete lines in a file that contains a specific pattern. The command I used to delete the set owner SQL statements in MacOS is:

sed -i '.bak' '/OWNER TO qat_db_user/d` input_file.sql

Remember that d flag at the end of the pattern string tells the program that I want to delete the entire line that matches the pattern.

I also used sed to change the schema from project_qat to project_dev by using this command:

sed -i '.bak' 's/_qat/_dev/g' input_file.sql

The s flag before the patterns tells the program that I want to do text substitution. The g flag at the end of the patterns mean I want to do the substition globally; not to replace only the first match of the pattern.

Another powerful command that I used is the find command. This command can search files not only by filename but also with other file metadata. In my case, I used find to search for files older than 120 days.

find /files-directory -mtime +120

The -mtime argument means modified time. This command searches files that is last modifed more than 120 days ago. Therefore using the negative symbol, such as -mtime -N, will match the file with less than N days of last modified time.

Since the find command lists all the absolute filename of matched files, these list of files can be used in another commands, such as tar using the $(...) syntax. When it is possible to process that matched files one-by-one, the -exec ... \; can be used. In my task, I used -exec rm {} \; to delete the matched file. Tho, in my research, there is a -delete flag that delete each files but it didn't work in my setup.

Unix sed Utility