These past few weeks, I have been tasked to do some optimization and load testing in a transaction table in the system I am working on. With this, I do learn new stuff from it. It was a challenging and satisfying experience overall. In this article we will focus on the large SQL file that I received to start the load testing.
These are the new Linux commands that I learned when working on that task:
Personally, this article works more a notebook for me. But I think it would be nice to share my learning journey publicly.
Unix sed
Utility
The sed
or stream editor in Unix is a CLI utility used to stream and manipulate a text file.
I used this command to process a database dump of tables, data, and everything from QAT server to an SQL file. This SQL file will be restored to my local database instance to enable testing locally with the same load from the QAT server without touching the data from QAT server. The problem with the dumped SQL file is that it is very large – like 4GB large! Sure postgres can handle all of that, but the problem lies with the generated SQL code: the schema and owner names are hard-coded in the file, therefore it doesn't work in my local setup, and I have to edit it. Using Find and Replace feature in VS Code crashes it and IntelliJ IDEA opens the file in read-only mode because of the file size.
With those problem, the a solution that I thought is to manipulate the file content without opening the entire file. First thing that came to my mind is the sed
command.
Again, the problems that need to be solved are:
- Remove the ownership setting code from the file for each object.
- Change the schema name to use the schema in my local machine.
Problem #1 - Delete all set owner command
This is done by deleting all the lines that contains ALTER <something> OWNER TO owner_name;
statement in the code.
The sed
command to do it is:
Command breakdown:
-i
– Edit the input file.EXTENSION
– The filename extension of the backup file. If there is no need for backup, and empty string can be provided as a value.PATTERN
– The string pattern to search in the file.d
– Flags that the line should be deleted.FILENAME
– The input file.
In my case, the command I run somewhat looks like this:
sed -i .bak '/OWNER TO qat_db_owner/d' core-database_march-2020.sql
In GNU sed
, the (almost) similar delete line command is:
Problem #2 - Change the schema name
The schema used in the QAT server is different from the schema name used in my machine. This is solved by using Find and Replace functionality in the text editors. The similar sed
command for this is:
Command breakdown:
-i
– Edit the input file.EXTENSION
– The backup file extension.s
– Substitute text flag.OLD_TEXT
– The pattern to search and be replaced.NEW_TEXT
– The text to replace text matched theOLD_TEXT
.g
– Global replace flag. This flag will replace all occurrence ofOLD_TEXT
; not just the first one.FILENAME
– The input file.
In my case, the command I run looks something like this:
The similar replace text command in GNU sed
is:
Unix find
Command
The Unix find
command is command to, you know, find files in a directory. This command prints the absoulte filename of the matched files from the arguments passed in the command. What made find
better than running grep
from ls
is that the find
command not only searches with a matching filename but has more flexibility when searching files.
I used find
command since I need to archive and delete files that is older the given date. This is done by using find
's -mtime
(modified time) argument. Another feature that I need from find
is the -exec
argument that runs a command for each found file.
Requirement #1 - Archive all files at least 3 months old
I need to delete a specific set of files. But before that, the files must be backed-up first. To do this, I fed the result of a find
command to a tar
command. The command I run looks like this:
There are 2 commands run in this small script:
find
/files-directory
– The top-level directory to search from.-mtime +120
– Matches all files older than 120 days.
tar
c
– Creates a tar file.z
– Zips the tar file.v
– Print verbose report to stdout.f /home/produser/backup_mar-2020.tar.gz
– The output filename.$(...)
– Evaluates the command inside and uses its result as program argument. In the case oftar
command, this will be the list of files to be tarballed.
After running this command, a tar file will be created at /home/produser
directory. This tarball can now be moved to a proper storage and the deletion can proceed.
Requirement #2 - Delete all files at least 3 months old
Now that that the files are backed-up, they can now be deleted safely. To ensure the same results from the previous find
command, the same arguments will be passed with additional delete command. The command I run looks like this:
find /files-directory -mtime +120 -exec rm {} \;
find
command breakdown:
/files-directory -mtime +120
– Same as above. Shows all files older than 120 days from/files-directory
top-level directory.-exec
– Execute a command for each file found.rm {}
– The command to execute. Here the basicrm
command is used to delete a file and the{}
will be replaced with the filename of the matching file.\;
– The;
symbol is used to terminate the statement, in this caserm {}
. The;
symbol is escaped to prevent misinterpretation by the shell that runs thefind
command.
To emphasize why I used $(...)
syntax from the archive operation and used -exec ... \;
in the second operation depends on how tar
and rm
accepts files as input.
In the case of rm
command, I used the -exec
argument to to delete files because the syntax is easier – no need to nest bash commands using the $(...)
syntax. Additionally, using rm
command to each files matched is no different from deleting all the matched files at once – all the files I want gone will be gone.
In the case of tar
command, the list of files must be sent to the command at once. This is because we are supposed to group the files to a single tar file.
While researching a solution to my problem, I have found another way to do this by using the -delete
flag. The command is less complex from using -exec
argument but it doesn't seem to work in the server, not sure why. Might be the OS or bash version? I didn't really dug onto it.
Conclusion
Unix offers powerful command-line utilities and commands to work with large files without using much of you computer resources: sed
and find
. Since these commands are from Unix, these commands are also available from derived systems such as Linux and MacOS.
The sed
utility is a powerful command-line application that enables streaming and editing text files without using large amount of memory when working with large files.
I used sed
to delete lines in a file that contains a specific pattern. The command I used to delete the set owner SQL statements in MacOS is:
I also used sed
to change the schema from project_qat
to project_dev
by using this command:
Another powerful command that I used is the find
command. This command can search files not only by filename but also with other file metadata. In my case, I used find
to search for files older than 120 days.
Since the find
command lists all the absolute filename of matched files, these list of files can be used in another commands, such as tar
using the $(...)
syntax. When it is possible to process that matched files one-by-one, the -exec ... \;
can be used. In my task, I used -exec rm {} \;
to delete the matched file. Tho, in my research, there is a -delete
flag that delete each files but it didn't work in my setup.