Embarrassingly GNU parallel¶
The GNU parallel is a great tool for solving embarrassingly parallel problems. Despite the problem class name parallelization is not embarrassing at all and can be hard to implement. In this post I would like to show some of the simple ETL and ML task parallelization examples using GNU parallel.
Examples requirements¶
For running examples in this post you will need to install:
Basics¶
In it's essence parallel
takes a list of arguments and runs a command for each of them
in parallel, here is a simple example with the echo
command:
1 |
|
1 2 3 4 5 6 7 8 9 10 |
|
That's it! It also has a lot of options and features, but the basic usage is very simple.
In the case like above you can just use xargs
but parallel
has some nice features
like combining arguments, progress bar, retries, and more.
In next segments I will show some examples of using parallel
for more practical tasks.
And some of the options that can be useful for them.
ETL pipeline¶
Let's say we want to load bunch of currency exchange rates from the web from 2024-04-01 until 2024-09-30 and store them in a single CSV for each of the currencies. Each date will require a separate HTTP request to the very starnge exchange-api and may take some time to complete. So we can parallelize this task based on currencies and dates.
Single date downloading¶
First let's see how we can download EUR exchange rates for a single date and EUR:
1 |
|
jq
:
1 2 |
|
jq
will simply extrach key values from the JSON response by provided path,
and we will format them as CSV string:
1 |
|
That's quite a big command, so let's put it inside a simple bash script get_cur.sh
,
with currency and date as arguments:
1 2 3 4 |
|
Don't forget to make the script executable and test it:
1 2 |
|
This will allow us to simplify the parallel
command later.
Generate list of dates¶
Lets generate list of dates we want to download with this bizarre python one-liner:
1 |
|
Check the content of the dates.txt
file:
1 |
|
Check future parallel pipeline¶
First we want to see how parallelization works on a simple level again
(let's also limit the number of parallel tasks to 4 with -j
(--jobs
) option).
We use --dry-run
option to see the commands that will be run.
The :::
is used to provide input from a list of arguments.
This command will just show us the first 10 dates from the dates.txt
file:
1 |
|
Let's check the output:
1 2 3 4 5 6 7 8 9 10 |
|
-k
(--keep-order
) option to fix this:
1 |
|
1 2 3 4 5 6 7 8 9 10 |
|
Awesome! Let's increase job count to 8 and combine currency and date arguments,
this will do it for the first 4 lines of the dates.txt
file:
1 |
|
The output should be:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
It works! Now let's try try to run it with our get_cur.sh
script, the {}
is a placeholder for
the input argument (combination of currency and date like eur 2024-04-01
):
1 |
|
If you can see the output of the exchange rates for the first 4 days of April 2024, if then you've done everything correctly!
Full pipeline¶
Now we can run the full pipeline with all dates (--bar
option will show progress bar),
the ::::
is used to provide input from a file, and here we use the dates.txt
file.
The {1}
is a placeholder for the first argument from the input:
1 2 |
|
Check the output:
1 |
|
Ta-da! We have all CSVs with exchange rates and have loaded them in parallel.
Serial variant comparison¶
I encourage you to compare our parallel pipeline with the naive serial one
by creation of get_cur_serial.sh
:
1 2 3 4 5 6 7 8 9 |
|
1 2 |
|
ML batch inference pipeline¶
Let's say for the sake of example we have a simple ML pipeline for inference that consists of 3 steps:
- Resize images to 256x256 pixels.
- Run inference with a pre-trained
MobileNetV2
model.
This pipeline is a very toy example, but it is simple and is a good example for parallelization capabilities and external image preprocessing.
Images downloading¶
We will use Stanford Dogs dataset:
1 |
|
Unpacking images.tar
will create Images
directory with images in it:
1 |
|
Images resizing¶
This dataset contains images of different sizes, so we need to resize them to 256x256
as MobileNetV2
model requires this size. We also want to strip metadata from images
and put them in a separate directory prepro_images
.
We can do this with imagemagick
and our friend parallel
:
1 2 3 4 5 |
|
This may take some time to finish, but after that you should have all images resized.
Inference script¶
Install tensorflow
, keras
, and Pillow
:
1 |
|
Now we need to write a script that will run inference on the images. Calling separate
script for each image is not efficient, so we will use parallel
again and will write
script in a way that it will accept multiple image paths as arguments:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
|
Let's check if the script works on some files:
1 |
|
After a bit of time you should see the output with image names, class names, and probabilities:
1 2 |
|
Yay! Now we can start parallelizing the inference pipeline.
Prepare the pipeline¶
Let's generate the list of image paths:
1 |
|
And check with --dry-run
option how the pipeline will work
(-N 2
option tells parallel
to provide arguments to the script by chunks of 2):
1 |
|
The output should be something like:
1 2 3 4 5 |
|
Run the pipeline¶
Now we can run the full pipeline with all images and save the results to a CSV files
(--bar
for progress bar and chunk size of 500 images):
1 2 3 4 |
|
1 |
|
And that's it! We have run the ML pipeline in parallel and saved the results to a CSV file.
Conclusion¶
In this post I've wanted to show you that CLI parallelization can go beyond simple xargs
cases and with usage of parallel
we can achieve quite complex parallel pipelines without
writing multiprocessing
or threading
code.
Options like --retries
, --halt
, --timeout
and --delay
can be useful for more advanced
tasks. And parallel
even allows you to resume failed tasks with --resume-failed
option.
Yes, it is quite old school and quirky; and you should not use it for very complex pipelines or performance-critical tasks. However, for many everyday tasks that benefit from parallel execution, it offers a practical and efficient solution without the overhead of writing additional code.