Using xargs to do parallel processing
- April 29th, 2010
- Posted in Computers
- Write comment
Going though some log files the other day I came to a realization. Most modern machine are muti-processor machines and they are rarely used as such. I had a boat-load of log files that had been archived I had to go though. It was taking forever to un-compress each one by one. At first I thought I would just make a loop, send all the gunzip processes to the background and wait for them to be done.
for i in `ls *.gz`; do gunzip $i & done wait
The problem with this approach is you can very quickly overwhelm the system if the number of compressed files is larger than the number of cores on the machine.
My next idea was to add a counter to the loop and then count to the number of cores (4 in this case) and then wait for all processes to be done.
COUNT=0
for i in `ls *.gz`; do
gunzip $i &
((( COUNT = $COUNT + 1 )))
if [ $COUNT -eq 4 ]; then
COUNT=0
wait
fi
done
It does the trick. It only lets 4 processes run at the same time. It does have it’s problems though. All four of the gunzip processes have to finish before the next batch begin. This would mean that if there is one big file in each one of the batches, all other cores will remain idle until it’s done. That’s no good, plus that just seems like a mess of code though.
Yes, the next step would be to capture the PIDs for the background processes and hold them in an array and then keep track of who’s still running but now this has turned into a CS 101 project.
Can we do this with one line and 2 commands? I believe so, but first a little history. xargs is a command that was designed to overcome a limitation that UNIX system program have . The limitation is there is a fixed number of arguments a program can take. This is set by the OS and there’s not much you can say or do about it. To get the value run `getconf ARG_MAX`.
- On my 32 bit Linux machine the limit = 131072
- On my 32 bit OSX 10.6.3 Macbook Pro the limit = 262144
- On my 64 bit FreeBSD machine the limit = 262144
These limits were much older in the “good-old-days” so that’s when xargs came in handy. I really can’t think of an example now a days that is not contrived to need that many arguments, so I will leave that as an exercise to the reader.
xargs takes the parameters you want to feed into a function and chops them up into pieces that can fit in the number of arguments specified by the system. This means that multiple instances of the program called by xargs will run at the same time. The use can control the size of the peaces used by xargs. That’s where this tool comes in handy.
xargs can take the -P flag to tell it how many processes to spawn. It can also take the -n which means how many rows from the input should go into each process.
So now we can do what we were trying to do with the previous scripts in one line. All that is needed is to tell xargs to spawn the same number of processes as there are cores in the system and take one row per process.
ls *.gz | xargs -P 4 -n 1 gunzip
The beauty here is xargs takes care of the scheduling and everything, so the entire system is loaded, but not overloaded, while it uncompresses all the files. That’s very cool!
After finding this out I keep finding uses for this command. If there is a sed transformation I need to do on a long file or anything computationally expensive I am doing this way. Or if you need to grep though a large directory structure it can be parallelized! Create 4 processes each one looking at 10 files.
find /home/CVSROOT -type f | xargs -P 4 -n 10 grep -H 'string-to-search'
The possibilities are really endless here.
So Ideally you would calculate the number of cores for the -P argument.
A trivial implementation would be:
`cat /proc/cpuinfo | grep processor | wc -l`
Very cool. I’ll keep this in mind
That’s the way to get the number of procs in Linux.
To do the same in BSD you do:
`sysctl -a | grep hw.ncpu | cut -d”:” -f2`
Interesting. Thanks!
It would be interesting to try this with the -exec option of find(1) and /proc just in the name of really abusive hacks
GNU Parallel http://www.gnu.org/software/parallel/ supports the detection of number of cpus, so you could do:
ls *.gz | parallel -j+0 gunzip
find /home/CVSROOT -type f | parallel -j+0 -n 10 grep -H ‘string-to-search’
GNU Parallel makes sure the output from different processes are not mixed. To see the difference compare these:
find . -type f | xargs -P 40 -n 10 grep -H ‘.’
find . -type f | parallel -P 40 -n 10 grep -H ‘.’
If you have filenames including ‘, ” or space xargs can give you a nasty surprise. See http://en.wikipedia.org/wiki/Xargs#The_separator_problem
Watch the intro video for GNU Parallel: http://www.youtube.com/watch?v=OpaiGYxkSuQ
xargs come with every UNIX/Linux install I’ve ever been on and it really isn’t that hard to find the number of procs/cores the hardware has. I’m lazy and having to go and install new software from the net just seems like hard work with very little benefit.
I do agree with you there is always nasty surprises when it comes to strings with spaces, but that’s a problem even with `cp` and `mv` so I don’t think it’s that big a factor.
I like your plug thought, I’ll allow it,
.
@labrat
…or simply:
sysctl -n hw.ncpu
@Ole Tange
Yes, but who would ever want to remember this complicated array of command-line options, even with the notion of linux being for the ‘power users’
Anshul, there’s a place for everything. I use `parallel` in certain areas where xargs just won’t cut it like running a command with ssh into a cluster of machines. All *nix CLI tools are hard at first, then it’s hard to remember a time when the flags to `ps` were cryptic.
Nice article, dude. But dude, peaces != pieces, and mutliple != multiple, among others.
Thanks, corrections have been made.
Ciao,
I’m wondering if there is a way to use xarg and give for each process the power of 2 cpu.
find /home/CVSROOT -type f | xargs -P 4 -XP 2 -n 10 grep -H ‘string-to-search’
where -XP 2 is two processor for each task? any idea?
ciao
@Adam
Instead of
`cat /proc/cpuinfo | grep processor | wc -l`
I would use
`grep -c processor /proc/cpuinfo`
Only one process instead of three
Giuseppe,
Sadly xargs doesn’t have that kind of power but I don’t know if it’s necessary. The -P argument tells xargs how many parallel processes to run at the same time. If you want to use more cores in your system you can just double the value you give -P.
I do see an instance where this is not what you would want. If the process that is getting called by xargs can me multi-threaded/multi-proc then you’d like to not oversubscribe the system. But in this case as well, it would just me a matter of decreasing the value you give -P.
Hi, I use this procedure to create tar of files
find {path_of_dir} | tail -n +2 | xargs -P 4 -n 1 tar -cvf new.tar
However if i untar it it created the dir sturucture but just a single file . however, if i dont use -n option it creates the directory structure with all the files in it which is what i am looking for . Could you please tell me whats the issue with -n 1 option because if i don’t use -n option its invokes a single process while creating tar.
Thanks
Hello Sourav:
From the xargs man page:
” -n number
Set the maximum number of arguments taken from standard input for
each invocation of utility. An invocation of utility will use
less than number standard input arguments if the number of bytes
accumulated (see the -s option) exceeds the specified size or
there are fewer than number arguments remaining for the last
invocation of utility. The current default value for number is
5000.
”
The -n is basically just taking each line from stdin and processing them one at a time. In your case this is each one of the files you are listing with the find. On-top of that, each instance of xargs is going to create a file called “new.tar” overwriting the previous file that was created. So you are correct, you should get 1 file a single file. I’m also going to guess the single file you have is the last file on the `find` command.
Not quire sure what you are trying to do here, but creating a tar file is not something that can easily be done in parallel.
Please see the GNU coreutil nproc which determines the number of CPUs while taking account of details like process pinning and CPU offline states etc.
Use find -print0 | xargs -0 to support filenames with spaces.
GNU Make can also do parallel processing, at the expense of having to write Makefiles.
Thanks for a great blog post!
If you have a LOT of smaller files to do processing on, you’d want to increase ‘-n’ parameter to xargs (if your command supports multiple files). This will reduce the overhead of spawning a new process very often. For smaller number of files, this will be at the expense of concurrency.
Concrete example:
ls *.gz | xargs -P 4 -n 1000 gunzip
can be rewritten to
ls *.gz | xargs -P 4 -n 1 gunzip
since gunzip can be given multiple files to decompress for every invocation. This will be much faster if you have a lot of smaller files to decompress.