Lam-MPI

While I was in ADA lab, we had all sort of problems with running lam-mpi across our nodes. Anyway, it was worth it and was quiet fun sometimes to work with it.

MPI Cloud

You can find good tutorials here on their website, but even after reading these you may encounter some unexpected problems. How to solve theseproblems? Always Google them use Google search engine to find the solution.

As you know, if you know lam-mpi, you should “wipe” the mp-cloud clean if you want to stop it; or if you want to restart it. It uses RSH which is sometimes not very easy to deal with. I wrote some scripts to ease the start-up and cleaning process.

The following script first pings the broadcast address and gets currently available nodes. Then updates the lamhosts file. Next uses the updated lamhosts file to wipe the network clean, and run lam-mpi again. You can optionally instruct i not to update the lamhosts file.

This is the probe script:

#!/bin/bash
#clear
RES=`ping -b -c 10 192.168.0.255 | awk ‘{print $4}’ | grep 192 | uniq | awk -F: ‘{print $1}’ | sort -n | uniq `for i in $RES; do
echo $i;
done

This one is the lam-run script, which uses the previous one:

#!/bin/bash
OPTION=”-c”;
LAMHOSTS=”/home/user01/lamhosts”export LAMRSH=”rsh”

if [ -z $1 ]; then
echo “usage: $0 {-c:update config and restart | -d: restart only}”
exit
fi;

cat > $LAMHOSTS <
EoF

if [ $1 = $OPTION ]; then
echo -e “updating lamhosts:\n”;
for i in `/bin/bash /usr/local/bin/ada/probe`; do
echo -e “$i\t user01” >> $LAMHOSTS;
done;
fi;

echo -e “\nStopping previsou run, if any\n”
/usr/bin/lamhalt
/usr/bin/wipe $LAMHOSTS
echo “DONE”
echo -e “\nrunning lamboot”
lamboot -d $LAMHOSTS
echo “DONE”

Also, I had a copier script to copy the necessary files to all clients using rcp:

#!/bin/bashif [ -z $1 ] ||[ -z $2 ]; then
echo -e “This script lets you copy a file to every client\n”;
echo -e “usage: $0 PATH-TO-LOCAL-FILE PATH-ON-EACH-CLIENT\n”
echo -e “example: $0 /tmp/file /home/user01/remote”
exit;
fi;

for i in `/usr/local/bin/ada/probe`; do
if [ $i = “192.168.0.100” ]; then
continue;
fi;
echo “copying file to $i”
/usr/bin/rcp $1 $i:$2;
done;

These are not appealing as scripts, I know. But were quiet handy and helped us alot. I had a directory for these in /usr/local/bin/ada/, and our clients had IP addresses in 192.168.0.100/24 range. These scripts are written based on these facts. It is easy to update them for your local network, if interested.

We had some other problems that I will add here whenever I remember them, but for the moment:

  • poll: protocol failure in circuit setup
  • if you see

    poll: protocol failure in circuit setup

    when you try to run a simple rsh command like

    rsh 192.168.0.101 -n -d -l user01 echo $SHELL

    but you can login to the remote client using

    rsh 192.168.0.101

    try turning off you firewall on both sides first, and then try again. If this is the case, try something like this in your firewall rules:

    -A INPUT -i eth0 -j ACCEPT

    where eth0 is connected to your internal network of lam nodes.