Havat tzabbim (turtle farm): the Martin group PBS cluster
In its present configuration, our turtle farm consists of twelve Intel IA32-boxed, one quad-CPU AMD Opteron 846, and two legacy Alpha machines. One of the linux boxes (harriet) acts as a file server and a bridge between the Weizmann LAN and a private network (192.168.1.x). The alphas (and, for now, the quad Opteron) sit on the Weizmann LAN, the remaining Linux machines on the private network. Connectivity between the latter and harriet is provided by a 24-port 100baseT Hewlett-Packard Procurve network switch with a Gbit-copper transceiver module. The latter is connected to the "private" network card of harriet (itself obviously Gbit-copper). All our Linux machines originally ran RedHat 7.2. Later they ran RedHat 8.0 with kernel 2.4.20-18. The standard kernel shipping with RedHat 8.0, i.e. 2.4.18-14, is a piece of bugware. Recently, all Linux machines were brought to SuSE Linux Enterprise Server 9 (which at the time was all we had that ran 64-bit native on an AMD Opteron).Overview of hardware
Aside from a 24-port (100baseT) Hewlett-Packard network switch with an added Gigabit-copper transceiver, our cluster at present consists of the following hardware:harriet
Our server machine, custom-built by Access Technologies: dual Pentium Xeon 2 GHz/512 KB L2 cache, 1 GB RAM, with four 36GB disks configured as a RAID-5 array holding the group's data and three 36GB disks in a RAID-0 (striped) array acting as scratch space for I/O intensive jobs. Harriet also acts as our mail server (POP and IMAP) and runs the PBS server and scheduler. It has two network cards: standard 100baseT connected to the Institute network, and copper gigabit connected to a gigabit transceiver in a Hewlett-Packard 24-port 100baseT switch. tzav1 (Turtle One) is the intranet name of harriet. (Our server is named after the world's oldest known living resident.)tzav2 to tzav5
Single-CPU Pentium IV (2 GHz, 512 KB L2 cache) machines with 512 MB memory and 40 GB IDE hard disks.tzav6
Dual-CPU Pentium Xeon (1.7 GHz, 256 KB cache2.4 GHz, 512 KB L2 cache) with two 18 GB SCSI hard disks and512 MB1 GB memorytzav7
Dual-CPU Pentium Xeon (2.0 GHz, 512 KB L2 cache) with a RAID-0 scratch array of four 36 GB SCSI hard disks and 1 GB memorytzav8
dittotzav9
Dual-CPU Pentium Xeon (2.8 GHz, 512 KB L2 cache, 533 MHz FSB) , with 3 GB memory and an Ultra320-SCSI RAID-0 array of 5x36 GB.tzav10
Dual-CPU Pentium Xeon (3.06 GHz, 512 KB L2 cache, 1 MB L3 cache, 533 MHz FSB) , with 3 GB memory and an Ultra320-SCSI RAID-0 array of 4x72 GB.tzav11
Dual-CPU Pentium Xeon (3.06 GHz, 512 KB L2 cache, 1 MB L3 cache, 533 MHz FSB) , with 3 GB memory and an Ultra320-SCSI RAID-0 array of 4x72 GB.tzav12
ditto.wigner
4-CPU AMD Opteron 846 server, with 8 GB memory and two Ultra320-SCSI RAID-0 arrays of 4x72 GB each, aggregated in software. Sustained streaming I/O bandwidth above 300 MB/s as measured by iozone for files larger than twice physical memory. Custom-built for us by Access Technologies. [DISCLAIMER: we have no relationship with this company other than as a satisfied customer.]feynman
Legacy DEC/Compaq ES40 (4 x EV67, 667 MHz, 4 GB RAM, 6x18GB striped scratch disk array). Used to be main compute and file server of the group. Still recommended for highly memory-intensive jobs. Warning: the RAID-0 (striped) file system is done in software (via LSM). Experience has taught that a single very I/O intensive job will be handled beautifully, but that performance drops like a stone for concurrent I/O intensive jobs. Keep this in mind before you submit three I/O intensive jobs at once to feynman: you will just be shooting yourself in the foot!winston
DEC/Compaq XP1000 workstation (EV6, 500 MHz, 1.1 GB RAM). Limited scratch disk space --- NOT recommended for I/O intensive jobs. Also used to do double-duty as intranet web server (currently migrated to harriet).
Installed software
- Resource management and batch scheduling was originally carried out by means of
OpenPBS 2.3.16, with the Sandia
CPLANT fault recovery patch installed as well as all of Ben Webb's patches. The default FIFO scheduler was used.
Presently we are running PBSPro, the "commercial" version which is considerably more robust.
- The usual GCC (GNU compiler collection) is available: in additional, Portland Group High Performance Fortran and the Intel compilers have been installed on harriet.
- Gaussian 98 rev. A11. File sizes limited
to 2 GB on the tzabbim: larger files can be achieved only by salami-slicing
the RWF file as follows:
%RWF=file1,1.9GB,file2,1.9GB,file3,1.9GB,... - Gaussian 03 rev. C1 with local modifications [Intranet documentation here]/ not recommended except for very large problem sizes. File sizes are limited to 16Gb on the tzabbim; on wigner and feynman there is no effective limit. (Make sure you run the 64-bit version on wigner!)
- MOLPRO 2002.6 with the latest patch set, compiled and linked using the Intel compilers and Intel MKL (Math Kernel Library). Both serial and SMP (shared memory parallel, or symmetric multiprocessing) versions are available; all multi-CPU machines support parallel running. Large files are supported on the tzabbim.
- CADPAC 6.5. The version on the tzabbim was linked with ATLAS and the PGF large files library and does not suffer from a 2 GB limit. Serial running only.
- Various less-common quantum chemical codes
- LyX, a WYSIWYM (What You See Is What You Mean) word processor that basically acts as a front-end to the LaTeX typesetting system.
- Too many Linux utilities to list