/[gentoo]/xml/htdocs/doc/en/hpc-howto.xml
Gentoo

Contents of /xml/htdocs/doc/en/hpc-howto.xml

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.4 - (show annotations) (download) (as text)
Fri May 20 16:54:18 2005 UTC (9 years, 3 months ago) by neysx
Branch: MAIN
Changes since 1.3: +2 -12 lines
File MIME type: application/xml
Removed copyright in comments, tx rane (date left unchanged on purpose)

1 <?xml version='1.0' encoding="UTF-8"?>
2
3 <!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/hpc-howto.xml,v 1.3 2005/05/13 20:15:50 neysx Exp $ -->
4
5 <!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
6 <guide link="/doc/en/hpc-howto.xml">
7
8 <title>High Performance Computing on Gentoo Linux</title>
9
10 <author title="Author">
11 <mail link="marc@adelielinux.com">Marc St-Pierre</mail>
12 </author>
13 <author title="Author">
14 <mail link="benoit@adelielinux.com">Benoit Morin</mail>
15 </author>
16 <author title="Assistant/Research">
17 <mail link="jean-francois@adelielinux.com">Jean-Francois Richard</mail>
18 </author>
19 <author title="Assistant/Research">
20 <mail link="olivier@adelielinux.com">Olivier Crete</mail>
21 </author>
22 <author title="Reviewer">
23 <mail link="spyderous@gentoo.org">Donnie Berkholz</mail>
24 </author>
25
26 <!-- No licensing information; this document has been written by a third-party
27 organisation without additional licensing information.
28
29 In other words, this is copyright adelielinux R&D; Gentoo only has
30 permission to distribute this document as-is and update it when appropriate
31 as long as the adelie linux R&D notice stays
32 -->
33
34 <abstract>
35 This document was written by people at the Adelie Linux R&amp;D Center
36 &lt;http://www.adelielinux.com&gt; as a step-by-step guide to turn a Gentoo
37 System into an High Performance Computing (HPC) system.
38 </abstract>
39
40 <version>1.1</version>
41 <date>2003-08-01</date>
42
43 <chapter>
44 <title>Introduction</title>
45 <section>
46 <body>
47
48 <p>
49 Gentoo Linux, a special flavor of Linux that can be automatically optimized
50 and customized for just about any application or need. Extreme performance,
51 configurability and a top-notch user and developer community are all hallmarks
52 of the Gentoo experience.
53 </p>
54
55 <p>
56 Thanks to a technology called Portage, Gentoo Linux can become an ideal secure
57 server, development workstation, professional desktop, gaming system, embedded
58 solution or... a High Performance Computing system. Because of its
59 near-unlimited adaptability, we call Gentoo Linux a metadistribution.
60 </p>
61
62 <p>
63 This document explains how to turn a Gentoo system into a High Performance
64 Computing system. Step by step, it explains what packages one may want to
65 install and helps configure them.
66 </p>
67
68 <p>
69 Obtain Gentoo Linux from the website <uri>http://www.gentoo.org</uri>, and
70 refer to the <uri link="/doc/en/">documentation</uri> at the same location to
71 install it.
72 </p>
73
74 </body>
75 </section>
76 </chapter>
77
78 <chapter>
79 <title>Configuring Gentoo Linux for Clustering</title>
80 <section>
81 <title>Recommended Optimizations</title>
82 <body>
83
84 <note>
85 We refer to the <uri link="/doc/en/handbook/">Gentoo Linux Handbooks</uri> in
86 this section.
87 </note>
88
89 <p>
90 During the installation process, you will have to set your USE variables in
91 <path>/etc/make.conf</path>. We recommended that you deactivate all the
92 defaults (see <path>/etc/make.profile/make.defaults</path>) by negating them
93 in make.conf. However, you may want to keep such use variables as x86, 3dnow,
94 gpm, mmx, sse, ncurses, pam and tcpd. Refer to the USE documentation for more
95 information.
96 </p>
97
98 <pre caption="USE Flags">
99 USE="-oss 3dnow -apm -arts -avi -berkdb -crypt -cups -encode -gdbm
100 -gif gpm -gtk -imlib -java -jpeg -kde -gnome -libg++ -libwww -mikmod
101 mmx -motif -mpeg ncurses -nls -oggvorbis -opengl pam -pdflib -png
102 -python -qt -qtmt -quicktime -readline -sdl -slang -spell -ssl
103 -svga tcpd -truetype -X -xml2 -xmms -xv -zlib"
104 </pre>
105
106 <p>
107 Or simply:
108 </p>
109
110 <pre caption="USE Flags - simplified version">
111 USE="-* 3dnow gpm mmx ncurses pam sse tcpd"
112 </pre>
113
114 <note>
115 The <e>tcpd</e> USE flag increases security for packages such as xinetd.
116 </note>
117
118 <p>
119 In step 15 ("Installing the kernel and a System Logger") for stability
120 reasons, we recommend the vanilla-sources, the official kernel sources
121 released on <uri>http://www.kernel.org/</uri>, unless you require special
122 support such as xfs.
123 </p>
124
125 <pre caption="Installing vanilla-sources">
126 # <i>emerge -p syslog-ng vanilla-sources</i>
127 </pre>
128
129 <p>
130 When you install miscellaneous packages, we recommend installing the
131 following:
132 </p>
133
134 <pre caption="Installing necessary packages">
135 # <i>emerge -p nfs-utils portmap tcpdump ssmtp iptables xinetd</i>
136 </pre>
137
138 </body>
139 </section>
140 <section>
141 <title>Communication Layer (TCP/IP Network)</title>
142 <body>
143
144 <p>
145 A cluster requires a communication layer to interconnect the slave nodes to
146 the master node. Typically, a FastEthernet or GigaEthernet LAN can be used
147 since they have a good price/performance ratio. Other possibilities include
148 use of products like <uri link="http://www.myricom.com/">Myrinet</uri>, <uri
149 link="http://quadrics.com/">QsNet</uri> or others.
150 </p>
151
152 <p>
153 A cluster is composed of two node types: master and slave. Typically, your
154 cluster will have one master node and several slave nodes.
155 </p>
156
157 <p>
158 The master node is the cluster's server. It is responsible for telling the
159 slave nodes what to do. This server will typically run such daemons as dhcpd,
160 nfs, pbs-server, and pbs-sched. Your master node will allow interactive
161 sessions for users, and accept job executions.
162 </p>
163
164 <p>
165 The slave nodes listen for instructions (via ssh/rsh perhaps) from the master
166 node. They should be dedicated to crunching results and therefore should not
167 run any unecessary services.
168 </p>
169
170 <p>
171 The rest of this documentation will assume a cluster configuration as per the
172 hosts file below. You should maintain on every node such a hosts file
173 (<path>/etc/hosts</path>) with entries for each node participating node in the
174 cluster.
175 </p>
176
177 <pre caption="/etc/hosts">
178 # Adelie Linux Research &amp; Development Center
179 # /etc/hosts
180
181 127.0.0.1 localhost
182
183 192.168.1.100 master.adelie master
184
185 192.168.1.1 node01.adelie node01
186 192.168.1.2 node02.adelie node02
187 </pre>
188
189 <p>
190 To setup your cluster dedicated LAN, edit your <path>/etc/conf.d/net</path>
191 file on the master node.
192 </p>
193
194 <pre caption="/etc/conf.d/net">
195 # Copyright 1999-2002 Gentoo Technologies, Inc.
196 # Distributed under the terms of the GNU General Public License, v2 or later
197
198 # Global config file for net.* rc-scripts
199
200 # This is basically the ifconfig argument without the ifconfig $iface
201 #
202
203 iface_eth0="192.168.1.100 broadcast 192.168.1.255 netmask 255.255.255.0"
204 # Network Connection to the outside world using dhcp -- configure as required for you network
205 iface_eth1="dhcp"
206 </pre>
207
208
209 <p>
210 Finally, setup a DHCP daemon on the master node to avoid having to maintain a
211 network configuration on each slave node.
212 </p>
213
214 <pre caption="/etc/dhcp/dhcpd.conf">
215 # Adelie Linux Research &amp; Development Center
216 # /etc/dhcp/dhcpd.conf
217
218 log-facility local7;
219 ddns-update-style none;
220 use-host-decl-names on;
221
222 subnet 192.168.1.0 netmask 255.255.255.0 {
223 option domain-name "adelie";
224 range 192.168.1.10 192.168.1.99;
225 option routers 192.168.1.100;
226
227 host node01.adelie {
228 # MAC address of network card on node 01
229 hardware ethernet 00:07:e9:0f:e2:d4;
230 fixed-address 192.168.1.1;
231 }
232 host node02.adelie {
233 # MAC address of network card on node 02
234 hardware ethernet 00:07:e9:0f:e2:6b;
235 fixed-address 192.168.1.2;
236 }
237 }
238 </pre>
239
240 </body>
241 </section>
242 <section>
243 <title>NFS/NIS</title>
244 <body>
245
246 <p>
247 The Network File System (NFS) was developed to allow machines to mount a disk
248 partition on a remote machine as if it were on a local hard drive. This allows
249 for fast, seamless sharing of files across a network.
250 </p>
251
252 <p>
253 There are other systems that provide similar functionality to NFS which could
254 be used in a cluster environment. The <uri
255 link="http://www.transarc.com/Product/EFS/AFS/index.html">Andrew File System
256 from IBM</uri>, recently open-sourced, provides a file sharing mechanism with
257 some additional security and performance features. The <uri
258 link="http://www.coda.cs.cmu.edu/">Coda File System</uri> is still in
259 development, but is designed to work well with disconnected clients. Many
260 of the features of the Andrew and Coda file systems are slated for inclusion
261 in the next version of <uri link="http://www.nfsv4.org">NFS (Version 4)</uri>.
262 The advantage of NFS today is that it is mature, standard, well understood,
263 and supported robustly across a variety of platforms.
264 </p>
265
266 <pre caption="Ebuilds for NFS-support">
267 # <i>emerge -p nfs-utils portmap</i>
268 # <i>emerge nfs-utils portmap</i>
269 </pre>
270
271 <p>
272 Configure and install a kernel to support NFS v3 on all nodes:
273 </p>
274
275 <pre caption="Required Kernel Configurations for NFS">
276 CONFIG_NFS_FS=y
277 CONFIG_NFSD=y
278 CONFIG_SUNRPC=y
279 CONFIG_LOCKD=y
280 CONFIG_NFSD_V3=y
281 CONFIG_LOCKD_V4=y
282 </pre>
283
284 <p>
285 On the master node, edit your <path>/etc/hosts.allow</path> file to allow
286 connections from slave nodes. If your cluster LAN is on 192.168.1.0/24,
287 your <path>hosts.allow</path> will look like:
288 </p>
289
290 <pre caption="hosts.allow">
291 portmap:192.168.1.0/255.255.255.0
292 </pre>
293
294 <p>
295 Edit the <path>/etc/exports</path> file of the master node to export a work
296 directory struture (/home is good for this).
297 </p>
298
299 <pre caption="/etc/exports">
300 /home/ *(rw)
301 </pre>
302
303 <p>
304 Add nfs to your master node's default runlevel:
305 </p>
306
307 <pre caption="Adding NFS to the default runlevel">
308 # <i>rc-update add nfs default</i>
309 </pre>
310
311 <p>
312 To mount the nfs exported filesystem from the master, you also have to
313 configure your salve nodes' <path>/etc/fstab</path>. Add a line like this
314 one:
315 </p>
316
317 <pre caption="/etc/fstab">
318 master:/home/ /home nfs rw,exec,noauto,nouser,async 0 0
319 </pre>
320
321 <p>
322 You'll also need to set up your nodes so that they mount the nfs filesystem by
323 issuing this command:
324 </p>
325
326 <pre caption="Adding nfsmount to the default runlevel">
327 # <i>rc-update add nfsmount default</i>
328 </pre>
329
330 </body>
331 </section>
332 <section>
333 <title>RSH/SSH</title>
334 <body>
335
336 <p>
337 SSH is a protocol for secure remote login and other secure network services
338 over an insecure network. OpenSSH uses public key cryptography to provide
339 secure authorization. Generating the public key, which is shared with remote
340 systems, and the private key which is kept on the local system, is done first
341 to configure OpenSSH on the cluster.
342 </p>
343
344 <p>
345 For transparent cluster usage, private/public keys may be used. This process
346 has two steps:
347 </p>
348
349 <ul>
350 <li>Generate public and private keys</li>
351 <li>Copy public key to slave nodes</li>
352 </ul>
353
354 <p>
355 For user based authentification, generate and copy as follows:
356 </p>
357
358 <pre caption="SSH key authentication">
359 # <i>ssh-keygen -t dsa</i>
360 Generating public/private dsa key pair.
361 Enter file in which to save the key (/root/.ssh/id_dsa): /root/.ssh/id_dsa
362 Enter passphrase (empty for no passphrase):
363 Enter same passphrase again:
364 Your identification has been saved in /root/.ssh/id_dsa.
365 Your public key has been saved in /root/.ssh/id_dsa.pub.
366 The key fingerprint is:
367 f1:45:15:40:fd:3c:2d:f7:9f:ea:55:df:76:2f:a4:1f root@master
368
369 <comment>WARNING! If you already have an "authorized_keys" file,
370 please append to it, do not use the following command.</comment>
371
372 # <i>scp /root/.ssh/id_dsa.pub node01:/root/.ssh/authorized_keys</i>
373 root@master's password:
374 id_dsa.pub 100% 234 2.0MB/s 00:00
375
376 # <i>scp /root/.ssh/id_dsa.pub node02:/root/.ssh/authorized_keys</i>
377 root@master's password:
378 id_dsa.pub 100% 234 2.0MB/s 00:00
379 </pre>
380
381 <note>
382 Host keys must have an empty passphrase. RSA is required for host-based
383 authentification.
384 </note>
385
386 <p>
387 For host based authentication, you will also need to edit your
388 <path>/etc/ssh/shosts.equiv</path>.
389 </p>
390
391 <pre caption="/etc/ssh/shosts.equiv">
392 node01.adelie
393 node02.adelie
394 master.adelie
395 </pre>
396
397 <p>
398 And a few modifications to the <path>/etc/ssh/sshd_config</path> file:
399 </p>
400
401 <pre caption="sshd configurations">
402 # $OpenBSD: sshd_config,v 1.42 2001/09/20 20:57:51 mouring Exp $
403 # This sshd was compiled with PATH=/usr/bin:/bin:/usr/sbin:/sbin
404
405 # This is the sshd server system-wide configuration file. See sshd(8)
406 # for more information.
407
408 # HostKeys for protocol version 2
409 HostKey /etc/ssh/ssh_host_rsa_key
410 </pre>
411
412 <p>
413 If your application require RSH communications, you will need to emerge
414 net-misc/netkit-rsh and sys-apps/xinetd.
415 </p>
416
417 <pre caption="Installing necessary applicaitons">
418 # <i>emerge -p xinetd</i>
419 # <i>emerge xinetd</i>
420 # <i>emerge -p netkit-rsh</i>
421 # <i>emerge netkit-rsh</i>
422 </pre>
423
424 <p>
425 Then configure the rsh deamon. Edit your <path>/etc/xinet.d/rsh</path> file.
426 </p>
427
428 <pre caption="rsh">
429 # Adelie Linux Research &amp; Development Center
430 # /etc/xinetd.d/rsh
431
432 service shell
433 {
434 socket_type = stream
435 protocol = tcp
436 wait = no
437 user = root
438 group = tty
439 server = /usr/sbin/in.rshd
440 log_type = FILE /var/log/rsh
441 log_on_success = PID HOST USERID EXIT DURATION
442 log_on_failure = USERID ATTEMPT
443 disable = no
444 }
445 </pre>
446
447 <p>
448 Edit your <path>/etc/hosts.allow</path> to permit rsh connections:
449 </p>
450
451 <pre caption="hosts.allow">
452 # Adelie Linux Research &amp; Development Center
453 # /etc/hosts.allow
454
455 in.rshd:192.168.1.0/255.255.255.0
456 </pre>
457
458 <p>
459 Or you can simply trust your cluster LAN:
460 </p>
461
462 <pre caption="hosts.allow">
463 # Adelie Linux Research &amp; Development Center
464 # /etc/hosts.allow
465
466 ALL:192.168.1.0/255.255.255.0
467 </pre>
468
469 <p>
470 Finally, configure host authentification from <path>/etc/hosts.equiv</path>.
471 </p>
472
473 <pre caption="hosts.equiv">
474 # Adelie Linux Research &amp; Development Center
475 # /etc/hosts.equiv
476
477 master
478 node01
479 node02
480 </pre>
481
482 <p>
483 And, add xinetd to your default runlevel:
484 </p>
485
486 <pre caption="Adding xinetd to the default runlevel">
487 # <i>rc-update add xinetd default</i>
488 </pre>
489
490 </body>
491 </section>
492 <section>
493 <title>NTP</title>
494 <body>
495
496 <p>
497 The Network Time Protocol (NTP) is used to synchronize the time of a computer
498 client or server to another server or reference time source, such as a radio
499 or satellite receiver or modem. It provides accuracies typically within a
500 millisecond on LANs and up to a few tens of milliseconds on WANs relative to
501 Coordinated Universal Time (UTC) via a Global Positioning Service (GPS)
502 receiver, for example. Typical NTP configurations utilize multiple redundant
503 servers and diverse network paths in order to achieve high accuracy and
504 reliability.
505 </p>
506
507 <p>
508 Select a NTP server geographically close to you from <uri
509 link="http://www.eecis.udel.edu/~mills/ntp/servers.html">Public NTP Time
510 Servers</uri>, and configure your <path>/etc/conf.d/ntp</path> and
511 <path>/etc/ntp.conf</path> files on the master node.
512 </p>
513
514 <pre caption="Master /etc/conf.d/ntp">
515 # Copyright 1999-2002 Gentoo Technologies, Inc.
516 # Distributed under the terms of the GNU General Public License v2
517 # /etc/conf.d/ntpd
518
519 # NOTES:
520 # - NTPDATE variables below are used if you wish to set your
521 # clock when you start the ntp init.d script
522 # - make sure that the NTPDATE_CMD will close by itself ...
523 # the init.d script will not attempt to kill/stop it
524 # - ntpd will be used to maintain synchronization with a time
525 # server regardless of what NTPDATE is set to
526 # - read each of the comments above each of the variable
527
528 # Comment this out if you dont want the init script to warn
529 # about not having ntpdate setup
530 NTPDATE_WARN="n"
531
532 # Command to run to set the clock initially
533 # Most people should just uncomment this line ...
534 # however, if you know what you're doing, and you
535 # want to use ntpd to set the clock, change this to 'ntpd'
536 NTPDATE_CMD="ntpdate"
537
538 # Options to pass to the above command
539 # Most people should just uncomment this variable and
540 # change 'someserver' to a valid hostname which you
541 # can aquire from the URL's below
542 NTPDATE_OPTS="-b ntp1.cmc.ec.gc.ca"
543
544 ##
545 # A list of available servers is available here:
546 # http://www.eecis.udel.edu/~mills/ntp/servers.html
547 # Please follow the rules of engagement and use a
548 # Stratum 2 server (unless you qualify for Stratum 1)
549 ##
550
551 # Options to pass to the ntpd process that will *always* be run
552 # Most people should not uncomment this line ...
553 # however, if you know what you're doing, feel free to tweak
554 #NTPD_OPTS=""
555
556 </pre>
557
558 <p>
559 Edit your <path>/etc/ntp.conf</path> file on the master to setup an external
560 synchronization source:
561 </p>
562
563 <pre caption="Master ntp.conf">
564 # Adelie Linux Research &amp; Development Center
565 # /etc/ntp.conf
566
567 # Synchronization source #1
568 server ntp1.cmc.ec.gc.ca
569 restrict ntp1.cmc.ec.gc.ca
570 # Synchronization source #2
571 server ntp2.cmc.ec.gc.ca
572 restrict ntp2.cmc.ec.gc.ca
573 stratum 10
574 driftfile /etc/ntp.drift.server
575 logfile /var/log/ntp
576 broadcast 192.168.1.255
577 restrict default kod
578 restrict 127.0.0.1
579 restrict 192.168.1.0 mask 255.255.255.0
580 </pre>
581
582 <p>
583 And on all your slave nodes, setup your synchronization source as your master
584 node.
585 </p>
586
587 <pre caption="Node /etc/conf.d/ntp">
588 # Copyright 1999-2002 Gentoo Technologies, Inc.
589 # Distributed under the terms of the GNU General Public License v2
590 # /etc/conf.d/ntpd
591
592 NTPDATE_WARN="n"
593 NTPDATE_CMD="ntpdate"
594 NTPDATE_OPTS="-b master"
595 </pre>
596
597 <pre caption="Node ntp.conf">
598 # Adelie Linux Research &amp; Development Center
599 # /etc/ntp.conf
600
601 # Synchronization source #1
602 server master
603 restrict master
604 stratum 11
605 driftfile /etc/ntp.drift.server
606 logfile /var/log/ntp
607 restrict default kod
608 restrict 127.0.0.1
609 </pre>
610
611 <p>
612 Then add ntpd to the default runlevel of all your nodes:
613 </p>
614
615 <pre caption="Adding ntpd to the default runlevel">
616 # <i>rc-update add ntpd default</i>
617 </pre>
618
619 <note>
620 NTP will not update the local clock if the time difference between your
621 synchronization source and the local clock is too great.
622 </note>
623
624 </body>
625 </section>
626 <section>
627 <title>IPTABLES</title>
628 <body>
629
630 <p>
631 To setup a firewall on your cluster, you will need iptables.
632 </p>
633
634 <pre caption="Installing iptables">
635 # <i>emerge -p iptables</i>
636 # <i>emerge iptables</i>
637 </pre>
638
639 <p>
640 Required kernel configuration:
641 </p>
642
643 <pre caption="IPtables kernel configuration">
644 CONFIG_NETFILTER=y
645 CONFIG_IP_NF_CONNTRACK=y
646 CONFIG_IP_NF_IPTABLES=y
647 CONFIG_IP_NF_MATCH_STATE=y
648 CONFIG_IP_NF_FILTER=y
649 CONFIG_IP_NF_TARGET_REJECT=y
650 CONFIG_IP_NF_NAT=y
651 CONFIG_IP_NF_NAT_NEEDED=y
652 CONFIG_IP_NF_TARGET_MASQUERADE=y
653 CONFIG_IP_NF_TARGET_LOG=y
654 </pre>
655
656 <p>
657 And the rules required for this firewall:
658 </p>
659
660 <pre caption="rule-save">
661 # Adelie Linux Research &amp; Development Center
662 # /var/lib/iptbles/rule-save
663
664 *filter
665 :INPUT ACCEPT [0:0]
666 :FORWARD ACCEPT [0:0]
667 :OUTPUT ACCEPT [0:0]
668 -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
669 -A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
670 -A INPUT -s 192.168.1.0/255.255.255.0 -i eth1 -j ACCEPT
671 -A INPUT -s 127.0.0.1 -i lo -j ACCEPT
672 -A INPUT -p icmp -j ACCEPT
673 -A INPUT -j LOG
674 -A INPUT -j REJECT --reject-with icmp-port-unreachable
675 COMMIT
676 *nat
677 :PREROUTING ACCEPT [0:0]
678 :POSTROUTING ACCEPT [0:0]
679 :OUTPUT ACCEPT [0:0]
680 -A POSTROUTING -s 192.168.1.0/255.255.255.0 -j MASQUERADE
681 COMMIT
682 </pre>
683
684 <p>
685 Then add iptables to the default runlevel of all your nodes:
686 </p>
687
688 <pre caption="Adding iptables to the default runlevel">
689 # <i>rc-update add iptables default</i>
690 </pre>
691
692 </body>
693 </section>
694 </chapter>
695
696 <chapter>
697 <title>HPC Tools</title>
698 <section>
699 <title>OpenPBS</title>
700 <body>
701
702 <p>
703 The Portable Batch System (PBS) is a flexible batch queueing and workload
704 management system originally developed for NASA. It operates on networked,
705 multi-platform UNIX environments, including heterogeneous clusters of
706 workstations, supercomputers, and massively parallel systems. Development of
707 PBS is provided by Altair Grid Technologies.
708 </p>
709
710 <pre caption="Installing openpbs">
711 # <i>emerge -p openpbs</i>
712 </pre>
713
714 <note>
715 OpenPBS ebuild does not currently set proper permissions on var-directories
716 used by OpenPBS.
717 </note>
718
719 <p>
720 Before starting using OpenPBS, some configurations are required. The files
721 you will need to personalize for your system are:
722 </p>
723
724 <ul>
725 <li>/etc/pbs_environment</li>
726 <li>/var/spool/PBS/server_name</li>
727 <li>/var/spool/PBS/server_priv/nodes</li>
728 <li>/var/spool/PBS/mom_priv/config</li>
729 <li>/var/spool/PBS/sched_priv/sched_config</li>
730 </ul>
731
732 <p>
733 Here is a sample sched_config:
734 </p>
735
736 <pre caption="/var/spool/PBS/sched_priv/sched_config">
737 #
738 # Create queues and set their attributes.
739 #
740 #
741 # Create and define queue upto4nodes
742 #
743 create queue upto4nodes
744 set queue upto4nodes queue_type = Execution
745 set queue upto4nodes Priority = 100
746 set queue upto4nodes resources_max.nodect = 4
747 set queue upto4nodes resources_min.nodect = 1
748 set queue upto4nodes enabled = True
749 set queue upto4nodes started = True
750 #
751 # Create and define queue default
752 #
753 create queue default
754 set queue default queue_type = Route
755 set queue default route_destinations = upto4nodes
756 set queue default enabled = True
757 set queue default started = True
758 #
759 # Set server attributes.
760 #
761 set server scheduling = True
762 set server acl_host_enable = True
763 set server default_queue = default
764 set server log_events = 511
765 set server mail_from = adm
766 set server query_other_jobs = True
767 set server resources_default.neednodes = 1
768 set server resources_default.nodect = 1
769 set server resources_default.nodes = 1
770 set server scheduler_iteration = 60
771 </pre>
772
773 <p>
774 To submit a task to OpenPBS, the command <c>qsub</c> is used with some
775 optional parameters. In the exemple below, "-l" allows you to specify
776 the resources required, "-j" provides for redirection of standard out and
777 standard error, and the "-m" will e-mail the user at begining (b), end (e)
778 and on abort (a) of the job.
779 </p>
780
781 <pre caption="Submitting a task">
782 <comment>(submit and request from OpenPBS that myscript be executed on 2 nodes)</comment>
783 # <i>qsub -l nodes=2 -j oe -m abe myscript</i>
784 </pre>
785
786 <p>
787 Normally jobs submitted to OpenPBS are in the form of scripts. Sometimes, you
788 may want to try a task manually. To request an interactive shell from OpenPBS,
789 use the "-I" parameter.
790 </p>
791
792 <pre caption="Requesting an interactive shell">
793 # <i>qsub -I</i>
794 </pre>
795
796 <p>
797 To check the status of your jobs, use the qstat command:
798 </p>
799
800 <pre caption="Checking the status of the jobs">
801 # <i>qstat</i>
802 Job id Name User Time Use S Queue
803 ------ ---- ---- -------- - -----
804 2.geist STDIN adelie 0 R upto1nodes
805 </pre>
806
807 </body>
808 </section>
809 <section>
810 <title>MPICH</title>
811 <body>
812
813 <p>
814 Message passing is a paradigm used widely on certain classes of parallel
815 machines, especially those with distributed memory. MPICH is a freely
816 available, portable implementation of MPI, the Standard for message-passing
817 libraries.
818 </p>
819
820 <p>
821 The mpich ebuild provided by Adelie Linux allows for two USE flags:
822 <e>doc</e> and <e>crypt</e>. <e>doc</e> will cause documentation to be
823 installed, while <e>crypt</e> will configure MPICH to use <c>ssh</c> instead
824 of <c>rsh</c>.
825 </p>
826
827 <pre caption="Installing the mpich application">
828 # <i>emerge -p mpich</i>
829 # <i>emerge mpich</i>
830 </pre>
831
832 <p>
833 You may need to export a mpich work directory to all your slave nodes in
834 <path>/etc/exports</path>:
835 </p>
836
837 <pre caption="/etc/exports">
838 /home *(rw)
839 </pre>
840
841 <p>
842 Most massively parallel processors (MPPs) provide a way to start a program on
843 a requested number of processors; <c>mpirun</c> makes use of the appropriate
844 command whenever possible. In contrast, workstation clusters require that each
845 process in a parallel job be started individually, though programs to help
846 start these processes exist. Because workstation clusters are not already
847 organized as an MPP, additional information is required to make use of them.
848 Mpich should be installed with a list of participating workstations in the
849 file <path>machines.LINUX</path> in the directory
850 <path>/usr/share/mpich/</path>. This file is used by <c>mpirun</c> to choose
851 processors to run on.
852 </p>
853
854 <p>
855 Edit this file to reflect your cluster-lan configuration:
856 </p>
857
858 <pre caption="/usr/share/mpich/machines.LINUX">
859 # Change this file to contain the machines that you want to use
860 # to run MPI jobs on. The format is one host name per line, with either
861 # hostname
862 # or
863 # hostname:n
864 # where n is the number of processors in an SMP. The hostname should
865 # be the same as the result from the command "hostname"
866 master
867 node01
868 node02
869 # node03
870 # node04
871 # ...
872 </pre>
873
874 <p>
875 Use the script <c>tstmachines</c> in <path>/usr/sbin/</path> to ensure that
876 you can use all of the machines that you have listed. This script performs
877 an <c>rsh</c> and a short directory listing; this tests that you both have
878 access to the node and that a program in the current directory is visible on
879 the remote node. If there are any problems, they will be listed. These
880 problems must be fixed before proceeding.
881 </p>
882
883 <p>
884 The only argument to <c>tstmachines</c> is the name of the architecture; this
885 is the same name as the extension on the machines file. For example, the
886 following tests that a program in the current directory can be executed by
887 all of the machines in the LINUX machines list.
888 </p>
889
890 <pre caption="Running a test">
891 # <i>/usr/local/mpich/sbin/tstmachines LINUX</i>
892 </pre>
893
894 <note>
895 This program is silent if all is well; if you want to see what it is doing,
896 use the -v (for verbose) argument:
897 </note>
898
899 <pre caption="Running a test verbosively">
900 # <i>/usr/local/mpich/sbin/tstmachines -v LINUX</i>
901 </pre>
902
903 <p>
904 The output from this command might look like:
905 </p>
906
907 <pre caption="Output of the above command">
908 Trying true on host1.uoffoo.edu ...
909 Trying true on host2.uoffoo.edu ...
910 Trying ls on host1.uoffoo.edu ...
911 Trying ls on host2.uoffoo.edu ...
912 Trying user program on host1.uoffoo.edu ...
913 Trying user program on host2.uoffoo.edu ...
914 </pre>
915
916 <p>
917 If <c>tstmachines</c> finds a problem, it will suggest possible reasons and
918 solutions. In brief, there are three tests:
919 </p>
920
921 <ul>
922 <li>
923 <e>Can processes be started on remote machines?</e> tstmachines attempts
924 to run the shell command true on each machine in the machines files by
925 using the remote shell command.
926 </li>
927 <li>
928 <e>Is current working directory available to all machines?</e> This
929 attempts to ls a file that tstmachines creates by running ls using the
930 remote shell command.
931 </li>
932 <li>
933 <e>Can user programs be run on remote systems?</e> This checks that shared
934 libraries and other components have been properly installed on all
935 machines.
936 </li>
937 </ul>
938
939 <p>
940 And the required test for every development tool:
941 </p>
942
943 <pre caption="Testing a development tool">
944 # <i>cd ~</i>
945 # <i>cp /usr/share/mpich/examples1/hello++.c ~</i>
946 # <i>make hello++</i>
947 # <i>mpirun -machinefile /usr/share/mpich/machines.LINUX -np 1 hello++</i>
948 </pre>
949
950 <p>
951 For further information on MPICH, consult the documentation at <uri
952 link="http://www-unix.mcs.anl.gov/mpi/mpich/docs/mpichman-chp4/mpichman-chp4.htm">http://www-unix.mcs.anl.gov/mpi/mpich/docs/mpichman-chp4/mpichman-chp4.htm</uri>.
953 </p>
954
955 </body>
956 </section>
957 <section>
958 <title>LAM</title>
959 <body>
960
961 <p>
962 (Coming Soon!)
963 </p>
964
965 </body>
966 </section>
967 <section>
968 <title>OMNI</title>
969 <body>
970
971 <p>
972 (Coming Soon!)
973 </p>
974
975 </body>
976 </section>
977 </chapter>
978
979 <chapter>
980 <title>Bibliography</title>
981 <section>
982 <body>
983
984 <p>
985 The original document is published at the <uri
986 link="http://www.adelielinux.com">Adelie Linux R&amp;D Centre</uri> web site,
987 and is reproduced here with the permission of the authors and <uri
988 link="http://www.cyberlogic.ca">Cyberlogic</uri>'s Adelie Linux R&amp;D
989 Centre.
990 </p>
991
992 <ul>
993 <li><uri>http://www.gentoo.org</uri>, Gentoo Technologies, Inc.</li>
994 <li>
995 <uri link="http://www.adelielinux.com">http://www.adelielinux.com</uri>,
996 Adelie Linux Research and Development Centre
997 </li>
998 <li>
999 <uri link="http://nfs.sourceforge.net/">http://nfs.sourceforge.net</uri>,
1000 Linux NFS Project
1001 </li>
1002 <li>
1003 <uri link="http://www-unix.mcs.anl.gov/mpi/mpich/">http://www-unix.mcs.anl.gov/mpi/mpich/</uri>,
1004 Mathematics and Computer Science Division, Argonne National Laboratory
1005 </li>
1006 <li>
1007 <uri link="http://www.ntp.org/">http://ntp.org</uri>
1008 </li>
1009 <li>
1010 <uri link="http://www.eecis.udel.edu/~mills/">http://www.eecis.udel.edu/~mills/</uri>,
1011 David L. Mills, University of Delaware
1012 </li>
1013 <li>
1014 <uri link="http://www.ietf.org/html.charters/secsh-charter.html">http://www.ietf.org/html.charters/secsh-charter.html</uri>,
1015 Secure Shell Working Group, IETF, Internet Society
1016 </li>
1017 <li>
1018 <uri link="http://www.linuxsecurity.com/">http://www.linuxsecurity.com/</uri>,
1019 Guardian Digital
1020 </li>
1021 <li>
1022 <uri link="http://www.openpbs.org/">http://www.openpbs.org/</uri>,
1023 Altair Grid Technologies, LLC.
1024 </li>
1025 </ul>
1026
1027 </body>
1028 </section>
1029 </chapter>
1030
1031 </guide>

  ViewVC Help
Powered by ViewVC 1.1.20