/[gentoo]/xml/htdocs/doc/en/hpc-howto.xml
Gentoo

Contents of /xml/htdocs/doc/en/hpc-howto.xml

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.16 - (show annotations) (download) (as text)
Tue Jul 24 12:12:51 2012 UTC (2 years, 1 month ago) by swift
Branch: MAIN
Changes since 1.15: +5 -5 lines
File MIME type: application/xml
Fix bug #427860 - Use /etc/portage for make.conf and make.profile. Old location (/etc) is still supported, this is a heads up (new default)

1 <?xml version='1.0' encoding="UTF-8"?>
2 <!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/hpc-howto.xml,v 1.15 2010/06/07 09:08:37 nightmorph Exp $ -->
3 <!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
4
5 <guide>
6 <title>High Performance Computing on Gentoo Linux</title>
7
8 <author title="Author">
9 <mail link="marc@adelielinux.com">Marc St-Pierre</mail>
10 </author>
11 <author title="Author">
12 <mail link="benoit@adelielinux.com">Benoit Morin</mail>
13 </author>
14 <author title="Assistant/Research">
15 <mail link="jean-francois@adelielinux.com">Jean-Francois Richard</mail>
16 </author>
17 <author title="Assistant/Research">
18 <mail link="olivier@adelielinux.com">Olivier Crete</mail>
19 </author>
20 <author title="Reviewer">
21 <mail link="dberkholz@gentoo.org">Donnie Berkholz</mail>
22 </author>
23 <author title="Editor">
24 <mail link="nightmorph"/>
25 </author>
26
27 <!-- No licensing information; this document has been written by a third-party
28 organisation without additional licensing information.
29
30 In other words, this is copyright adelielinux R&D; Gentoo only has
31 permission to distribute this document as-is and update it when appropriate
32 as long as the adelie linux R&D notice stays
33 -->
34
35 <abstract>
36 This document was written by people at the Adelie Linux R&amp;D Center
37 &lt;http://www.adelielinux.com&gt; as a step-by-step guide to turn a Gentoo
38 System into a High Performance Computing (HPC) system.
39 </abstract>
40
41 <version>2</version>
42 <date>2012-07-24</date>
43
44 <chapter>
45 <title>Introduction</title>
46 <section>
47 <body>
48
49 <p>
50 Gentoo Linux, a special flavor of Linux that can be automatically optimized
51 and customized for just about any application or need. Extreme performance,
52 configurability and a top-notch user and developer community are all hallmarks
53 of the Gentoo experience.
54 </p>
55
56 <p>
57 Thanks to a technology called Portage, Gentoo Linux can become an ideal secure
58 server, development workstation, professional desktop, gaming system, embedded
59 solution or... a High Performance Computing system. Because of its
60 near-unlimited adaptability, we call Gentoo Linux a metadistribution.
61 </p>
62
63 <p>
64 This document explains how to turn a Gentoo system into a High Performance
65 Computing system. Step by step, it explains what packages one may want to
66 install and helps configure them.
67 </p>
68
69 <p>
70 Obtain Gentoo Linux from the website <uri>http://www.gentoo.org</uri>, and
71 refer to the <uri link="/doc/en/">documentation</uri> at the same location to
72 install it.
73 </p>
74
75 </body>
76 </section>
77 </chapter>
78
79 <chapter>
80 <title>Configuring Gentoo Linux for Clustering</title>
81 <section>
82 <title>Recommended Optimizations</title>
83 <body>
84
85 <note>
86 We refer to the <uri link="/doc/en/handbook/">Gentoo Linux Handbooks</uri> in
87 this section.
88 </note>
89
90 <p>
91 During the installation process, you will have to set your USE variables in
92 <path>/etc/portage/make.conf</path>. We recommended that you deactivate all the
93 defaults (see <path>/etc/portage/make.profile/make.defaults</path>) by negating them in
94 make.conf. However, you may want to keep such use variables as 3dnow, gpm,
95 mmx, nptl, nptlonly, sse, ncurses, pam and tcpd. Refer to the USE documentation
96 for more information.
97 </p>
98
99 <pre caption="USE Flags">
100 USE="-oss 3dnow -apm -avi -berkdb -crypt -cups -encode -gdbm -gif gpm -gtk
101 -imlib -java -jpeg -kde -gnome -libg++ -libwww -mikmod mmx -motif -mpeg ncurses
102 -nls nptl nptlonly -ogg -opengl pam -pdflib -png -python -qt4 -qtmt
103 -quicktime -readline -sdl -slang -spell -ssl -svga tcpd -truetype -vorbis -X
104 -xml2 -xv -zlib"
105 </pre>
106
107 <p>
108 Or simply:
109 </p>
110
111 <pre caption="USE Flags - simplified version">
112 USE="-* 3dnow gpm mmx ncurses pam sse tcpd"
113 </pre>
114
115 <note>
116 The <e>tcpd</e> USE flag increases security for packages such as xinetd.
117 </note>
118
119 <p>
120 In step 15 ("Installing the kernel and a System Logger") for stability
121 reasons, we recommend the vanilla-sources, the official kernel sources
122 released on <uri>http://www.kernel.org/</uri>, unless you require special
123 support such as xfs.
124 </p>
125
126 <pre caption="Installing vanilla-sources">
127 # <i>emerge -a syslog-ng vanilla-sources</i>
128 </pre>
129
130 <p>
131 When you install miscellaneous packages, we recommend installing the
132 following:
133 </p>
134
135 <pre caption="Installing necessary packages">
136 # <i>emerge -a nfs-utils portmap tcpdump ssmtp iptables xinetd</i>
137 </pre>
138
139 </body>
140 </section>
141 <section>
142 <title>Communication Layer (TCP/IP Network)</title>
143 <body>
144
145 <p>
146 A cluster requires a communication layer to interconnect the slave nodes to
147 the master node. Typically, a FastEthernet or GigaEthernet LAN can be used
148 since they have a good price/performance ratio. Other possibilities include
149 use of products like <uri link="http://www.myricom.com/">Myrinet</uri>, <uri
150 link="http://quadrics.com/">QsNet</uri> or others.
151 </p>
152
153 <p>
154 A cluster is composed of two node types: master and slave. Typically, your
155 cluster will have one master node and several slave nodes.
156 </p>
157
158 <p>
159 The master node is the cluster's server. It is responsible for telling the
160 slave nodes what to do. This server will typically run such daemons as dhcpd,
161 nfs, pbs-server, and pbs-sched. Your master node will allow interactive
162 sessions for users, and accept job executions.
163 </p>
164
165 <p>
166 The slave nodes listen for instructions (via ssh/rsh perhaps) from the master
167 node. They should be dedicated to crunching results and therefore should not
168 run any unnecessary services.
169 </p>
170
171 <p>
172 The rest of this documentation will assume a cluster configuration as per the
173 hosts file below. You should maintain on every node such a hosts file
174 (<path>/etc/hosts</path>) with entries for each node participating node in the
175 cluster.
176 </p>
177
178 <pre caption="/etc/hosts">
179 # Adelie Linux Research &amp; Development Center
180 # /etc/hosts
181
182 127.0.0.1 localhost
183
184 192.168.1.100 master.adelie master
185
186 192.168.1.1 node01.adelie node01
187 192.168.1.2 node02.adelie node02
188 </pre>
189
190 <p>
191 To setup your cluster dedicated LAN, edit your <path>/etc/conf.d/net</path>
192 file on the master node.
193 </p>
194
195 <pre caption="/etc/conf.d/net">
196 # Global config file for net.* rc-scripts
197
198 # This is basically the ifconfig argument without the ifconfig $iface
199 #
200
201 iface_eth0="192.168.1.100 broadcast 192.168.1.255 netmask 255.255.255.0"
202 # Network Connection to the outside world using dhcp -- configure as required for you network
203 iface_eth1="dhcp"
204 </pre>
205
206
207 <p>
208 Finally, setup a DHCP daemon on the master node to avoid having to maintain a
209 network configuration on each slave node.
210 </p>
211
212 <pre caption="/etc/dhcp/dhcpd.conf">
213 # Adelie Linux Research &amp; Development Center
214 # /etc/dhcp/dhcpd.conf
215
216 log-facility local7;
217 ddns-update-style none;
218 use-host-decl-names on;
219
220 subnet 192.168.1.0 netmask 255.255.255.0 {
221 option domain-name "adelie";
222 range 192.168.1.10 192.168.1.99;
223 option routers 192.168.1.100;
224
225 host node01.adelie {
226 # MAC address of network card on node 01
227 hardware ethernet 00:07:e9:0f:e2:d4;
228 fixed-address 192.168.1.1;
229 }
230 host node02.adelie {
231 # MAC address of network card on node 02
232 hardware ethernet 00:07:e9:0f:e2:6b;
233 fixed-address 192.168.1.2;
234 }
235 }
236 </pre>
237
238 </body>
239 </section>
240 <section>
241 <title>NFS/NIS</title>
242 <body>
243
244 <p>
245 The Network File System (NFS) was developed to allow machines to mount a disk
246 partition on a remote machine as if it were on a local hard drive. This allows
247 for fast, seamless sharing of files across a network.
248 </p>
249
250 <p>
251 There are other systems that provide similar functionality to NFS which could
252 be used in a cluster environment. The <uri
253 link="http://www.openafs.org">Andrew File System
254 from IBM</uri>, recently open-sourced, provides a file sharing mechanism with
255 some additional security and performance features. The <uri
256 link="http://www.coda.cs.cmu.edu/">Coda File System</uri> is still in
257 development, but is designed to work well with disconnected clients. Many
258 of the features of the Andrew and Coda file systems are slated for inclusion
259 in the next version of <uri link="http://www.nfsv4.org">NFS (Version 4)</uri>.
260 The advantage of NFS today is that it is mature, standard, well understood,
261 and supported robustly across a variety of platforms.
262 </p>
263
264 <pre caption="Ebuilds for NFS-support">
265 # <i>emerge -a nfs-utils portmap</i>
266 </pre>
267
268 <p>
269 Configure and install a kernel to support NFS v3 on all nodes:
270 </p>
271
272 <pre caption="Required Kernel Configurations for NFS">
273 CONFIG_NFS_FS=y
274 CONFIG_NFSD=y
275 CONFIG_SUNRPC=y
276 CONFIG_LOCKD=y
277 CONFIG_NFSD_V3=y
278 CONFIG_LOCKD_V4=y
279 </pre>
280
281 <p>
282 On the master node, edit your <path>/etc/hosts.allow</path> file to allow
283 connections from slave nodes. If your cluster LAN is on 192.168.1.0/24,
284 your <path>hosts.allow</path> will look like:
285 </p>
286
287 <pre caption="hosts.allow">
288 portmap:192.168.1.0/255.255.255.0
289 </pre>
290
291 <p>
292 Edit the <path>/etc/exports</path> file of the master node to export a work
293 directory structure (/home is good for this).
294 </p>
295
296 <pre caption="/etc/exports">
297 /home/ *(rw)
298 </pre>
299
300 <p>
301 Add nfs to your master node's default runlevel:
302 </p>
303
304 <pre caption="Adding NFS to the default runlevel">
305 # <i>rc-update add nfs default</i>
306 </pre>
307
308 <p>
309 To mount the nfs exported filesystem from the master, you also have to
310 configure your salve nodes' <path>/etc/fstab</path>. Add a line like this
311 one:
312 </p>
313
314 <pre caption="/etc/fstab">
315 master:/home/ /home nfs rw,exec,noauto,nouser,async 0 0
316 </pre>
317
318 <p>
319 You'll also need to set up your nodes so that they mount the nfs filesystem by
320 issuing this command:
321 </p>
322
323 <pre caption="Adding nfsmount to the default runlevel">
324 # <i>rc-update add nfsmount default</i>
325 </pre>
326
327 </body>
328 </section>
329 <section>
330 <title>RSH/SSH</title>
331 <body>
332
333 <p>
334 SSH is a protocol for secure remote login and other secure network services
335 over an insecure network. OpenSSH uses public key cryptography to provide
336 secure authorization. Generating the public key, which is shared with remote
337 systems, and the private key which is kept on the local system, is done first
338 to configure OpenSSH on the cluster.
339 </p>
340
341 <p>
342 For transparent cluster usage, private/public keys may be used. This process
343 has two steps:
344 </p>
345
346 <ul>
347 <li>Generate public and private keys</li>
348 <li>Copy public key to slave nodes</li>
349 </ul>
350
351 <p>
352 For user based authentication, generate and copy as follows:
353 </p>
354
355 <pre caption="SSH key authentication">
356 # <i>ssh-keygen -t dsa</i>
357 Generating public/private dsa key pair.
358 Enter file in which to save the key (/root/.ssh/id_dsa): /root/.ssh/id_dsa
359 Enter passphrase (empty for no passphrase):
360 Enter same passphrase again:
361 Your identification has been saved in /root/.ssh/id_dsa.
362 Your public key has been saved in /root/.ssh/id_dsa.pub.
363 The key fingerprint is:
364 f1:45:15:40:fd:3c:2d:f7:9f:ea:55:df:76:2f:a4:1f root@master
365
366 <comment>WARNING! If you already have an "authorized_keys" file,
367 please append to it, do not use the following command.</comment>
368
369 # <i>scp /root/.ssh/id_dsa.pub node01:/root/.ssh/authorized_keys</i>
370 root@master's password:
371 id_dsa.pub 100% 234 2.0MB/s 00:00
372
373 # <i>scp /root/.ssh/id_dsa.pub node02:/root/.ssh/authorized_keys</i>
374 root@master's password:
375 id_dsa.pub 100% 234 2.0MB/s 00:00
376 </pre>
377
378 <note>
379 Host keys must have an empty passphrase. RSA is required for host-based
380 authentication.
381 </note>
382
383 <p>
384 For host based authentication, you will also need to edit your
385 <path>/etc/ssh/shosts.equiv</path>.
386 </p>
387
388 <pre caption="/etc/ssh/shosts.equiv">
389 node01.adelie
390 node02.adelie
391 master.adelie
392 </pre>
393
394 <p>
395 And a few modifications to the <path>/etc/ssh/sshd_config</path> file:
396 </p>
397
398 <pre caption="sshd configurations">
399 # $OpenBSD: sshd_config,v 1.42 2001/09/20 20:57:51 mouring Exp $
400 # This sshd was compiled with PATH=/usr/bin:/bin:/usr/sbin:/sbin
401
402 # This is the sshd server system-wide configuration file. See sshd(8)
403 # for more information.
404
405 # HostKeys for protocol version 2
406 HostKey /etc/ssh/ssh_host_rsa_key
407 </pre>
408
409 <p>
410 If your application require RSH communications, you will need to emerge
411 <c>net-misc/netkit-rsh</c> and <c>sys-apps/xinetd</c>.
412 </p>
413
414 <pre caption="Installing necessary applicaitons">
415 # <i>emerge -a xinetd</i>
416 # <i>emerge -a netkit-rsh</i>
417 </pre>
418
419 <p>
420 Then configure the rsh deamon. Edit your <path>/etc/xinet.d/rsh</path> file.
421 </p>
422
423 <pre caption="rsh">
424 # Adelie Linux Research &amp; Development Center
425 # /etc/xinetd.d/rsh
426
427 service shell
428 {
429 socket_type = stream
430 protocol = tcp
431 wait = no
432 user = root
433 group = tty
434 server = /usr/sbin/in.rshd
435 log_type = FILE /var/log/rsh
436 log_on_success = PID HOST USERID EXIT DURATION
437 log_on_failure = USERID ATTEMPT
438 disable = no
439 }
440 </pre>
441
442 <p>
443 Edit your <path>/etc/hosts.allow</path> to permit rsh connections:
444 </p>
445
446 <pre caption="hosts.allow">
447 # Adelie Linux Research &amp; Development Center
448 # /etc/hosts.allow
449
450 in.rshd:192.168.1.0/255.255.255.0
451 </pre>
452
453 <p>
454 Or you can simply trust your cluster LAN:
455 </p>
456
457 <pre caption="hosts.allow">
458 # Adelie Linux Research &amp; Development Center
459 # /etc/hosts.allow
460
461 ALL:192.168.1.0/255.255.255.0
462 </pre>
463
464 <p>
465 Finally, configure host authentication from <path>/etc/hosts.equiv</path>.
466 </p>
467
468 <pre caption="hosts.equiv">
469 # Adelie Linux Research &amp; Development Center
470 # /etc/hosts.equiv
471
472 master
473 node01
474 node02
475 </pre>
476
477 <p>
478 And, add xinetd to your default runlevel:
479 </p>
480
481 <pre caption="Adding xinetd to the default runlevel">
482 # <i>rc-update add xinetd default</i>
483 </pre>
484
485 </body>
486 </section>
487 <section>
488 <title>NTP</title>
489 <body>
490
491 <p>
492 The Network Time Protocol (NTP) is used to synchronize the time of a computer
493 client or server to another server or reference time source, such as a radio
494 or satellite receiver or modem. It provides accuracies typically within a
495 millisecond on LANs and up to a few tens of milliseconds on WANs relative to
496 Coordinated Universal Time (UTC) via a Global Positioning Service (GPS)
497 receiver, for example. Typical NTP configurations utilize multiple redundant
498 servers and diverse network paths in order to achieve high accuracy and
499 reliability.
500 </p>
501
502 <p>
503 Select a NTP server geographically close to you from <uri
504 link="http://www.eecis.udel.edu/~mills/ntp/servers.html">Public NTP Time
505 Servers</uri>, and configure your <path>/etc/conf.d/ntp</path> and
506 <path>/etc/ntp.conf</path> files on the master node.
507 </p>
508
509 <pre caption="Master /etc/conf.d/ntp">
510 # /etc/conf.d/ntpd
511
512 # NOTES:
513 # - NTPDATE variables below are used if you wish to set your
514 # clock when you start the ntp init.d script
515 # - make sure that the NTPDATE_CMD will close by itself ...
516 # the init.d script will not attempt to kill/stop it
517 # - ntpd will be used to maintain synchronization with a time
518 # server regardless of what NTPDATE is set to
519 # - read each of the comments above each of the variable
520
521 # Comment this out if you dont want the init script to warn
522 # about not having ntpdate setup
523 NTPDATE_WARN="n"
524
525 # Command to run to set the clock initially
526 # Most people should just uncomment this line ...
527 # however, if you know what you're doing, and you
528 # want to use ntpd to set the clock, change this to 'ntpd'
529 NTPDATE_CMD="ntpdate"
530
531 # Options to pass to the above command
532 # Most people should just uncomment this variable and
533 # change 'someserver' to a valid hostname which you
534 # can acquire from the URL's below
535 NTPDATE_OPTS="-b ntp1.cmc.ec.gc.ca"
536
537 ##
538 # A list of available servers is available here:
539 # http://www.eecis.udel.edu/~mills/ntp/servers.html
540 # Please follow the rules of engagement and use a
541 # Stratum 2 server (unless you qualify for Stratum 1)
542 ##
543
544 # Options to pass to the ntpd process that will *always* be run
545 # Most people should not uncomment this line ...
546 # however, if you know what you're doing, feel free to tweak
547 #NTPD_OPTS=""
548
549 </pre>
550
551 <p>
552 Edit your <path>/etc/ntp.conf</path> file on the master to setup an external
553 synchronization source:
554 </p>
555
556 <pre caption="Master ntp.conf">
557 # Adelie Linux Research &amp; Development Center
558 # /etc/ntp.conf
559
560 # Synchronization source #1
561 server ntp1.cmc.ec.gc.ca
562 restrict ntp1.cmc.ec.gc.ca
563 # Synchronization source #2
564 server ntp2.cmc.ec.gc.ca
565 restrict ntp2.cmc.ec.gc.ca
566 stratum 10
567 driftfile /etc/ntp.drift.server
568 logfile /var/log/ntp
569 broadcast 192.168.1.255
570 restrict default kod
571 restrict 127.0.0.1
572 restrict 192.168.1.0 mask 255.255.255.0
573 </pre>
574
575 <p>
576 And on all your slave nodes, setup your synchronization source as your master
577 node.
578 </p>
579
580 <pre caption="Node /etc/conf.d/ntp">
581 # /etc/conf.d/ntpd
582
583 NTPDATE_WARN="n"
584 NTPDATE_CMD="ntpdate"
585 NTPDATE_OPTS="-b master"
586 </pre>
587
588 <pre caption="Node ntp.conf">
589 # Adelie Linux Research &amp; Development Center
590 # /etc/ntp.conf
591
592 # Synchronization source #1
593 server master
594 restrict master
595 stratum 11
596 driftfile /etc/ntp.drift.server
597 logfile /var/log/ntp
598 restrict default kod
599 restrict 127.0.0.1
600 </pre>
601
602 <p>
603 Then add ntpd to the default runlevel of all your nodes:
604 </p>
605
606 <pre caption="Adding ntpd to the default runlevel">
607 # <i>rc-update add ntpd default</i>
608 </pre>
609
610 <note>
611 NTP will not update the local clock if the time difference between your
612 synchronization source and the local clock is too great.
613 </note>
614
615 </body>
616 </section>
617 <section>
618 <title>IPTABLES</title>
619 <body>
620
621 <p>
622 To setup a firewall on your cluster, you will need iptables.
623 </p>
624
625 <pre caption="Installing iptables">
626 # <i>emerge -a iptables</i>
627 </pre>
628
629 <p>
630 Required kernel configuration:
631 </p>
632
633 <pre caption="IPtables kernel configuration">
634 CONFIG_NETFILTER=y
635 CONFIG_IP_NF_CONNTRACK=y
636 CONFIG_IP_NF_IPTABLES=y
637 CONFIG_IP_NF_MATCH_STATE=y
638 CONFIG_IP_NF_FILTER=y
639 CONFIG_IP_NF_TARGET_REJECT=y
640 CONFIG_IP_NF_NAT=y
641 CONFIG_IP_NF_NAT_NEEDED=y
642 CONFIG_IP_NF_TARGET_MASQUERADE=y
643 CONFIG_IP_NF_TARGET_LOG=y
644 </pre>
645
646 <p>
647 And the rules required for this firewall:
648 </p>
649
650 <pre caption="rule-save">
651 # Adelie Linux Research &amp; Development Center
652 # /var/lib/iptables/rule-save
653
654 *filter
655 :INPUT ACCEPT [0:0]
656 :FORWARD ACCEPT [0:0]
657 :OUTPUT ACCEPT [0:0]
658 -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
659 -A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
660 -A INPUT -s 192.168.1.0/255.255.255.0 -i eth1 -j ACCEPT
661 -A INPUT -s 127.0.0.1 -i lo -j ACCEPT
662 -A INPUT -p icmp -j ACCEPT
663 -A INPUT -j LOG
664 -A INPUT -j REJECT --reject-with icmp-port-unreachable
665 COMMIT
666 *nat
667 :PREROUTING ACCEPT [0:0]
668 :POSTROUTING ACCEPT [0:0]
669 :OUTPUT ACCEPT [0:0]
670 -A POSTROUTING -s 192.168.1.0/255.255.255.0 -j MASQUERADE
671 COMMIT
672 </pre>
673
674 <p>
675 Then add iptables to the default runlevel of all your nodes:
676 </p>
677
678 <pre caption="Adding iptables to the default runlevel">
679 # <i>rc-update add iptables default</i>
680 </pre>
681
682 </body>
683 </section>
684 </chapter>
685
686 <chapter>
687 <title>HPC Tools</title>
688 <section>
689 <title>OpenPBS</title>
690 <body>
691
692 <p>
693 The Portable Batch System (PBS) is a flexible batch queueing and workload
694 management system originally developed for NASA. It operates on networked,
695 multi-platform UNIX environments, including heterogeneous clusters of
696 workstations, supercomputers, and massively parallel systems. Development of
697 PBS is provided by Altair Grid Technologies.
698 </p>
699
700 <pre caption="Installing openpbs">
701 # <i>emerge -a openpbs</i>
702 </pre>
703
704 <note>
705 OpenPBS ebuild does not currently set proper permissions on var-directories
706 used by OpenPBS.
707 </note>
708
709 <p>
710 Before starting using OpenPBS, some configurations are required. The files
711 you will need to personalize for your system are:
712 </p>
713
714 <ul>
715 <li>/etc/pbs_environment</li>
716 <li>/var/spool/PBS/server_name</li>
717 <li>/var/spool/PBS/server_priv/nodes</li>
718 <li>/var/spool/PBS/mom_priv/config</li>
719 <li>/var/spool/PBS/sched_priv/sched_config</li>
720 </ul>
721
722 <p>
723 Here is a sample sched_config:
724 </p>
725
726 <pre caption="/var/spool/PBS/sched_priv/sched_config">
727 #
728 # Create queues and set their attributes.
729 #
730 #
731 # Create and define queue upto4nodes
732 #
733 create queue upto4nodes
734 set queue upto4nodes queue_type = Execution
735 set queue upto4nodes Priority = 100
736 set queue upto4nodes resources_max.nodect = 4
737 set queue upto4nodes resources_min.nodect = 1
738 set queue upto4nodes enabled = True
739 set queue upto4nodes started = True
740 #
741 # Create and define queue default
742 #
743 create queue default
744 set queue default queue_type = Route
745 set queue default route_destinations = upto4nodes
746 set queue default enabled = True
747 set queue default started = True
748 #
749 # Set server attributes.
750 #
751 set server scheduling = True
752 set server acl_host_enable = True
753 set server default_queue = default
754 set server log_events = 511
755 set server mail_from = adm
756 set server query_other_jobs = True
757 set server resources_default.neednodes = 1
758 set server resources_default.nodect = 1
759 set server resources_default.nodes = 1
760 set server scheduler_iteration = 60
761 </pre>
762
763 <p>
764 To submit a task to OpenPBS, the command <c>qsub</c> is used with some
765 optional parameters. In the example below, "-l" allows you to specify
766 the resources required, "-j" provides for redirection of standard out and
767 standard error, and the "-m" will e-mail the user at beginning (b), end (e)
768 and on abort (a) of the job.
769 </p>
770
771 <pre caption="Submitting a task">
772 <comment>(submit and request from OpenPBS that myscript be executed on 2 nodes)</comment>
773 # <i>qsub -l nodes=2 -j oe -m abe myscript</i>
774 </pre>
775
776 <p>
777 Normally jobs submitted to OpenPBS are in the form of scripts. Sometimes, you
778 may want to try a task manually. To request an interactive shell from OpenPBS,
779 use the "-I" parameter.
780 </p>
781
782 <pre caption="Requesting an interactive shell">
783 # <i>qsub -I</i>
784 </pre>
785
786 <p>
787 To check the status of your jobs, use the qstat command:
788 </p>
789
790 <pre caption="Checking the status of the jobs">
791 # <i>qstat</i>
792 Job id Name User Time Use S Queue
793 ------ ---- ---- -------- - -----
794 2.geist STDIN adelie 0 R upto1nodes
795 </pre>
796
797 </body>
798 </section>
799 <section>
800 <title>MPICH</title>
801 <body>
802
803 <p>
804 Message passing is a paradigm used widely on certain classes of parallel
805 machines, especially those with distributed memory. MPICH is a freely
806 available, portable implementation of MPI, the Standard for message-passing
807 libraries.
808 </p>
809
810 <p>
811 The mpich ebuild provided by Adelie Linux allows for two USE flags:
812 <e>doc</e> and <e>crypt</e>. <e>doc</e> will cause documentation to be
813 installed, while <e>crypt</e> will configure MPICH to use <c>ssh</c> instead
814 of <c>rsh</c>.
815 </p>
816
817 <pre caption="Installing the mpich application">
818 # <i>emerge -a mpich</i>
819 </pre>
820
821 <p>
822 You may need to export a mpich work directory to all your slave nodes in
823 <path>/etc/exports</path>:
824 </p>
825
826 <pre caption="/etc/exports">
827 /home *(rw)
828 </pre>
829
830 <p>
831 Most massively parallel processors (MPPs) provide a way to start a program on
832 a requested number of processors; <c>mpirun</c> makes use of the appropriate
833 command whenever possible. In contrast, workstation clusters require that each
834 process in a parallel job be started individually, though programs to help
835 start these processes exist. Because workstation clusters are not already
836 organized as an MPP, additional information is required to make use of them.
837 Mpich should be installed with a list of participating workstations in the
838 file <path>machines.LINUX</path> in the directory
839 <path>/usr/share/mpich/</path>. This file is used by <c>mpirun</c> to choose
840 processors to run on.
841 </p>
842
843 <p>
844 Edit this file to reflect your cluster-lan configuration:
845 </p>
846
847 <pre caption="/usr/share/mpich/machines.LINUX">
848 # Change this file to contain the machines that you want to use
849 # to run MPI jobs on. The format is one host name per line, with either
850 # hostname
851 # or
852 # hostname:n
853 # where n is the number of processors in an SMP. The hostname should
854 # be the same as the result from the command "hostname"
855 master
856 node01
857 node02
858 # node03
859 # node04
860 # ...
861 </pre>
862
863 <p>
864 Use the script <c>tstmachines</c> in <path>/usr/sbin/</path> to ensure that
865 you can use all of the machines that you have listed. This script performs
866 an <c>rsh</c> and a short directory listing; this tests that you both have
867 access to the node and that a program in the current directory is visible on
868 the remote node. If there are any problems, they will be listed. These
869 problems must be fixed before proceeding.
870 </p>
871
872 <p>
873 The only argument to <c>tstmachines</c> is the name of the architecture; this
874 is the same name as the extension on the machines file. For example, the
875 following tests that a program in the current directory can be executed by
876 all of the machines in the LINUX machines list.
877 </p>
878
879 <pre caption="Running a test">
880 # <i>/usr/local/mpich/sbin/tstmachines LINUX</i>
881 </pre>
882
883 <note>
884 This program is silent if all is well; if you want to see what it is doing,
885 use the -v (for verbose) argument:
886 </note>
887
888 <pre caption="Running a test verbosively">
889 # <i>/usr/local/mpich/sbin/tstmachines -v LINUX</i>
890 </pre>
891
892 <p>
893 The output from this command might look like:
894 </p>
895
896 <pre caption="Output of the above command">
897 Trying true on host1.uoffoo.edu ...
898 Trying true on host2.uoffoo.edu ...
899 Trying ls on host1.uoffoo.edu ...
900 Trying ls on host2.uoffoo.edu ...
901 Trying user program on host1.uoffoo.edu ...
902 Trying user program on host2.uoffoo.edu ...
903 </pre>
904
905 <p>
906 If <c>tstmachines</c> finds a problem, it will suggest possible reasons and
907 solutions. In brief, there are three tests:
908 </p>
909
910 <ul>
911 <li>
912 <e>Can processes be started on remote machines?</e> tstmachines attempts
913 to run the shell command true on each machine in the machines files by
914 using the remote shell command.
915 </li>
916 <li>
917 <e>Is current working directory available to all machines?</e> This
918 attempts to ls a file that tstmachines creates by running ls using the
919 remote shell command.
920 </li>
921 <li>
922 <e>Can user programs be run on remote systems?</e> This checks that shared
923 libraries and other components have been properly installed on all
924 machines.
925 </li>
926 </ul>
927
928 <p>
929 And the required test for every development tool:
930 </p>
931
932 <pre caption="Testing a development tool">
933 # <i>cd ~</i>
934 # <i>cp /usr/share/mpich/examples1/hello++.c ~</i>
935 # <i>make hello++</i>
936 # <i>mpirun -machinefile /usr/share/mpich/machines.LINUX -np 1 hello++</i>
937 </pre>
938
939 <p>
940 For further information on MPICH, consult the documentation at <uri
941 link="http://www-unix.mcs.anl.gov/mpi/mpich/docs/mpichman-chp4/mpichman-chp4.htm">http://www-unix.mcs.anl.gov/mpi/mpich/docs/mpichman-chp4/mpichman-chp4.htm</uri>.
942 </p>
943
944 </body>
945 </section>
946 <section>
947 <title>LAM</title>
948 <body>
949
950 <p>
951 (Coming Soon!)
952 </p>
953
954 </body>
955 </section>
956 <section>
957 <title>OMNI</title>
958 <body>
959
960 <p>
961 (Coming Soon!)
962 </p>
963
964 </body>
965 </section>
966 </chapter>
967
968 <chapter>
969 <title>Bibliography</title>
970 <section>
971 <body>
972
973 <p>
974 The original document is published at the <uri
975 link="http://www.adelielinux.com">Adelie Linux R&amp;D Centre</uri> web site,
976 and is reproduced here with the permission of the authors and <uri
977 link="http://www.cyberlogic.ca">Cyberlogic</uri>'s Adelie Linux R&amp;D
978 Centre.
979 </p>
980
981 <ul>
982 <li><uri>http://www.gentoo.org</uri>, Gentoo Foundation, Inc.</li>
983 <li>
984 <uri link="http://www.adelielinux.com">http://www.adelielinux.com</uri>,
985 Adelie Linux Research and Development Centre
986 </li>
987 <li>
988 <uri link="http://nfs.sourceforge.net/">http://nfs.sourceforge.net</uri>,
989 Linux NFS Project
990 </li>
991 <li>
992 <uri link="http://www-unix.mcs.anl.gov/mpi/mpich/">http://www-unix.mcs.anl.gov/mpi/mpich/</uri>,
993 Mathematics and Computer Science Division, Argonne National Laboratory
994 </li>
995 <li>
996 <uri link="http://www.ntp.org/">http://ntp.org</uri>
997 </li>
998 <li>
999 <uri link="http://www.eecis.udel.edu/~mills/">http://www.eecis.udel.edu/~mills/</uri>,
1000 David L. Mills, University of Delaware
1001 </li>
1002 <li>
1003 <uri link="http://www.ietf.org/html.charters/secsh-charter.html">http://www.ietf.org/html.charters/secsh-charter.html</uri>,
1004 Secure Shell Working Group, IETF, Internet Society
1005 </li>
1006 <li>
1007 <uri link="http://www.linuxsecurity.com/">http://www.linuxsecurity.com/</uri>,
1008 Guardian Digital
1009 </li>
1010 <li>
1011 <uri link="http://www.openpbs.org/">http://www.openpbs.org/</uri>,
1012 Altair Grid Technologies, LLC.
1013 </li>
1014 </ul>
1015
1016 </body>
1017 </section>
1018 </chapter>
1019
1020 </guide>

  ViewVC Help
Powered by ViewVC 1.1.20