Bull HPC BAS4 Administration Guide

Category
Software
Type
Administration Guide
HPC BAS4
Administrator's Guide
HPC
REFERENCE
86 A2 30ER 12
HPC
HPC BAS4
Administrator's Guide
Hardware and Software
April 2008
BULL CEDOC
357 AVENUE PATTON
B.P.20845
49008 ANGERS CEDEX 01
FRANCE
REFERENCE
86 A2 30ER 12
The following copyright notice protects this book under Copyright laws which prohibit such actions as, but not limited
to, copying, distributing, modifying, and making derivative works.
Copyright
Bull SAS 2005, 2008
Printed in France
Suggestions and criticisms concerning the form, content, and presentation of this
book are invited. A form is provided at the end of this book for this purpose.
To order additional copies of this book or other Bull Technical Publications, you
are invited to use the Ordering Form also provided at the end of this book.
Trademarks and Acknowledgements
We acknowledge the rights of the proprietors of the trademarks mentioned in this manual.
All brand names and software and hardware product names are subject to trademark and/or patent protection.
Quoting of brand and product names is for information purposes only and does not represent trademark misuse.
The information in this document is subject to change without notice. Bull will not be liable for errors
contained herein, or for incidental or consequential damages in connection with the use of this material.
Preface i
Preface
Scope and Objectives
The purpose of this guide is to explain how to configure and manage Bull High
Performance Computing (HPC) clusters, using the administration tools recommended by
Bull.
It is not in the scope of this guide to describe in depth the Linux administration functions.
For this information, please refer to the standard Linux distribution documentation.
Intended Readers
This guide is for HPC cluster system administrators.
Prerequisites
The installation of all hardware and software HPC components must have been completed.
Structure
This guide is organized as follows:
Chapter 1. Explains some General Concepts for Bull Linux HPC systems.
Chapter 2. HPC Configuration
Describes some basic configuration tasks including password
management, security settings. It also describes how to run parallel
commands, and kernel tuning parameters.
Chapter 3. Cluster Database Management
Describes the commands and the tools which enable the administrator
to display and to change the Cluster Database.
Chapter 4. Parallel File Systems
Explains how these file systems operate on a Bull HPC system. It
describes in detail how to install, configure and manage the Lustre file
system.
Chapter 5. Software Deployment
Describes how to use KSIS to deploy, manage, modify and check
software images.
Chapter 6. Resource Management
Explains how Quadrics RMS and SLURM Resource Managers work in
order to have an optimal management of the resources.
Chapter 7. Batch Management
Explains how to optimize the execution of programs using TORQUE.
ii HPC BAS4 - Administrator's Guide
Chapter 8. Monitoring
Describes NovaScale Master - HPC Edition monitoring tool for Bull HPC
systems.
Chapter 9. Storage Devices Management
Explains how to setup the management environment for storage
devices, and how to use storage management services.
Chapter 10. Kerberos – Network Authentication Protocol
Describes how to set up and use Kerberos.
Chapter 11. Cluster High Availability
Explains the main concepts for implementing High Availability on a Bull
HPC system.
Chapter 12. Management Node High Availability
Explains how to implement Management Node High Availability using
Cluster Suite and specific HA scripts.
Chapter 13. I/O Node and Lustre File System High Availability
Explains how to implement High Availability for I/O Nodes and Lustre
file system.
Bibliography
• Bull HPC BAS4 Installation and Configuration Guide (86 A2 28ER).
• Bull HPC BAS4 User's Guide (86 A2 29ER).
• Bull HPC BAS4 Application Tuning Guide (86 A2 19ER).
• Bull HPC BAS4 Maintenance Guide (86 A2 46ER).
• Bull NovaScale Master Remote Hardware Management CLI Reference Manual
(86 A2 88EM).
• The Software Release Bulletin (SRB) provides release-specific information, and details
of restrictions resulting from known problems.
• Bull Voltaire Switches Documentation CD (86 A2 79ET)
Highlighting
• Commands entered by the user are in a frame in "Courier" font. Example:
mkdir /var/lib/newdir
• Commands, files, directories and other items whose names are predefined by the
system are in "Bold". Example:
The /etc/sysconfig/dump file.
• Text and messages displayed by the system to illustrate explanations are in "Courier
New" font. Example:
BIOS Intel
Preface iii
• Text for values to be entered in by the user is in "Courier New". Example:
COM1
• Italics Identifies referenced publications, chapters, sections, figures, and tables.
• < > identifies parameters to be supplied by the user. Example:
<node_name>
Warning:
A Warning notice indicates an action that could cause damage to a program, device,
system, or data.
iv HPC BAS4 - Administrator's Guide
Table of Contents
v
Table of Contents
Chapter 1. General Concepts .......................................................................... 1-1
1.1 Cluster Architecture ......................................................................................................... 1-1
1.2 Management Functions and Corresponding Products........................................................... 1-3
Chapter 2. HPC Configuration ......................................................................... 2-1
2.1 Configuring Services ....................................................................................................... 2-1
2.2 Modifying Passwords and Creating Users .......................................................................... 2-1
2.3 Managing Partitions ........................................................................................................ 2-2
2.4 Creating Swap Partitions.................................................................................................. 2-3
2.5 Configuring Security........................................................................................................ 2-4
2.5.1 Setting up SSH ...................................................................................................... 2-4
2.6 Running Parallel Commands with pdsh .............................................................................. 2-6
2.6.1 Using pdsh............................................................................................................ 2-6
2.6.2 Using pdcp ........................................................................................................... 2-9
2.6.3 Using dshbak ........................................................................................................ 2-9
2.7 Day to Day Maintenance Operations .............................................................................. 2-11
Chapter 3. Cluster Database Management......................................................... 3-1
3.1 Architecture of ClusterDB.................................................................................................. 3-1
3.2 ClusterDB Administrator ................................................................................................... 3-2
3.3 Using Commands............................................................................................................ 3-3
3.3.1 ChangeOwnerProperties......................................................................................... 3-4
3.3.2 dbmConfig............................................................................................................ 3-7
3.3.3 dbmCluster.......................................................................................................... 3-10
3.3.4 dbmNode ........................................................................................................... 3-11
3.3.5 dbmHwManager ................................................................................................. 3-14
3.3.6 dbmGroup .......................................................................................................... 3-15
3.3.7 dbmEthernet ........................................................................................................ 3-18
3.3.8 dbmIconnect........................................................................................................ 3-20
3.3.9 dbmTalim............................................................................................................ 3-22
3.3.10 dbmSerial ........................................................................................................... 3-24
3.3.11 dbmFiberChannel ................................................................................................ 3-26
3.3.12 dbmServices........................................................................................................ 3-28
3.3.13 dbmDiskArray ..................................................................................................... 3-29
3.4 Managing the ClusterDB ................................................................................................ 3-31
3.4.1 Saving and Restoring the Database........................................................................ 3-31
3.4.2 Starting and Stopping PostgreSQL ......................................................................... 3-33
3.4.3 Viewing the PostgreSQL Alert Log .......................................................................... 3-33
3.5 ClusterDB Modeling....................................................................................................... 3-34
3.5.1 Physical View of the Cluster Networks .................................................................... 3-34
3.5.2 Physical View of the Storage ................................................................................. 3-42
vi HPC BAS4 - Administrator's Guide
3.5.3
Machine View ..................................................................................................... 3-49
3.5.4 HWMANAGER View ........................................................................................... 3-55
3.5.5 Complementary Tables ......................................................................................... 3-57
3.5.6 NsDoctor View .................................................................................................... 3-59
3.5.7 Nagios View ....................................................................................................... 3-61
3.5.8 Lustre View.......................................................................................................... 3-62
Chapter 4. Parallel File Systems ........................................................................4-1
4.1 Parallel File Systems Overview .......................................................................................... 4-1
4.2 Lustre Overview............................................................................................................... 4-2
4.3 Lustre Administrator’s Role ................................................................................................ 4-3
4.4 Planning a Lustre System .................................................................................................. 4-4
4.4.1 Data Pipelines........................................................................................................ 4-4
4.4.2 OSS / OST Distribution .......................................................................................... 4-4
4.4.3 MDS / MDT Distribution ......................................................................................... 4-4
4.4.4 File Striping ........................................................................................................... 4-5
4.4.5 Lustre File System Limitations.................................................................................... 4-5
4.5 Lustre System Management............................................................................................... 4-6
4.5.1 The Lustre Database ............................................................................................... 4-6
4.5.2 /etc/lustre/storage.conf for Lustre Tools without ClusterDB.......................................... 4-8
4.5.3 Lustre Management Configuration File: /etc/lustre/lustre.cfg .................................... 4-13
4.5.4 Lustre Services Definition ....................................................................................... 4-15
4.5.5 Creating Lustre File Systems................................................................................... 4-17
4.6 Installing and Managing Lustre File Systems...................................................................... 4-21
4.6.1 Installing Lustre File Systems using lustre_util ............................................................ 4-21
4.6.2 Removing Lustre File Systems using lustre_util........................................................... 4-21
4.6.3 lustre_util Actions and Options............................................................................... 4-21
4.6.4 lustre_util Configuration File /etc/lustre/lustre_util.conf............................................. 4-33
4.6.5 Lustre Tuning File /etc/lustre/tuning.conf ................................................................ 4-34
4.6.6 Lustre Filesystem Reconfiguration............................................................................ 4-35
4.6.7 Using Quotas with Lustre File Systems..................................................................... 4-36
4.7 Monitoring Lustre System................................................................................................ 4-39
4.7.1 Lustre System Health Supervision............................................................................ 4-39
4.7.2 Lustre Filesystem Indicator...................................................................................... 4-41
4.7.3 Lustre System Performance Supervision ................................................................... 4-43
Chapter 5. Software Deployment (KSIS) .............................................................5-1
5.1 Overview ....................................................................................................................... 5-2
5.2 Configuring and Verifying a Reference Node ..................................................................... 5-3
5.3 Main Steps for Deployment............................................................................................... 5-4
5.4 Modifying Images and Managing their Release .................................................................. 5-5
5.4.1 Methods................................................................................................................ 5-5
5.4.2 Naming Images or Patches with the Workon Mechanism............................................ 5-6
5.4.3 Image Types .......................................................................................................... 5-7
5.5 Checking Deployed Images .............................................................................................. 5-8
Table of Contents
vii
5.5.1 Checking Principles................................................................................................ 5-8
5.5.2 Check Groups ....................................................................................................... 5-8
5.5.3 Modifying the Checks Database .............................................................................. 5-9
5.5.4 Examining the Results ........................................................................................... 5-10
5.5.5 Looking at the Discrepancies ................................................................................. 5-10
5.6 Importing and Exporting an Image .................................................................................. 5-11
5.7 Working with Secondary Images .................................................................................... 5-12
5.7.1 Disk Partitioning Constraints .................................................................................. 5-12
5.7.2 Managing Primary/Secondary Images................................................................... 5-13
5.8 Ksis Commands ............................................................................................................ 5-15
5.8.1 Syntax ................................................................................................................ 5-15
5.8.2 Advanced ksis create options ................................................................................ 5-15
5.8.3 Creating the Image of the Reference Node ............................................................. 5-16
5.8.4 Deleting an Image or a Patch ................................................................................ 5-16
5.8.5 Deploying an Image or a Patch ............................................................................. 5-17
5.8.6 Removing a Patch ................................................................................................ 5-17
5.8.7 Getting Information about an Image or a Node....................................................... 5-17
5.8.8 Listing Images on the Image Server ........................................................................ 5-17
5.8.9 Listing Images by Nodes....................................................................................... 5-18
5.9 Modifying an Image...................................................................................................... 5-19
5.9.1 Creating a Working Patch Image .......................................................................... 5-19
5.9.2 Creating a Patch Image ........................................................................................ 5-19
5.9.3 Creating a Patched Golden Image......................................................................... 5-20
5.9.4 Building a Patch................................................................................................... 5-20
5.10 Checking Images .......................................................................................................... 5-21
5.11 Importing and Exporting Images ..................................................................................... 5-21
5.12 Setting Boot Mode......................................................................................................... 5-21
Chapter 6. Resource Management.................................................................... 6-1
6.1 Managing Resources with RMS......................................................................................... 6-1
6.1.1 Managing Partitions............................................................................................... 6-2
6.1.2 Managing Limits .................................................................................................... 6-2
6.1.3 Synchronizing Time................................................................................................ 6-6
6.1.4 Getting Topology Information with rinfo.................................................................... 6-6
6.1.5 Viewing processor information in /proc/cpuinfo ....................................................... 6-6
6.1.6 Day to Day RMS Operations ................................................................................... 6-8
6.1.7 More Information ................................................................................................. 6-14
6.2 BLBS: Bull Load-Balancing System and RMS...................................................................... 6-15
6.2.1 Overview............................................................................................................ 6-15
6.2.2 Using BLBS.......................................................................................................... 6-15
6.2.3 Configuring BLBS................................................................................................. 6-15
6.2.4 BLBS Administration ............................................................................................. 6-16
6.3 Resource Management with SLURM................................................................................. 6-17
6.3.1 SLURM Key Functions ........................................................................................... 6-17
6.3.2 SLURM Components............................................................................................. 6-18
6.3.3 SLURM Daemons ................................................................................................. 6-18
viii HPC BAS4 - Administrator's Guide
6.3.4
Scheduler Types................................................................................................... 6-20
6.4 SLURM Configuration..................................................................................................... 6-22
6.4.1 Configuration Parameters...................................................................................... 6-23
6.4.2 slurm.conf Example Files ....................................................................................... 6-36
6.4.3 SCONTROL – Managing the SLURM Configuration.................................................. 6-38
6.4.4 Pam_Slurm Module Configuration .......................................................................... 6-45
6.5 Administrating Cluster Activity with SLURM....................................................................... 6-47
6.5.1 Starting the Daemons............................................................................................ 6-47
6.5.2 Starting and Stopping the SLURM Daemons with slurm.sh ......................................... 6-48
6.5.3 SLURMCTLD (Controller Daemon)........................................................................... 6-49
6.5.4 SLURMD (Compute Node Daemon) ........................................................................ 6-50
6.5.5 Scheduler Support................................................................................................ 6-51
6.5.6 Node Selection.................................................................................................... 6-52
6.5.7 Logging .............................................................................................................. 6-52
6.5.8 Corefile Format .................................................................................................... 6-52
6.5.9 Security............................................................................................................... 6-52
6.5.10 SLURM Cluster Administration Examples ................................................................. 6-52
Chapter 7. Batch Management (TORQUE)..........................................................7-1
7.1 TORQUE Features ........................................................................................................... 7-2
7.2 TORQUE Architecture ...................................................................................................... 7-3
7.3 Configuration Files .......................................................................................................... 7-4
7.4 Configuring Passwordless Access for TORQUE ................................................................... 7-4
7.5 Configuring TORQUE ...................................................................................................... 7-5
7.5.1 On the Compute Nodes.......................................................................................... 7-5
7.5.2 On the Server ........................................................................................................ 7-5
7.6 Configuring TORQUE and RMS ........................................................................................ 7-7
7.6.1 Architecture ........................................................................................................... 7-7
7.6.2 Configuration ........................................................................................................ 7-7
7.7 Common Commands ....................................................................................................... 7-9
7.7.1 TORQUE Commands for Job Management................................................................ 7-9
7.7.2 Commands to start a Program inside a Job ............................................................... 7-9
7.7.3 Commands to Display Status ................................................................................... 7-9
7.7.4 Administrator’s Commands and Configuration Files.................................................... 7-9
Chapter 8. Monitoring with NovaScale Master - HPC Edition ................................8-1
8.1 Launching NovaScale Master - HPC Edition........................................................................ 8-2
8.2 Access Rights .................................................................................................................. 8-3
8.2.1 Administrator Access Rights..................................................................................... 8-3
8.2.2 Standard User Access Rights ................................................................................... 8-3
8.2.3 Adding Users and Changing Passwords ................................................................... 8-3
8.3 Hosts, Services and Contacts for Nagios............................................................................ 8-4
8.4 Using NovaScale Master - HPC Edition.............................................................................. 8-5
8.4.1 NovaScale Master - HPC Edition – View Levels.......................................................... 8-5
Table of Contents
ix
8.5 Map Button .................................................................................................................... 8-6
8.5.1 All Status Map View............................................................................................... 8-6
8.5.2 Rack View............................................................................................................. 8-7
8.5.3 Host Services detailed View .................................................................................... 8-7
8.5.4 Ping Map View...................................................................................................... 8-8
8.6 Status Button...................................................................................................................8-9
8.7 Alerts Button ................................................................................................................. 8-10
8.7.1 Active Checks...................................................................................................... 8-11
8.7.2 Passive Checks .................................................................................................... 8-11
8.7.3 Notifications........................................................................................................ 8-12
8.7.4 Acknowledgments................................................................................................ 8-12
8.7.5 Comments........................................................................................................... 8-13
8.7.6 Logs ................................................................................................................... 8-14
8.7.7 Alert Definition .................................................................................................... 8-14
8.7.8 Running a Script .................................................................................................. 8-15
8.7.9 Generating SNMP Alerts ...................................................................................... 8-16
8.7.10 Resetting an Alert Back to OK................................................................................ 8-16
8.7.11 nsmhpc.conf configuration file ............................................................................... 8-16
8.8 Storage Overview ......................................................................................................... 8-17
8.9 Shell ............................................................................................................................ 8-18
8.10 Monitoring the Performance - Ganglia Statistics ................................................................ 8-18
8.11 Group Performance View............................................................................................... 8-19
8.12 Global Performance View .............................................................................................. 8-20
8.12.1 Modifying the Performance Graphics Views............................................................ 8-21
8.12.2 Refresh Period for the Performance View Web Pages ............................................... 8-22
8.13 Configuring and Modifying Nagios Services .................................................................... 8-23
8.13.1 Configuring Using the Database............................................................................ 8-23
8.13.2 Modifying Nagios Services................................................................................... 8-23
8.13.3 Changing the Verification Frequency...................................................................... 8-24
8.14 General Nagios Services ............................................................................................... 8-25
8.14.1 Ethernet Interfaces................................................................................................ 8-25
8.14.2 RMS Status.......................................................................................................... 8-25
8.14.3 Hardware Status.................................................................................................. 8-25
8.14.4 Temperature ........................................................................................................ 8-25
8.14.5 Alert Log ............................................................................................................. 8-26
8.14.6 I/O Status........................................................................................................... 8-26
8.14.7 Postbootchecker................................................................................................... 8-26
8.15 Management Node Nagios Services............................................................................... 8-27
8.15.1 MiniSQL Daemon ................................................................................................ 8-27
8.15.2 RMS Daemon ...................................................................................................... 8-27
8.15.3 Quadrics Switch Manager .................................................................................... 8-27
8.15.4 ClusterDB............................................................................................................ 8-27
8.15.5 Cron Daemon...................................................................................................... 8-27
8.15.6 Compute Power Available..................................................................................... 8-27
8.15.7 Global Filesystems bandwidth available ................................................................. 8-28
8.15.8 Storage Arrays available ...................................................................................... 8-28
x HPC BAS4 - Administrator's Guide
8.15.9
Global Filesystem Usage....................................................................................... 8-28
8.15.10 I/O pairs Migration Alert...................................................................................... 8-28
8.15.11 Backbone Ports Available...................................................................................... 8-28
8.15.12 Quadrics Ports Available....................................................................................... 8-28
8.15.13 HA System Status ................................................................................................. 8-29
8.15.14 Kerberos KDC Daemon......................................................................................... 8-29
8.15.15 Kerberos Admin Daemon ...................................................................................... 8-29
8.15.16 LDAP Daemon...................................................................................................... 8-29
8.15.17 Lustre filesystems access ........................................................................................ 8-29
8.15.18 NFS filesystems access.......................................................................................... 8-30
8.15.19 InfiniBand Links available...................................................................................... 8-30
8.16 Ethernet Switch Services ................................................................................................. 8-31
8.16.1 Ethernet Interface ................................................................................................. 8-31
8.16.2 Power supply....................................................................................................... 8-31
8.16.3 Temperature ........................................................................................................ 8-31
8.16.4 Fans ................................................................................................................... 8-32
8.16.5 Ports ................................................................................................................... 8-32
8.17 Portserver Services......................................................................................................... 8-33
8.17.1 Temperature Threshold Service .............................................................................. 8-33
8.17.2 Current Threshold Service...................................................................................... 8-33
8.18 More Nagios Information ............................................................................................... 8-34
Chapter 9. Storage Device Management............................................................9-1
9.1 Overview of Storage Device Management for Bull HPC Clusters ........................................... 9-2
9.2 Monitoring Node I/O Status............................................................................................. 9-4
9.2.1 Managing I/O Reference Counters .......................................................................... 9-4
9.2.2 I/O Counters Definitions ......................................................................................... 9-6
9.2.3 Managing I/O Resources........................................................................................ 9-7
9.3 Monitoring Storage Devices.............................................................................................. 9-8
9.3.1 NovaScale Master - HPC Edition: Host and Service Monitoring for Storage Devices ...... 9-8
9.3.2 NovaScale Master - HPC Edition: Storage & I/O Information .................................... 9-13
9.3.3 Querying the Cluster Management Data Base ......................................................... 9-18
9.3.4 Checking the Logs for Storage Devices ................................................................... 9-19
9.4 Monitoring Brocade Switch Status ................................................................................... 9-20
9.5 Managing Storage Devices with Bull CLI .......................................................................... 9-23
9.5.1 Bull FDA Storage Systems...................................................................................... 9-23
9.5.2 DataDirect Networks Systems - DDN Commands ..................................................... 9-24
9.6 Using Management Tools ............................................................................................... 9-26
9.7 Configuring Storage Devices .......................................................................................... 9-27
9.7.1 Planning Tasks ..................................................................................................... 9-27
9.7.2 Deployment Service for Storage Systems................................................................. 9-28
9.7.3 Understanding the Configuration Deployment Service............................................... 9-28
9.7.4 Alternate Storage Configuration Process for Lustre I/O Nodes................................... 9-34
9.8 User Rights and Security Levels for the Storage Commands................................................. 9-38
9.8.1 Management Node.............................................................................................. 9-38
9.8.2 Other Node Types ............................................................................................... 9-39
Table of Contents
xi
9.8.3 Configuration Files............................................................................................... 9-39
Chapter 5. Kerberos - Network Authentication Protocol........................................ 5-1
5.1 Environment.................................................................................................................... 5-1
5.1.1 Kerberos Infrastructure............................................................................................ 5-1
5.1.2 Authentication of the SSHv2 Connections.................................................................. 5-1
5.2 KERBEROS Infrastructure Configuration.............................................................................. 5-2
5.2.1 secu0 Server including KDC Server and Administration Server .................................... 5-2
5.2.2 Configuration Files ................................................................................................. 5-2
5.2.3 Creating the Kerberos Database.............................................................................. 5-3
5.2.4 Creating the Kerberos Administrator......................................................................... 5-3
5.2.5 Starting the KDC Server.......................................................................................... 5-4
5.2.6 Adding Access Control List (ACL) Rights for the Kerberos Administrator Created............ 5-4
5.2.7 Starting the Administration Daemon ......................................................................... 5-4
5.2.8 Creating Principals Associated with Users................................................................. 5-5
5.2.9 Creating Principals Associated with Remote Kerberized Services ................................. 5-5
5.3 Configuring the secu1 Machine that hosts the Host Principal remote service........................... 5-6
5.3.1 Generating the key associated with the Host Principal remote service .......................... 5-6
5.4 Kerberos Authentication and SSH...................................................................................... 5-7
5.4.1 Configuring the SSH Server on the secu1machine...................................................... 5-7
5.4.2 SSH Client ............................................................................................................ 5-8
5.5 Troubleshooting Errors ..................................................................................................... 5-9
5.6 Generating Associated Keys for Nodes of a Cluster .......................................................... 5-10
5.7 Modifying the Lifespan and Renewal Period for TGT Tickets ............................................... 5-11
5.8 Including Addresses with Tickets ..................................................................................... 5-11
Chapter 11. Cluster High Availability................................................................ 11-1
Chapter 12. Management Node High Availability .............................................. 12-1
12.1 Introduction .................................................................................................................. 12-1
12.1.1 Possible Management Node failures ...................................................................... 12-2
12.2 Configuration ............................................................................................................... 12-3
12.2.1 Hardware ........................................................................................................... 12-3
12.2.2 Software ............................................................................................................. 12-4
12.2.3 Network ............................................................................................................. 12-5
12.3 Understanding High Availability Scripts for the Management Node .................................... 12-7
12.3.1 /etc/HA/bin/haservices script.............................................................................. 12-8
12.3.2 /etc/HA/start file ................................................................................................ 12-8
12.3.3 /etc/HA/status file .............................................................................................. 12-9
12.3.4 /etc/HA/stop file .............................................................................................. 12-10
12.3.5 hacron script ..................................................................................................... 12-11
12.3.6 haip script......................................................................................................... 12-11
12.3.7 halinks script ..................................................................................................... 12-12
12.3.8 hamount script................................................................................................... 12-13
12.3.9 hapidcleaner script ............................................................................................ 12-14
12.3.10 harms script....................................................................................................... 12-14
xii HPC BAS4 - Administrator's Guide
12.3.11
hasynchro script................................................................................................. 12-14
12.3.12 hasyslogng script ............................................................................................... 12-15
12.3.13 haunwantedfiles script ........................................................................................ 12-16
12.4 Understanding Channel Bonding (Optional) ................................................................... 12-17
12.5 Installing the Primary and Secondary Management Nodes............................................... 12-18
12.5.1 Prerequisites ...................................................................................................... 12-18
12.5.2 Configuring Channel Bonding ............................................................................. 12-19
12.5.3 Configuring Services .......................................................................................... 12-21
12.5.4 Implementing the Primary Node........................................................................... 12-24
12.5.5 Start HA on the Primary Node ............................................................................. 12-26
12.5.6 Implementing the Secondary Node ...................................................................... 12-26
12.6 Managing Cluster Suite................................................................................................ 12-28
12.6.1 /etc/HA/bin/cs ................................................................................................ 12-28
12.6.2 cs -config-admin ................................................................................................. 12-28
12.6.3 clustat ............................................................................................................... 12-29
12.6.4 clusvcadm ......................................................................................................... 12-29
12.7 Patching a Node / Updating an Application .................................................................. 12-30
12.7.1 Introduction ....................................................................................................... 12-30
12.7.2 Example of Updating RMS .................................................................................. 12-30
12.7.3 Example of Updating Lustre ................................................................................. 12-33
Chapter 13. I/O Node and Lustre File System High Availability.............................13-1
13.1 Introduction to Lustre File System...................................................................................... 13-1
13.2 Lustre Failover Mechanism.............................................................................................. 13-2
13.3 Hardware Architecture................................................................................................... 13-4
13.4 High Availability Policy .................................................................................................. 13-7
13.5 High Availability Management........................................................................................ 13-8
13.6 Error Detection and Prevention Mechanisms ................................................................... 13-10
13.7 Analysis of Failure Modes ............................................................................................ 13-11
13.7.1 I/O Node and Metadata Failures ........................................................................ 13-11
13.7.2 Storage Failures................................................................................................. 13-11
13.7.3 Ethernet Network Failures.................................................................................... 13-12
13.8 Using Cluster Suite (Cluster with Management Node) ...................................................... 13-13
13.8.1 Distributing the cluster.conf file on the I/O Node ................................................... 13-13
13.8.2 Starting / Stopping Cluster Suite's Daemons ......................................................... 13-14
13.8.3 Checking the Cluster Suite Status.......................................................................... 13-14
13.9 Managing Lustre High Availability (Cluster with Management Node) ................................ 13-15
13.9.1 ClusterDB Information ......................................................................................... 13-15
13.9.2 LDAP Directory – the lustre_ldap Utility.................................................................. 13-15
13.9.3 Failover Tools Configuration – the /etc/lustre/lustre.cfg File.................................... 13-17
13.9.4 Managing Lustre Failover Services on I/O and Metadata Nodes – the lustre_migrate
Tool.................................................................................................................. 13-17
13.9.5 Configuring File Systems for Failover .................................................................... 13-18
13.10 High Availability Management without a Management Node........................................... 13-19
Table of Contents
xiii
13.10.1 I/O Nodes Pair Configuration............................................................................. 13-20
13.10.2 Central Point Configuration: MDS Pair.................................................................. 13-21
13.10.3 High Availability Operations............................................................................... 13-22
13.11 Lustre High Availability Operations ............................................................................... 13-23
13.11.1 Service Migration triggered by Cluster Suite.......................................................... 13-23
13.11.2 Service Migration triggered by Administrator ........................................................ 13-23
13.12 Monitoring Lustre High Availability................................................................................ 13-25
13.12.1 Command Line Monitoring.................................................................................. 13-25
13.12.2 Graphic Monitoring ........................................................................................... 13-26
13.12.3 Traces and Debug.............................................................................................. 13-27
Glossary and Acronyms........................................................................................G-1
Index .................................................................................................................. I-1
xiv HPC BAS4 - Administrator's Guide
List of Figures
Figure 1-1.
A typical HPC architecture ........................................................................................... 1-2
Figure 1-2. Management Functions................................................................................................ 1-3
Figure 3-1. ClusterDB architecture ................................................................................................. 3-1
Figure 3-2. Cluster Network – diagram 1..................................................................................... 3-34
Figure 3-3. Cluster Network – diagram 2..................................................................................... 3-35
Figure 3-4. Storage physical view ............................................................................................... 3-42
Figure 3-5. Cluster Database – Machine view 1............................................................................ 3-49
Figure 3-6. Cluster Database – Machine view 2............................................................................ 3-50
Figure 3-7. HWManager view.................................................................................................... 3-55
Figure 3-8. Cluster Database – Complementary tables................................................................... 3-57
Figure 3-9. Cluster Database – NsDoctor view.............................................................................. 3-59
Figure 3-10. ClusterDB –Nagios View ........................................................................................... 3-61
Figure 3-11. Cluster Database – Lustre view ................................................................................... 3-62
Figure 4-1. NovaScale Master Map view..................................................................................... 4-39
Figure 4-2. NovaScale Nagios file system indicator ...................................................................... 4-41
Figure 4-3. Lustre Management Node web interface...................................................................... 4-42
Figure 4-4. Detailed view of Lustre file systems.............................................................................. 4-43
Figure 4-5. Group performance global view pop up window ......................................................... 4-44
Figure 4-6. Dispatched performance view pop up window............................................................. 4-44
Figure 4-7. Global performance view pop up window................................................................... 4-45
Figure 5-1. Main steps for deployment........................................................................................... 5-4
Figure 5-2. Image modification (workon, store, deploy, detach)........................................................ 5-6
Figure 5-3. Names of derived images or patches ............................................................................ 5-7
Figure 5-4. Primary system node partitions ................................................................................... 5-12
Figure 5-5. Secondary system node partitions............................................................................... 5-12
Figure 6-1. SLURM Simplified Architecture.................................................................................... 6-18
Figure 6-2. SLURM Architecture - Subsystems ................................................................................ 6-19
Figure 7-1. TORQUE Resource Manager Architecture ...................................................................... 7-3
Figure 7-2. TORQUE resource allocation........................................................................................ 7-7
Figure 8-1. NovaScale Master - HPC Edition opening view .............................................................. 8-5
Figure 8-2. Map button all status opening view............................................................................... 8-6
Figure 8-3. Rack view with the problems window at the bottom ........................................................ 8-7
Figure 8-4. Host Service details..................................................................................................... 8-8
Figure 8-5. Status overview screen ................................................................................................ 8-9
Figure 8-6. Alert Window showing the different alert states............................................................ 8-11
Figure 8-7. Hostgroups Reporting Notifications Window showing the Notification Levels................... 8-12
Figure 8-8. Status Monitoring Control window showing the links to add and delete comments............ 8-13
Figure 8-9. Monitoring Service Status window for a host................................................................ 8-14
Figure 8-10. Storage overview window ......................................................................................... 8-17
Figure 8-11. Group Performance view ........................................................................................... 8-19
Figure 8-12. Global overview for a host (top screen) ....................................................................... 8-20
Figure 8-13. Detailed monitoring view for a host (bottom half of screen displayed in Figure 8-12) ........ 8-21
Figure 8-14. Ethernet Switch services............................................................................................. 8-31
Figure 9-1. I/O Status – initial screen ............................................................................................ 9-4
Figure 9-2. NovaScale Master HPC Edition - I/O Status Details ........................................................ 9-5
Figure 9-3. NovaScale Master –HPC Edition – I/O Resources of a node............................................ 9-7
Figure 9-4. Detailed service status for a storage host ....................................................................... 9-9
Table of Contents
xv
Figure 9-5. Nova Scale Master opening console window with the Storage overview icon circled ....... 9-13
Figure 9-6. Storage overview ..................................................................................................... 9-15
Figure 9-7. Inventory view of faulty storage systems and components .............................................. 9-16
Figure 9-8. Storage detailed view ............................................................................................... 9-17
Figure 9-9. Nodes I/O Overview................................................................................................ 9-18
Figure 9-10. Detailed Service status of a brocade switch ................................................................. 9-22
Figure 9-11. Deployment service for storage configurations ............................................................. 9-30
Figure 12-1. High Availability Management Node configuration...................................................... 12-3
Figure 12-2. High Availability management network configuration ................................................... 12-3
Figure 13-1. Lustre interactions ..................................................................................................... 13-1
Figure 13-2. OST takeover and client recovery ............................................................................... 13-2
Figure 13-3. MDT takeover and client recovery .............................................................................. 13-2
Figure 13-4. I/O Cell diagram ..................................................................................................... 13-4
Figure 13-5. High Availability/Cluster Suite on Novascale 50xx nodes............................................. 13-5
Figure 13-6. High Availability/Cluster Suite on Novascale 40xx nodes............................................. 13-5
Figure 13-7. MDT/OST Dispatching on two nodes.......................................................................... 13-6
Figure 13-8 Lustre High-Availability Management architecture......................................................... 13-8
Figure 13-9 HA for a small cluster.............................................................................................. 13-19
Figure 13-10 Service migration triggered by Cluster Suite............................................................. 13-23
Figure 13-11 Service migration triggered by the Administrator...................................................... 13-24
Figure 13-12 NovaScale Master Map all status screen................................................................. 13-26
Figure 13-13 Lustre filesystem status indicator in the Host service status window ................................ 13-27
xvi HPC BAS4 - Administrator's Guide
List of Tables
Table 2-1.
Maintenance Tools.................................................................................................... 2-11
Table 3-1. Cluster Table ............................................................................................................ 3-36
Table 3-2. IP_NW table ............................................................................................................ 3-36
Table 3-3. ETH_SWITCH Table .................................................................................................. 3-37
Table 3-4. IC_NW Table........................................................................................................... 3-37
Table 3-5. IC_Switch Table........................................................................................................ 3-38
Table 3-6. Serial_NW Table...................................................................................................... 3-38
Table 3-7. PORTSERVER Table ................................................................................................... 3-39
Table 3-8. ETH_VLAN table....................................................................................................... 3-39
Table 3-9. FC_NW table........................................................................................................... 3-40
Table 3-10. FC_SWITCH table..................................................................................................... 3-41
Table 3-11. TALIM table.............................................................................................................. 3-41
Table 3-12. Storage – disk_array table......................................................................................... 3-44
Table 3-13. Storage – da_enclosure table ..................................................................................... 3-44
Table 3-14. Storage – da_disk_slot table ...................................................................................... 3-45
Table 3-15. Storage – da_controller table ..................................................................................... 3-45
Table 3-16. Storage – da_fc_port.table......................................................................................... 3-45
Table 3-17. Storage – da_serial_port table.................................................................................... 3-46
Table 3-18. Storage – da_ethernet_port Table ............................................................................... 3-46
Table 3-19. Storage – da_power_supply table............................................................................... 3-47
Table 3-20. Storage – da_fan table.............................................................................................. 3-47
Table 3-21. Storage – da_power_fan table ................................................................................... 3-47
Table 3-22. Storage – da_temperature_sensor table ....................................................................... 3-48
Table 3-23. da_io_path table ...................................................................................................... 3-48
Table 3-24. Storage – da_iocell_component table.......................................................................... 3-48
Table 3-25. Storage – da_cfg_model table.................................................................................... 3-49
Table 3-26. Storage – da_power_port table .................................................................................. 3-49
Table 3-27. Machine view – node table........................................................................................ 3-52
Table 3-28. Machine view – Node_image table ............................................................................ 3-52
Table 3-29. Machine view – Node_Profile table ............................................................................ 3-53
Table 3-30. Machine view – IC_BOARD table ............................................................................... 3-53
Table 3-31. Machine view – IPOIB Table ...................................................................................... 3-54
Table 3-32. Machine view – SDPOIB table.................................................................................... 3-54
Table 3-33. Machine view – FC_BOARD table .............................................................................. 3-54
Table 3-34. HWManager Table................................................................................................... 3-56
Table 3-35. Cluster Database – Admin table ................................................................................. 3-57
Table 3-36. Cluster Database – Rack table.................................................................................... 3-58
Table 3-37. Cluster Database – Config Candidate table ................................................................. 3-58
Table 3-38. Cluster database – Config_Status table ....................................................................... 3-58
Table 3-39. Cluster Database Group_Node table .......................................................................... 3-58
Table 3-40. Cluster Database NsDoctor – Test_Groups table........................................................... 3-59
Table 3-41. Cluster Database NsDoctor – Tests table...................................................................... 3-59
Table 3-42. Cluster Database NsDoctor – Test_Depencencies table.................................................. 3-60
Table 3-43. Cluster Database NsDoctor – Test Results table............................................................. 3-60
Table 3-44. Nagios Services Table............................................................................................... 3-61
Table 3-45. Nagios Availability Table .......................................................................................... 3-61
Table 3-46. Cluster Database – Lustre View – Lustre_fs table ........................................................... 3-63
  • Page 1 1
  • Page 2 2
  • Page 3 3
  • Page 4 4
  • Page 5 5
  • Page 6 6
  • Page 7 7
  • Page 8 8
  • Page 9 9
  • Page 10 10
  • Page 11 11
  • Page 12 12
  • Page 13 13
  • Page 14 14
  • Page 15 15
  • Page 16 16
  • Page 17 17
  • Page 18 18
  • Page 19 19
  • Page 20 20
  • Page 21 21
  • Page 22 22
  • Page 23 23
  • Page 24 24
  • Page 25 25
  • Page 26 26
  • Page 27 27
  • Page 28 28
  • Page 29 29
  • Page 30 30
  • Page 31 31
  • Page 32 32
  • Page 33 33
  • Page 34 34
  • Page 35 35
  • Page 36 36
  • Page 37 37
  • Page 38 38
  • Page 39 39
  • Page 40 40
  • Page 41 41
  • Page 42 42
  • Page 43 43
  • Page 44 44
  • Page 45 45
  • Page 46 46
  • Page 47 47
  • Page 48 48
  • Page 49 49
  • Page 50 50
  • Page 51 51
  • Page 52 52
  • Page 53 53
  • Page 54 54
  • Page 55 55
  • Page 56 56
  • Page 57 57
  • Page 58 58
  • Page 59 59
  • Page 60 60
  • Page 61 61
  • Page 62 62
  • Page 63 63
  • Page 64 64
  • Page 65 65
  • Page 66 66
  • Page 67 67
  • Page 68 68
  • Page 69 69
  • Page 70 70
  • Page 71 71
  • Page 72 72
  • Page 73 73
  • Page 74 74
  • Page 75 75
  • Page 76 76
  • Page 77 77
  • Page 78 78
  • Page 79 79
  • Page 80 80
  • Page 81 81
  • Page 82 82
  • Page 83 83
  • Page 84 84
  • Page 85 85
  • Page 86 86
  • Page 87 87
  • Page 88 88
  • Page 89 89
  • Page 90 90
  • Page 91 91
  • Page 92 92
  • Page 93 93
  • Page 94 94
  • Page 95 95
  • Page 96 96
  • Page 97 97
  • Page 98 98
  • Page 99 99
  • Page 100 100
  • Page 101 101
  • Page 102 102
  • Page 103 103
  • Page 104 104
  • Page 105 105
  • Page 106 106
  • Page 107 107
  • Page 108 108
  • Page 109 109
  • Page 110 110
  • Page 111 111
  • Page 112 112
  • Page 113 113
  • Page 114 114
  • Page 115 115
  • Page 116 116
  • Page 117 117
  • Page 118 118
  • Page 119 119
  • Page 120 120
  • Page 121 121
  • Page 122 122
  • Page 123 123
  • Page 124 124
  • Page 125 125
  • Page 126 126
  • Page 127 127
  • Page 128 128
  • Page 129 129
  • Page 130 130
  • Page 131 131
  • Page 132 132
  • Page 133 133
  • Page 134 134
  • Page 135 135
  • Page 136 136
  • Page 137 137
  • Page 138 138
  • Page 139 139
  • Page 140 140
  • Page 141 141
  • Page 142 142
  • Page 143 143
  • Page 144 144
  • Page 145 145
  • Page 146 146
  • Page 147 147
  • Page 148 148
  • Page 149 149
  • Page 150 150
  • Page 151 151
  • Page 152 152
  • Page 153 153
  • Page 154 154
  • Page 155 155
  • Page 156 156
  • Page 157 157
  • Page 158 158
  • Page 159 159
  • Page 160 160
  • Page 161 161
  • Page 162 162
  • Page 163 163
  • Page 164 164
  • Page 165 165
  • Page 166 166
  • Page 167 167
  • Page 168 168
  • Page 169 169
  • Page 170 170
  • Page 171 171
  • Page 172 172
  • Page 173 173
  • Page 174 174
  • Page 175 175
  • Page 176 176
  • Page 177 177
  • Page 178 178
  • Page 179 179
  • Page 180 180
  • Page 181 181
  • Page 182 182
  • Page 183 183
  • Page 184 184
  • Page 185 185
  • Page 186 186
  • Page 187 187
  • Page 188 188
  • Page 189 189
  • Page 190 190
  • Page 191 191
  • Page 192 192
  • Page 193 193
  • Page 194 194
  • Page 195 195
  • Page 196 196
  • Page 197 197
  • Page 198 198
  • Page 199 199
  • Page 200 200
  • Page 201 201
  • Page 202 202
  • Page 203 203
  • Page 204 204
  • Page 205 205
  • Page 206 206
  • Page 207 207
  • Page 208 208
  • Page 209 209
  • Page 210 210
  • Page 211 211
  • Page 212 212
  • Page 213 213
  • Page 214 214
  • Page 215 215
  • Page 216 216
  • Page 217 217
  • Page 218 218
  • Page 219 219
  • Page 220 220
  • Page 221 221
  • Page 222 222
  • Page 223 223
  • Page 224 224
  • Page 225 225
  • Page 226 226
  • Page 227 227
  • Page 228 228
  • Page 229 229
  • Page 230 230
  • Page 231 231
  • Page 232 232
  • Page 233 233
  • Page 234 234
  • Page 235 235
  • Page 236 236
  • Page 237 237
  • Page 238 238
  • Page 239 239
  • Page 240 240
  • Page 241 241
  • Page 242 242
  • Page 243 243
  • Page 244 244
  • Page 245 245
  • Page 246 246
  • Page 247 247
  • Page 248 248
  • Page 249 249
  • Page 250 250
  • Page 251 251
  • Page 252 252
  • Page 253 253
  • Page 254 254
  • Page 255 255
  • Page 256 256
  • Page 257 257
  • Page 258 258
  • Page 259 259
  • Page 260 260
  • Page 261 261
  • Page 262 262
  • Page 263 263
  • Page 264 264
  • Page 265 265
  • Page 266 266
  • Page 267 267
  • Page 268 268
  • Page 269 269
  • Page 270 270
  • Page 271 271
  • Page 272 272
  • Page 273 273
  • Page 274 274
  • Page 275 275
  • Page 276 276
  • Page 277 277
  • Page 278 278
  • Page 279 279
  • Page 280 280
  • Page 281 281
  • Page 282 282
  • Page 283 283
  • Page 284 284
  • Page 285 285
  • Page 286 286
  • Page 287 287
  • Page 288 288
  • Page 289 289
  • Page 290 290
  • Page 291 291
  • Page 292 292
  • Page 293 293
  • Page 294 294
  • Page 295 295
  • Page 296 296
  • Page 297 297
  • Page 298 298
  • Page 299 299
  • Page 300 300
  • Page 301 301
  • Page 302 302
  • Page 303 303
  • Page 304 304
  • Page 305 305
  • Page 306 306
  • Page 307 307
  • Page 308 308
  • Page 309 309
  • Page 310 310
  • Page 311 311
  • Page 312 312
  • Page 313 313
  • Page 314 314
  • Page 315 315
  • Page 316 316
  • Page 317 317
  • Page 318 318
  • Page 319 319
  • Page 320 320
  • Page 321 321
  • Page 322 322
  • Page 323 323
  • Page 324 324
  • Page 325 325
  • Page 326 326
  • Page 327 327
  • Page 328 328
  • Page 329 329
  • Page 330 330
  • Page 331 331
  • Page 332 332
  • Page 333 333
  • Page 334 334
  • Page 335 335
  • Page 336 336
  • Page 337 337
  • Page 338 338
  • Page 339 339
  • Page 340 340
  • Page 341 341
  • Page 342 342
  • Page 343 343
  • Page 344 344
  • Page 345 345
  • Page 346 346
  • Page 347 347
  • Page 348 348
  • Page 349 349
  • Page 350 350
  • Page 351 351
  • Page 352 352
  • Page 353 353
  • Page 354 354
  • Page 355 355
  • Page 356 356
  • Page 357 357
  • Page 358 358
  • Page 359 359
  • Page 360 360
  • Page 361 361
  • Page 362 362
  • Page 363 363
  • Page 364 364
  • Page 365 365
  • Page 366 366
  • Page 367 367
  • Page 368 368
  • Page 369 369
  • Page 370 370
  • Page 371 371
  • Page 372 372
  • Page 373 373
  • Page 374 374
  • Page 375 375
  • Page 376 376
  • Page 377 377
  • Page 378 378
  • Page 379 379
  • Page 380 380
  • Page 381 381
  • Page 382 382
  • Page 383 383
  • Page 384 384
  • Page 385 385
  • Page 386 386
  • Page 387 387
  • Page 388 388
  • Page 389 389
  • Page 390 390
  • Page 391 391
  • Page 392 392
  • Page 393 393
  • Page 394 394
  • Page 395 395
  • Page 396 396
  • Page 397 397
  • Page 398 398
  • Page 399 399
  • Page 400 400
  • Page 401 401
  • Page 402 402
  • Page 403 403
  • Page 404 404

Bull HPC BAS4 Administration Guide

Category
Software
Type
Administration Guide

Ask a question and I''ll find the answer in the document

Finding information in a document is now easier with AI