About | Formal basis | Workpackages | Documents | Meetings | Partners | Contact list |

 »   Workpackages

 »   WP 2.2 Checkpoint mechanisms and process migration for IA-64 processors

Leader »
Maciej Stroiński, PhD
PSNC, Poznań
Co-executor » ACC Cyfronet AGH
Start date » 2 Jun, 2003
Ending date » 31 Oct, 2004
[ Team  | Papers ]
Short task description

Long-lasting tasks are susceptible to interruptions because of hardware or software failures. For economic reasons it is undesirable to waste time on repeating interrupted computations from the beginning.

Operating systems designed for Intel x86 processors family do not provide checkpoint realization on the operating system level. So far there has been no need for similar functionality in the world of PC computers. Introducing a new family of IA-64 processors by Intel gives an opportunity of using them for HPC computations. Creating and popularizing new processors and operating systems designed for them is a good moment for preparing mechanisms of checkpointing on the kernel level and process migration to another computational node.

Basing on CRAK and EPCKPT projects experiences an implementation of checkpoint mechanism in Linux operating system environment for new Intel IA-64 processors is planned. The tasks consist in the implementation of stop, restart and migration task functions on computers equipped with Intel Itanium processors working under operating systems of the Linux family. The checkpoint mechanism will be deployed as a supplementary module attached to the operating system and will be working on the system kernel level. The module will support pipeline processing, IPC mechanisms (semaphores, shared memory, message queues), dynamic loaded libraries, storing process groups in both single-processor and multiprocessor systems, socket communication for TCP/IPv4. Additionally, research on preparing mechanisms allowing for process migration between different systems of the same architecture and implementing tools for process migration management will be performed.

For the purposes of accomplishing the task a cluster of SGI 750 systems will be used (Linux operating system and Intel Itanium processors). The cluster is going to be distributed to several computer centers and communicate through a fast optical network (see the WP 1.2 description).