|
Short task description
Long-lasting tasks are susceptible to
interruptions because of hardware or software failures. For economic reasons
it is undesirable to waste time on repeating interrupted computations from
the beginning.
Operating systems designed for Intel x86
processors family do not provide checkpoint realization on the operating
system level. So far there has been no need for similar functionality in
the world of PC computers. Introducing a new family of IA-64 processors by
Intel gives an opportunity of using them for HPC computations. Creating and
popularizing new processors and operating systems designed for them is a
good moment for preparing mechanisms of checkpointing on the kernel level
and process migration to another computational node.
Basing on
CRAK
and EPCKPT
projects experiences an implementation of checkpoint mechanism in Linux
operating system environment for new Intel IA-64 processors is planned.
The tasks consist in the implementation of stop, restart and migration
task functions on computers equipped with Intel Itanium processors working
under operating systems of the Linux family. The checkpoint mechanism will
be deployed as a supplementary module attached to the operating system and
will be working on the system kernel level. The module will support
pipeline processing, IPC mechanisms (semaphores, shared memory, message
queues), dynamic loaded libraries, storing process groups in both
single-processor and multiprocessor systems, socket communication for
TCP/IPv4. Additionally, research on preparing mechanisms allowing for
process migration between different systems of the same architecture and
implementing tools for process migration management will be performed.
For the purposes of accomplishing the task
a cluster of SGI 750 systems will be used (Linux operating system and Intel
Itanium processors). The cluster is going to be distributed to several
computer centers and communicate through a fast optical network (see the
WP 1.2 description). |