checkpoint

A common need of users is to periodically backup an application. This practice provides resiliency and potential restart points in code. HPX utilizes the concept of a checkpoint to support this use case.

Found in hpx/util/checkpoint.hpp, checkpoints are defined as objects that hold a serialized version of an object or set of objects at a particular moment in time. This representation can be stored in memory for later use or it can be written to disk for storage and/or recovery at a later point. In order to create and fill this object with data, users must use a function called save_checkpoint. In code the function looks like this:

hpx::future<hpx::util::checkpoint> hpx::util::save_checkpoint(a, b, c, ...);

save_checkpoint takes arbitrary data containers, such as int, double, float, vector, and future, and serializes them into a newly created checkpoint object. This function returns a future to a checkpoint containing the data. Here’s an example of a simple use case:

using hpx::util::checkpoint;
using hpx::util::save_checkpoint;

std::vector<int> vec{1,2,3,4,5};
hpx::future<checkpoint> save_checkpoint(vec);

Once the future is ready, the checkpoint object will contain the vector vec and its five elements.

prepare_checkpoint takes arbitrary data containers (same as for save_checkpoint), , such as int, double, float, vector, and future, and calculates the necessary buffer space for the checkpoint that would be created if save_checkpoint was called with the same arguments. This function returns a future to a checkpoint that is appropriately initialized. Here’s an example of a simple use case:

using hpx::util::checkpoint;
using hpx::util::prepare_checkpoint;

std::vector<int> vec{1,2,3,4,5};
hpx::future<checkpoint> prepare_checkpoint(vec);

Once the future is ready, the checkpoint object will be initialized with an appropriately sized internal buffer.

It is also possible to modify the launch policy used by save_checkpoint. This is accomplished by passing a launch policy as the first argument. It is important to note that passing hpx::launch::sync will cause save_checkpoint to return a checkpoint instead of a future to a checkpoint. All other policies passed to save_checkpoint will return a future to a checkpoint.

Sometimes checkpoint s must be declared before they are used. save_checkpoint allows users to move pre-created checkpoint s into the function as long as they are the first container passing into the function (In the case where a launch policy is used, the checkpoint will immediately follow the launch policy). An example of these features can be found below:

    char character = 'd';
    int integer = 10;
    float flt = 10.01f;
    bool boolean = true;
    std::string str = "I am a string of characters";
    std::vector<char> vec(str.begin(), str.end());
    checkpoint archive;

    // Test 1
    //  test basic functionality
    hpx::shared_future<checkpoint> f_archive = save_checkpoint(
        std::move(archive), character, integer, flt, boolean, str, vec);

Once users can create checkpoints they must now be able to restore the objects they contain into memory. This is accomplished by the function restore_checkpoint. This function takes a checkpoint and fills its data into the containers it is provided. It is important to remember that the containers must be ordered in the same way they were placed into the checkpoint. For clarity see the example below:

    char character2;
    int integer2;
    float flt2;
    bool boolean2;
    std::string str2;
    std::vector<char> vec2;

    restore_checkpoint(data, character2, integer2, flt2, boolean2, str2, vec2);

The core utility of checkpoint is in its ability to make certain data persistent. Often, this means that the data needs to be stored in an object, such as a file, for later use. HPX has two solutions for these issues: stream operator overloads and access iterators.

HPX contains two stream overloads, operator<< and operator>>, to stream data out of and into checkpoint. Here is an example of the overloads in use below:

    double a9 = 1.0, b9 = 1.1, c9 = 1.2;
    std::ofstream test_file_9("test_file_9.txt");
    hpx::future<checkpoint> f_9 = save_checkpoint(a9, b9, c9);
    test_file_9 << f_9.get();
    test_file_9.close();

    double a9_1, b9_1, c9_1;
    std::ifstream test_file_9_1("test_file_9.txt");
    checkpoint archive9;
    test_file_9_1 >> archive9;
    restore_checkpoint(archive9, a9_1, b9_1, c9_1);

This is the primary way to move data into and out of a checkpoint. It is important to note, however, that users should be cautious when using a stream operator to load data and another function to remove it (or vice versa). Both operator<< and operator>> rely on a .write() and a .read() function respectively. In order to know how much data to read from the std::istream, the operator<< will write the size of the checkpoint before writing the checkpoint data. Correspondingly, the operator>> will read the size of the stored data before reading the data into a new instance of checkpoint. As long as the user employs the operator<< and operator>> to stream the data, this detail can be ignored.

Important

Be careful when mixing operator<< and operator>> with other facilities to read and write to a checkpoint. operator<< writes an extra variable, and operator>> reads this variable back separately. Used together the user will not encounter any issues and can safely ignore this detail.

Users may also move the data into and out of a checkpoint using the exposed .begin() and .end() iterators. An example of this use case is illustrated below.

    std::ofstream test_file_7("checkpoint_test_file.txt");
    std::vector<float> vec7{1.02f, 1.03f, 1.04f, 1.05f};
    hpx::future<checkpoint> fut_7 = save_checkpoint(vec7);
    checkpoint archive7 = fut_7.get();
    std::copy(archive7.begin(),    // Write data to ofstream
        archive7.end(),            // ie. the file
        std::ostream_iterator<char>(test_file_7));
    test_file_7.close();

    std::vector<float> vec7_1;
    std::vector<char> char_vec;
    std::ifstream test_file_7_1("checkpoint_test_file.txt");
    if (test_file_7_1)
    {
        test_file_7_1.seekg(0, test_file_7_1.end);
        auto length = test_file_7_1.tellg();
        test_file_7_1.seekg(0, test_file_7_1.beg);
        char_vec.resize(length);
        test_file_7_1.read(char_vec.data(), length);
    }
    checkpoint archive7_1(std::move(char_vec));    // Write data to checkpoint
    restore_checkpoint(archive7_1, vec7_1);

Checkpointing components

save_checkpoint and restore_checkpoint are also able to store components inside checkpoints. This can be done in one of two ways. First a client of the component can be passed to save_checkpoint. When the user wishes to resurrect the component she can pass a client instance to restore_checkpoint.

This technique is demonstrated below:

    // Try to checkpoint and restore a component with a client
    std::vector<int> vec3{10, 10, 10, 10, 10};

    // Create a component instance through client constructor
    data_client D(hpx::find_here(), std::move(vec3));
    hpx::future<checkpoint> f3 = save_checkpoint(D);

    // Create a new client
    data_client E;

    // Restore server inside client instance
    restore_checkpoint(f3.get(), E);

The second way a user can save a component is by passing a shared_ptr to the component to save_checkpoint. This component can be resurrected by creating a new instance of the component type and passing a shared_ptr to the new instance to restore_checkpoint.

This technique is demonstrated below:

    // test checkpoint a component using a shared_ptr
    std::vector<int> vec{1, 2, 3, 4, 5};
    data_client A(hpx::find_here(), std::move(vec));

    // Checkpoint Server
    hpx::id_type old_id = A.get_id();

    hpx::future<std::shared_ptr<data_server>> f_a_ptr =
        hpx::get_ptr<data_server>(A.get_id());
    std::shared_ptr<data_server> a_ptr = f_a_ptr.get();
    hpx::future<checkpoint> f = save_checkpoint(a_ptr);
    auto&& data = f.get();

    // test prepare_checkpoint API
    checkpoint c = prepare_checkpoint(hpx::launch::sync, a_ptr);
    HPX_TEST(c.size() == data.size());

    // Restore Server
    // Create a new server instance
    std::shared_ptr<data_server> b_server;
    restore_checkpoint(data, b_server);