FAQ

This page contains the answers to a few questions that we recieve often. As more questions are asked, this page may be updated.

How do I get an account?

To request an account, follow the instructions and answer the questions on our Requesting an Account Page. We will reach out to you once your account is created, or if we have any questions for you.

I would like to log in from a new computer. Can I add a new ssh key?

If you have a new computer, or want to add keys for additional computers that you use, you can add your own key on our web portal. Instructions on how to generate a new ssh key and add it to your account are on our Account Request Page. In summary, log in with your credentials (for MIT and other educational institutions this is the middle option when you go to https://txe1-portal.mit.edu) and then click on the “sshkeys” link. Scroll to the bottom and paste your key in the box.

How much storage do I have for my account?

We do not impose storage limits. However, it is recommended that users not use their accounts as primary storage. Further, we do not back up the storage on the system, so we strongly recommend transfering your code, data, and any other important files to another machine for backup.

How can I share files/code/data with my colleagues?

If you would like to share files with others you can request a shared group directory. Shared group directories are located at /home/gridsan/groups and we will put a symlink in your home directory to use as a shortcut to your shared group directory. To request one, send email to supercloud@mit.edu and let us know:

  1. What the group should be called. Short, descriptive names are best.
  2. Who should be the owner/approver for the group. We will ask this person for approval whenever we receive a request to join a group.
  3. Who should be in the group. Supercloud usernames are helpful, but not required.
  4. Whether you plan to store any non-public data in the group. If you do, let us know what requirements, restrictions, or agreements are associated with the data. See why we ask here.

To learn more about Shared Groups and best practices using them, see the page on Shared Group Directories.

How do I set/change my password?

You most likely do not need to set a password. If you have an active MIT Kerberos or login from another University, you can most likely log in using your institution's credentials. On the Supercloud Web Portal Login page, select the middle option "MIT Touchstone/InCommon Federation". You may have to select your institution from the dropdown list, which should take you to your institution's login page. After you log in, you should see the Portal main page. If you have trouble logging in this way, please contact us and we can help.

If you cannot log in using "MIT Touchstone/InCommon Federation", we may set you up with a password. If you have not yet reset your password, or remember your previous password, then follow the instructions on the Web Portal page. If you have previously set your password and cannot remember it, contact us and we will help you reset your password.

Are there any resource limits?

By default, all users have a limit of 24 Xeon-P8 CPU nodes (1152 cores) and 8 Xeon-G6 GPU nodes (320 cores and 16 GPUs). This includes using multiple slots or cores for a single process. For example, if you are using 2 slots or cores per process, you can run 160 processes at a time. If you have a deadline and need additional resources you can request more by contacting us. If you looking to request more GPUs, please read through this page on Optimizing your GPU Usage first. Please state the number of additional processors you need, the length of time for which you need it, and tell us about the jobs you are running and how you are submitting them. If you plan to run many independent jobs we will ask you to convert your job to use Triples Mode before giving and increased allocation. Remember this is a shared system, so during busy times we may not be able to grant your request.

It is also important to keep in mind what your fair share of memory is for each process and request additional resources if needed. For example, if there are 40 cores and 384GB of RAM on the machine you are using, each processor's fair share would be about 9GB. Check the Systems and Software page to see how many cores and how much memory each node type has. If you think your processes will go over this, request additional slots as needed. This ensures you have sufficient memory without killing your job or someone else's.

What do I do if my job won't be deleted?

Occasionally this will happen if the node where your job is running goes down, or your job does not exit gracefully. If this happens, contact us with the Job ID, and we'll delete the job and reboot the node if needed.

Why do I get an error when I try to install a package?

There are two common reasons you get an error when you try to install a package. If you get a "Permission Denied" or similar error, it is because you are trying to install the package system-wide, rather than your own home directory. See the Software and Package Management page for more information on how to install packages.

If you get a "Network Error", or similar, this is because we don't have internet/network connection on the compute nodes, this includes Jupyter and any interactive jobs. You will have to install the package on one of the login nodes.

If you get an error like "Could not install packages due to an EnvironmentError: [Errno 122] Disk quota exceeded" when installing a package with pip or something like "ERROR: could not download https://pkg.julialang.org/registry/..." installing a package with Julia, even though you are on the login node, this is because it is filling up your quota in the /tmp directory. We have set quotas on this directory to prevent a single person from inadvertently filling it up, as when this happens it can cause issues for everyone using the node, including preventing anyone from installing packages. This can be fixed by setting the TMPIDR environment variable like so:

mkdir /state/partition1/user/$USER
export TMPDIR=/state/partition1/user/$USER

After you have installed your package you can clean up any lingering files by removing the temporary directory you have created:

rm -rf /state/partition1/user/$USER

In an Interactive Job I get the error "bash: module: command not found" when I try to use a module command. How can I use modules in an interactive job?

If a module command is not recognized in an interactive job you can run source /etc/profile at the command line as you would have in a submission script to use the module command.

How can I set up VSCode to edit files remotely on Supercloud?

You can use VSCode to remotely connect to Supercloud  via the Remote-SSH extension. The default settings in the VSCode Remote - SSH extension will fail to connect. This is due to it trying to lock files in your home directory, which is disabled for performance reasons.

The solution is to have it use the local filesystem. To get it to work, go to your VS Code settings, click “Extensions” and then “Remote - SSH”. Once you’re in the settings for Remote - SSH, check the box next to “Remote.SSH: Lockfiles in Tmp”. What this will do is put any lockfiles in /tmp, rather than your home directory.

A side note: we have seen VS Code clutter up /tmp in the past, which we keep fairly small. Disconnecting occasionally should clean these up, however we do not know for sure. If you can check it once in a while and clean up any files that are yours in /tmp, that would be really helpful.

How can I use Tensorboard on Supercloud?

Take a look at this page on how to run Tensorboard in an interactive job.

How can I get more help?

If you have a question that is not answered here, send email to supercloud@mit.edu for more help.