Student Projects

About Us

Our team creates powerful supercomputers for modeling and analyzing the most complex problems and the largest data sets to enable revolutionary discoveries and capabilities.  Many of these capabilities have been developed and published in partnership with amazing students (see our Google Scholar page).

Here are a selection of videos describing some of our work:
• MIT Lincoln Laboratory Supercomputing Center
• Large Scale Parallel Sparse Matrix Streaming Graph/Network Analysis [ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) Keynote Talk]
• Beyond Zero Botnets: Web3 Enabled Observe-Pursue-Counter Approach [TEDx Boston Studio MIT Imagination in Action
• Statistical Cyber Characterization [IEEE High Performance Extreme Computing (HPEC) conference]
• HPC Server 3D Representation
• Building the Massachusettes Green High Performance Computing Center (the largest open research data center in the world)

Listed below are a wide range protential projects in AI, Mathematics, Green Computing, Supercomputing Systems, Online Learning, and Connection Science.  If you are interested in any of these projects, please send us email at supercloud@mit.edu.

AI Projects

  • Speeding up Architecture and Hyper-Parameter Searches
    Architecture and hyper-parameter searches are a major part of the AI development pipeline. This workflow typically consists of training a large number of models to identify architectures best suited for a given mission. These searches can take a significant amount of compute resources and time. Our project aims to develop new approaches to early identification of optimal architectures or hyper-parameters by modeling the loss curve trajectories during the training process. There are two approaches we are developing.
    1. Training performance estimation (TPE) – TPE has been shown to be very effective at estimating the converged training performance of graph neural networks. We are currently exploring the application of TPE to other models and domains.
    2. Loss curve gradient estimation (LCGA) – The LCGA approach aims to model the curvature of the training loss and has been shown to effective at identifying optimal architectures across different model families while also maintaining the relative ordering of model losses. Next steps in this research involve
    · Extending LCGA as an early stopping mechanism – By modeling the training loss curve, it may be possible to identify the optimal number of epochs required to train a model beyond which the model does not improve substantially.
    · Trainless architecture searches – By modeling the loss curves across a large family of deep neural network architectures, it may be possible to identify optimal new architectures without the need for training every possible architecture.
    References:
    • Energy-Aware Neural Architecture Selection and Hyperparameter Optimization
    • Loss Curve Approximations for Fast Neural Architecture Ranking & Training Elasticity Estimation

  • AI Analysis of User Interactions
    The SuperCloud team communicates and provides assistance to users via email. What can we learn from the email and Zoom communications to improve SuperCloud and user experience on SuperCloud. We want to build an infrastructure to collect, parse and analyze the email communications to improve SuperCloud and user experience.

  • Hierarchical Anonymized AI
    The costs of adversarial activity on networks are growing at an alarming rate and have reached $1T per year.  90% of Americans are now concerned about cyber-attacks; a level of public concern that is greater than pandemics and nuclear war. In the land, sea, undersea, air, and space operating domains observe-pursue-counter (detect-handoff-intercept) walls-out architectures have proven cost effective.  Our recent innovations in high performance privacy-preserving network sensing and analysis offer new opportunities for obtaining the required observations to enable such architectures in the cyber domain.  Using these network observations to pursue and counter adversarial activity requires the development of novel privacy-preserving hierarchical AI analytics techniques that explore connections both within and across the layers of the knowledge pyramid from low-level network traffic to high-level social media.
    References:   
    Zero Botnets: An Observe-Pursue-Counter Approach
    Realizing Forward Defense in the Cyber Domain
    GraphBLAS on the Edge: Anonymized High Performance Streaming of Network Traffic
    Large Scale Enrichment and Statistical Cyber Characterization of Network Traffic
    Temporal Correlation of Internet Observatories and Outposts
    Hypersparse Neural Network Analysis of Large-Scale Internet Traffic
    GraphBLAS

Mathematics Projects

  • Mathematics of Big Data & Machine Learning
    Big Data describes a new era in the digital age where the volume, velocity, and variety of data created across a wide range of fields is increasing at a rate well beyond our ability to analyze the data.  Machine Learning has emerged as a powerful tool for transforming this data into usable information.  Many technologies (e.g., spreadsheets, databases, graphs, matrices, deep neural networks, ...) have been developed to address these challenges.  The common theme amongst these technologies is the need to store and operate on data as tabular collections instead of as individual data elements.  This project explore the common mathematical foundation of these tabular collections (associative arrays) that apply across a wide range of applications and technologies.  Associative arrays unify and simplify Big Data and Machine Learning.  Understanding these mathematical foundations enables seeing past the differences that lie on the surface of Big Data and Machine Learning applications and technologies and leverage their core mathematical similarities to solve the hardest Big Data and Machine Learning challenges.
    References:
    • Mathematics of Big Data

  • Catastrophe vs Conspiracy: Heavy Tail Statistics
    Heavy-tail distributions, where the probability decays slower than exp(-x), are a natural result of multiplicative processes and play an important role in many of today’s most important problems (pandemics, climate, weather, finance, wealth distribution, social media, …).     Computer networks are among the most notable examples of heavy-tail distributions, whose celebrated discovery led to the creation of the new field of Network Science.   However, this observation brings with it the recognition that many cyber detection systems use light-tail statistical tests for which there may be no combination of thresholds that can result in acceptable operator probability-of-detection (Pd) and probability-of-false-alarm (Pfa). This Pd/Pfa paradox is consistent with the lived experience of many cyber operators and a possible root cause is the potential incompatibility of light-tail statistical tests on heavy-tail data. The goal of this effort is to develop the necessary educational and training tools for effectively understanding and applying heavy-tail distributions in a cyber context.
    References:
    New Phenomena in Large-Scale Internet Traffic
    Large Scale Enrichment and Statistical Cyber Characterization of Network Traffic
    The Fundamentals of Heavy Tails: Properties, Emergence, and Estimation
    Temporal Correlation of Internet Observatories and Outposts
    Hybrid Power-Law Models of Network Traffic
    Hypersparse Neural Network Analysis of Large-Scale Internet Traffic

  • Abstract Algebra of Cyberspace
    Social media, e-commerce, streaming video, e-mail, cloud documents, web pages, traffic flows, and network packets fill vast digital lakes, rivers, and oceans that we each navigate daily. This digital hyperspace is an amorphous flow of data supported by continuous streams that stretch standard concepts of type and dimension. The unstructured data of digital hyperspace can be elegantly represented, traversed, and transformed via the mathematics of hypergraphs, hypersparse matrices, and associative array algebra. This work will explore a novel mathematical concept, the semilink, that combines pairs of semirings to provide the essential operations for network/graph analytics, database operations, and machine learning.
    References:
    Mathematics of Big Data
    Mathematics of Digital Hyperspace
    Visually Representing the Landscape of Mathematical Structures
    Polystore Mathematics of Relational Algebra

  • Mathematical Underpinnings of Associative Array Algebra
    Semirings have found success as an algebraic structure which can support the variety of data types and operations used by those working with graphs, matrices, spreadsheets, and database, and form the mathematical foundation of the associativey array algebra of D4M and matrix algebra of GraphBLAS. Mirroring the fact that module theory has many but not all of the structural guarantees of vector space theory, the semimodule theory has some but not all of the structural guarantees of module theory. The added generality of semirings allows semimodule theory to consider structures wholly unlike rings and fields like Boolean algebras and the max-plus algebra. By focusing on these special cases which are diametrically opposed to the traditional ring and field cases, analogs of standard linear algebra like eigenanalysis and solving linear systems. This work will further explore the theory of semirings in the form of solving linear systems, carrying out eigenanalysis, and graph algorithms.
    References:
    Mathematics of Big Data
    Linear Systems over Join-Blank Algebras
    Graphs, Dioids and Semirings
    Semirings and their Applications
    GraphBLAS
    D4M

Green Computing Projects

  • Reducing the Carbon Footprint of AI
    AI is increasingly being used in a variety of domains ranging from vision, speech, cyber, bio and many more. With data and models getting ever larger, the associated carbon footprint of training and inference operations of these models is also increasing. While there is an increasing awareness in the community, no significant effort to reduce the energy consumed by AI currently exists. This project aims to develop best practices and recommendations for reducing the carbon footprint of AI through several approaches, including but not limited to
    · Hardware performance modulation
    · Optimized/energy efficient training and inference pipelines
    · Optimized network architecture searches (NAS) that require less compute
    · Trainless NAS
    · Quantization, distillation, low-precision compute

  • Optimizing Datacenter Operations
    Datacenters are an increasing source of energy consumption and the global carbon footprint. The energy consumed in a datacenter consists not only of the electricity required to the actual hardware, but also the energy expenditures for cooling the hardware. While some commercial cloud providers are able to locate datacenters in areas with abundant carbon-free, renewable energy, this is not the case in all cases. Thus, there is a need to reduce the carbon footprint of running these massive datacenters that provide compute for a wide variety of compute workloads, ranging from AI/ML to scientific applications, as well as e-commerce.  By optimizing the use of compute inside the datacenter, it may be possible to the reduce the power usage efficiency (PUE) of the datacenter, which can help reduce the carbon footprint significantly. Approaches to enable this goal include
    · Matching workloads with compute capability – by understanding the compute characteristics of workloads being run at datacenter scale, it may be possible to run workloads on the hardware that is best suited for the task and is most energy efficient.
    · Reconfigurable supercomputing – by characterizing the diversity of datacenter workloads, operators may be able to take advantage of advanced hardware that allows optimal provisioning and partitioning on a weekly basis. This can accommodate diverse workloads optimally based on prior behavior, while reducing energy consumption and also improving hardware availability.
    · Energy modeling – the mix of clean/renewable energy varies by region as well as by the time of year. By modeling energy demand at the datacenter and the seasonal mix of renewables, it may be possible for supercomputing centers to reduce their carbon footprint by shifting compute to “clean” periods.
    · Climate aware scheduling – current workload schedulers in datacenters are climate agnostic and do not take into account environmental factors when running workloads. Additionally, schedulers do not take into account the physical organization/layout of the hardware in the facility while running user submitted jobs. Potential approaches to address this are as follows

  • Scheduling jobs based on external temperature
    Any hardware operating at peak performance generates significant amount of heat, which necessitates the use of large amounts of cooling in the datacenter. This is in addition to external temperatures over which operators have no control. Thus, if the amount of heat generated inside the datacenter is reduced, this could have a measurable positive impact on the amount of cooling required. On way to achieve this is to schedule power-hungry compute such as GPU compute to cooler parts of the day. Another approach is to enable the scheduler to take into account the physical proximity of hardware while scheduling compute jobs. This can help reduce hot-spots in the server racks and potentially lead to more efficient cooling operations.

  • Climate modeling
    Supercomputing center operators can build localized climate models to estimate events such as heatwaves and enable the scheduler to reduce hardware power in the event of anticipated extreme weather. This will have the effect of not only reducing energy consumption, but also ensuring safer and un-interrupted datacenter operations by avoiding potential cooling failures in extreme heat.

  • User Engagement
    Supercomputing operations can be further optimized by actively working with users who are running large scale compute in the datacenter. This can take the form of education for optimizing compute, policies that enable users to choose low-power compute, ability to defer compute to cooler times of the day, etc.

Supercomputing Systems Projects

  • MIT/Stanford Next Generation Operating System
    The goal of the MIT/Stanford DBOS–the DBMS-oriented Operating System–is to build a completely new operating system (OS) stack for distributed systems. Currently, distributed systems are built on many instances of a single-node OS like Linux with entirely separate cluster schedulers, distributed file systems, and network managers. DBOS uses a distributed transactional DBMS as the basis for a scalable cluster OS. We have shown that such a database OS can do scheduling, file management, and inter-process communication with competitive performance to existing systems. It can additionally provide significantly better analytics and dramatically reduce code complexity by building core OS services from standard database queries, while implementing low-latency transactions and high availability only once. We are currently working on building a complete end-to-end prototype of DBOS.  This project will exploring implementing next generation cyber analytics within DBOS.
    References:
    DBOS: a DBMS-Oriented Operating System
    DBOS
    GraphBLAS on the Edge: Anonymized High Performance Streaming of Network Traffic
    Hypersparse Network Flow Analysis of Packets with GraphBLAS
    Temporal Correlation of Internet Observatories and Outposts
    Python Implementation of the Dynamic Distributed Dimensional Data Model
    Large Scale Enrichment and Statistical Cyber Characterization of Network Traffic
    Hypersparse Neural Network Analysis of Large-Scale Internet Traffic
    D4M
    GraphBLAS

  • Supercomputing and Cloud interoperability
    Shared Supercomputing typically resources offer a limited set of hardware and software combinations that researchers can leverage. At the same time, the absolute number of resources offered are also limited physically. Commercial cloud providers can offer an avenue to leverage additional resources as well as new/unique hardware as needs arise. Thus, having the technology to seamlessly transition between a shared resource such as the MIT SuperCloud and a commercial cloud provider (AWS, Microsoft Azure, Google Cloud) can significantly increase user productivity and enable new research. This project aims to
    · Make the MIT SuperCloud software stack available as a deployable image on commercial cloud providers
    · Develop tools to seamlessly transition between SuperCloud and cloud as requirements change
    · Enable sponsors and funding agencies to provide a standard AI stack that can be leveraged by performers and the broader community

  • Performance Tuning of Large-Scale Cluster Management Systems
    Modern supercomputers rely on a collection of open-source and bespoke custom software to handle node and user provisioning, system configuration, configuration persistence, change management, monitoring and metrics gathering and imaging.  The MIT SuperCloud system’s routine monthly maintenance includes a full reimage and reinstall of the operating system, all software and configuration files to ensure the reliability of our imaging system as well as to maintain a consistent state for our users, preventing the accumulation of incidental changes which could complicate troubleshooting and interfere with the running of user jobs.  The frequency with which we reimage nodes necessitates that the process be streamlined and optimized such that node reinstallation is as quick and reliable as possible.  This project would explore methods to refine our node installation procedures and search for new efficiencies furthering our ability to manage and maintain very large systems.

  • Datacenteric AI
    The Datacentric AI project aims to develop revolutionary data centric systems that can enable edge-to-datacenter scale computing while also providing high performance and accuracy for AI tasks, high productivity for AI developers using the system, and self-driven resource management of underlying complex systems. Rapidly evolving technologies such as new computing architectures, AI frameworks, supercomputing systems, cloud, and data management are the key enablers of AI and the speed at which they develop is outpacing the ability of AI practitioners to leverage optimally. As this compute capability, AI frameworks, and data diversity have grown, AI models have also evolved from traditional feed-forward or convolutional networks that employ computationally simple layers to more complex networks that use differential equations to model physical phenomenon. These new classes of algorithms and massive model architectures need new types of data-centric systems that can help map the novel computing requirements with ever complex hardware platforms such as quantum processors, neuromorphic processors and datacenter scale chips. A data-centric system would need revolutionary operating systems, ML-enhanced data management, highly parallel algorithms, and workload-aware schedulers that can automatically map workloads to heterogenous hardware platforms. By developing technologies to address these needs of future AI systems, this project aims to provide Lincoln and DoD researchers with the tools to address the needs of future AI systems.

  • Parallel Python Programming
    There are plethora of libraries to enable parallel programming with Python programming language but little has been done with partitioned global array semantics  (PGAS) approach.  Using PGAS approach, one can deploy a parallel capability that provides good speed-up without sacrificing the ease of programming in Python. This project will explore the scalability and performance of the preliminary implementation of PGAS in Python and compare its performance with other libraries available for Python parallel programming, and potentially seeking further performance optimization in the current PGAS implementation.
    References:
    pPython for Parallel Python Programming

  • 3D Visualization of Supercomputer Performance
    There are a number of data collection and visualization tools to assist in the real time performance analysis of High Performance Computing(HPC) systems but there is a need to analyze past performance for systems troubleshooting and system behavior analysis. Optimizing HPC systems for processing speed, power consumption, and network optimization can be difficult to do in real time so a system to use collected data to “rerun” system performance would be advantageous. Gaming engines, like Unity 3D, can be used to build virtual system representations and run scenarios using historical or manufactured data to identify system failures or bottlenecks and fine tuned to optimize performace metrics.
    References:
    3D Real-Time Supercomputer Monitoring
    Large Scale Network Situational Awareness via 3D Gaming Technology

  • Data Analytics and 3D Game Development
    The LLSC operates and maintains a large number of High Performance Computing clusters for general-purpose accelerated discovery across many research domains. The operation of these systems requires access to detailed information regarding the status of systems schedulers, storage arrays, compute node status, network data, and data center conditions. The collections of data represents the collection of over 80 million data points per day. Effectively correlating this volume of data into actionable information requires innovative approaches and tools. The LLSC has developed a 3D Monitoring and Management platform by leveraging Unity3D to render the physical data center space and assets into a virtual environment which strives to provide a holistic view of the HPC resources in a human digestible format. Our goal is to achieve a level of situational awareness that enables the operations team to identify and correct issues before they negatively impact the user experience. Some near term goals are to fold the innovative Green Data Center challenge work and data into the M&M system to enable the identification of carbon impacts of different job profiles across a heterogeneous compute environment.
    References:
    Large Scale Network Situational Awareness via 3D Gaming Technology
    Big Data Strategies for Data Center Infrastructure Management using a 3D Gaming Platform
    Optimizing the Visualization Pipeline of a 3-D Monitoring and Management System
    3D Real-Time Supercomputer Monitoring
    A Green(er) World for AI
    Unity game development platform

Online Learning Projects

  • Evaluate the State of User Applications
    • Capture the most commonly used applications/workflows
    • Compare to existing set of teaching examples
    • Design a prioritized suite of new examples or updates to existing examples
    • Highlight topics that would make good micro-lessons

  • Expand the Suite of Teaching Examples
    • Use the prioritized list created via project defined above or start with code snippets that we currently have
    • Identify areas/scripts that would be beneficial to users:
    • Build whole learning module, e.g. tensorboard
    • documentation converted to scripts (where possible)
    • clean up scripts – (ascii art) to make it clearer for user
    • testing
    • include description of what is happening on systems
    • include information on impact to system and user
    • create mini-workshop
    • convert workshop to video & hands-on module

  • Evaluation of Educational Games
    • Explore the literature to understand methods for evaluating learning
    • Evaluate our HPC Games
    • What data should we collect? survey design?  is there any data that we should look for in the online version?
    • recommend a formal education plan for each game
  • Knowledge Graph from Video Scripts
    • NLP
    • align with content in course(s)

  • Learning Analytics (WPLA - workplace learning analytics)
    • review of  WPLA literature, best practices
    • using SLURM data and Edly Data to evaluate effectiveness of courseware
    • path from Jupyter NB to batch jobs
    • efficiency of jobs
    • email requests aligned with learning modules - do we lower the amount of simple email?

  • Exploring User Data
    • Prior experience
    • Departments, position, affiliation
    • How long are they using the system for? “Turnover”?
    • Combine with Slurm data- how much are people using the system? Correlations between system use and how long they’ve had their account?
     

Connection Science Projects