data science thesis

Analytics Insight

10 Best Research and Thesis Topic Ideas for Data Science in 2022

' src=

These research and thesis topics for data science will ensure more knowledge and skills for both students and scholars

  • Handling practical video analytics in a distributed cloud:  With increased dependency on the internet, sharing videos has become a mode of data and information exchange. The role of the implementation of the Internet of Things (IoT), telecom infrastructure, and operators is huge in generating insights from video analytics. In this perspective, several questions need to be answered, like the efficiency of the existing analytics systems, the changes about to take place if real-time analytics are integrated, and others.
  • Smart healthcare systems using big data analytics: Big data analytics plays a significant role in making healthcare more efficient, accessible, and cost-effective. Big data analytics enhances the operational efficiency of smart healthcare providers by providing real-time analytics. It enhances the capabilities of the intelligent systems by using short-span data-driven insights, but there are still distinct challenges that are yet to be addressed in this field.
  • Identifying fake news using real-time analytics:  The circulation of fake news has become a pressing issue in the modern era. The data gathered from social media networks might seem legit, but sometimes they are not. The sources that provide the data are unauthenticated most of the time, which makes it a crucial issue to be addressed.
  • TOP 10 DATA SCIENCE JOB SKILLS THAT WILL BE ON HIGH DEMAND IN 2022
  • TOP 10 DATA SCIENCE UNDERGRADUATE COURSES IN INDIA FOR 2022
  • TOP DATA SCIENCE PROJECTS TO DO DURING YOUR OMICRON QUARANTINE
  • Secure federated learning with real-world applications : Federated learning is a technique that trains an algorithm across multiple decentralized edge devices and servers. This technique can be adopted to build models locally, but if this technique can be deployed at scale or not, across multiple platforms with high-level security is still obscure.
  • Big data analytics and its impact on marketing strategy : The advent of data science and big data analytics has entirely redefined the marketing industry. It has helped enterprises by offering valuable insights into their existing and future customers. But several issues like the existence of surplus data, integrating complex data into customers’ journeys, and complete data privacy are some of the branches that are still untrodden and need immediate attention.
  • Impact of big data on business decision-making: Present studies signify that big data has transformed the way managers and business leaders make critical decisions concerning the growth and development of the business. It allows them to access objective data and analyse the market environments, enabling companies to adapt rapidly and make decisions faster. Working on this topic will help students understand the present market and business conditions and help them analyse new solutions.
  • Implementing big data to understand consumer behaviour : In understanding consumer behaviour, big data is used to analyse the data points depicting a consumer’s journey after buying a product. Data gives a clearer picture in understanding specific scenarios. This topic will help understand the problems that businesses face in utilizing the insights and develop new strategies in the future to generate more ROI.
  • Applications of big data to predict future demand and forecasting : Predictive analytics in data science has emerged as an integral part of decision-making and demand forecasting. Working on this topic will enable the students to determine the significance of the high-quality historical data analysis and the factors that drive higher demand in consumers.
  • The importance of data exploration over data analysis : Exploration enables a deeper understanding of the dataset, making it easier to navigate and use the data later. Intelligent analysts must understand and explore the differences between data exploration and analysis and use them according to specific needs to fulfill organizational requirements.
  • Data science and software engineering : Software engineering and development are a major part of data science. Skilled data professionals should learn and explore the possibilities of the various technical and software skills for performing critical AI and big data tasks.

Whatsapp Icon

Disclaimer: Any financial and crypto market information given on Analytics Insight are sponsored articles, written for informational purpose only and is not an investment advice. The readers are further advised that Crypto products and NFTs are unregulated and can be highly risky. There may be no regulatory recourse for any loss from such transactions. Conduct your own research by contacting financial experts before making any investment decisions. The decision to read hereinafter is purely a matter of choice and shall be construed as an express undertaking/guarantee in favour of Analytics Insight of being absolved from any/ all potential legal action, or enforceable claims. We do not represent nor own any cryptocurrency, any complaints, abuse or concerns with regards to the information provided shall be immediately informed here .

You May Also Like

Why-This-AI-Stock-Could-Be-Your-Ticket-to-Millionaire-Status

Why This AI Stock Could Be Your Ticket to Millionaire Status

Wearable NFTs

Wearable NFTs: Let Your Virtual Avatar be a Fashionista

Cryptocurrencies

10 Cryptocurrencies with Millionaire-Maker Potential in 2022

upgrade

Tron Blockchain’s $USDD Stablecoin Marks Major Milestones in its Latest Upgrade

AI-logo

Analytics Insight® is an influential platform dedicated to insights, trends, and opinion from the world of data-driven technologies. It monitors developments, recognition, and achievements made by Artificial Intelligence, Big Data and Analytics companies across the globe.

linkedin

  • Select Language:
  • Privacy Policy
  • Content Licensing
  • Terms & Conditions
  • Submit an Interview

Special Editions

  • Dec – Crypto Weekly Vol-1
  • 40 Under 40 Innovators
  • Women In Technology
  • Market Reports
  • AI Glossary
  • Infographics

Latest Issue

Magazine April 2024

Disclaimer: Any financial and crypto market information given on Analytics Insight is written for informational purpose only and is not an investment advice. Conduct your own research by contacting financial experts before making any investment decisions, more information here .

Second Menu

data science thesis

Warning icon

Thesis/Capstone for Master's in Data Science | Northwestern SPS - Northwestern School of Professional Studies

  • Post-baccalaureate
  • Undergraduate
  • Professional Development
  • Pre-College
  • Center for Public Safety
  • Get Information

SPS Logo

Data Science

Capstone and thesis overview.

Capstone and thesis are similar in that they both represent a culminating, scholarly effort of high quality. Both should clearly state a problem or issue to be addressed. Both will allow students to complete a larger project and produce a product or publication that can be highlighted on their resumes. Students should consider the factors below when deciding whether a capstone or thesis may be more appropriate to pursue.

A capstone is a practical or real-world project that can emphasize preparation for professional practice. A capstone is more appropriate if:

  • you don't necessarily need or want the experience of the research process or writing a big publication
  • you want more input on your project, from fellow students and instructors
  • you want more structure to your project, including assignment deadlines and due dates
  • you want to complete the project or graduate in a timely manner

A student can enroll in MSDS 498 Capstone in any term. However, capstone specialization courses can provide a unique student experience and may be offered only twice a year. 

A thesis is an academic-focused research project with broader applicability. A thesis is more appropriate if:

  • you want to get a PhD or other advanced degree and want the experience of the research process and writing for publication
  • you want to work individually with a specific faculty member who serves as your thesis adviser
  • you are more self-directed, are good at managing your own projects with very little supervision, and have a clear direction for your work
  • you have a project that requires more time to pursue

Students can enroll in MSDS 590 Thesis as long as there is an approved thesis project proposal, identified thesis adviser, and all other required documentation at least two weeks before the start of any term.

From Faculty Director, Thomas W. Miller, PhD

Tom Miller

Capstone projects and thesis research give students a chance to study topics of special interest to them. Students can highlight analytical skills developed in the program. Work on capstone and thesis research projects often leads to publications that students can highlight on their resumes.”

A thesis is an individual research project that usually takes two to four terms to complete. Capstone course sections, on the other hand, represent a one-term commitment.

Students need to evaluate their options prior to choosing a capstone course section because capstones vary widely from one instructor to the next. There are both general and specialization-focused capstone sections. Some capstone sections offer in individual research projects, others offer team research projects, and a few give students a choice of individual or team projects.

Students should refer to the SPS Graduate Student Handbook for more information regarding registration for either MSDS 590 Thesis or MSDS 498 Capstone.

Capstone Experience

If students wish to engage with an outside organization to work on a project for capstone, they can refer to this checklist and lessons learned for some helpful tips.

Capstone Checklist

  • Start early — set aside a minimum of one to two months prior to the capstone quarter to determine the industry and modeling interests.
  • Networking — pitch your idea to potential organizations for projects and focus on the business benefits you can provide.
  • Permission request — make sure your final project can be shared with others in the course and the information can be made public.
  • Engagement — engage with the capstone professor prior to and immediately after getting the dataset to ensure appropriate scope for the 10 weeks.
  • Teambuilding — recruit team members who have similar interests for the type of project during the first week of the course.

Capstone Lesson Learned

  • Access to company data can take longer than expected; not having this access before or at the start of the term can severely delay the progress
  • Project timeline should align with coursework timeline as closely as possible
  • One point of contact (POC) for business facing to ensure streamlined messages and more effective time management with the organization
  • Expectation management on both sides: (business) this is pro-bono (students) this does not guarantee internship or job opportunities
  • Data security/masking not executed in time can risk the opportunity completely

Publication of Work

Northwestern University Libraries offers an option for students to publish their master’s thesis or capstone in Arch, Northwestern’s open access research and data repository.

Benefits for publishing your thesis:

  • Your work will be indexed by search engines and discoverable by researchers around the world, extending your work’s impact beyond Northwestern
  • Your work will be assigned a Digital Object Identifier (DOI) to ensure perpetual online access and to facilitate scholarly citation
  • Your work will help accelerate discovery and increase knowledge in your subject domain by adding to the global corpus of public scholarly information

Get started:

  • Visit Arch online
  • Log in with your NetID
  • Describe your thesis: title, author, date, keywords, rights, license, subject, etc.
  • Upload your thesis or capstone PDF and any related supplemental files (data, code, images, presentations, documentation, etc.)
  • Select a visibility: Public, Northwestern-only, Embargo (i.e. delayed release)
  • Save your work to the repository

Your thesis manuscript or capstone report will then be published on the MSDS page. You can view other published work here .

For questions or support in publishing your thesis or capstone, please contact [email protected] .

Grad Coach

Research Topics & Ideas: Data Science

50 Topic Ideas To Kickstart Your Research Project

Research topics and ideas about data science and big data analytics

If you’re just starting out exploring data science-related topics for your dissertation, thesis or research project, you’ve come to the right place. In this post, we’ll help kickstart your research by providing a hearty list of data science and analytics-related research ideas , including examples from recent studies.

PS – This is just the start…

We know it’s exciting to run through a list of research topics, but please keep in mind that this list is just a starting point . These topic ideas provided here are intentionally broad and generic , so keep in mind that you will need to develop them further. Nevertheless, they should inspire some ideas for your project.

To develop a suitable research topic, you’ll need to identify a clear and convincing research gap , and a viable plan to fill that gap. If this sounds foreign to you, check out our free research topic webinar that explores how to find and refine a high-quality research topic, from scratch. Alternatively, consider our 1-on-1 coaching service .

Research topic idea mega list

Data Science-Related Research Topics

  • Developing machine learning models for real-time fraud detection in online transactions.
  • The use of big data analytics in predicting and managing urban traffic flow.
  • Investigating the effectiveness of data mining techniques in identifying early signs of mental health issues from social media usage.
  • The application of predictive analytics in personalizing cancer treatment plans.
  • Analyzing consumer behavior through big data to enhance retail marketing strategies.
  • The role of data science in optimizing renewable energy generation from wind farms.
  • Developing natural language processing algorithms for real-time news aggregation and summarization.
  • The application of big data in monitoring and predicting epidemic outbreaks.
  • Investigating the use of machine learning in automating credit scoring for microfinance.
  • The role of data analytics in improving patient care in telemedicine.
  • Developing AI-driven models for predictive maintenance in the manufacturing industry.
  • The use of big data analytics in enhancing cybersecurity threat intelligence.
  • Investigating the impact of sentiment analysis on brand reputation management.
  • The application of data science in optimizing logistics and supply chain operations.
  • Developing deep learning techniques for image recognition in medical diagnostics.
  • The role of big data in analyzing climate change impacts on agricultural productivity.
  • Investigating the use of data analytics in optimizing energy consumption in smart buildings.
  • The application of machine learning in detecting plagiarism in academic works.
  • Analyzing social media data for trends in political opinion and electoral predictions.
  • The role of big data in enhancing sports performance analytics.
  • Developing data-driven strategies for effective water resource management.
  • The use of big data in improving customer experience in the banking sector.
  • Investigating the application of data science in fraud detection in insurance claims.
  • The role of predictive analytics in financial market risk assessment.
  • Developing AI models for early detection of network vulnerabilities.

Research topic evaluator

Data Science Research Ideas (Continued)

  • The application of big data in public transportation systems for route optimization.
  • Investigating the impact of big data analytics on e-commerce recommendation systems.
  • The use of data mining techniques in understanding consumer preferences in the entertainment industry.
  • Developing predictive models for real estate pricing and market trends.
  • The role of big data in tracking and managing environmental pollution.
  • Investigating the use of data analytics in improving airline operational efficiency.
  • The application of machine learning in optimizing pharmaceutical drug discovery.
  • Analyzing online customer reviews to inform product development in the tech industry.
  • The role of data science in crime prediction and prevention strategies.
  • Developing models for analyzing financial time series data for investment strategies.
  • The use of big data in assessing the impact of educational policies on student performance.
  • Investigating the effectiveness of data visualization techniques in business reporting.
  • The application of data analytics in human resource management and talent acquisition.
  • Developing algorithms for anomaly detection in network traffic data.
  • The role of machine learning in enhancing personalized online learning experiences.
  • Investigating the use of big data in urban planning and smart city development.
  • The application of predictive analytics in weather forecasting and disaster management.
  • Analyzing consumer data to drive innovations in the automotive industry.
  • The role of data science in optimizing content delivery networks for streaming services.
  • Developing machine learning models for automated text classification in legal documents.
  • The use of big data in tracking global supply chain disruptions.
  • Investigating the application of data analytics in personalized nutrition and fitness.
  • The role of big data in enhancing the accuracy of geological surveying for natural resource exploration.
  • Developing predictive models for customer churn in the telecommunications industry.
  • The application of data science in optimizing advertisement placement and reach.

Recent Data Science-Related Studies

While the ideas we’ve presented above are a decent starting point for finding a research topic, they are fairly generic and non-specific. So, it helps to look at actual studies in the data science and analytics space to see how this all comes together in practice.

Below, we’ve included a selection of recent studies to help refine your thinking. These are actual studies,  so they can provide some useful insight as to what a research topic looks like in practice.

  • Data Science in Healthcare: COVID-19 and Beyond (Hulsen, 2022)
  • Auto-ML Web-application for Automated Machine Learning Algorithm Training and evaluation (Mukherjee & Rao, 2022)
  • Survey on Statistics and ML in Data Science and Effect in Businesses (Reddy et al., 2022)
  • Visualization in Data Science VDS @ KDD 2022 (Plant et al., 2022)
  • An Essay on How Data Science Can Strengthen Business (Santos, 2023)
  • A Deep study of Data science related problems, application and machine learning algorithms utilized in Data science (Ranjani et al., 2022)
  • You Teach WHAT in Your Data Science Course?!? (Posner & Kerby-Helm, 2022)
  • Statistical Analysis for the Traffic Police Activity: Nashville, Tennessee, USA (Tufail & Gul, 2022)
  • Data Management and Visual Information Processing in Financial Organization using Machine Learning (Balamurugan et al., 2022)
  • A Proposal of an Interactive Web Application Tool QuickViz: To Automate Exploratory Data Analysis (Pitroda, 2022)
  • Applications of Data Science in Respective Engineering Domains (Rasool & Chaudhary, 2022)
  • Jupyter Notebooks for Introducing Data Science to Novice Users (Fruchart et al., 2022)
  • Towards a Systematic Review of Data Science Programs: Themes, Courses, and Ethics (Nellore & Zimmer, 2022)
  • Application of data science and bioinformatics in healthcare technologies (Veeranki & Varshney, 2022)
  • TAPS Responsibility Matrix: A tool for responsible data science by design (Urovi et al., 2023)
  • Data Detectives: A Data Science Program for Middle Grade Learners (Thompson & Irgens, 2022)
  • MACHINE LEARNING FOR NON-MAJORS: A WHITE BOX APPROACH (Mike & Hazzan, 2022)
  • COMPONENTS OF DATA SCIENCE AND ITS APPLICATIONS (Paul et al., 2022)
  • Analysis on the Application of Data Science in Business Analytics (Wang, 2022)

As you can see, these research topics are a lot more focused than the generic topic ideas we presented earlier. So, for you to develop a high-quality research topic, you’ll need to get specific and laser-focused on a specific context with specific variables of interest.  In the video below, we explore some other important things you’ll need to consider when crafting your research topic.

Get 1-On-1 Help

If you’re still unsure about how to find a quality research topic, check out our Research Topic Kickstarter service, which is the perfect starting point for developing a unique, well-justified research topic.

Research Topic Kickstarter - Need Help Finding A Research Topic?

You Might Also Like:

IT & Computer Science Research Topics

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly

Chapman University Digital Commons

Home > Dissertations and Theses > Computational and Data Sciences (PhD) Dissertations

Computational and Data Sciences (PhD) Dissertations

Below is a selection of dissertations from the Doctor of Philosophy in Computational and Data Sciences program in Schmid College that have been included in Chapman University Digital Commons. Additional dissertations from years prior to 2019 are available through the Leatherby Libraries' print collection or in Proquest's Dissertations and Theses database.

Dissertations from 2023 2023

Computational Analysis of Antibody Binding Mechanisms to the Omicron RBD of SARS-CoV-2 Spike Protein: Identification of Epitopes and Hotspots for Developing Effective Therapeutic Strategies , Mohammed Alshahrani

Integration of Computer Algebra Systems and Machine Learning in the Authoring of the SANYMS Intelligent Tutoring System , Sam Ford

Voluntary Action and Conscious Intention , Jake Gavenas

Random Variable Spaces: Mathematical Properties and an Extension to Programming Computable Functions , Mohammed Kurd-Misto

Computational Modeling of Superconductivity from the Set of Time-Dependent Ginzburg-Landau Equations for Advancements in Theory and Applications , Iris Mowgood

Application of Machine Learning Algorithms for Elucidation of Biological Networks from Time Series Gene Expression Data , Krupa Nagori

Stochastic Processes and Multi-Resolution Analysis: A Trigonometric Moment Problem Approach and an Analysis of the Expenditure Trends for Diabetic Patients , Isaac Nwi-Mozu

Applications of Causal Inference Methods for the Estimation of Effects of Bone Marrow Transplant and Prescription Drugs on Survival of Aplastic Anemia Patients , Yesha M. Patel

Causal Inference and Machine Learning Methods in Parkinson's Disease Data Analysis , Albert Pierce

Causal Inference Methods for Estimation of Survival and General Health Status Measures of Alzheimer’s Disease Patients , Ehsan Yaghmaei

Dissertations from 2022 2022

Computational Approaches to Facilitate Automated Interchange between Music and Art , Rao Hamza Ali

Causal Inference in Psychology and Neuroscience: From Association to Causation , Dehua Liang

Advances in NLP Algorithms on Unstructured Medical Notes Data and Approaches to Handling Class Imbalance Issues , Hanna Lu

Novel Techniques for Quantifying Secondhand Smoke Diffusion into Children's Bedroom , Sunil Ramchandani

Probing the Boundaries of Human Agency , Sook Mun Wong

Dissertations from 2021 2021

Predicting Eye Movement and Fixation Patterns on Scenic Images Using Machine Learning for Children with Autism Spectrum Disorder , Raymond Anden

Forecasting the Prices of Cryptocurrencies using a Novel Parameter Optimization of VARIMA Models , Alexander Barrett

Applications of Machine Learning to Facilitate Software Engineering and Scientific Computing , Natalie Best

Exploring Behaviors of Software Developers and Their Code Through Computational and Statistical Methods , Elia Eiroa Lledo

Assessing the Re-Identification Risk in ECG Datasets and an Application of Privacy Preserving Techniques in ECG Analysis , Arin Ghazarian

Multi-Modal Data Fusion, Image Segmentation, and Object Identification using Unsupervised Machine Learning: Conception, Validation, Applications, and a Basis for Multi-Modal Object Detection and Tracking , Nicholas LaHaye

Machine-Learning-Based Approach to Decoding Physiological and Neural Signals , Elnaz Lashgari

Learning-Based Modeling of Weather and Climate Events Related To El Niño Phenomenon via Differentiable Programming and Empirical Decompositions , Justin Le

Quantum State Estimation and Tracking for Superconducting Processors Using Machine Learning , Shiva Lotfallahzadeh Barzili

Novel Applications of Statistical and Machine Learning Methods to Analyze Trial-Level Data from Cognitive Measures , Chelsea Parlett

Optimal Analytical Methods for High Accuracy Cardiac Disease Classification and Treatment Based on ECG Data , Jianwei Zheng

Dissertations from 2020 2020

Development of Integrated Machine Learning and Data Science Approaches for the Prediction of Cancer Mutation and Autonomous Drug Discovery of Anti-Cancer Therapeutic Agents , Steven Agajanian

Allocation of Public Resources: Bringing Order to Chaos , Lance Clifner

A Novel Correction for the Adjusted Box-Pierce Test — New Risk Factors for Emergency Department Return Visits within 72 hours for Children with Respiratory Conditions — General Pediatric Model for Understanding and Predicting Prolonged Length of Stay , Sidy Danioko

A Computational and Experimental Examination of the FCC Incentive Auction , Logan Gantner

Exploring the Employment Landscape for Individuals with Autism Spectrum Disorders using Supervised and Unsupervised Machine Learning , Kayleigh Hyde

Integrated Machine Learning and Bioinformatics Approaches for Prediction of Cancer-Driving Gene Mutations , Oluyemi Odeyemi

On Quantum Effects of Vector Potentials and Generalizations of Functional Analysis , Ismael L. Paiva

Long Term Ground Based Precipitation Data Analysis: Spatial and Temporal Variability , Luciano Rodriguez

Gaining Computational Insight into Psychological Data: Applications of Machine Learning with Eating Disorders and Autism Spectrum Disorder , Natalia Rosenfield

Connecting the Dots for People with Autism: A Data-driven Approach to Designing and Evaluating a Global Filter , Viseth Sean

Novel Statistical and Machine Learning Methods for the Forecasting and Analysis of Major League Baseball Player Performance , Christopher Watkins

Dissertations from 2019 2019

Contributions to Variable Selection in Complexly Sampled Case-control Models, Epidemiology of 72-hour Emergency Department Readmission, and Out-of-site Migration Rate Estimation Using Pseudo-tagged Longitudinal Data , Kyle Anderson

Bias Reduction in Machine Learning Classifiers for Spatiotemporal Analysis of Coral Reefs using Remote Sensing Images , Justin J. Gapper

Estimating Auction Equilibria using Individual Evolutionary Learning , Kevin James

Employing Earth Observations and Artificial Intelligence to Address Key Global Environmental Challenges in Service of the SDGs , Wenzhao Li

Image Restoration using Automatic Damaged Regions Detection and Machine Learning-Based Inpainting Technique , Chloe Martin-King

Theses from 2017 2017

Optimized Forecasting of Dominant U.S. Stock Market Equities Using Univariate and Multivariate Time Series Analysis Methods , Michael Schwartz

  • Collections
  • Disciplines

Advanced Search

  • Notify me via email or RSS

Author Corner

  • Submit Research
  • Rights and Terms of Use
  • Leatherby Libraries
  • Chapman University

ISSN 2572-1496

Home | About | FAQ | My Account | Accessibility Statement

Privacy Copyright

  • Thesis Option

Data Science master’s students can choose to satisfy the research experience requirement by selecting the thesis option. Students will spend the majority of their second year working on a substantial data science project that culminates in the submission and oral defense of a master’s thesis. While all thesis projects must be related to data science, students are given leeway in finding a project in a domain of study that fits with their background and interest.

All students choosing the thesis option must find a research advisor and submit a thesis proposal by mid-April of their first year of study. Thesis proposals will be evaluated by the Data Science faculty committee and only those students whose proposals are accepted will be allowed to continue with the thesis option.  

To account for the time spent on thesis research, students choosing the thesis option are able substitute three required courses (the Capstone and two "free" elective courses (as defined in the final bullet point on the degree requirement page )) with AC 302.

In Applied Computation

  • How to Apply
  • Learning Outcomes
  • Master of Science Degree Requirements
  • Master of Engineering Degree Requirements
  • CSE courses
  • Degree Requirements
  • Data Science courses
  • Data Science FAQ
  • Secondary Field Requirements
  • Advising and Other Activities
  • AB/SM Information
  • Alumni Stories
  • Financing the Degree
  • Student FAQ
  • Warning : Invalid argument supplied for foreach() in /home/customer/www/opendatascience.com/public_html/wp-includes/nav-menu.php on line 95 Warning : array_merge(): Expected parameter 2 to be an array, null given in /home/customer/www/opendatascience.com/public_html/wp-includes/nav-menu.php on line 102
  • ODSC EUROPE
  • AI+ Training
  • Speak at ODSC

data science thesis

  • Data Analytics
  • Data Engineering
  • Data Visualization
  • Deep Learning
  • Generative AI
  • Machine Learning
  • NLP and LLMs
  • Business & Use Cases
  • Career Advice
  • Write for us
  • ODSC Community Slack Channel
  • Upcoming Webinars

17 Compelling Machine Learning Ph.D. Dissertations

17 Compelling Machine Learning Ph.D. Dissertations

Machine Learning Modeling Research posted by Daniel Gutierrez, ODSC August 12, 2021 Daniel Gutierrez, ODSC

Working in the field of data science, I’m always seeking ways to keep current in the field and there are a number of important resources available for this purpose: new book titles, blog articles, conference sessions, Meetups, webinars/podcasts, not to mention the gems floating around in social media. But to dig even deeper, I routinely look at what’s coming out of the world’s research labs. And one great way to keep a pulse for what the research community is working on is to monitor the flow of new machine learning Ph.D. dissertations. Admittedly, many such theses are laser-focused and narrow, but from previous experience reading these documents, you can learn an awful lot about new ways to solve difficult problems over a vast range of problem domains. 

In this article, I present a number of hand-picked machine learning dissertations that I found compelling in terms of my own areas of interest and aligned with problems that I’m working on. I hope you’ll find a number of them that match your own interests. Each dissertation may be challenging to consume but the process will result in hours of satisfying summer reading. Enjoy!

Please check out my previous data science dissertation round-up article . 

1. Fitting Convex Sets to Data: Algorithms and Applications

This machine learning dissertation concerns the geometric problem of finding a convex set that best fits a given data set. The overarching question serves as an abstraction for data-analytical tasks arising in a range of scientific and engineering applications with a focus on two specific instances: (i) a key challenge that arises in solving inverse problems is ill-posedness due to a lack of measurements. A prominent family of methods for addressing such issues is based on augmenting optimization-based approaches with a convex penalty function so as to induce a desired structure in the solution. These functions are typically chosen using prior knowledge about the data. The thesis also studies the problem of learning convex penalty functions directly from data for settings in which we lack the domain expertise to choose a penalty function. The solution relies on suitably transforming the problem of learning a penalty function into a fitting task; and (ii) the problem of fitting tractably-described convex sets given the optimal value of linear functionals evaluated in different directions.

2. Structured Tensors and the Geometry of Data

This machine learning dissertation analyzes data to build a quantitative understanding of the world. Linear algebra is the foundation of algorithms, dating back one hundred years, for extracting structure from data. Modern technologies provide an abundance of multi-dimensional data, in which multiple variables or factors can be compared simultaneously. To organize and analyze such data sets we can use a tensor , the higher-order analogue of a matrix. However, many theoretical and practical challenges arise in extending linear algebra to the setting of tensors. The first part of the thesis studies and develops the algebraic theory of tensors. The second part of the thesis presents three algorithms for tensor data. The algorithms use algebraic and geometric structure to give guarantees of optimality.

3. Statistical approaches for spatial prediction and anomaly detection

This machine learning dissertation is primarily a description of three projects. It starts with a method for spatial prediction and parameter estimation for irregularly spaced, and non-Gaussian data. It is shown that by judiciously replacing the likelihood with an empirical likelihood in the Bayesian hierarchical model, approximate posterior distributions for the mean and covariance parameters can be obtained. Due to the complex nature of the hierarchical model, standard Markov chain Monte Carlo methods cannot be applied to sample from the posterior distributions. To overcome this issue, a generalized sequential Monte Carlo algorithm is used. Finally, this method is applied to iron concentrations in California. The second project focuses on anomaly detection for functional data; specifically for functional data where the observed functions may lie over different domains. By approximating each function as a low-rank sum of spline basis functions the coefficients will be compared for each basis across each function. The idea being, if two functions are similar then their respective coefficients should not be significantly different. This project concludes with an application of the proposed method to detect anomalous behavior of users of a supercomputer at NREL. The final project is an extension of the second project to two-dimensional data. This project aims to detect location and temporal anomalies from ground motion data from a fiber-optic cable using distributed acoustic sensing (DAS). 

4. Sampling for Streaming Data

Advances in data acquisition technology pose challenges in analyzing large volumes of streaming data. Sampling is a natural yet powerful tool for analyzing such data sets due to their competent estimation accuracy and low computational cost. Unfortunately, sampling methods and their statistical properties for streaming data, especially streaming time series data, are not well studied in the literature. Meanwhile, estimating the dependence structure of multidimensional streaming time-series data in real-time is challenging. With large volumes of streaming data, the problem becomes more difficult when the multidimensional data are collected asynchronously across distributed nodes, which motivates us to sample representative data points from streams. This machine learning dissertation proposes a series of leverage score-based sampling methods for streaming time series data. The simulation studies and real data analysis are conducted to validate the proposed methods. The theoretical analysis of the asymptotic behaviors of the least-squares estimator is developed based on the subsamples.

5.  Statistical Machine Learning Methods for Complex, Heterogeneous Data

This machine learning dissertation develops statistical machine learning methodology for three distinct tasks. Each method blends classical statistical approaches with machine learning methods to provide principled solutions to problems with complex, heterogeneous data sets. The first framework proposes two methods for high-dimensional shape-constrained regression and classification. These methods reshape pre-trained prediction rules to satisfy shape constraints like monotonicity and convexity. The second method provides a nonparametric approach to the econometric analysis of discrete choice. This method provides a scalable algorithm for estimating utility functions with random forests, and combines this with random effects to properly model preference heterogeneity. The final method draws inspiration from early work in statistical machine translation to construct embeddings for variable-length objects like mathematical equations

6. Topics in Multivariate Statistics with Dependent Data

This machine learning dissertation comprises four chapters. The first is an introduction to the topics of the dissertation and the remaining chapters contain the main results. Chapter 2 gives new results for consistency of maximum likelihood estimators with a focus on multivariate mixed models. The presented theory builds on the idea of using subsets of the full data to establish consistency of estimators based on the full data. The theory is applied to two multivariate mixed models for which it was unknown whether maximum likelihood estimators are consistent. In Chapter 3 an algorithm is proposed for maximum likelihood estimation of a covariance matrix when the corresponding correlation matrix can be written as the Kronecker product of two lower-dimensional correlation matrices. The proposed method is fully likelihood-based. Some desirable properties of separable correlation in comparison to separable covariance are also discussed. Chapter 4 is concerned with Bayesian vector auto-regressions (VARs). A collapsed Gibbs sampler is proposed for Bayesian VARs with predictors and the convergence properties of the algorithm are studied. 

7.  Model Selection and Estimation for High-dimensional Data Analysis

In the era of big data, uncovering useful information and hidden patterns in the data is prevalent in different fields. However, it is challenging to effectively select input variables in data and estimate their effects. The goal of this machine learning dissertation is to develop reproducible statistical approaches that provide mechanistic explanations of the phenomenon observed in big data analysis. The research contains two parts: variable selection and model estimation. The first part investigates how to measure and interpret the usefulness of an input variable using an approach called “variable importance learning” and builds tools (methodology and software) that can be widely applied. Two variable importance measures are proposed, a parametric measure SOIL and a non-parametric measure CVIL, using the idea of a model combining and cross-validation respectively. The SOIL method is theoretically shown to have the inclusion/exclusion property: When the model weights are properly around the true model, the SOIL importance can well separate the variables in the true model from the rest. The CVIL method possesses desirable theoretical properties and enhances the interpretability of many mysterious but effective machine learning methods. The second part focuses on how to estimate the effect of a useful input variable in the case where the interaction of two input variables exists. Investigated is the minimax rate of convergence for regression estimation in high-dimensional sparse linear models with two-way interactions, and construct an adaptive estimator that achieves the minimax rate of convergence regardless of the true heredity condition and the sparsity indices.

https://odsc.com/california/#register

8.  High-Dimensional Structured Regression Using Convex Optimization

While the term “Big Data” can have multiple meanings, this dissertation considers the type of data in which the number of features can be much greater than the number of observations (also known as high-dimensional data). High-dimensional data is abundant in contemporary scientific research due to the rapid advances in new data-measurement technologies and computing power. Recent advances in statistics have witnessed great development in the field of high-dimensional data analysis. This machine learning dissertation proposes three methods that study three different components of a general framework of the high-dimensional structured regression problem. A general theme of the proposed methods is that they cast a certain structured regression as a convex optimization problem. In so doing, the theoretical properties of each method can be well studied, and efficient computation is facilitated. Each method is accompanied by a thorough theoretical analysis of its performance, and also by an R package containing its practical implementation. It is shown that the proposed methods perform favorably (both theoretically and practically) compared with pre-existing methods.

9. Asymptotics and Interpretability of Decision Trees and Decision Tree Ensembles

Decision trees and decision tree ensembles are widely used nonparametric statistical models. A decision tree is a binary tree that recursively segments the covariate space along the coordinate directions to create hyper rectangles as basic prediction units for fitting constant values within each of them. A decision tree ensemble combines multiple decision trees, either in parallel or in sequence, in order to increase model flexibility and accuracy, as well as to reduce prediction variance. Despite the fact that tree models have been extensively used in practice, results on their asymptotic behaviors are scarce. This machine learning dissertation presents analyses on tree asymptotics in the perspectives of tree terminal nodes, tree ensembles, and models incorporating tree ensembles respectively. The study introduces a few new tree-related learning frameworks which provides provable statistical guarantees and interpretations. A study on the Gini index used in the greedy tree building algorithm reveals its limiting distribution, leading to the development of a test of better splitting that helps to measure the uncertain optimality of a decision tree split. This test is combined with the concept of decision tree distillation, which implements a decision tree to mimic the behavior of a block box model, to generate stable interpretations by guaranteeing a unique distillation tree structure as long as there are sufficiently many random sample points. Also applied is mild modification and regularization to the standard tree boosting to create a new boosting framework named Boulevard. Also included is an integration of two new mechanisms: honest trees , which isolate the tree terminal values from the tree structure, and adaptive shrinkage , which scales the boosting history to create an equally weighted ensemble. This theoretical development provides the prerequisite for the practice of statistical inference with boosted trees. Lastly, the thesis investigates the feasibility of incorporating existing semi-parametric models with tree boosting. 

10. Bayesian Models for Imputing Missing Data and Editing Erroneous Responses in Surveys

This dissertation develops Bayesian methods for handling unit nonresponse, item nonresponse, and erroneous responses in large-scale surveys and censuses containing categorical data. The focus is on applications of nested household data where individuals are nested within households and certain combinations of the variables are not allowed, such as the U.S. Decennial Census, as well as surveys subject to both unit and item nonresponse, such as the Current Population Survey.

11. Localized Variable Selection with Random Forest  

Due to recent advances in computer technology, the cost of collecting and storing data has dropped drastically. This makes it feasible to collect large amounts of information for each data point. This increasing trend in feature dimensionality justifies the need for research on variable selection. Random forest (RF) has demonstrated the ability to select important variables and model complex data. However, simulations confirm that it fails in detecting less influential features in presence of variables with large impacts in some cases. This dissertation proposes two algorithms for localized variable selection: clustering-based feature selection (CBFS) and locally adjusted feature importance (LAFI). Both methods aim to find regions where the effects of weaker features can be isolated and measured. CBFS combines RF variable selection with a two-stage clustering method to detect variables where their effect can be detected only in certain regions. LAFI, on the other hand, uses a binary tree approach to split data into bins based on response variable rankings, and implements RF to find important variables in each bin. Larger LAFI is assigned to variables that get selected in more bins. Simulations and real data sets are used to evaluate these variable selection methods. 

12. Functional Principal Component Analysis and Sparse Functional Regression

The focus of this dissertation is on functional data which are sparsely and irregularly observed. Such data require special consideration, as classical functional data methods and theory were developed for densely observed data. As is the case in much of functional data analysis, the functional principal components (FPCs) play a key role in current sparse functional data methods via the Karhunen-Loéve expansion. Thus, after a review of relevant background material, this dissertation is divided roughly into two parts, the first focusing specifically on theoretical properties of FPCs, and the second on regression for sparsely observed functional data.

13. Essays In Causal Inference: Addressing Bias In Observational And Randomized Studies Through Analysis And Design

In observational studies, identifying assumptions may fail, often quietly and without notice, leading to biased causal estimates. Although less of a concern in randomized trials where treatment is assigned at random, bias may still enter the equation through other means. This dissertation has three parts, each developing new methods to address a particular pattern or source of bias in the setting being studied. The first part extends the conventional sensitivity analysis methods for observational studies to better address patterns of heterogeneous confounding in matched-pair designs. The second part develops a modified difference-in-difference design for comparative interrupted time-series studies. The method permits partial identification of causal effects when the parallel trends assumption is violated by an interaction between group and history. The method is applied to a study of the repeal of Missouri’s permit-to-purchase handgun law and its effect on firearm homicide rates. The final part presents a study design to identify vaccine efficacy in randomized control trials when there is no gold standard case definition. The approach augments a two-arm randomized trial with natural variation of a genetic trait to produce a factorial experiment. 

14. Bayesian Shrinkage: Computation, Methods, and Theory

Sparsity is a standard structural assumption that is made while modeling high-dimensional statistical parameters. This assumption essentially entails a lower-dimensional embedding of the high-dimensional parameter thus enabling sound statistical inference. Apart from this obvious statistical motivation, in many modern applications of statistics such as Genomics, Neuroscience, etc. parameters of interest are indeed of this nature. For over almost two decades, spike and slab type priors have been the Bayesian gold standard for modeling of sparsity. However, due to their computational bottlenecks, shrinkage priors have emerged as a powerful alternative. This family of priors can almost exclusively be represented as a scale mixture of Gaussian distribution and posterior Markov chain Monte Carlo (MCMC) updates of related parameters are then relatively easy to design. Although shrinkage priors were tipped as having computational scalability in high-dimensions, when the number of parameters is in thousands or more, they do come with their own computational challenges. Standard MCMC algorithms implementing shrinkage priors generally scale cubic in the dimension of the parameter making real-life application of these priors severely limited. 

The first chapter of this dissertation addresses this computational issue and proposes an alternative exact posterior sampling algorithm complexity of which that linearly in the ambient dimension. The algorithm developed in the first chapter is specifically designed for regression problems. The second chapter develops a Bayesian method based on shrinkage priors for high-dimensional multiple response regression. Chapter three chooses a specific member of the shrinkage family known as the horseshoe prior and studies its convergence rates in several high-dimensional models. 

15.  Topics in Measurement Error Analysis and High-Dimensional Binary Classification

This dissertation proposes novel methods to tackle two problems: the misspecified model with measurement error and high-dimensional binary classification, both have a crucial impact on applications in public health. The first problem exists in the epidemiology practice. Epidemiologists often categorize a continuous risk predictor since categorization is thought to be more robust and interpretable, even when the true risk model is not a categorical one. Thus, their goal is to fit the categorical model and interpret the categorical parameters. The second project considers the problem of high-dimensional classification between the two groups with unequal covariance matrices. Rather than estimating the full quadratic discriminant rule, it is proposed to perform simultaneous variable selection and linear dimension reduction on original data, with the subsequent application of quadratic discriminant analysis on the reduced space. Further, in order to support the proposed methodology, two R packages were developed, CCP and DAP, along with two vignettes as long-format illustrations for their usage.

16. Model-Based Penalized Regression

This dissertation contains three chapters that consider penalized regression from a model-based perspective, interpreting penalties as assumed prior distributions for unknown regression coefficients. The first chapter shows that treating a lasso penalty as a prior can facilitate the choice of tuning parameters when standard methods for choosing the tuning parameters are not available, and when it is necessary to choose multiple tuning parameters simultaneously. The second chapter considers a possible drawback of treating penalties as models, specifically possible misspecification. The third chapter introduces structured shrinkage priors for dependent regression coefficients which generalize popular independent shrinkage priors. These can be useful in various applied settings where many regression coefficients are not only expected to be nearly or exactly equal to zero, but also structured.

17. Topics on Least Squares Estimation

This dissertation revisits and makes progress on some old but challenging problems concerning least squares estimation, the work-horse of supervised machine learning. Two major problems are addressed: (i) least squares estimation with heavy-tailed errors, and (ii) least squares estimation in non-Donsker classes. For (i), this problem is studied both from a worst-case perspective, and a more refined envelope perspective. For (ii), two case studies are performed in the context of (a) estimation involving sets and (b) estimation of multivariate isotonic functions. Understanding these particular aspects of least squares estimation problems requires several new tools in the empirical process theory, including a sharp multiplier inequality controlling the size of the multiplier empirical process, and matching upper and lower bounds for empirical processes indexed by non-Donsker classes.

How to Learn More about Machine Learning

At our upcoming event this November 16th-18th in San Francisco,  ODSC West 2021  will feature a plethora of talks, workshops, and training sessions on machine learning and machine learning research. You can  register now for 50% off all ticket types  before the discount drops to 40% in a few weeks. Some  highlighted sessions on machine learning  include:

  • Towards More Energy-Efficient Neural Networks? Use Your Brain!: Olaf de Leeuw | Data Scientist | Dataworkz
  • Practical MLOps: Automation Journey: Evgenii Vinogradov, PhD | Head of DHW Development | YooMoney
  • Applications of Modern Survival Modeling with Python: Brian Kent, PhD | Data Scientist | Founder The Crosstab Kite
  • Using Change Detection Algorithms for Detecting Anomalous Behavior in Large Systems: Veena Mendiratta, PhD | Adjunct Faculty, Network Reliability and Analytics Researcher | Northwestern University

Sessions on MLOps:

  • Tuning Hyperparameters with Reproducible Experiments: Milecia McGregor | Senior Software Engineer | Iterative
  • MLOps… From Model to Production: Filipa Peleja, PhD | Lead Data Scientist | Levi Strauss & Co
  • Operationalization of Models Developed and Deployed in Heterogeneous Platforms: Sourav Mazumder | Data Scientist, Thought Leader, AI & ML Operationalization Leader | IBM
  • Develop and Deploy a Machine Learning Pipeline in 45 Minutes with Ploomber: Eduardo Blancas | Data Scientist | Fidelity Investments

Sessions on Deep Learning:

  • GANs: Theory and Practice, Image Synthesis With GANs Using TensorFlow: Ajay Baranwal | Center Director | Center for Deep Learning in Electronic Manufacturing, Inc
  • Machine Learning With Graphs: Going Beyond Tabular Data: Dr. Clair J. Sullivan | Data Science Advocate | Neo4j
  • Deep Dive into Reinforcement Learning with PPO using TF-Agents & TensorFlow 2.0: Oliver Zeigermann | Software Developer | embarc Software Consulting GmbH
  • Get Started with Time-Series Forecasting using the Google Cloud AI Platform: Karl Weinmeister | Developer Relations Engineering Manager | Google

data science thesis

Daniel Gutierrez, ODSC

Daniel D. Gutierrez is a practicing data scientist who’s been working with data long before the field came in vogue. As a technology journalist, he enjoys keeping a pulse on this fast-paced industry. Daniel is also an educator having taught data science, machine learning and R classes at the university level. He has authored four computer industry books on database and data science technology, including his most recent title, “Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R.” Daniel holds a BS in Mathematics and Computer Science from UCLA.

DE Summit Square

ODSC’s AI Weekly Recap: Week of April 19th

AI and Data Science News posted by ODSC Team Apr 19, 2024 Every week, the ODSC team researches the latest advancements in AI. We review a selection of...

Meta AI Has Introduced Llama 3

Meta AI Has Introduced Llama 3

AI and Data Science News posted by ODSC Team Apr 18, 2024 Meta AI has introduced Llama 3 to the world today through a new blog post by...

New Stanford University Report Shows Rapid Progress of AI and Expanding Costs

New Stanford University Report Shows Rapid Progress of AI and Expanding Costs

AI and Data Science News posted by ODSC Team Apr 18, 2024 A recent report from Stanford University’s Institute for Human-Centered Artificial Intelligence reveals that AI systems now...

AI weekly square

LIBRARIES | ARCH

Data science masters theses.

The Master of Science in Data Science program requires the successful completion of 12 courses to obtain a degree. These requirements cover six core courses, a leadership or project management course, two required courses corresponding to a declared specialization, two electives, and a capstone project or thesis. This collection contains a selection of masters theses or capstone projects by MSDS graduates.

Collection Details

Instructions for MSc Thesis

Before the thesis.

Before you start work on your thesis, it is important to put some thought into the choice of topic and familiarize yourself with the criteria and procedure. To do that, follow these steps, in this order:

Step 0: Read the university instructions .

Read the MSc thesis instructions and grading criteria on the university website. Computer Science Master's program: [link] . Data Science Master's program: [ link ].

Step 1: Choose a topic .

Choose a topic among the ones listed on the group's webpage [ link ].

You can also propose your own topic. In this case, you must explain what the main contribution of the thesis will be and identify at least one scientific publication that is related to the topic you propose.

Step 2: Contact us .

Submit the application form [ link ] to let us know of your interest to do your thesis in the group. Note : If you contact us, then please be ready to start work on the thesis within one month .

Step 3: Agree on the topic .

We have a brief discussion about the topic and devise a high-level plan for thesis work and content. We also discuss a start date , when you start work on the thesis. In addition, you should contact a second evaluator for the thesis.

Thesis timeline

Below you find the milestones after you have started work on the thesis. In parenthesis, you find an estimate of when each milestone occurs. The thesis work ends when you submit it for approval. The total duration from start to end of the thesis should be about four months.

Milestone #0: Thesis outline (at most 3 weeks from the start) .

You create a first outline of the thesis. The outline should contain the titles of the chapters, along with a (tentative) list of sections and contents. An indicative template for the outline is shown below on this page.

Milestone #1: A draft with first results (about 2 months from start) .

All chapters should contain some readable content (not necessarily polished). Most importantly, some results should already be described. Ideally, you should be able to complete and refine the results within one more month.

Milestone #2: A draft with all results (about 1 month before the end).

Most content should now be in the draft. Some polishing remains and some results may still be refined. Notify the second evaluator that you are near the end of the thesis work. Optionally, you may send the thesis draft and receive preliminary comments from the second evaluator.

Milestone #3: Submit the thesis for approval (end of thesis work).

You will receive a grade and comments after the next program board's meeting.

Supervision

What you can expect from the supervisor:

  • Comments for the thesis draft after each milestone (see timeline above) and, if necessary, a meeting.
  • Suggestions for how to proceed in cases when you encounter a major hurdle.

In addition, you are welcome to participate in the group meetings and discuss your thesis work with other group members.

Note however that one of the grading criteria for the thesis is whether you worked independently -- and in the end, the thesis should be your own work.

Template for Thesis Outline

Below you find a suggested template for the outline of the thesis. You may adapt it to your work, of course (e.g., change chapter titles or structure).

A summary of the thesis that mentions the broader topic of the thesis and why it is important; the research question or technical problem addressed by the thesis; the main thesis contributions (e.g., data gathering, developed methods and algorithms, experimental evaluation) and results.

Chapter 1: Introduction

The introduction should motivate the thesis and give a longer summary. It should be written in a way that allows anyone in your program to understand it, even if they are not experts in the topic.

  • What is the broader topic of the thesis?
  • Why is it important?
  • What research question(s) or technical problems does the thesis address?
  • What are the most related works from the literature on the topic? How does the thesis differ from what has already been done?
  • What are the main thesis contributions (e.g., data gathering, developed methods and algorithms, experimental evaluation)?
  • What are the results?

Chapter 2: Related literature

Organize this chapter in sections, with one section for each research area that is related to your thesis. For each research area, cite all the publications that are related to your topic, and describe at least the most important of them.

Chapter 3: Preliminaries

In this chapter, place the information that is necessary for you to describe the contributions and results of the thesis. It may be different from thesis to thesis, but could include sections about:

Setting. Define the terms and notation you will be using. State any assumptions you make across the thesis. Background on Methods . Describe existing methods from the literature (e.g., algorithms or ML models) that you use for your work. Data (esp. for a Data Science thesis). If the main contribution is data analysis, then describe the data here, before the analysis.

Chapter 4: Methodological contribution

For a Computer Science thesis, this part typically describes the algorithm(s) developed for the thesis. For a Data Science thesis, this part typically describes the method for the analysis.

Chapter 5: Results

This chapter describes the results obtained when the methods of Chapter 4 are used on data.

For a Computer Science thesis, this part typically describes the performance of the developed algorithm(s) on various synthetic and real datasets. For a Data Science thesis, this part typically describes the findings of the analysis.

The chapter should also describe what insights are obtained from the results.

Chapter 6: Conclusion

  • Summarize the contribution of the thesis.
  • Provide an evaluation: are the results conclusive, are there limitations in the contribution?
  • How would you extend the thesis, what can be done next on the same topic?

MSc in Data Science, Project Guide, 2018-2019

NEW: List of project areas is available!

Introduction

The project is an essential component of the Masters course. It is a substantial piece of full-time independent research in some area of data science. You will carry out your project under the individual supervision of a member of CDT staff.

The project will occupy a large part of your time during the Spring semester, and 100% of your time from late May/early June — once your examinations have completed — until mid-August. A dissertation describing the work must be submitted by a deadline in mid-August.

Choosing a Project

You are expected to choose a project at the end of Semester 1. Students are expected to find their own projects in consultation with supervisors. To help with this, staff will post some project ideas in late October. These will be indicative of their areas of interest, but they shouldn't be interpreted as a fixed menu; they are simply the starting point for discussion. The procedure for project selection is:

  • You should identify some research areas that interest you, on the basis of your coursework so far, your independent reading, the guest lectures in IRDS, and the set of project ideas proposed by staff in late October.
  • Arrange meetings with supervisors in those research areas to discuss potential MSc projects. Often supervisors will have several potential project ideas in mind, but you should of course bring up any potential directions that you have been thinking about.
  • IMPORTANT: Once you have identified a project and supervisor who is willing to take you on, you will need to fill out a brief form identifying the topic and supervisor. The deadline for this is the 12th of December, 2018 .
  • The project proposals will all be reviewed for suitability by the CDT project coordinator. However, your proposal is not a contract and we are not going to hold you to it. It should simply represent a good-faith attempt to identify a topic of mutual interest to you and your supervisor.

Schedule and Important Dates

The overall schedule is: You will meet with supervisors during Semester 1 and select a project shortly after Semester 1 classes end. Once you have selected a project, we recommend that you get a head start on your project over the winter break. During Semester 2, you will work approximately 50% on coursework and 50% on your project. After classes end in Semester 2, you will have a revision period for your exams — during this period we recommend that you focus on your exams. Once the exams complete, you should return to your project work, spending 100% time on it until the final deadline in mid-August.

Here are the important dates and deadlines for 2018-19:

  • November -- You should start meeting with potential MSc supervisors now (if you have not begun already)
  • 12 December 2018 -- MSc project selections due (RTDS students).
  • 11 January 2018, noon -- Interim Report due (RTDS+ students).
  • 1 March 2018, noon -- Interim Report due (RTDS students).
  • April - May 2018 -- Revision period and exams. During this period we would not expect you to be making much progress on your project
  • late May 2018 -- Begin full time work on project.
  • mid-August 2018 (exact date TBD, probably 16 Aug) -- Deadline for submission of dissertation.
  • October 2018 -- Board of Examiners meets and marks announced

Supervision

As part of choosing a project, you will also choose a supervisor. Your supervisor gives technical advice and also assists you in planning the research. Students should expect approximately weekly meetings with their supervisor. Backup supervisors may be allocated to cover periods of absence of the supervisor, if necessary.

Interim Report

At the beginning of March (or early January for RTDS+), you will submit an interim report about how your project has gone so far. This should be 6-8 pages. This report will not form part of the mark; it is solely for feedback, so it is in your best interest to complete it. The report should describe the research problem that you are considering, explain why it is important, what methods you expect to use, how you expect to evaluate your results, what results you have been able to obtain so far, and what your plans are for the summer. You should write this in such a way that you can re-use the text in your final MSc project report.

Relationship to Your PhD Project

The MSc project is designed to be a first research project that prepares you for the more extended work that you will do in your PhD. The project is intended to be novel research — we hope that in some cases the MSc projects will lead to publishable results, although this is not required and will not always be possible, depending on the nature of the project. Your supervisor should help you identify a topic that has the potential to lead into a larger PhD project, should you decide to continue research in the area.

That said, it is not required that your PhD research be in the same area as your MSc research. Some students will indeed continue their PhD work with the same research area and supervisor as their MSc. Others will choose a different PhD supervisor. Both of these outcomes are expected and are perfectly fine.

Of course if you do already have a good idea about your intended PhD topic, you will want to take this into account when selecting your MSc topic — whether it be to choose a topic in the same area, or to choose a topic that will provide you with complementary experience.

Projects with External Collaborators

Some students may wish to undertake a project which relates to the activities of one of our external partners. Alternatively, some projects that supervisors suggest to you may have a natural relationship with one of the CDT partners. This is encouraged. A student undertaking such a project will still need to find an academic supervisor who is willing to take on the project. During the project phase, students working on such projects have both an academic supervisor and a designated contact at the partner organization.

We strongly encourage you to discuss your projects with other students, talk informally about your progress, and get advice from your peers about any issues. Last year this happened as part of the CDT Tea meetings; this year, we will discuss whether to continue this or to have more formal tutorials.

The Dissertation

  • Title page with abstract.
  • Introduction : an introduction to the document, clearly stating the hypothesis or objective of the project, motivation for the work and the results achieved. The structure of the remainder of the document should also be outlined.
  • Background : background to the project, previous work, exposition of relevant literature, setting of the work in the proper context. This should contain sufficient information to allow the reader to appreciate the contribution you have made.
  • Description of the work undertaken : this may be divided into chapters describing the conceptual design work and the actual implementation separately. Any problems or difficulties and the suggested solutions should be mentioned. Alternative solutions and their evaluation should also be included.
  • Analysis or Evaluation : results and their critical analysis should be reported, whether the results conform to expectations or otherwise and how they compare with other related work. Where appropriate evaluation of the work against the original objectives should be presented.
  • Conclusion : concluding remarks and observations, unsolved problems, suggestions for further work.
  • Bibliography .

In addition, the dissertation must be accompanied by a statement declaring that the student has read and understood the University's plagiarism guidelines.

In the acknowledgments section of your dissertation, in addition to thanking anyone that you wish, you should also acknowledge the funding sources that have supported you during the year. Please follow these instructions for acknowledging your funding sources . You should get to know them well as you will also need to follow them for every paper that you publish during your PhD.

Students should write as they go , but should also budget several weeks towards the end of the project to focus on writing. Where appropriate the dissertation may additionally contain appendices in which relevant program listings, experimental data, circuit diagrams, formal proofs, etc. may be included. However, students should keep in mind that they are marked on the quality of the dissertation, not its length.

The dissertation must be word-processed using either LaTeX or a system with similar capabilities. The LaTeX thesis template can be found via the local packages web page. You don't have to use these packages, but your thesis must match the style (i.e., font size, text width etc) shown in the sample output for an Informatics thesis.

Computing Resources

Many projects will require computing resources. Please see the CDT handbook for information about what computing resources are available to CDT students.

If a project requires anything more, this needs to be requested at the time of writing the proposal, and the supervisor needs to explicitly ask for additional resources if necessary (start by talking to the CDT projects organizer, below).

Technical problems during project work are only considered for resources we provide; no technical support, compensation for lost data, extensions for time lost due to technical problems with external hard- and software as provided will be given, except where this is explicitly stated as part of a project specification and adequately resourced at the start of the project.

Students must submit their project by the deadline in mid August (see above). Students need to submit hard copy, electronic copy and archive software as detailed below.

  • Hard Copy. Two printed copies of the dissertation, bound with the soft covers provided by the School, must be submitted to the ITO before the deadline.
  • Electronic Copy. Students must follow the instructions for how to submit their project electronically. Please use the online submission form that is linked from there.
  • Software. Students are required to preserve any software they have generated, source, object and make files, together with any associated data that has been accumulated. When you submit the electronic copy of your thesis you will also be asked to provide an archive file (tar or zip) containing all the project materials. You should create a directory, for example named PROJECT , in your file space specifically for the purpose. Please follow the accepted practice of creating a README file which documents your files and their function. This directory should be compressed and then submitted, together with the electronic version of the thesis, via the online submission webpage. See these instructions for how to submit your project electronically.

Project Assessment

  • Understanding of the problem
  • Completion of the work
  • Quality of the work
  • Quality of the dissertation
  • Knowledge of the literature
  • Critical evaluation of previous work
  • Critical evaluation of own work
  • Justification of design decisions
  • Solution of conceptual problems
  • Amount of work
  • Evidence of outstanding merit e.g. originality
  • Inclusion of material worthy of publication

The project involves both the application of skills learned in the past and the acquisition of new skills. It allows students to demonstrate their ability to organise and carry out a major piece of work according to sound scientific and engineering principles. The types of activity involved in each project will vary but all will typically share the following features:

  • Research the literature and gather background information
  • Analyse requirements, compare alternatives and specify a solution
  • Design and implement the solution
  • Experiment and evaluate the solution
  • Develop written and oral presentation skills

You may have noticed that there is both a 90pt version of the project (RTDS) and a 120pt version (RTDS+). The 120pt version is for students who have a previous Master's degree in an area relating to data science along with a clear project and a supervisor in mind when they arrive, and therefore want to take fewer classes and a larger project. If you wish to choose this option, you must speak to the CDT Year 1 organizer during course registeration; see the MSc by Research Course Handbook for more information.

The RTDS+ project works the same as the RTDS project, except that: (a) You are expected to have selected a supervisor by 21 September; (b) You should commence work on your project part-time in the autumn; (c) You should submit an interim report by 11 January; and (d) The markers will look to see evidence of more work or a more advanced project, commensurate to the additional amount of time you have had. For example, a larger project might make a larger research contribution, apply more advanced methodology, contain more extensive experimental evaluation, etc.

This page is currently maintained by Adam Lopez .

eml header

37 Research Topics In Data Science To Stay On Top Of

Stewart Kaplan

  • February 22, 2024

As a data scientist, staying on top of the latest research in your field is essential.

The data science landscape changes rapidly, and new techniques and tools are constantly being developed.

To keep up with the competition, you need to be aware of the latest trends and topics in data science research.

In this article, we will provide an overview of 37 hot research topics in data science.

We will discuss each topic in detail, including its significance and potential applications.

These topics could be an idea for a thesis or simply topics you can research independently.

Stay tuned – this is one blog post you don’t want to miss!

37 Research Topics in Data Science

1.) predictive modeling.

Predictive modeling is a significant portion of data science and a topic you must be aware of.

Simply put, it is the process of using historical data to build models that can predict future outcomes.

Predictive modeling has many applications, from marketing and sales to financial forecasting and risk management.

As businesses increasingly rely on data to make decisions, predictive modeling is becoming more and more important.

While it can be complex, predictive modeling is a powerful tool that gives businesses a competitive advantage.

predictive modeling

2.) Big Data Analytics

These days, it seems like everyone is talking about big data.

And with good reason – organizations of all sizes are sitting on mountains of data, and they’re increasingly turning to data scientists to help them make sense of it all.

But what exactly is big data? And what does it mean for data science?

Simply put, big data is a term used to describe datasets that are too large and complex for traditional data processing techniques.

Big data typically refers to datasets of a few terabytes or more.

But size isn’t the only defining characteristic – big data is also characterized by its high Velocity (the speed at which data is generated), Variety (the different types of data), and Volume (the amount of the information).

Given the enormity of big data, it’s not surprising that organizations are struggling to make sense of it all.

That’s where data science comes in.

Data scientists use various methods to wrangle big data, including distributed computing and other decentralized technologies.

With the help of data science, organizations are beginning to unlock the hidden value in their big data.

By harnessing the power of big data analytics, they can improve their decision-making, better understand their customers, and develop new products and services.

3.) Auto Machine Learning

Auto machine learning is a research topic in data science concerned with developing algorithms that can automatically learn from data without intervention.

This area of research is vital because it allows data scientists to automate the process of writing code for every dataset.

This allows us to focus on other tasks, such as model selection and validation.

Auto machine learning algorithms can learn from data in a hands-off way for the data scientist – while still providing incredible insights.

This makes them a valuable tool for data scientists who either don’t have the skills to do their own analysis or are struggling.

Auto Machine Learning

4.) Text Mining

Text mining is a research topic in data science that deals with text data extraction.

This area of research is important because it allows us to get as much information as possible from the vast amount of text data available today.

Text mining techniques can extract information from text data, such as keywords, sentiments, and relationships.

This information can be used for various purposes, such as model building and predictive analytics.

5.) Natural Language Processing

Natural language processing is a data science research topic that analyzes human language data.

This area of research is important because it allows us to understand and make sense of the vast amount of text data available today.

Natural language processing techniques can build predictive and interactive models from any language data.

Natural Language processing is pretty broad, and recent advances like GPT-3 have pushed this topic to the forefront.

natural language processing

6.) Recommender Systems

Recommender systems are an exciting topic in data science because they allow us to make better products, services, and content recommendations.

Businesses can better understand their customers and their needs by using recommender systems.

This, in turn, allows them to develop better products and services that meet the needs of their customers.

Recommender systems are also used to recommend content to users.

This can be done on an individual level or at a group level.

Think about Netflix, for example, always knowing what you want to watch!

Recommender systems are a valuable tool for businesses and users alike.

7.) Deep Learning

Deep learning is a research topic in data science that deals with artificial neural networks.

These networks are composed of multiple layers, and each layer is formed from various nodes.

Deep learning networks can learn from data similarly to how humans learn, irrespective of the data distribution.

This makes them a valuable tool for data scientists looking to build models that can learn from data independently.

The deep learning network has become very popular in recent years because of its ability to achieve state-of-the-art results on various tasks.

There seems to be a new SOTA deep learning algorithm research paper on  https://arxiv.org/  every single day!

deep learning

8.) Reinforcement Learning

Reinforcement learning is a research topic in data science that deals with algorithms that can learn on multiple levels from interactions with their environment.

This area of research is essential because it allows us to develop algorithms that can learn non-greedy approaches to decision-making, allowing businesses and companies to win in the long term compared to the short.

9.) Data Visualization

Data visualization is an excellent research topic in data science because it allows us to see our data in a way that is easy to understand.

Data visualization techniques can be used to create charts, graphs, and other visual representations of data.

This allows us to see the patterns and trends hidden in our data.

Data visualization is also used to communicate results to others.

This allows us to share our findings with others in a way that is easy to understand.

There are many ways to contribute to and learn about data visualization.

Some ways include attending conferences, reading papers, and contributing to open-source projects.

data visualization

10.) Predictive Maintenance

Predictive maintenance is a hot topic in data science because it allows us to prevent failures before they happen.

This is done using data analytics to predict when a failure will occur.

This allows us to take corrective action before the failure actually happens.

While this sounds simple, avoiding false positives while keeping recall is challenging and an area wide open for advancement.

11.) Financial Analysis

Financial analysis is an older topic that has been around for a while but is still a great field where contributions can be felt.

Current researchers are focused on analyzing macroeconomic data to make better financial decisions.

This is done by analyzing the data to identify trends and patterns.

Financial analysts can use this information to make informed decisions about where to invest their money.

Financial analysis is also used to predict future economic trends.

This allows businesses and individuals to prepare for potential financial hardships and enable companies to be cash-heavy during good economic conditions.

Overall, financial analysis is a valuable tool for anyone looking to make better financial decisions.

Financial Analysis

12.) Image Recognition

Image recognition is one of the hottest topics in data science because it allows us to identify objects in images.

This is done using artificial intelligence algorithms that can learn from data and understand what objects you’re looking for.

This allows us to build models that can accurately recognize objects in images and video.

This is a valuable tool for businesses and individuals who want to be able to identify objects in images.

Think about security, identification, routing, traffic, etc.

Image Recognition has gained a ton of momentum recently – for a good reason.

13.) Fraud Detection

Fraud detection is a great topic in data science because it allows us to identify fraudulent activity before it happens.

This is done by analyzing data to look for patterns and trends that may be associated with the fraud.

Once our machine learning model recognizes some of these patterns in real time, it immediately detects fraud.

This allows us to take corrective action before the fraud actually happens.

Fraud detection is a valuable tool for anyone who wants to protect themselves from potential fraudulent activity.

fraud detection

14.) Web Scraping

Web scraping is a controversial topic in data science because it allows us to collect data from the web, which is usually data you do not own.

This is done by extracting data from websites using scraping tools that are usually custom-programmed.

This allows us to collect data that would otherwise be inaccessible.

For obvious reasons, web scraping is a unique tool – giving you data your competitors would have no chance of getting.

I think there is an excellent opportunity to create new and innovative ways to make scraping accessible for everyone, not just those who understand Selenium and Beautiful Soup.

15.) Social Media Analysis

Social media analysis is not new; many people have already created exciting and innovative algorithms to study this.

However, it is still a great data science research topic because it allows us to understand how people interact on social media.

This is done by analyzing data from social media platforms to look for insights, bots, and recent societal trends.

Once we understand these practices, we can use this information to improve our marketing efforts.

For example, if we know that a particular demographic prefers a specific type of content, we can create more content that appeals to them.

Social media analysis is also used to understand how people interact with brands on social media.

This allows businesses to understand better what their customers want and need.

Overall, social media analysis is valuable for anyone who wants to improve their marketing efforts or understand how customers interact with brands.

social media

16.) GPU Computing

GPU computing is a fun new research topic in data science because it allows us to process data much faster than traditional CPUs .

Due to how GPUs are made, they’re incredibly proficient at intense matrix operations, outperforming traditional CPUs by very high margins.

While the computation is fast, the coding is still tricky.

There is an excellent research opportunity to bring these innovations to non-traditional modules, allowing data science to take advantage of GPU computing outside of deep learning.

17.) Quantum Computing

Quantum computing is a new research topic in data science and physics because it allows us to process data much faster than traditional computers.

It also opens the door to new types of data.

There are just some problems that can’t be solved utilizing outside of the classical computer.

For example, if you wanted to understand how a single atom moved around, a classical computer couldn’t handle this problem.

You’ll need to utilize a quantum computer to handle quantum mechanics problems.

This may be the “hottest” research topic on the planet right now, with some of the top researchers in computer science and physics worldwide working on it.

You could be too.

quantum computing

18.) Genomics

Genomics may be the only research topic that can compete with quantum computing regarding the “number of top researchers working on it.”

Genomics is a fantastic intersection of data science because it allows us to understand how genes work.

This is done by sequencing the DNA of different organisms to look for insights into our and other species.

Once we understand these patterns, we can use this information to improve our understanding of diseases and create new and innovative treatments for them.

Genomics is also used to study the evolution of different species.

Genomics is the future and a field begging for new and exciting research professionals to take it to the next step.

19.) Location-based services

Location-based services are an old and time-tested research topic in data science.

Since GPS and 4g cell phone reception became a thing, we’ve been trying to stay informed about how humans interact with their environment.

This is done by analyzing data from GPS tracking devices, cell phone towers, and Wi-Fi routers to look for insights into how humans interact.

Once we understand these practices, we can use this information to improve our geotargeting efforts, improve maps, find faster routes, and improve cohesion throughout a community.

Location-based services are used to understand the user, something every business could always use a little bit more of.

While a seemingly “stale” field, location-based services have seen a revival period with self-driving cars.

GPS

20.) Smart City Applications

Smart city applications are all the rage in data science research right now.

By harnessing the power of data, cities can become more efficient and sustainable.

But what exactly are smart city applications?

In short, they are systems that use data to improve city infrastructure and services.

This can include anything from traffic management and energy use to waste management and public safety.

Data is collected from various sources, including sensors, cameras, and social media.

It is then analyzed to identify tendencies and habits.

This information can make predictions about future needs and optimize city resources.

As more and more cities strive to become “smart,” the demand for data scientists with expertise in smart city applications is only growing.

21.) Internet Of Things (IoT)

The Internet of Things, or IoT, is exciting and new data science and sustainability research topic.

IoT is a network of physical objects embedded with sensors and connected to the internet.

These objects can include everything from alarm clocks to refrigerators; they’re all connected to the internet.

That means that they can share data with computers.

And that’s where data science comes in.

Data scientists are using IoT data to learn everything from how people use energy to how traffic flows through a city.

They’re also using IoT data to predict when an appliance will break down or when a road will be congested.

Really, the possibilities are endless.

With such a wide-open field, it’s easy to see why IoT is being researched by some of the top professionals in the world.

internet of things

22.) Cybersecurity

Cybersecurity is a relatively new research topic in data science and in general, but it’s already garnering a lot of attention from businesses and organizations.

After all, with the increasing number of cyber attacks in recent years, it’s clear that we need to find better ways to protect our data.

While most of cybersecurity focuses on infrastructure, data scientists can leverage historical events to find potential exploits to protect their companies.

Sometimes, looking at a problem from a different angle helps, and that’s what data science brings to cybersecurity.

Also, data science can help to develop new security technologies and protocols.

As a result, cybersecurity is a crucial data science research area and one that will only become more important in the years to come.

23.) Blockchain

Blockchain is an incredible new research topic in data science for several reasons.

First, it is a distributed database technology that enables secure, transparent, and tamper-proof transactions.

Did someone say transmitting data?

This makes it an ideal platform for tracking data and transactions in various industries.

Second, blockchain is powered by cryptography, which not only makes it highly secure – but is a familiar foe for data scientists.

Finally, blockchain is still in its early stages of development, so there is much room for research and innovation.

As a result, blockchain is a great new research topic in data science that vows to revolutionize how we store, transmit and manage data.

blockchain

24.) Sustainability

Sustainability is a relatively new research topic in data science, but it is gaining traction quickly.

To keep up with this demand, The Wharton School of the University of Pennsylvania has  started to offer an MBA in Sustainability .

This demand isn’t shocking, and some of the reasons include the following:

Sustainability is an important issue that is relevant to everyone.

Datasets on sustainability are constantly growing and changing, making it an exciting challenge for data scientists.

There hasn’t been a “set way” to approach sustainability from a data perspective, making it an excellent opportunity for interdisciplinary research.

As data science grows, sustainability will likely become an increasingly important research topic.

25.) Educational Data

Education has always been a great topic for research, and with the advent of big data, educational data has become an even richer source of information.

By studying educational data, researchers can gain insights into how students learn, what motivates them, and what barriers these students may face.

Besides, data science can be used to develop educational interventions tailored to individual students’ needs.

Imagine being the researcher that helps that high schooler pass mathematics; what an incredible feeling.

With the increasing availability of educational data, data science has enormous potential to improve the quality of education.

online education

26.) Politics

As data science continues to evolve, so does the scope of its applications.

Originally used primarily for business intelligence and marketing, data science is now applied to various fields, including politics.

By analyzing large data sets, political scientists (data scientists with a cooler name) can gain valuable insights into voting patterns, campaign strategies, and more.

Further, data science can be used to forecast election results and understand the effects of political events on public opinion.

With the wealth of data available, there is no shortage of research opportunities in this field.

As data science evolves, so does our understanding of politics and its role in our world.

27.) Cloud Technologies

Cloud technologies are a great research topic.

It allows for the outsourcing and sharing of computer resources and applications all over the internet.

This lets organizations save money on hardware and maintenance costs while providing employees access to the latest and greatest software and applications.

I believe there is an argument that AWS could be the greatest and most technologically advanced business ever built (Yes, I know it’s only part of the company).

Besides, cloud technologies can help improve team members’ collaboration by allowing them to share files and work on projects together in real-time.

As more businesses adopt cloud technologies, data scientists must stay up-to-date on the latest trends in this area.

By researching cloud technologies, data scientists can help organizations to make the most of this new and exciting technology.

cloud technologies

28.) Robotics

Robotics has recently become a household name, and it’s for a good reason.

First, robotics deals with controlling and planning physical systems, an inherently complex problem.

Second, robotics requires various sensors and actuators to interact with the world, making it an ideal application for machine learning techniques.

Finally, robotics is an interdisciplinary field that draws on various disciplines, such as computer science, mechanical engineering, and electrical engineering.

As a result, robotics is a rich source of research problems for data scientists.

29.) HealthCare

Healthcare is an industry that is ripe for data-driven innovation.

Hospitals, clinics, and health insurance companies generate a tremendous amount of data daily.

This data can be used to improve the quality of care and outcomes for patients.

This is perfect timing, as the healthcare industry is undergoing a significant shift towards value-based care, which means there is a greater need than ever for data-driven decision-making.

As a result, healthcare is an exciting new research topic for data scientists.

There are many different ways in which data can be used to improve healthcare, and there is a ton of room for newcomers to make discoveries.

healthcare

30.) Remote Work

There’s no doubt that remote work is on the rise.

In today’s global economy, more and more businesses are allowing their employees to work from home or anywhere else they can get a stable internet connection.

But what does this mean for data science? Well, for one thing, it opens up a whole new field of research.

For example, how does remote work impact employee productivity?

What are the best ways to manage and collaborate on data science projects when team members are spread across the globe?

And what are the cybersecurity risks associated with working remotely?

These are just a few of the questions that data scientists will be able to answer with further research.

So if you’re looking for a new topic to sink your teeth into, remote work in data science is a great option.

31.) Data-Driven Journalism

Data-driven journalism is an exciting new field of research that combines the best of both worlds: the rigor of data science with the creativity of journalism.

By applying data analytics to large datasets, journalists can uncover stories that would otherwise be hidden.

And telling these stories compellingly can help people better understand the world around them.

Data-driven journalism is still in its infancy, but it has already had a major impact on how news is reported.

In the future, it will only become more important as data becomes increasingly fluid among journalists.

It is an exciting new topic and research field for data scientists to explore.

journalism

32.) Data Engineering

Data engineering is a staple in data science, focusing on efficiently managing data.

Data engineers are responsible for developing and maintaining the systems that collect, process, and store data.

In recent years, there has been an increasing demand for data engineers as the volume of data generated by businesses and organizations has grown exponentially.

Data engineers must be able to design and implement efficient data-processing pipelines and have the skills to optimize and troubleshoot existing systems.

If you are looking for a challenging research topic that would immediately impact you worldwide, then improving or innovating a new approach in data engineering would be a good start.

33.) Data Curation

Data curation has been a hot topic in the data science community for some time now.

Curating data involves organizing, managing, and preserving data so researchers can use it.

Data curation can help to ensure that data is accurate, reliable, and accessible.

It can also help to prevent research duplication and to facilitate the sharing of data between researchers.

Data curation is a vital part of data science. In recent years, there has been an increasing focus on data curation, as it has become clear that it is essential for ensuring data quality.

As a result, data curation is now a major research topic in data science.

There are numerous books and articles on the subject, and many universities offer courses on data curation.

Data curation is an integral part of data science and will only become more important in the future.

businessman

34.) Meta-Learning

Meta-learning is gaining a ton of steam in data science. It’s learning how to learn.

So, if you can learn how to learn, you can learn anything much faster.

Meta-learning is mainly used in deep learning, as applications outside of this are generally pretty hard.

In deep learning, many parameters need to be tuned for a good model, and there’s usually a lot of data.

You can save time and effort if you can automatically and quickly do this tuning.

In machine learning, meta-learning can improve models’ performance by sharing knowledge between different models.

For example, if you have a bunch of different models that all solve the same problem, then you can use meta-learning to share the knowledge between them to improve the cluster (groups) overall performance.

I don’t know how anyone looking for a research topic could stay away from this field; it’s what the  Terminator  warned us about!

35.) Data Warehousing

A data warehouse is a system used for data analysis and reporting.

It is a central data repository created by combining data from multiple sources.

Data warehouses are often used to store historical data, such as sales data, financial data, and customer data.

This data type can be used to create reports and perform statistical analysis.

Data warehouses also store data that the organization is not currently using.

This type of data can be used for future research projects.

Data warehousing is an incredible research topic in data science because it offers a variety of benefits.

Data warehouses help organizations to save time and money by reducing the need for manual data entry.

They also help to improve the accuracy of reports and provide a complete picture of the organization’s performance.

Data warehousing feels like one of the weakest parts of the Data Science Technology Stack; if you want a research topic that could have a monumental impact – data warehousing is an excellent place to look.

data warehousing

36.) Business Intelligence

Business intelligence aims to collect, process, and analyze data to help businesses make better decisions.

Business intelligence can improve marketing, sales, customer service, and operations.

It can also be used to identify new business opportunities and track competition.

BI is business and another tool in your company’s toolbox to continue dominating your area.

Data science is the perfect tool for business intelligence because it combines statistics, computer science, and machine learning.

Data scientists can use business intelligence to answer questions like, “What are our customers buying?” or “What are our competitors doing?” or “How can we increase sales?”

Business intelligence is a great way to improve your business’s bottom line and an excellent opportunity to dive deep into a well-respected research topic.

37.) Crowdsourcing

One of the newest areas of research in data science is crowdsourcing.

Crowdsourcing is a process of sourcing tasks or projects to a large group of people, typically via the internet.

This can be done for various purposes, such as gathering data, developing new algorithms, or even just for fun (think: online quizzes and surveys).

But what makes crowdsourcing so powerful is that it allows businesses and organizations to tap into a vast pool of talent and resources they wouldn’t otherwise have access to.

And with the rise of social media, it’s easier than ever to connect with potential crowdsource workers worldwide.

Imagine if you could effect that, finding innovative ways to improve how people work together.

That would have a huge effect.

crowd sourcing

Final Thoughts, Are These Research Topics In Data Science For You?

Thirty-seven different research topics in data science are a lot to take in, but we hope you found a research topic that interests you.

If not, don’t worry – there are plenty of other great topics to explore.

The important thing is to get started with your research and find ways to apply what you learn to real-world problems.

We wish you the best of luck as you begin your data science journey!

Other Data Science Articles

We love talking about data science; here are a couple of our favorite articles:

  • Why Are You Interested In Data Science?
  • Recent Posts

Stewart Kaplan

  • Advantages and Disadvantages of In-house Software Development [Maximize Your Software Strategy] - April 19, 2024
  • How much does Genentech pay research software engineers? [Secrets Revealed] - April 19, 2024
  • What Do TikTok Software Engineers Get Paid? [Uncover Salary Secrets] - April 19, 2024

Trending now

Multivariate Polynomial Regression Python

Eindhoven University of Technology research portal Logo

  • Help & FAQ

Data Science

  • Mathematics and Computer Science

Student theses

  • 1 - 50 out of 752 results
  • Title (descending)

Search results

3d face reconstruction using deep learning.

Supervisor: Medeiros de Carvalho, R. (Supervisor 1), Gallucci, A. (Supervisor 2) & Vanschoren, J. (Supervisor 2)

Student thesis : Master

3D fingerprint detection in ancient museum sculptures from CT data

Supervisor: van Liere, R. (Supervisor 1) & Jalba, A. C. (Supervisor 2)

Achieving Long Term Fairness through Curiosity Driven Reinforcement Learning: How intrinsic motivation influences fairness in algorithmic decision making

Supervisor: Pechenizkiy, M. (Supervisor 1), Gajane, P. (Supervisor 2) & Kapodistria, S. (Supervisor 2)

A Coherent Temporal Visualization of Algorithm Dynamics over Large Graphs

Supervisor: van de Wetering, H. M. M. (Supervisor 1)

Student thesis : Bachelor

A comparative study for process mining approaches in a real-life environment

Supervisor: Reijers, H. A. (Supervisor 1), Eshuis, H. (Supervisor 2), Gonzalez Lopez de Murillas, E. (Supervisor 2) & Vos, P. (External person) (External coach)

A comparative study on Unsupervised Deep Learning Methods for X-Ray Image denoising with Multi-Image Self2Self and Single Frequency Denoising

Supervisor: Tavakol, M. (Supervisor 1), Zhaorui, Y. (External person) (External coach) & Vilanova, A. (Supervisor 2)

A Comparison of Quantitative Evaluation and Human Perception of Quality of Generated Images of Faces

Supervisor: de Campos, C. (Supervisor 1)

A computational biology framework: a data analysis tool to support biomedical engineers in their research

Supervisor: Bosnacki, D. (Supervisor 1), Cheplygina, V. (Supervisor 2), Hilbers, P. A. J. (Supervisor 2), Fletcher, G. (Supervisor 2) & Vanschoren, J. (Supervisor 2)

Active learning for text classification

Supervisor: Vanschoren, J. (Supervisor 1) & Schaefers, K. (External person) (External coach)

Active learning in VAE latent space

Supervisor: Menkovski, V. (Supervisor 1), Portegies, J. W. (Supervisor 2) & Holenderski, M. J. (Supervisor 2)

Activity Recognition Using Deep Learning in Videos under Clinical Setting

Supervisor: Duivesteijn, W. (Supervisor 1), Papapetrou, O. (Supervisor 2), Zhang, L. (External person) (External coach) & Vasu, J. D. (External coach)

A Dashboard for emulating LSTM-based Predictive Process Monitoring and its Qualitative Evaluation

Supervisor: Fahland, D. (Supervisor 1)

A Data Cleaning Assistant

Supervisor: Vanschoren, J. (Supervisor 1)

A Data Cleaning Assistant for Machine Learning

Adding formal specifications to a legacy code generator.

Supervisor: Kurtev, I. (Supervisor 1), Alberts, W. (External coach) & Sidorova, N. (Supervisor 2)

A Deep Learning Approach for Clustering a Multi-Class Dataset

Supervisor: Pei, Y. (Supervisor 1), Marczak, M. (External person) (External coach) & Groen, J. (External person) (External coach)

A Detailed Understanding of Actor Involvement in Business Processes

Supervisor: Fahland, D. (Supervisor 1) & Verbeek, H. M. W. (Supervisor 2)

Adopting the factorized model of execution in a graph database engine

Supervisor: Yakovets, N. (Supervisor 1) & van de Wall, A. A. G. (Supervisor 2)

Advances in Understanding and Initializing Einsum Networks

Adversarial attacks on deep dreams.

Supervisor: Quaeghebeur, E. (Supervisor 1), Gala, G. (Supervisor 2), Joosse, R. (External person) (External coach) & Stoelinga, E. (External person) (External coach)

Adversarial datasets through sentence length and conjunctions

Adversarial nlp benchmarks: data characteristics complicating automated generation of adversarial examples.

Supervisor: van Cauter, Z. M. (Supervisor 1) & de Campos, C. (Supervisor 2)

Adversarial Noise Benchmarking On Image Caption

Supervisor: de Campos, C. (Supervisor 1) & van Cauter, Z. M. (Supervisor 2)

Aerial Imagery Pixel-level Segmentation

Aethra db: optimising analytical processing through query-tailored code generation.

Supervisor: Bonetta, D. (Supervisor 1)

A Feasibility Study on Automated Database Exercise Generation with Large Language Models

Supervisor: Fletcher, G. H. L. (Supervisor 1)

A Forecasting Framework for Recirculation in Baggage Handling Systems

Supervisor: Fahland, D. (Supervisor 1) & Bernard, H. F. (External coach)

A framework for understanding business process remaining time predictions

Supervisor: Pechenizkiy, M. (Supervisor 1) & Scheepens, R. J. (Supervisor 2)

Age(ing) in software development

Supervisor: Serebrenik, A. (Supervisor 1), Baltes, S. (External person) (External coach), Constantinou, E. (Supervisor 2) & Fletcher, G. H. L. (Supervisor 2)

Aggregated Information Visualization for Process Alignments

Supervisor: van den Elzen, S. J. (Supervisor 1), Scheepens, R. J. (External coach) & van Dongen, B. F. (Supervisor 2)

A Heuristic Approach for the VRPTW using dual information of its LP formulation

Supervisor: Firat, M. (Supervisor 1), Medeiros de Carvalho, R. (Supervisor 2) & Hurkens, C. A. J. (Supervisor 2)

A Hybrid Model for Pedestrian Motion Prediction

Supervisor: Pechenizkiy, M. (Supervisor 1), Muñoz Sánchez, M. (Supervisor 2), Silvas, E. (External coach) & Smit, R. M. B. (External coach)

Algorithms for center-based trajectory clustering

Supervisor: Buchin, K. (Supervisor 1) & Driemel, A. (Supervisor 2)

Allocation Decision-Making in Service Supply Chain with Deep Reinforcement Learning

Supervisor: Zhang, Y. (Supervisor 1), van Jaarsveld, W. L. (Supervisor 2), Menkovski, V. (Supervisor 2) & Lamghari-Idrissi, D. (Supervisor 2)

A method for identifying undesired medical treatment variants using process and data mining techniques

Supervisor: Vanderfeesten, I. T. P. (Supervisor 1), Medeiros de Carvalho, R. (Supervisor 2) & Pechenizkiy, M. (Supervisor 2)

A Method to determine Actual Time Worked from Event Logs

Supervisor: van Dongen, B. F. (Supervisor 1)

An adaptive and scrutable math tutoring system

Supervisor: Stash, N. (Supervisor 1), De Bra, P. M. E. (Supervisor 2) & Huizing, C. (Supervisor 2)

An adversarial analysis of inference capabilities acquired by state-of-the-art NLP models from the RTE dataset

Analysis and improvement of process models with respect to key performance indicators: a debt collection case study.

Supervisor: de Leoni, M. (Supervisor 1), Schouten, M. (External person) (External coach), Duivesteijn, W. (Supervisor 2) & Türetken, O. (Supervisor 2)

Analysis of the influence of routines on task execution performance

Supervisor: Fahland, D. (Supervisor 1) & Klijn, E. L. (Supervisor 2)

Analyzing application usage logs to understand the users

Supervisor: Sidorova, N. (Supervisor 1), Chituc, C. M. (Supervisor 2), Lövei, P. (Supervisor 2) & Marchese, M. (External person) (External coach)

Analyzing Causes of Outlier Cascade Behavior in Baggage Handling Systems

Analyzing collaborations and routines in event graphs using statistics and pattern mining, analyzing complexity progression and complexity correlation of sql questions on stack overflow, analyzing customer journey with process mining: from discovery to recommendations.

Supervisor: Hassani, M. (Supervisor 1), Vitali, M. (External person) (External coach) & Carrá, A. (External person) (External coach)

Analyzing data of operating rooms in hospitals to reduce rework

Supervisor: Medeiros de Carvalho, R. (Supervisor 1) & Broeren, J. (External person) (External coach)

Analyzing Policy Gradient approaches towards Rapid Policy Transfer

Analyzing routines and habits in event graphs using statistics and pattern mining.

Supervisor: Fahland, D. (Supervisor 1) & Klijn, E. (Supervisor 2)

  • Search Ramapo College Website Search Ramapo College Website
  • Accreditation / Memberships
  • Mission, Vision & History
  • Visit Ramapo College
  • Lodging/Restaurants
  • Public Transportation
  • Virtual Campus Tour
  • Campus Directory
  • News & Media Home
  • Press Releases
  • The College Tour
  • Photo Galleries
  • Campus Videos
  • Ramapo Magazine
  • College Leadership
  • Office of the President
  • Board of Trustees
  • Strategic Plan
  • Institutional Effectiveness Council (IEC)
  • Office Directory
  • Consumer Info
  • Emergency Preparedness
  • Public Safety Department
  • Events & Conferences
  • Phone Directory
  • Ramapo Green
  • Academics Home
  • Majors, Minors, Concentrations
  • Graduate Programs
  • Degree Completion Program
  • College Honors Program
  • Nursing Programs
  • Teacher Education Programs
  • Anisfield School of Business (ASB)
  • Contemporary Arts (CA)
  • School of Humanities and Global Studies (HGS)
  • Social Science and Human Services (SSHS)
  • Theoretical and Applied Science (TAS)
  • Int'l Education Home
  • Study & Intern Abroad
  • International Students
  • International Scholars, Faculty & Staff
  • Internationalization
  • Registrar Home
  • Registration Information
  • Online Course Information
  • Graduation & Commencement Info
  • Forms / Transcripts
  • College Catalog
  • Academic Calendar
  • Office of Student Accounts
  • Testing Center
  • First Year Students
  • First-Generation Student Center
  • Web For Students & Faculty
  • Admissions Home
  • International
  • Veterans / Military Family
  • Admitted Students
  • Admission Requirements
  • Tuition & Cost
  • Financial Aid & Deadlines
  • Education Opp. Fund
  • Scholarships
  • Request More Information
  • Residence Life
  • Center for Student Involvement (CSI)
  • Career Services
  • Civic & Community Engagement Center
  • Health & Counseling Center
  • Queer Peer Services
  • Specialized Services
  • Dining Services
  • Student Affairs
  • Office of Student Conduct
  • Sexual Assault Resources
  • Commuter Affairs
  • Women's Center
  • Clubs & Organizations
  • Fraternity & Sorority Life
  • Student Government Association (SGA)
  • Student Leadership Programs
  • Student Jobs On Campus
  • Shuttle Destinations
  • Student Guide
  • Student Success Stories
  • Alumni Home
  • Alumni Advisory Boards
  • Alumni Association
  • Alumni Benefits
  • Alumni Discount
  • Alumni Events
  • Get Involved
  • Foundation Home
  • Board of Governors
  • College Magazine
  • Foundation Events
  • Foundation Grants
  • Friends of Ramapo
  • Government Grant Awards
  • Giving Home
  • The Fund for Ramapo
  • Capital Projects
  • How to Give
  • Matching Gifts
  • Planned Giving
  • About the Berrie Center
  • Performance Schedule
  • Tickets / Seating
  • About the Galleries
  • Kresge & Pascal
  • Rodman Gallery
  • Potter Library
  • Ramapo Collections
  • Gross Center for Holocaust and Genocide Studies
  • STEM Center at Ramapo College
  • Roukema Center for International Education
  • Sabrin Center for Free Enterprise
  • Sharp Sustainability Education Center
  • New Jersey Small Business Development Center at Ramapo College
  • About Events and Conferences
  • About the Facilities
  • Space Requests
  • Policies & Procedures
  • Summer Programs
  • Other Resources
  • Contact Event Services
  • Current Students
  • Parents & Families
  • Faculty & Staff
  • RCNJ Intranet
  • About Ramapo
  • Admissions & Aid
  • Student Life
  • Arts / Community

Ramapo College of New Jersey Home Page » Admissions & Aid » Graduate » DMC » MS Thesis Archive

  • Center for Data, Mathematical, and Computational Sciences
  • Undergraduate
  • MS Data Science
  • MS Applied Mathematics
  • MS Computer Science
  • 4+1 BS to MS
  • Academic Policies and Resources
  • Fieldwork Experience
  • Thesis (Handbook)
  • Thesis Archive
  • Student Clubs
  • Advisory Board
  • Lecture Series
  • Tuition and Financial Aid
  • News and Events
  • Fieldwork Sponsorship

MS Thesis Archive

Examining disease through microbiome data analysis, brett van tassel, m.s. data science.

The objective of this project is to examine the relationship between gut microbiomes of human subjects having different disease statuses by examining microbial diversity shifts. Read analysis and data cleaning is recorded from beginning to end so that the unfiltered and unfettered data can be reanalyzed and processed. Here we strive to create a tool that works for well curated data. Data is gathered from the database QIITA and the read data and metadata are queried via the tool redbiom. The initial exploratory analysis involved an examination of metadata attributes. A heat map of correlating attributes of the metadata using Cramer’s V algorithm allows visual correlation examination. Next, we train random forests based on metadata of interest. Due to the large quantity of attributes, many random forests are trained, and their respective significance values and Receiver Operating Characteristic curves (ROC) are generated. ROC curves are used to isolate optimal correlations. This process is built into a pipeline, ultimately allowing the efficient, automated analysis and assignment of disease susceptibility. Alpha and beta diversity metrics are generated and plotted for visual interpretation using QIIME2, a microbial analysis software platform. CLOUD, a tool for finding microbiome outliers, is used to identify markers of dysbiosis and contamination, and to measure rates of successful identification. CLOUD was found to identify positive diagnoses where Random Forests did not when examining positive samples and their predicted diagnosis status. SMOTE was found to perform similarly or slightly poorer compared to random sampling as a data balancing technique.

Summer 2023

Evaluating how nhl player shot selection impacts even-strength goal output over the course of a full season, elliott barinberg, m.s. data science.

Within this thesis work, the applications of data collection, machine learning, and data visualization were used on National Hockey League (NHL) shot data collected between the 2014-2015 season and the 2022-2023 season. Modeling sports data to better understand player evaluation has always been a goal of sports analytics. In the modern era of sports analytics the techniques used to quantify impacts on games have multiplied. However, when it comes to ice hockey all the most difficult challenges of sports data analysis present themselves in trying to understand the player impacts of such a continuously changing game-state. The methods developed and presented in this work serve to highlight those challenges and better explain a player’s impact on goal scoring for their team.

Throughout this work there are multiple kinds of modeling techniques used to try to best demonstrate a player’s impact on goal scoring as a factor of all the elements the player is capable of controlling. We try to understand which players have the best offensive process and impact on goal-scoring by caring about the merit of the offensive opportunities they create. It is important to note that these models are not intended to re-create the results seen in reality, although reality and true results are used to evaluate the outputs.

This process used data scraping to collect the data from the NHL public application programming interface (API). Data cleansing techniques were applied to the collected data, yielding custom data sets which were used for the corresponding models. Data transformation techniques were used to calculate additional factors based upon the data provided, thus creating additional data within the training and testing datasets. Techniques including but not limited to linear regression, logistic regression, random forests and extreme gradient boosted regression were all used to attempt to model the true possibility of any particular even-strength event being a goal in the NHL. Then, using formulaic approaches the individual event model was extrapolated upon to draw larger conclusions. Lastly, some unique data visualization techniques were used to best present the outputs of these models. In all, many experimental models were created which have yielded a reproducible methodology upon which to evaluate the results of any NHL player impact upon goal scoring over the course of a season.

Spring 2023

Building a statistical learning model for evaluation of nba players using player tracking data, matthew byman, m.s. data science.

This thesis aims to develop faster and more accurate methods for evaluating NBA player performances by leveraging publicly available player tracking data. The primary research question addresses whether player tracking data can improve existing performance evaluation metrics. The ultimate goal is to enable teams to make better-informed decisions in player acquisitions and evaluations.

To achieve this objective, the study first acquired player tracking data for all available NBA seasons from 2013 to 2021. Regularized Adjusted Plus-Minus (RAPM) was selected as the target variable, as it effectively ranks player value over the long term. Five statistical learning models were employed to estimate RAPM using player tracking data as features. Furthermore, the coefficients of each feature were ranked, and the models were rerun with only the 30 most important features.

Once the models were developed, they were tested on a newly acquired player tracking data from the 2022 season to evaluate their effectiveness in estimating RAPM. The key findings revealed that Lasso Regression and Random Forest models performed the best in predicting RAPM values. These models enable the use of player tracking statistics that settle earlier, providing an accurate estimate of future RAPM. This early insight into player performance offers teams a competitive advantage in player evaluations and acquisitions.

In conclusion, this study demonstrates that combining statistical learning models with player tracking data can effectively estimate performance metrics, such as RAPM, earlier in the season. By obtaining accurate RAPM estimates before other teams, organizations can identify and acquire top-performing players, ultimately enhancing their competitive edge in the NBA.

BUILDING AN ML DRIVEN SYSTEM FOR REAL-TIME CODE-PERFORMANCE MONITORING

Mikhail delyusto, m.s. data science.

This project is a part of a multidirectional attempt to increase quality of the software and data product that is being produced by Science and Engineering departments of Aetion Inc., the company that is transforming the healthcare industry by providing its partners (major healthcare industry players) with a real-world evidence generation platform, that helps to drive greater safety, effectiveness, and value of health treatments. Large datasets (up to 100Tb each) of healthcare market data (for example, insurance claims) get ingested into the platform and get transformed into Aetion’s proprietary longitudinal format.

This attempt is being led by the Quality Engineering Team and is envisioned to move away from conventional testing techniques by decoupling different moving parts and isolating them in separate, maintainable and reliable tools.

A subject of this thesis is a particular branch of a large quality initiative that will be helping to continuously monitor a number of metrics that are involved in execution of the two most common types of jobs that run on Aetion’s platform: cohorts and analyses. These jobs may take up to a few hours to generate depending on the size of a dataset and the complexity of an analysis.

Implemented, this monitoring system would be supplied with a feed of logs that contain certain data points, like timestamps. Enhanced with a built-in algorithm to set a threshold on the metrics and notify its users (stakeholders from Engineering and Science) when said threshold is exceeded, would be a game-changing capability in Aetion’s quality space. Currently, there is no way to say if any given job is taking more or, otherwise, significantly less time and most of the defects get identified in upper environments (including production).

The issues identified in upper environments are the costlier of all the types and, by different industry considerations, can cost $5000 – $10000 each.

As a result of implementing said system we would expect a steep decrease in a number of issues in upper environments, as well as an increase in release frequency, that the organization will greatly benefit from.

OPTIMIZING PRODUCT RECOMMENDATION DECISIONS USING SPATIAL ANALYSIS

Raul a. hincapie, m.s. data science.

At a certain Consumer Packaged Goods (CPG) company, there was a need to coordinate between sales, geographic location, and demographic datasets to make better-informed business decisions. One area that required this type of coordination was the replacement process of a specific product being sold to a store. The need for this type of replacement arises when a product is not authorized to be sold at the store, out of stock, permanently discontinued, or not selling at the intended rate. Previously, the process at this company relied on instinctual decision-making when it came to product replacements, which showed a need for this protocol to be more data-driven.

The premise of this project is to create a data-driven product replacement process. It would be a type of system where the CPG company inputs a store and a product then it would output a product list with suitable replacement items. The replacement items would be based on stores similar to the input store using its sales, geographic location, and demographic portfolio. By identifying these similar stores, it is possible that the CPG company could also discover product opportunities or niches for a specific store or region. With a system like this, the company will increase their regional product knowledge based on geographical location as well as improve current and future sales. The system could also provide highly valuable information on its consumer preferences and behaviors, which could eventually help to understand future customers.

PREDICTING AND ANALYZING STOCK MARKET BEHAVIOR USING MAGAZINE COVERS

Egor isakson, m.s. data science.

Financial magazines have been part of the financial industry right from the start. There has long been a debate whether a stock being featured in a magazine is a contrarian signal. The reasoning behind this is simple; any informational edge reaches the wide masses last, which means by the time that happens, the bulk of the directional move of the financial instrument has long been completed. This paper puts this idea to the test by examining the behavior of the stock market and the stocks that are featured on magazine covers of various financial magazines and newspapers. By going through several stages of data extraction and processing utilizing a series of most up-to-date data science techniques, ticker symbols are derived from raw colorful images of covers. The derivation results in a many-to-many relationship, where a single ticker shows up at different points in time, at the same time, with a possibility of a single cover having many tickers at once. From then, several historic price and media-related features are created in preparation for the machine learning models. Several models are utilized to look at the behavior of the stock and the index at different points in time in the upcoming future. Results demonstrate more than random results but insufficient as the sole determinant of direction of the asset.

IDENTIFYING OUTLIER DATA POINTS IN NON-CLINICAL INVESTIGATIONAL NEW DRUG SUBMISSIONS

Cassandra o’malley, m.s. data science.

The Food and Drug Administration (FDA) uses a format known as SEND (Standard for Exchange of Nonclinical Data) to evaluate non-clinical (animal) studies for investigational new drug applications. Investigative drug sponsors currently use information from historical and control data to determine if drugs cause toxicity.

The goal of this study is to identify outlying data points that may indicate an investigative new drug could be toxic. Examples include a negative body weight gain over time, enlarged organ weights, or laboratory test abnormalities, especially in relation to a control group within the same study. Flagged records can be analyzed by a veterinarian or pathologist for potential signs of toxicity without looking at each individual data point.

Common domains within the non-clinical pharmaceutical studies were evaluated using changes from baseline measurements, changes from the control group, a percent change from the previous measurement with reference to the ethical guidelines, values outside of the mean ± two standard deviations, and a measure of abnormal findings to unremarkable findings in pathology. A program was designed to analyze five of these domains and return a collection of possible outlying data for simpler and faster than individual data point analysis by a study monitor, performing the analysis in a fraction of the time. The resulting file is more easily read by someone unfamiliar with the SEND format.

With this program, analyzing a study for possible toxic effects during the study can save time, effort, and even animal lives by identifying the signs of toxicity early. Sponsors or CROs can determine if the product is safe enough to proceed with testing or should be stopped in the interest of safety and additional research.

CLIMATE CHANGE IMPACTS ON FOOD PRODUCTION: A BIBLIOMETRIC NETWORK ANALYSIS

Skylar clawson, m.s. data science.

Climate change is an environmental issue that is affecting many different sectors of society such as terrestrial, freshwater and marine ecosystems, human health and agriculture. With a growing population, food security is a serious issue exacerbated by climate change. Climate change is not only impacting food production, but food production is also impacting climate change by emitting greenhouse gasses during the different stages of the food supply chain. This project seeks to use a bibliometric network analysis to identify the influence that the food supply chain has on climate change. We created four networks for each stage in the food supply chain (food processing, food transportation, food retail, food waste) to distinguish how influential the food supply chain is on climate change. The data needed for a bibliometric network comes from a scientific database and the networks are created based on a co-word analysis. Co-word analysis reveals words that frequently appear together to show that they have some form of a relationship in research publications. The second part of our analysis is more focused on how climate change impacts the early growth and development stages of grains. We collected data on several grains as well as temperature and precipitation to see if the representing climate stressors had any influence on production rates. This project’s main focus is to identify how climate change and food production could be influencing each other. The main findings of this project indicate that all four stages of the food supply chain influence climate change. This project also indicates that climate change affects grain production by different climate variables such as temperature and precipitation variability.

EXPLORING VEHICLE SERVICE CONTRACT CANCELLATIONS

Josip skunca, m.s. data science.

The goal of this thesis is to propose the cancellation reserve requirement for ServiceContract.com, a start-up vehicle service contract administrator being formed by its parent company DOWC. DOWC is a vehicle service contract administrator who prides itself on offering customized financial products to large car dealerships. The creation of ServiceContract.com (referred to as ServiceContract) serves to offer no-chargeback products as a means of marketing to another portion of the automotive industry. No-chargeback means that if the contract cancels (after 90 days) the Dealership, Finance Manager, and Agent (account manager) of the account are not required to refund their profit from the insurance contract – the administrator refunds the prorated price of the contract. In other words, the administrator must refund the entirety of the contract’s price, prorated at the time of cancellation.

Therefore, the cancellation reserve is the price that must be collected per contract in order to cover all cancellation costs. This research was a requirement to determine the feasibility of the new company and determine the pricing requirements of its products. The pricing of the new company’s products would determine ServiceContract’s competitiveness in the market, and therefore provide an evaluation of the business model.

To find this reserve requirement, research first started by finding the total amount of money that DOWC has refunded, along with the total number of contracts sold. Adding specific information allowed the calculation of these requirements in the necessary form. Service contract administrators are required to file rate cards with each state that must clearly specify the dimensions of the contract and their corresponding price.

The key result in the research was the realization that the Cancellation Reserve would be tied to the Maximum allowed retail price. If the maximum price dealerships can sell for is lowered, the required Cancellation Reserve will follow suit, and as a result lower the Coverage cost of the contract. This allowed for the dealership to have an opportunity to make their desired profit, while enabling ServiceContract to offer competitive pricing.

The most significant impact of these results is that ServiceContract was able to determine that the company had more competitive rates than both competitors and DOWC. This research opened the company’s eyes to the benefit of this kind of research, and will prompt further research in the future.

Spring 2022

A tool for who will drop out of school, colette joelle barca, m.s. data science.

A student’s high school experience often forms the foundation of his or her postsecondary career. As the competition in our nation’s job market continues to increase, many businesses stipulate applicants need a college degree. However, recent studies show approximately one-third of the United States’ college students never obtain a degree. Although colleges have developed methods for identifying and supporting their struggling students, early intervention could be a more effective approach for combating postsecondary dropout rates. This project seeks to use anomaly detection techniques to create a holistic early detection tool that indicates which high school students are most at risk to drop out of college. An individual’s high school experience is not confined to the academic components. As such, an effective model should incorporate both environmental and educational factors, including various descriptive data on the student’s home area, the school’s area, and the school’s overall structure and performance. This project combined this information with data on students throughout their secondary educational careers (i.e., from ninth through twelfth grade) in an attempt to develop a model that could detect during high school which students have a higher probability of dropping out of college. The clustering-based and classification-based anomaly detection algorithms detail the situational and numeric circumstances, respectively, that most frequently result in a student dropping out of college. High school administrators could implement these models at the culmination of each school year to identify which students are most at risk for dropping out in college. Then, administrators could provide additional support to those students during the following school year to decrease that risk. College administrators could also follow this same process to minimize dropout rates.

COMPREHENSIVE ANALYSIS OF THE FUTURE PRICE OF NBA TOP SHOT MOMENTS

Miguel a. esteban diaz, m.s. data science.

NBA Top Shot moments are NFTs built on the FLOW blockchain and created by Dapper Labs in collaboration with the NBA. These NFTs, commonly referred to as “moments”, consist of in-game highlights of an NBA or WNBA player. Using the different variables of a moment, like for example: the type of play done by the player appearing in the moment (dunk, assist, block, etc.), the number of listings of that moment in the marketplace, whether the player appearing in the moment is a rookie or the rarity tier of the moment (Common, Fandom, Rare or Legendary). This project aims to provide a statistical analysis that could yield hidden correlations of the characteristics of a moment and its price, and a prediction of the price of moments with the use of machine learning regression models which include linear regression, random forest or neural networks. As NFTs, and especially NBA Top Shot, are a relatively recent area of research, at the moment there is not extensive research performed about this area. This research has an intent to expand the up to date analysis and research performed in this topic and serve as a foundation for any future research in this area, as well as provide helpful and practical information about the valuation of moments, the importance of the diverse characteristics of moments and impact in the pricing of the moments and the future possible application of this information to other similar highlight-oriented sport NFTs like NFL AllDay or UFC Strike, which are designed similarly to NBA Top Shot.

PREVENTING THE LOSS OF SKILLFUL TEACHERS: TEACHER TURNOVER PREDICTION USING MACHINE LEARNING TECHNIQUES

Nirusha srishan, m.s. data science.

Teacher turnover rate is an increasing problem in the United States. Each year, teachers leave their current teaching position to either move to a different school or to leave the profession entirely. In an effort to understand why teachers are leaving their current teaching positions and to help identify ways to increase teacher retention rate, I am exploring possible reasons that influence teacher turnover and creating a model to predict if a teacher will leave the teaching profession. The ongoing turnover of teachers has a vast impact on school district employees, the state, the country, and the student population. Therefore, exploring the variables that contribute to teacher turnover can ultimately lead to decreasing the rate of turnover.

This project compares those in the educational field, including general education teachers, special education teachers and other educational staff, who have completed the 1999-2000 School and Staffing Survey (SASS) and Teacher Follow-up Survey (TFS) from the National Center for Educational Statistics (NCES, n.d.). This data will be used to identify trends in teachers that have left the profession. Predictive modeling will include various machine learning techniques, including Logistic Regression, Support Vector Machines (SVM), Decision Tree and Random Forest, and K-Nearest Neighbors. By finding the reasons for teacher turnover, a school district can identify a way to maximize their teacher retention rate, fostering a supportive learning environment for students, and creating a positive work environment for educators.

FORECASTING AVERAGE SPEED OF CALL CENTER RESPONSES

Emmanuel torres, m.s. data science.

Organizations use multifaceted modern call centers and are currently utilizing antiquated forecasting technologies leading to erroneous staffing during critical periods of unprecedented volume. Companies will experience financial hemorrhaging or provide an inadequate customer experience due to incorrect staffing when sporadic volume emerges. The current forecasting models being employed are being used with known caveats such as the inability for the model to handle wait time without abandonment and only considers a single call type when making the prediction.

This study aims to create a new forecasting model to predict the Average Speed of Answer (ASA) to obtain a more accurate prediction of the staffing requirements for a call center. The new model will anticipate historical volume of varying capacities to create the prediction. Both parametric and nonparametric methodologies will be used to forecast the ASA. An ARIMA (Autoregressive Integrated Moving Average) parametric model was used to create a baseline for the prediction. The application of machine learning techniques such as Recurrent Neural Networks (RNN) was used since it can process sequential data by utilizing previous outputs as inputs to create the neural network. Specifically, Long Short-Term Memory (LSTM) recurrent neural networks were used to create a forecasting model for the call center ASA.

With the LSTM neural network a univariate and multivariate approach was utilized to forecast the ASA. The findings confirm that univariate LSTM neural networks resulted in a more accurate forecast by netting the lowest Root Mean Squared Error (RMSE) score from the three methods used to predict the call center ASA. Even though the univariate LSTM model produced the best results, the multivariate LSTM model did not stray far from providing an accurate prediction but received a higher RMSE score compared to the univariate model. Furthermore, ARIMA provided the highest RMSE score and forecasted the ASA inaccurately.

A COMPREHENSIVE EVALUATION ON THE APPLICATIONS OF DATA AUGMENTATION, TRANSFER LEARNING AND IMAGE ENHANCEMENT IN DEVELOPING A ROBUST SPEECH EMOTION RECOGNITION SYSTEM

Kyle philip calabro, m.s. data science.

Within this thesis work, the applications of data augmentation, transfer learning, and image enhancement techniques were explored in great depth with respect to speech emotion recognition (SER) via convolutional neural networks and the classification of spectrogram images. Speech emotion recognition is a challenging subset of machine learning with an incredibly active research community. One of the prominent challenges of SER is a lack of quality training data. The methods developed and presented in this work serve to alleviate this issue and improve upon the current state-of-the-art methodology. A novel unimodal approach was taken in which five transfer learning models pre-trained on the ImageNet data set were used with both the feature extraction and fine-tuning method of transfer learning. Such transfer learning models include the VGG-16, VGG-19, InceptionV3, Xception and ResNet-50. A modified version of the AlexNet deep neural network model was utilized as a baseline for non pre-trained deep neural networks. Two speech corpora were utilized to develop these methods. The Ryerson Audio-Visual Database of Emotional Speech and Songs (RAVDESS) and the Crowd-source Emotional Multimodal Actors dataset (CREMA-D). Data augmentation techniques were applied to the raw audio of each speech corpora to increase the amount of training data, yielding custom data sets. Raw audio data augmentation techniques include the addition of Gaussian noise, stretching by two different factors, time shifting and shifting pitch by three separate tones. Image enhancement techniques were implemented with the aim of improving classification accuracy by unveiling more prominent features in the spectrograms. Image enhancement techniques include conversion to grayscale, contrast stretching and the combination of grayscale conversion followed by contrast stretching. In all, 176 experiments were conducted to provide a comprehensive overview of all techniques that were proposed as well as a definitive methodology. Such methodology yields improved or comparable results to what is currently considered to be state-of-the-art when deployed on the RAVDESS and CREMA-D speech corpora.

Ramapo College Logo

505 Ramapo Valley Road Mahwah, NJ 07430

p: 201-684-7500 e: [email protected]

  • Web Self-Service
  • Student Complaint Form

Copyright ©2024 Ramapo College Of New Jersey. Statements And Policies . Contact Webmaster .

  • Survey Paper
  • Open access
  • Published: 01 July 2020

Cybersecurity data science: an overview from machine learning perspective

  • Iqbal H. Sarker   ORCID: orcid.org/0000-0003-1740-5517 1 , 2 ,
  • A. S. M. Kayes 3 ,
  • Shahriar Badsha 4 ,
  • Hamed Alqahtani 5 ,
  • Paul Watters 3 &
  • Alex Ng 3  

Journal of Big Data volume  7 , Article number:  41 ( 2020 ) Cite this article

141k Accesses

238 Citations

51 Altmetric

Metrics details

In a computing context, cybersecurity is undergoing massive shifts in technology and its operations in recent days, and data science is driving the change. Extracting security incident patterns or insights from cybersecurity data and building corresponding data-driven model , is the key to make a security system automated and intelligent. To understand and analyze the actual phenomena with data, various scientific methods, machine learning techniques, processes, and systems are used, which is commonly known as data science. In this paper, we focus and briefly discuss on cybersecurity data science , where the data is being gathered from relevant cybersecurity sources, and the analytics complement the latest data-driven patterns for providing more effective security solutions. The concept of cybersecurity data science allows making the computing process more actionable and intelligent as compared to traditional ones in the domain of cybersecurity. We then discuss and summarize a number of associated research issues and future directions . Furthermore, we provide a machine learning based multi-layered framework for the purpose of cybersecurity modeling. Overall, our goal is not only to discuss cybersecurity data science and relevant methods but also to focus the applicability towards data-driven intelligent decision making for protecting the systems from cyber-attacks.

Introduction

Due to the increasing dependency on digitalization and Internet-of-Things (IoT) [ 1 ], various security incidents such as unauthorized access [ 2 ], malware attack [ 3 ], zero-day attack [ 4 ], data breach [ 5 ], denial of service (DoS) [ 2 ], social engineering or phishing [ 6 ] etc. have grown at an exponential rate in recent years. For instance, in 2010, there were less than 50 million unique malware executables known to the security community. By 2012, they were double around 100 million, and in 2019, there are more than 900 million malicious executables known to the security community, and this number is likely to grow, according to the statistics of AV-TEST institute in Germany [ 7 ]. Cybercrime and attacks can cause devastating financial losses and affect organizations and individuals as well. It’s estimated that, a data breach costs 8.19 million USD for the United States and 3.9 million USD on an average [ 8 ], and the annual cost to the global economy from cybercrime is 400 billion USD [ 9 ]. According to Juniper Research [ 10 ], the number of records breached each year to nearly triple over the next 5 years. Thus, it’s essential that organizations need to adopt and implement a strong cybersecurity approach to mitigate the loss. According to [ 11 ], the national security of a country depends on the business, government, and individual citizens having access to applications and tools which are highly secure, and the capability on detecting and eliminating such cyber-threats in a timely way. Therefore, to effectively identify various cyber incidents either previously seen or unseen, and intelligently protect the relevant systems from such cyber-attacks, is a key issue to be solved urgently.

figure 1

Popularity trends of data science, machine learning and cybersecurity over time, where x-axis represents the timestamp information and y axis represents the corresponding popularity values

Cybersecurity is a set of technologies and processes designed to protect computers, networks, programs and data from attack, damage, or unauthorized access [ 12 ]. In recent days, cybersecurity is undergoing massive shifts in technology and its operations in the context of computing, and data science (DS) is driving the change, where machine learning (ML), a core part of “Artificial Intelligence” (AI) can play a vital role to discover the insights from data. Machine learning can significantly change the cybersecurity landscape and data science is leading a new scientific paradigm [ 13 , 14 ]. The popularity of these related technologies is increasing day-by-day, which is shown in Fig.  1 , based on the data of the last five years collected from Google Trends [ 15 ]. The figure represents timestamp information in terms of a particular date in the x-axis and corresponding popularity in the range of 0 (minimum) to 100 (maximum) in the y-axis. As shown in Fig.  1 , the popularity indication values of these areas are less than 30 in 2014, while they exceed 70 in 2019, i.e., more than double in terms of increased popularity. In this paper, we focus on cybersecurity data science (CDS), which is broadly related to these areas in terms of security data processing techniques and intelligent decision making in real-world applications. Overall, CDS is security data-focused, applies machine learning methods to quantify cyber risks, and ultimately seeks to optimize cybersecurity operations. Thus, the purpose of this paper is for those academia and industry people who want to study and develop a data-driven smart cybersecurity model based on machine learning techniques. Therefore, great emphasis is placed on a thorough description of various types of machine learning methods, and their relations and usage in the context of cybersecurity. This paper does not describe all of the different techniques used in cybersecurity in detail; instead, it gives an overview of cybersecurity data science modeling based on artificial intelligence, particularly from machine learning perspective.

The ultimate goal of cybersecurity data science is data-driven intelligent decision making from security data for smart cybersecurity solutions. CDS represents a partial paradigm shift from traditional well-known security solutions such as firewalls, user authentication and access control, cryptography systems etc. that might not be effective according to today’s need in cyber industry [ 16 , 17 , 18 , 19 ]. The problems are these are typically handled statically by a few experienced security analysts, where data management is done in an ad-hoc manner [ 20 , 21 ]. However, as an increasing number of cybersecurity incidents in different formats mentioned above continuously appear over time, such conventional solutions have encountered limitations in mitigating such cyber risks. As a result, numerous advanced attacks are created and spread very quickly throughout the Internet. Although several researchers use various data analysis and learning techniques to build cybersecurity models that are summarized in “ Machine learning tasks in cybersecurity ” section, a comprehensive security model based on the effective discovery of security insights and latest security patterns could be more useful. To address this issue, we need to develop more flexible and efficient security mechanisms that can respond to threats and to update security policies to mitigate them intelligently in a timely manner. To achieve this goal, it is inherently required to analyze a massive amount of relevant cybersecurity data generated from various sources such as network and system sources, and to discover insights or proper security policies with minimal human intervention in an automated manner.

Analyzing cybersecurity data and building the right tools and processes to successfully protect against cybersecurity incidents goes beyond a simple set of functional requirements and knowledge about risks, threats or vulnerabilities. For effectively extracting the insights or the patterns of security incidents, several machine learning techniques, such as feature engineering, data clustering, classification, and association analysis, or neural network-based deep learning techniques can be used, which are briefly discussed in “ Machine learning tasks in cybersecurity ” section. These learning techniques are capable to find the anomalies or malicious behavior and data-driven patterns of associated security incidents to make an intelligent decision. Thus, based on the concept of data-driven decision making, we aim to focus on cybersecurity data science , where the data is being gathered from relevant cybersecurity sources such as network activity, database activity, application activity, or user activity, and the analytics complement the latest data-driven patterns for providing corresponding security solutions.

The contributions of this paper are summarized as follows.

We first make a brief discussion on the concept of cybersecurity data science and relevant methods to understand its applicability towards data-driven intelligent decision making in the domain of cybersecurity. For this purpose, we also make a review and brief discussion on different machine learning tasks in cybersecurity, and summarize various cybersecurity datasets highlighting their usage in different data-driven cyber applications.

We then discuss and summarize a number of associated research issues and future directions in the area of cybersecurity data science, that could help both the academia and industry people to further research and development in relevant application areas.

Finally, we provide a generic multi-layered framework of the cybersecurity data science model based on machine learning techniques. In this framework, we briefly discuss how the cybersecurity data science model can be used to discover useful insights from security data and making data-driven intelligent decisions to build smart cybersecurity systems.

The remainder of the paper is organized as follows. “ Background ” section summarizes background of our study and gives an overview of the related technologies of cybersecurity data science. “ Cybersecurity data science ” section defines and discusses briefly about cybersecurity data science including various categories of cyber incidents data. In “  Machine learning tasks in cybersecurity ” section, we briefly discuss various categories of machine learning techniques including their relations with cybersecurity tasks and summarize a number of machine learning based cybersecurity models in the field. “ Research issues and future directions ” section briefly discusses and highlights various research issues and future directions in the area of cybersecurity data science. In “  A multi-layered framework for smart cybersecurity services ” section, we suggest a machine learning-based framework to build cybersecurity data science model and discuss various layers with their roles. In “  Discussion ” section, we highlight several key points regarding our studies. Finally,  “ Conclusion ” section concludes this paper.

In this section, we give an overview of the related technologies of cybersecurity data science including various types of cybersecurity incidents and defense strategies.

  • Cybersecurity

Over the last half-century, the information and communication technology (ICT) industry has evolved greatly, which is ubiquitous and closely integrated with our modern society. Thus, protecting ICT systems and applications from cyber-attacks has been greatly concerned by the security policymakers in recent days [ 22 ]. The act of protecting ICT systems from various cyber-threats or attacks has come to be known as cybersecurity [ 9 ]. Several aspects are associated with cybersecurity: measures to protect information and communication technology; the raw data and information it contains and their processing and transmitting; associated virtual and physical elements of the systems; the degree of protection resulting from the application of those measures; and eventually the associated field of professional endeavor [ 23 ]. Craigen et al. defined “cybersecurity as a set of tools, practices, and guidelines that can be used to protect computer networks, software programs, and data from attack, damage, or unauthorized access” [ 24 ]. According to Aftergood et al. [ 12 ], “cybersecurity is a set of technologies and processes designed to protect computers, networks, programs and data from attacks and unauthorized access, alteration, or destruction”. Overall, cybersecurity concerns with the understanding of diverse cyber-attacks and devising corresponding defense strategies that preserve several properties defined as below [ 25 , 26 ].

Confidentiality is a property used to prevent the access and disclosure of information to unauthorized individuals, entities or systems.

Integrity is a property used to prevent any modification or destruction of information in an unauthorized manner.

Availability is a property used to ensure timely and reliable access of information assets and systems to an authorized entity.

The term cybersecurity applies in a variety of contexts, from business to mobile computing, and can be divided into several common categories. These are - network security that mainly focuses on securing a computer network from cyber attackers or intruders; application security that takes into account keeping the software and the devices free of risks or cyber-threats; information security that mainly considers security and the privacy of relevant data; operational security that includes the processes of handling and protecting data assets. Typical cybersecurity systems are composed of network security systems and computer security systems containing a firewall, antivirus software, or an intrusion detection system [ 27 ].

Cyberattacks and security risks

The risks typically associated with any attack, which considers three security factors, such as threats, i.e., who is attacking, vulnerabilities, i.e., the weaknesses they are attacking, and impacts, i.e., what the attack does [ 9 ]. A security incident is an act that threatens the confidentiality, integrity, or availability of information assets and systems. Several types of cybersecurity incidents that may result in security risks on an organization’s systems and networks or an individual [ 2 ]. These are:

Unauthorized access that describes the act of accessing information to network, systems or data without authorization that results in a violation of a security policy [ 2 ];

Malware known as malicious software, is any program or software that intentionally designed to cause damage to a computer, client, server, or computer network, e.g., botnets. Examples of different types of malware including computer viruses, worms, Trojan horses, adware, ransomware, spyware, malicious bots, etc. [ 3 , 26 ]; Ransom malware, or ransomware , is an emerging form of malware that prevents users from accessing their systems or personal files, or the devices, then demands an anonymous online payment in order to restore access.

Denial-of-Service is an attack meant to shut down a machine or network, making it inaccessible to its intended users by flooding the target with traffic that triggers a crash. The Denial-of-Service (DoS) attack typically uses one computer with an Internet connection, while distributed denial-of-service (DDoS) attack uses multiple computers and Internet connections to flood the targeted resource [ 2 ];

Phishing a type of social engineering , used for a broad range of malicious activities accomplished through human interactions, in which the fraudulent attempt takes part to obtain sensitive information such as banking and credit card details, login credentials, or personally identifiable information by disguising oneself as a trusted individual or entity via an electronic communication such as email, text, or instant message, etc. [ 26 ];

Zero-day attack is considered as the term that is used to describe the threat of an unknown security vulnerability for which either the patch has not been released or the application developers were unaware [ 4 , 28 ].

Beside these attacks mentioned above, privilege escalation [ 29 ], password attack [ 30 ], insider threat [ 31 ], man-in-the-middle [ 32 ], advanced persistent threat [ 33 ], SQL injection attack [ 34 ], cryptojacking attack [ 35 ], web application attack [ 30 ] etc. are well-known as security incidents in the field of cybersecurity. A data breach is another type of security incident, known as a data leak, which is involved in the unauthorized access of data by an individual, application, or service [ 5 ]. Thus, all data breaches are considered as security incidents, however, all the security incidents are not data breaches. Most data breaches occur in the banking industry involving the credit card numbers, personal information, followed by the healthcare sector and the public sector [ 36 ].

Cybersecurity defense strategies

Defense strategies are needed to protect data or information, information systems, and networks from cyber-attacks or intrusions. More granularly, they are responsible for preventing data breaches or security incidents and monitoring and reacting to intrusions, which can be defined as any kind of unauthorized activity that causes damage to an information system [ 37 ]. An intrusion detection system (IDS) is typically represented as “a device or software application that monitors a computer network or systems for malicious activity or policy violations” [ 38 ]. The traditional well-known security solutions such as anti-virus, firewalls, user authentication, access control, data encryption and cryptography systems, however might not be effective according to today’s need in the cyber industry

[ 16 , 17 , 18 , 19 ]. On the other hand, IDS resolves the issues by analyzing security data from several key points in a computer network or system [ 39 , 40 ]. Moreover, intrusion detection systems can be used to detect both internal and external attacks.

Intrusion detection systems are different categories according to the usage scope. For instance, a host-based intrusion detection system (HIDS), and network intrusion detection system (NIDS) are the most common types based on the scope of single computers to large networks. In a HIDS, the system monitors important files on an individual system, while it analyzes and monitors network connections for suspicious traffic in a NIDS. Similarly, based on methodologies, the signature-based IDS, and anomaly-based IDS are the most well-known variants [ 37 ].

Signature-based IDS : A signature can be a predefined string, pattern, or rule that corresponds to a known attack. A particular pattern is identified as the detection of corresponding attacks in a signature-based IDS. An example of a signature can be known patterns or a byte sequence in a network traffic, or sequences used by malware. To detect the attacks, anti-virus software uses such types of sequences or patterns as a signature while performing the matching operation. Signature-based IDS is also known as knowledge-based or misuse detection [ 41 ]. This technique can be efficient to process a high volume of network traffic, however, is strictly limited to the known attacks only. Thus, detecting new attacks or unseen attacks is one of the biggest challenges faced by this signature-based system.

Anomaly-based IDS : The concept of anomaly-based detection overcomes the issues of signature-based IDS discussed above. In an anomaly-based intrusion detection system, the behavior of the network is first examined to find dynamic patterns, to automatically create a data-driven model, to profile the normal behavior, and thus it detects deviations in the case of any anomalies [ 41 ]. Thus, anomaly-based IDS can be treated as a dynamic approach, which follows behavior-oriented detection. The main advantage of anomaly-based IDS is the ability to identify unknown or zero-day attacks [ 42 ]. However, the issue is that the identified anomaly or abnormal behavior is not always an indicator of intrusions. It sometimes may happen because of several factors such as policy changes or offering a new service.

In addition, a hybrid detection approach [ 43 , 44 ] that takes into account both the misuse and anomaly-based techniques discussed above can be used to detect intrusions. In a hybrid system, the misuse detection system is used for detecting known types of intrusions and anomaly detection system is used for novel attacks [ 45 ]. Beside these approaches, stateful protocol analysis can also be used to detect intrusions that identifies deviations of protocol state similarly to the anomaly-based method, however it uses predetermined universal profiles based on accepted definitions of benign activity [ 41 ]. In Table 1 , we have summarized these common approaches highlighting their pros and cons. Once the detecting has been completed, the intrusion prevention system (IPS) that is intended to prevent malicious events, can be used to mitigate the risks in different ways such as manual, providing notification, or automatic process [ 46 ]. Among these approaches, an automatic response system could be more effective as it does not involve a human interface between the detection and response systems.

  • Data science

We are living in the age of data, advanced analytics, and data science, which are related to data-driven intelligent decision making. Although, the process of searching patterns or discovering hidden and interesting knowledge from data is known as data mining [ 47 ], in this paper, we use the broader term “data science” rather than data mining. The reason is that, data science, in its most fundamental form, is all about understanding of data. It involves studying, processing, and extracting valuable insights from a set of information. In addition to data mining, data analytics is also related to data science. The development of data mining, knowledge discovery, and machine learning that refers creating algorithms and program which learn on their own, together with the original data analysis and descriptive analytics from the statistical perspective, forms the general concept of “data analytics” [ 47 ]. Nowadays, many researchers use the term “data science” to describe the interdisciplinary field of data collection, preprocessing, inferring, or making decisions by analyzing the data. To understand and analyze the actual phenomena with data, various scientific methods, machine learning techniques, processes, and systems are used, which is commonly known as data science. According to Cao et al. [ 47 ] “data science is a new interdisciplinary field that synthesizes and builds on statistics, informatics, computing, communication, management, and sociology to study data and its environments, to transform data to insights and decisions by following a data-to-knowledge-to-wisdom thinking and methodology”. As a high-level statement in the context of cybersecurity, we can conclude that it is the study of security data to provide data-driven solutions for the given security problems, as known as “the science of cybersecurity data”. Figure 2 shows the typical data-to-insight-to-decision transfer at different periods and general analytic stages in data science, in terms of a variety of analytics goals (G) and approaches (A) to achieve the data-to-decision goal [ 47 ].

figure 2

Data-to-insight-to-decision analytic stages in data science [ 47 ]

Based on the analytic power of data science including machine learning techniques, it can be a viable component of security strategies. By using data science techniques, security analysts can manipulate and analyze security data more effectively and efficiently, uncovering valuable insights from data. Thus, data science methodologies including machine learning techniques can be well utilized in the context of cybersecurity, in terms of problem understanding, gathering security data from diverse sources, preparing data to feed into the model, data-driven model building and updating, for providing smart security services, which motivates to define cybersecurity data science and to work in this research area.

Cybersecurity data science

In this section, we briefly discuss cybersecurity data science including various categories of cyber incidents data with the usage in different application areas, and the key terms and areas related to our study.

Understanding cybersecurity data

Data science is largely driven by the availability of data [ 48 ]. Datasets typically represent a collection of information records that consist of several attributes or features and related facts, in which cybersecurity data science is based on. Thus, it’s important to understand the nature of cybersecurity data containing various types of cyberattacks and relevant features. The reason is that raw security data collected from relevant cyber sources can be used to analyze the various patterns of security incidents or malicious behavior, to build a data-driven security model to achieve our goal. Several datasets exist in the area of cybersecurity including intrusion analysis, malware analysis, anomaly, fraud, or spam analysis that are used for various purposes. In Table 2 , we summarize several such datasets including their various features and attacks that are accessible on the Internet, and highlight their usage based on machine learning techniques in different cyber applications. Effectively analyzing and processing of these security features, building target machine learning-based security model according to the requirements, and eventually, data-driven decision making, could play a role to provide intelligent cybersecurity services that are discussed briefly in “ A multi-layered framework for smart cybersecurity services ” section.

Defining cybersecurity data science

Data science is transforming the world’s industries. It is critically important for the future of intelligent cybersecurity systems and services because of “security is all about data”. When we seek to detect cyber threats, we are analyzing the security data in the form of files, logs, network packets, or other relevant sources. Traditionally, security professionals didn’t use data science techniques to make detections based on these data sources. Instead, they used file hashes, custom-written rules like signatures, or manually defined heuristics [ 21 ]. Although these techniques have their own merits in several cases, it needs too much manual work to keep up with the changing cyber threat landscape. On the contrary, data science can make a massive shift in technology and its operations, where machine learning algorithms can be used to learn or extract insight of security incident patterns from the training data for their detection and prevention. For instance, to detect malware or suspicious trends, or to extract policy rules, these techniques can be used.

In recent days, the entire security industry is moving towards data science, because of its capability to transform raw data into decision making. To do this, several data-driven tasks can be associated, such as—(i) data engineering focusing practical applications of data gathering and analysis; (ii) reducing data volume that deals with filtering significant and relevant data to further analysis; (iii) discovery and detection that focuses on extracting insight or incident patterns or knowledge from data; (iv) automated models that focus on building data-driven intelligent security model; (v) targeted security  alerts focusing on the generation of remarkable security alerts based on discovered knowledge that minimizes the false alerts, and (vi) resource optimization that deals with the available resources to achieve the target goals in a security system. While making data-driven decisions, behavioral analysis could also play a significant role in the domain of cybersecurity [ 81 ].

Thus, the concept of cybersecurity data science incorporates the methods and techniques of data science and machine learning as well as the behavioral analytics of various security incidents. The combination of these technologies has given birth to the term “cybersecurity data science”, which refers to collect a large amount of security event data from different sources and analyze it using machine learning technologies for detecting security risks or attacks either through the discovery of useful insights or the latest data-driven patterns. It is, however, worth remembering that cybersecurity data science is not just about a collection of machine learning algorithms, rather,  a process that can help security professionals or analysts to scale and automate their security activities in a smart way and in a timely manner. Therefore, the formal definition can be as follows: “Cybersecurity data science is a research or working area existing at the intersection of cybersecurity, data science, and machine learning or artificial intelligence, which is mainly security data-focused, applies machine learning methods, attempts to quantify cyber-risks or incidents, and promotes inferential techniques to analyze behavioral patterns in security data. It also focuses on generating security response alerts, and eventually seeks for optimizing cybersecurity solutions, to build automated and intelligent cybersecurity systems.”

Table  3 highlights some key terms associated with cybersecurity data science. Overall, the outputs of cybersecurity data science are typically security data products, which can be a data-driven security model, policy rule discovery, risk or attack prediction, potential security service and recommendation, or the corresponding security system depending on the given security problem in the domain of cybersecurity. In the next section, we briefly discuss various machine learning tasks with examples within the scope of our study.

Machine learning tasks in cybersecurity

Machine learning (ML) is typically considered as a branch of “Artificial Intelligence”, which is closely related to computational statistics, data mining and analytics, data science, particularly focusing on making the computers to learn from data [ 82 , 83 ]. Thus, machine learning models typically comprise of a set of rules, methods, or complex “transfer functions” that can be applied to find interesting data patterns, or to recognize or predict behavior [ 84 ], which could play an important role in the area of cybersecurity. In the following, we discuss different methods that can be used to solve machine learning tasks and how they are related to cybersecurity tasks.

Supervised learning

Supervised learning is performed when specific targets are defined to reach from a certain set of inputs, i.e., task-driven approach. In the area of machine learning, the most popular supervised learning techniques are known as classification and regression methods [ 129 ]. These techniques are popular to classify or predict the future for a particular security problem. For instance, to predict denial-of-service attack (yes, no) or to identify different classes of network attacks such as scanning and spoofing, classification techniques can be used in the cybersecurity domain. ZeroR [ 83 ], OneR [ 130 ], Navies Bayes [ 131 ], Decision Tree [ 132 , 133 ], K-nearest neighbors [ 134 ], support vector machines [ 135 ], adaptive boosting [ 136 ], and logistic regression [ 137 ] are the well-known classification techniques. In addition, recently Sarker et al. have proposed BehavDT [ 133 ], and IntruDtree [ 106 ] classification techniques that are able to effectively build a data-driven predictive model. On the other hand, to predict the continuous or numeric value, e.g., total phishing attacks in a certain period or predicting the network packet parameters, regression techniques are useful. Regression analyses can also be used to detect the root causes of cybercrime and other types of fraud [ 138 ]. Linear regression [ 82 ], support vector regression [ 135 ] are the popular regression techniques. The main difference between classification and regression is that the output variable in the regression is numerical or continuous, while the predicted output for classification is categorical or discrete. Ensemble learning is an extension of supervised learning while mixing different simple models, e.g., Random Forest learning [ 139 ] that generates multiple decision trees to solve a particular security task.

Unsupervised learning

In unsupervised learning problems, the main task is to find patterns, structures, or knowledge in unlabeled data, i.e., data-driven approach [ 140 ]. In the area of cybersecurity, cyber-attacks like malware stays hidden in some ways, include changing their behavior dynamically and autonomously to avoid detection. Clustering techniques, a type of unsupervised learning, can help to uncover the hidden patterns and structures from the datasets, to identify indicators of such sophisticated attacks. Similarly, in identifying anomalies, policy violations, detecting, and eliminating noisy instances in data, clustering techniques can be useful. K-means [ 141 ], K-medoids [ 142 ] are the popular partitioning clustering algorithms, and single linkage [ 143 ] or complete linkage [ 144 ] are the well-known hierarchical clustering algorithms used in various application domains. Moreover, a bottom-up clustering approach proposed by Sarker et al. [ 145 ] can also be used by taking into account the data characteristics.

Besides, feature engineering tasks like optimal feature selection or extraction related to a particular security problem could be useful for further analysis [ 106 ]. Recently, Sarker et al. [ 106 ] have proposed an approach for selecting security features according to their importance score values. Moreover, Principal component analysis, linear discriminant analysis, pearson correlation analysis, or non-negative matrix factorization are the popular dimensionality reduction techniques to solve such issues [ 82 ]. Association rule learning is another example, where machine learning based policy rules can prevent cyber-attacks. In an expert system, the rules are usually manually defined by a knowledge engineer working in collaboration with a domain expert [ 37 , 140 , 146 ]. Association rule learning on the contrary, is the discovery of rules or relationships among a set of available security features or attributes in a given dataset [ 147 ]. To quantify the strength of relationships, correlation analysis can be used [ 138 ]. Many association rule mining algorithms have been proposed in the area of machine learning and data mining literature, such as logic-based [ 148 ], frequent pattern based [ 149 , 150 , 151 ], tree-based [ 152 ], etc. Recently, Sarker et al. [ 153 ] have proposed an association rule learning approach considering non-redundant generation, that can be used to discover a set of useful security policy rules. Moreover, AIS [ 147 ], Apriori [ 149 ], Apriori-TID and Apriori-Hybrid [ 149 ], FP-Tree [ 152 ], and RARM [ 154 ], and Eclat [ 155 ] are the well-known association rule learning algorithms that are capable to solve such problems by generating a set of policy rules in the domain of cybersecurity.

Neural networks and deep learning

Deep learning is a part of machine learning in the area of artificial intelligence, which is a computational model that is inspired by the biological neural networks in the human brain [ 82 ]. Artificial Neural Network (ANN) is frequently used in deep learning and the most popular neural network algorithm is backpropagation [ 82 ]. It performs learning on a multi-layer feed-forward neural network consists of an input layer, one or more hidden layers, and an output layer. The main difference between deep learning and classical machine learning is its performance on the amount of security data increases. Typically deep learning algorithms perform well when the data volumes are large, whereas machine learning algorithms perform comparatively better on small datasets [ 44 ]. In our earlier work, Sarker et al. [ 129 ], we have illustrated the effectiveness of these approaches considering contextual datasets. However, deep learning approaches mimic the human brain mechanism to interpret large amount of data or the complex data such as images, sounds and texts [ 44 , 129 ]. In terms of feature extraction to build models, deep learning reduces the effort of designing a feature extractor for each problem than the classical machine learning techniques. Beside these characteristics, deep learning typically takes a long time to train an algorithm than a machine learning algorithm, however, the test time is exactly the opposite [ 44 ]. Thus, deep learning relies more on high-performance machines with GPUs than classical machine-learning algorithms [ 44 , 156 ]. The most popular deep neural network learning models include multi-layer perceptron (MLP) [ 157 ], convolutional neural network (CNN) [ 158 ], recurrent neural network (RNN) or long-short term memory (LSTM) network [ 121 , 158 ]. In recent days, researchers use these deep learning techniques for different purposes such as detecting network intrusions, malware traffic detection and classification, etc. in the domain of cybersecurity [ 44 , 159 ].

Other learning techniques

Semi-supervised learning can be described as a hybridization of supervised and unsupervised techniques discussed above, as it works on both the labeled and unlabeled data. In the area of cybersecurity, it could be useful, when it requires to label data automatically without human intervention, to improve the performance of cybersecurity models. Reinforcement techniques are another type of machine learning that characterizes an agent by creating its own learning experiences through interacting directly with the environment, i.e., environment-driven approach, where the environment is typically formulated as a Markov decision process and take decision based on a reward function [ 160 ]. Monte Carlo learning, Q-learning, Deep Q Networks, are the most common reinforcement learning algorithms [ 161 ]. For instance, in a recent work [ 126 ], the authors present an approach for detecting botnet traffic or malicious cyber activities using reinforcement learning combining with neural network classifier. In another work [ 128 ], the authors discuss about the application of deep reinforcement learning to intrusion detection for supervised problems, where they received the best results for the Deep Q-Network algorithm. In the context of cybersecurity, genetic algorithms that use fitness, selection, crossover, and mutation for finding optimization, could also be used to solve a similar class of learning problems [ 119 ].

Various types of machine learning techniques discussed above can be useful in the domain of cybersecurity, to build an effective security model. In Table  4 , we have summarized several machine learning techniques that are used to build various types of security models for various purposes. Although these models typically represent a learning-based security model, in this paper, we aim to focus on a comprehensive cybersecurity data science model and relevant issues, in order to build a data-driven intelligent security system. In the next section, we highlight several research issues and potential solutions in the area of cybersecurity data science.

Research issues and future directions

Our study opens several research issues and challenges in the area of cybersecurity data science to extract insight from relevant data towards data-driven intelligent decision making for cybersecurity solutions. In the following, we summarize these challenges ranging from data collection to decision making.

Cybersecurity datasets : Source datasets are the primary component to work in the area of cybersecurity data science. Most of the existing datasets are old and might insufficient in terms of understanding the recent behavioral patterns of various cyber-attacks. Although the data can be transformed into a meaningful understanding level after performing several processing tasks, there is still a lack of understanding of the characteristics of recent attacks and their patterns of happening. Thus, further processing or machine learning algorithms may provide a low accuracy rate for making the target decisions. Therefore, establishing a large number of recent datasets for a particular problem domain like cyber risk prediction or intrusion detection is needed, which could be one of the major challenges in cybersecurity data science.

Handling quality problems in cybersecurity datasets : The cyber datasets might be noisy, incomplete, insignificant, imbalanced, or may contain inconsistency instances related to a particular security incident. Such problems in a data set may affect the quality of the learning process and degrade the performance of the machine learning-based models [ 162 ]. To make a data-driven intelligent decision for cybersecurity solutions, such problems in data is needed to deal effectively before building the cyber models. Therefore, understanding such problems in cyber data and effectively handling such problems using existing algorithms or newly proposed algorithm for a particular problem domain like malware analysis or intrusion detection and prevention is needed, which could be another research issue in cybersecurity data science.

Security policy rule generation : Security policy rules reference security zones and enable a user to allow, restrict, and track traffic on the network based on the corresponding user or user group, and service, or the application. The policy rules including the general and more specific rules are compared against the incoming traffic in sequence during the execution, and the rule that matches the traffic is applied. The policy rules used in most of the cybersecurity systems are static and generated by human expertise or ontology-based [ 163 , 164 ]. Although, association rule learning techniques produce rules from data, however, there is a problem of redundancy generation [ 153 ] that makes the policy rule-set complex. Therefore, understanding such problems in policy rule generation and effectively handling such problems using existing algorithms or newly proposed algorithm for a particular problem domain like access control [ 165 ] is needed, which could be another research issue in cybersecurity data science.

Hybrid learning method : Most commercial products in the cybersecurity domain contain signature-based intrusion detection techniques [ 41 ]. However, missing features or insufficient profiling can cause these techniques to miss unknown attacks. In that case, anomaly-based detection techniques or hybrid technique combining signature-based and anomaly-based can be used to overcome such issues. A hybrid technique combining multiple learning techniques or a combination of deep learning and machine-learning methods can be used to extract the target insight for a particular problem domain like intrusion detection, malware analysis, access control, etc. and make the intelligent decision for corresponding cybersecurity solutions.

Protecting the valuable security information : Another issue of a cyber data attack is the loss of extremely valuable data and information, which could be damaging for an organization. With the use of encryption or highly complex signatures, one can stop others from probing into a dataset. In such cases, cybersecurity data science can be used to build a data-driven impenetrable protocol to protect such security information. To achieve this goal, cyber analysts can develop algorithms by analyzing the history of cyberattacks to detect the most frequently targeted chunks of data. Thus, understanding such data protecting problems and designing corresponding algorithms to effectively handling these problems, could be another research issue in the area of cybersecurity data science.

Context-awareness in cybersecurity : Existing cybersecurity work mainly originates from the relevant cyber data containing several low-level features. When data mining and machine learning techniques are applied to such datasets, a related pattern can be identified that describes it properly. However, a broader contextual information [ 140 , 145 , 166 ] like temporal, spatial, relationship among events or connections, dependency can be used to decide whether there exists a suspicious activity or not. For instance, some approaches may consider individual connections as DoS attacks, while security experts might not treat them as malicious by themselves. Thus, a significant limitation of existing cybersecurity work is the lack of using the contextual information for predicting risks or attacks. Therefore, context-aware adaptive cybersecurity solutions could be another research issue in cybersecurity data science.

Feature engineering in cybersecurity : The efficiency and effectiveness of a machine learning-based security model has always been a major challenge due to the high volume of network data with a large number of traffic features. The large dimensionality of data has been addressed using several techniques such as principal component analysis (PCA) [ 167 ], singular value decomposition (SVD) [ 168 ] etc. In addition to low-level features in the datasets, the contextual relationships between suspicious activities might be relevant. Such contextual data can be stored in an ontology or taxonomy for further processing. Thus how to effectively select the optimal features or extract the significant features considering both the low-level features as well as the contextual features, for effective cybersecurity solutions could be another research issue in cybersecurity data science.

Remarkable security alert generation and prioritizing : In many cases, the cybersecurity system may not be well defined and may cause a substantial number of false alarms that are unexpected in an intelligent system. For instance, an IDS deployed in a real-world network generates around nine million alerts per day [ 169 ]. A network-based intrusion detection system typically looks at the incoming traffic for matching the associated patterns to detect risks, threats or vulnerabilities and generate security alerts. However, to respond to each such alert might not be effective as it consumes relatively huge amounts of time and resources, and consequently may result in a self-inflicted DoS. To overcome this problem, a high-level management is required that correlate the security alerts considering the current context and their logical relationship including their prioritization before reporting them to users, which could be another research issue in cybersecurity data science.

Recency analysis in cybersecurity solutions : Machine learning-based security models typically use a large amount of static data to generate data-driven decisions. Anomaly detection systems rely on constructing such a model considering normal behavior and anomaly, according to their patterns. However, normal behavior in a large and dynamic security system is not well defined and it may change over time, which can be considered as an incremental growing of dataset. The patterns in incremental datasets might be changed in several cases. This often results in a substantial number of false alarms known as false positives. Thus, a recent malicious behavioral pattern is more likely to be interesting and significant than older ones for predicting unknown attacks. Therefore, effectively using the concept of recency analysis [ 170 ] in cybersecurity solutions could be another issue in cybersecurity data science.

The most important work for an intelligent cybersecurity system is to develop an effective framework that supports data-driven decision making. In such a framework, we need to consider advanced data analysis based on machine learning techniques, so that the framework is capable to minimize these issues and to provide automated and intelligent security services. Thus, a well-designed security framework for cybersecurity data and the experimental evaluation is a very important direction and a big challenge as well. In the next section, we suggest and discuss a data-driven cybersecurity framework based on machine learning techniques considering multiple processing layers.

A multi-layered framework for smart cybersecurity services

As discussed earlier, cybersecurity data science is data-focused, applies machine learning methods, attempts to quantify cyber risks, promotes inferential techniques to analyze behavioral patterns, focuses on generating security response alerts, and eventually seeks for optimizing cybersecurity operations. Hence, we briefly discuss a multiple data processing layered framework that potentially can be used to discover security insights from the raw data to build smart cybersecurity systems, e.g., dynamic policy rule-based access control or intrusion detection and prevention system. To make a data-driven intelligent decision in the resultant cybersecurity system, understanding the security problems and the nature of corresponding security data and their vast analysis is needed. For this purpose, our suggested framework not only considers the machine learning techniques to build the security model but also takes into account the incremental learning and dynamism to keep the model up-to-date and corresponding response generation, which could be more effective and intelligent for providing the expected services. Figure 3 shows an overview of the framework, involving several processing layers, from raw security event data to services. In the following, we briefly discuss the working procedure of the framework.

figure 3

A generic multi-layered framework based on machine learning techniques for smart cybersecurity services

Security data collecting

Collecting valuable cybersecurity data is a crucial step, which forms a connecting link between security problems in cyberinfrastructure and corresponding data-driven solution steps in this framework, shown in Fig.  3 . The reason is that cyber data can serve as the source for setting up ground truth of the security model that affect the model performance. The quality and quantity of cyber data decide the feasibility and effectiveness of solving the security problem according to our goal. Thus, the concern is how to collect valuable and unique needs data for building the data-driven security models.

The general step to collect and manage security data from diverse data sources is based on a particular security problem and project within the enterprise. Data sources can be classified into several broad categories such as network, host, and hybrid [ 171 ]. Within the network infrastructure, the security system can leverage different types of security data such as IDS logs, firewall logs, network traffic data, packet data, and honeypot data, etc. for providing the target security services. For instance, a given IP is considered malicious or not, could be detected by performing data analysis utilizing the data of IP addresses and their cyber activities. In the domain of cybersecurity, the network source mentioned above is considered as the primary security event source to analyze. In the host category, it collects data from an organization’s host machines, where the data sources can be operating system logs, database access logs, web server logs, email logs, application logs, etc. Collecting data from both the network and host machines are considered a hybrid category. Overall, in a data collection layer the network activity, database activity, application activity, and user activity can be the possible security event sources in the context of cybersecurity data science.

Security data preparing

After collecting the raw security data from various sources according to the problem domain discussed above, this layer is responsible to prepare the raw data for building the model by applying various necessary processes. However, not all of the collected data contributes to the model building process in the domain of cybersecurity [ 172 ]. Therefore, the useless data should be removed from the rest of the data captured by the network sniffer. Moreover, data might be noisy, have missing or corrupted values, or have attributes of widely varying types and scales. High quality of data is necessary for achieving higher accuracy in a data-driven model, which is a process of learning a function that maps an input to an output based on example input-output pairs. Thus, it might require a procedure for data cleaning, handling missing or corrupted values. Moreover, security data features or attributes can be in different types, such as continuous, discrete, or symbolic [ 106 ]. Beyond a solid understanding of these types of data and attributes and their permissible operations, its need to preprocess the data and attributes to convert into the target type. Besides, the raw data can be in different types such as structured, semi-structured, or unstructured, etc. Thus, normalization, transformation, or collation can be useful to organize the data in a structured manner. In some cases, natural language processing techniques might be useful depending on data type and characteristics, e.g., textual contents. As both the quality and quantity of data decide the feasibility of solving the security problem, effectively pre-processing and management of data and their representation can play a significant role to build an effective security model for intelligent services.

Machine learning-based security modeling

This is the core step where insights and knowledge are extracted from data through the application of cybersecurity data science. In this section, we particularly focus on machine learning-based modeling as machine learning techniques can significantly change the cybersecurity landscape. The security features or attributes and their patterns in data are of high interest to be discovered and analyzed to extract security insights. To achieve the goal, a deeper understanding of data and machine learning-based analytical models utilizing a large number of cybersecurity data can be effective. Thus, various machine learning tasks can be involved in this model building layer according to the solution perspective. These are - security feature engineering that mainly responsible to transform raw security data into informative features that effectively represent the underlying security problem to the data-driven models. Thus, several data-processing tasks such as feature transformation and normalization, feature selection by taking into account a subset of available security features according to their correlations or importance in modeling, or feature generation and extraction by creating new brand principal components, may be involved in this module according to the security data characteristics. For instance, the chi-squared test, analysis of variance test, correlation coefficient analysis, feature importance, as well as discriminant and principal component analysis, or singular value decomposition, etc. can be used for analyzing the significance of the security features to perform the security feature engineering tasks [ 82 ].

Another significant module is security data clustering that uncovers hidden patterns and structures through huge volumes of security data, to identify where the new threats exist. It typically involves the grouping of security data with similar characteristics, which can be used to solve several cybersecurity problems such as detecting anomalies, policy violations, etc. Malicious behavior or anomaly detection module is typically responsible to identify a deviation to a known behavior, where clustering-based analysis and techniques can also be used to detect malicious behavior or anomaly detection. In the cybersecurity area, attack classification or prediction is treated as one of the most significant modules, which is responsible to build a prediction model to classify attacks or threats and to predict future for a particular security problem. To predict denial-of-service attack or a spam filter separating tasks from other messages, could be the relevant examples. Association learning or policy rule generation module can play a role to build an expert security system that comprises several IF-THEN rules that define attacks. Thus, in a problem of policy rule generation for rule-based access control system, association learning can be used as it discovers the associations or relationships among a set of available security features in a given security dataset. The popular machine learning algorithms in these categories are briefly discussed in “  Machine learning tasks in cybersecurity ” section. The module model selection or customization is responsible to choose whether it uses the existing machine learning model or needed to customize. Analyzing data and building models based on traditional machine learning or deep learning methods, could achieve acceptable results in certain cases in the domain of cybersecurity. However, in terms of effectiveness and efficiency or other performance measurements considering time complexity, generalization capacity, and most importantly the impact of the algorithm on the detection rate of a system, machine learning models are needed to customize for a specific security problem. Moreover, customizing the related techniques and data could improve the performance of the resultant security model and make it better applicable in a cybersecurity domain. The modules discussed above can work separately and combinedly depending on the target security problems.

Incremental learning and dynamism

In our framework, this layer is concerned with finalizing the resultant security model by incorporating additional intelligence according to the needs. This could be possible by further processing in several modules. For instance, the post-processing and improvement module in this layer could play a role to simplify the extracted knowledge according to the particular requirements by incorporating domain-specific knowledge. As the attack classification or prediction models based on machine learning techniques strongly rely on the training data, it can hardly be generalized to other datasets, which could be significant for some applications. To address such kind of limitations, this module is responsible to utilize the domain knowledge in the form of taxonomy or ontology to improve attack correlation in cybersecurity applications.

Another significant module recency mining and updating security model is responsible to keep the security model up-to-date for better performance by extracting the latest data-driven security patterns. The extracted knowledge discussed in the earlier layer is based on a static initial dataset considering the overall patterns in the datasets. However, such knowledge might not be guaranteed higher performance in several cases, because of incremental security data with recent patterns. In many cases, such incremental data may contain different patterns which could conflict with existing knowledge. Thus, the concept of RecencyMiner [ 170 ] on incremental security data and extracting new patterns can be more effective than the existing old patterns. The reason is that recent security patterns and rules are more likely to be significant than older ones for predicting cyber risks or attacks. Rather than processing the whole security data again, recency-based dynamic updating according to the new patterns would be more efficient in terms of processing and outcome. This could make the resultant cybersecurity model intelligent and dynamic. Finally, response planning and decision making module is responsible to make decisions based on the extracted insights and take necessary actions to prevent the system from the cyber-attacks to provide automated and intelligent services. The services might be different depending on particular requirements for a given security problem.

Overall, this framework is a generic description which potentially can be used to discover useful insights from security data, to build smart cybersecurity systems, to address complex security challenges, such as intrusion detection, access control management, detecting anomalies and fraud, or denial of service attacks, etc. in the area of cybersecurity data science.

Although several research efforts have been directed towards cybersecurity solutions, discussed in “ Background ” , “ Cybersecurity data science ”, and “ Machine learning tasks in cybersecurity ” sections in different directions, this paper presents a comprehensive view of cybersecurity data science. For this, we have conducted a literature review to understand cybersecurity data, various defense strategies including intrusion detection techniques, different types of machine learning techniques in cybersecurity tasks. Based on our discussion on existing work, several research issues related to security datasets, data quality problems, policy rule generation, learning methods, data protection, feature engineering, security alert generation, recency analysis etc. are identified that require further research attention in the domain of cybersecurity data science.

The scope of cybersecurity data science is broad. Several data-driven tasks such as intrusion detection and prevention, access control management, security policy generation, anomaly detection, spam filtering, fraud detection and prevention, various types of malware attack detection and defense strategies, etc. can be considered as the scope of cybersecurity data science. Such tasks based categorization could be helpful for security professionals including the researchers and practitioners who are interested in the domain-specific aspects of security systems [ 171 ]. The output of cybersecurity data science can be used in many application areas such as Internet of things (IoT) security [ 173 ], network security [ 174 ], cloud security [ 175 ], mobile and web applications [ 26 ], and other relevant cyber areas. Moreover, intelligent cybersecurity solutions are important for the banking industry, the healthcare sector, or the public sector, where data breaches typically occur [ 36 , 176 ]. Besides, the data-driven security solutions could also be effective in AI-based blockchain technology, where AI works with huge volumes of security event data to extract the useful insights using machine learning techniques, and block-chain as a trusted platform to store such data [ 177 ].

Although in this paper, we discuss cybersecurity data science focusing on examining raw security data to data-driven decision making for intelligent security solutions, it could also be related to big data analytics in terms of data processing and decision making. Big data deals with data sets that are too large or complex having characteristics of high data volume, velocity, and variety. Big data analytics mainly has two parts consisting of data management involving data storage, and analytics [ 178 ]. The analytics typically describe the process of analyzing such datasets to discover patterns, unknown correlations, rules, and other useful insights [ 179 ]. Thus, several advanced data analysis techniques such as AI, data mining, machine learning could play an important role in processing big data by converting big problems to small problems [ 180 ]. To do this, the potential strategies like parallelization, divide-and-conquer, incremental learning, sampling, granular computing, feature or instance selection, can be used to make better decisions, reducing costs, or enabling more efficient processing. In such cases, the concept of cybersecurity data science, particularly machine learning-based modeling could be helpful for process automation and decision making for intelligent security solutions. Moreover, researchers could consider modified algorithms or models for handing big data on parallel computing platforms like Hadoop, Storm, etc. [ 181 ].

Based on the concept of cybersecurity data science discussed in the paper, building a data-driven security model for a particular security problem and relevant empirical evaluation to measure the effectiveness and efficiency of the model, and to asses the usability in the real-world application domain could be a future work.

Motivated by the growing significance of cybersecurity and data science, and machine learning technologies, in this paper, we have discussed how cybersecurity data science applies to data-driven intelligent decision making in smart cybersecurity systems and services. We also have discussed how it can impact security data, both in terms of extracting insight of security incidents and the dataset itself. We aimed to work on cybersecurity data science by discussing the state of the art concerning security incidents data and corresponding security services. We also discussed how machine learning techniques can impact in the domain of cybersecurity, and examine the security challenges that remain. In terms of existing research, much focus has been provided on traditional security solutions, with less available work in machine learning technique based security systems. For each common technique, we have discussed relevant security research. The purpose of this article is to share an overview of the conceptualization, understanding, modeling, and thinking about cybersecurity data science.

We have further identified and discussed various key issues in security analysis to showcase the signpost of future research directions in the domain of cybersecurity data science. Based on the knowledge, we have also provided a generic multi-layered framework of cybersecurity data science model based on machine learning techniques, where the data is being gathered from diverse sources, and the analytics complement the latest data-driven patterns for providing intelligent security services. The framework consists of several main phases - security data collecting, data preparation, machine learning-based security modeling, and incremental learning and dynamism for smart cybersecurity systems and services. We specifically focused on extracting insights from security data, from setting a research design with particular attention to concepts for data-driven intelligent security solutions.

Overall, this paper aimed not only to discuss cybersecurity data science and relevant methods but also to discuss the applicability towards data-driven intelligent decision making in cybersecurity systems and services from machine learning perspectives. Our analysis and discussion can have several implications both for security researchers and practitioners. For researchers, we have highlighted several issues and directions for future research. Other areas for potential research include empirical evaluation of the suggested data-driven model, and comparative analysis with other security systems. For practitioners, the multi-layered machine learning-based model can be used as a reference in designing intelligent cybersecurity systems for organizations. We believe that our study on cybersecurity data science opens a promising path and can be used as a reference guide for both academia and industry for future research and applications in the area of cybersecurity.

Availability of data and materials

Not applicable.

Abbreviations

  • Machine learning

Artificial Intelligence

Information and communication technology

Internet of Things

Distributed Denial of Service

Intrusion detection system

Intrusion prevention system

Host-based intrusion detection systems

Network Intrusion Detection Systems

Signature-based intrusion detection system

Anomaly-based intrusion detection system

Li S, Da Xu L, Zhao S. The internet of things: a survey. Inform Syst Front. 2015;17(2):243–59.

Google Scholar  

Sun N, Zhang J, Rimba P, Gao S, Zhang LY, Xiang Y. Data-driven cybersecurity incident prediction: a survey. IEEE Commun Surv Tutor. 2018;21(2):1744–72.

McIntosh T, Jang-Jaccard J, Watters P, Susnjak T. The inadequacy of entropy-based ransomware detection. In: International conference on neural information processing. New York: Springer; 2019. p. 181–189

Alazab M, Venkatraman S, Watters P, Alazab M, et al. Zero-day malware detection based on supervised learning algorithms of api call signatures (2010)

Shaw A. Data breach: from notification to prevention using pci dss. Colum Soc Probs. 2009;43:517.

Gupta BB, Tewari A, Jain AK, Agrawal DP. Fighting against phishing attacks: state of the art and future challenges. Neural Comput Appl. 2017;28(12):3629–54.

Av-test institute, germany, https://www.av-test.org/en/statistics/malware/ . Accessed 20 Oct 2019.

Ibm security report, https://www.ibm.com/security/data-breach . Accessed on 20 Oct 2019.

Fischer EA. Cybersecurity issues and challenges: In brief. Congressional Research Service (2014)

Juniper research. https://www.juniperresearch.com/ . Accessed on 20 Oct 2019.

Papastergiou S, Mouratidis H, Kalogeraki E-M. Cyber security incident handling, warning and response system for the european critical information infrastructures (cybersane). In: International Conference on Engineering Applications of Neural Networks, p. 476–487 (2019). New York: Springer

Aftergood S. Cybersecurity: the cold war online. Nature. 2017;547(7661):30.

Hey AJ, Tansley S, Tolle KM, et al. The fourth paradigm: data-intensive scientific discovery. 2009;1:

Cukier K. Data, data everywhere: A special report on managing information, 2010.

Google trends. In: https://trends.google.com/trends/ , 2019.

Anwar S, Mohamad Zain J, Zolkipli MF, Inayat Z, Khan S, Anthony B, Chang V. From intrusion detection to an intrusion response system: fundamentals, requirements, and future directions. Algorithms. 2017;10(2):39.

MATH   Google Scholar  

Mohammadi S, Mirvaziri H, Ghazizadeh-Ahsaee M, Karimipour H. Cyber intrusion detection by combined feature selection algorithm. J Inform Sec Appl. 2019;44:80–8.

Tapiador JE, Orfila A, Ribagorda A, Ramos B. Key-recovery attacks on kids, a keyed anomaly detection system. IEEE Trans Depend Sec Comput. 2013;12(3):312–25.

Tavallaee M, Stakhanova N, Ghorbani AA. Toward credible evaluation of anomaly-based intrusion-detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 40(5), 516–524 (2010)

Foroughi F, Luksch P. Data science methodology for cybersecurity projects. arXiv preprint arXiv:1803.04219 , 2018.

Saxe J, Sanders H. Malware data science: Attack detection and attribution, 2018.

Rainie L, Anderson J, Connolly J. Cyber attacks likely to increase. Digital Life in. 2014, vol. 2025.

Fischer EA. Creating a national framework for cybersecurity: an analysis of issues and options. LIBRARY OF CONGRESS WASHINGTON DC CONGRESSIONAL RESEARCH SERVICE, 2005.

Craigen D, Diakun-Thibault N, Purse R. Defining cybersecurity. Technology Innovation. Manag Rev. 2014;4(10):13–21.

Council NR. et al. Toward a safer and more secure cyberspace, 2007.

Jang-Jaccard J, Nepal S. A survey of emerging threats in cybersecurity. J Comput Syst Sci. 2014;80(5):973–93.

MathSciNet   MATH   Google Scholar  

Mukkamala S, Sung A, Abraham A. Cyber security challenges: Designing efficient intrusion detection systems and antivirus tools. Vemuri, V. Rao, Enhancing Computer Security with Smart Technology.(Auerbach, 2006), 125–163, 2005.

Bilge L, Dumitraş T. Before we knew it: an empirical study of zero-day attacks in the real world. In: Proceedings of the 2012 ACM conference on computer and communications security. ACM; 2012. p. 833–44.

Davi L, Dmitrienko A, Sadeghi A-R, Winandy M. Privilege escalation attacks on android. In: International conference on information security. New York: Springer; 2010. p. 346–60.

Jovičić B, Simić D. Common web application attack types and security using asp .net. ComSIS, 2006.

Warkentin M, Willison R. Behavioral and policy issues in information systems security: the insider threat. Eur J Inform Syst. 2009;18(2):101–5.

Kügler D. “man in the middle” attacks on bluetooth. In: International Conference on Financial Cryptography. New York: Springer; 2003, p. 149–61.

Virvilis N, Gritzalis D. The big four-what we did wrong in advanced persistent threat detection. In: 2013 International Conference on Availability, Reliability and Security. IEEE; 2013. p. 248–54.

Boyd SW, Keromytis AD. Sqlrand: Preventing sql injection attacks. In: International conference on applied cryptography and network security. New York: Springer; 2004. p. 292–302.

Sigler K. Crypto-jacking: how cyber-criminals are exploiting the crypto-currency boom. Comput Fraud Sec. 2018;2018(9):12–4.

2019 data breach investigations report, https://enterprise.verizon.com/resources/reports/dbir/ . Accessed 20 Oct 2019.

Khraisat A, Gondal I, Vamplew P, Kamruzzaman J. Survey of intrusion detection systems: techniques, datasets and challenges. Cybersecurity. 2019;2(1):20.

Johnson L. Computer incident response and forensics team management: conducting a successful incident response, 2013.

Brahmi I, Brahmi H, Yahia SB. A multi-agents intrusion detection system using ontology and clustering techniques. In: IFIP international conference on computer science and its applications. New York: Springer; 2015. p. 381–93.

Qu X, Yang L, Guo K, Ma L, Sun M, Ke M, Li M. A survey on the development of self-organizing maps for unsupervised intrusion detection. In: Mobile networks and applications. 2019;1–22.

Liao H-J, Lin C-HR, Lin Y-C, Tung K-Y. Intrusion detection system: a comprehensive review. J Netw Comput Appl. 2013;36(1):16–24.

Alazab A, Hobbs M, Abawajy J, Alazab M. Using feature selection for intrusion detection system. In: 2012 International symposium on communications and information technologies (ISCIT). IEEE; 2012. p. 296–301.

Viegas E, Santin AO, Franca A, Jasinski R, Pedroni VA, Oliveira LS. Towards an energy-efficient anomaly-based intrusion detection engine for embedded systems. IEEE Trans Comput. 2016;66(1):163–77.

Xin Y, Kong L, Liu Z, Chen Y, Li Y, Zhu H, Gao M, Hou H, Wang C. Machine learning and deep learning methods for cybersecurity. IEEE Access. 2018;6:35365–81.

Dutt I, Borah S, Maitra IK, Bhowmik K, Maity A, Das S. Real-time hybrid intrusion detection system using machine learning techniques. 2018, p. 885–94.

Ragsdale DJ, Carver C, Humphries JW, Pooch UW. Adaptation techniques for intrusion detection and intrusion response systems. In: Smc 2000 conference proceedings. 2000 IEEE international conference on systems, man and cybernetics.’cybernetics evolving to systems, humans, organizations, and their complex interactions’(cat. No. 0). IEEE; 2000. vol. 4, p. 2344–2349.

Cao L. Data science: challenges and directions. Commun ACM. 2017;60(8):59–68.

Rizk A, Elragal A. Data science: developing theoretical contributions in information systems via text analytics. J Big Data. 2020;7(1):1–26.

Lippmann RP, Fried DJ, Graf I, Haines JW, Kendall KR, McClung D, Weber D, Webster SE, Wyschogrod D, Cunningham RK, et al. Evaluating intrusion detection systems: The 1998 darpa off-line intrusion detection evaluation. In: Proceedings DARPA information survivability conference and exposition. DISCEX’00. IEEE; 2000. vol. 2, p. 12–26.

Kdd cup 99. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html . Accessed 20 Oct 2019.

Tavallaee M, Bagheri E, Lu W, Ghorbani AA. A detailed analysis of the kdd cup 99 data set. In: 2009 IEEE symposium on computational intelligence for security and defense applications. IEEE; 2009. p. 1–6.

Caida ddos attack 2007 dataset. http://www.caida.org/data/ passive/ddos-20070804-dataset.xml/ . Accessed 20 Oct 2019.

Caida anonymized internet traces 2008 dataset. https://www.caida.org/data/passive/passive-2008-dataset . Accessed 20 Oct 2019.

Isot botnet dataset. https://www.uvic.ca/engineering/ece/isot/ datasets/index.php/ . Accessed 20 Oct 2019.

The honeynet project. http://www.honeynet.org/chapters/france/ . Accessed 20 Oct 2019.

Canadian institute of cybersecurity, university of new brunswick, iscx dataset, http://www.unb.ca/cic/datasets/index.html/ . Accessed 20 Oct 2019.

Shiravi A, Shiravi H, Tavallaee M, Ghorbani AA. Toward developing a systematic approach to generate benchmark datasets for intrusion detection. Comput Secur. 2012;31(3):357–74.

The ctu-13 dataset. https://stratosphereips.org/category/datasets-ctu13 . Accessed 20 Oct 2019.

Moustafa N, Slay J. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 Military Communications and Information Systems Conference (MilCIS). IEEE; 2015. p. 1–6.

Cse-cic-ids2018 [online]. available: https://www.unb.ca/cic/ datasets/ids-2018.html/ . Accessed 20 Oct 2019.

Cic-ddos2019 [online]. available: https://www.unb.ca/cic/datasets/ddos-2019.html/ . Accessed 28 Mar 2019.

Jing X, Yan Z, Jiang X, Pedrycz W. Network traffic fusion and analysis against ddos flooding attacks with a novel reversible sketch. Inform Fusion. 2019;51:100–13.

Xie M, Hu J, Yu X, Chang E. Evaluating host-based anomaly detection systems: application of the frequency-based algorithms to adfa-ld. In: International conference on network and system security. New York: Springer; 2015. p. 542–49.

Lindauer B, Glasser J, Rosen M, Wallnau KC, ExactData L. Generating test data for insider threat detectors. JoWUA. 2014;5(2):80–94.

Glasser J, Lindauer B. Bridging the gap: A pragmatic approach to generating insider threat data. In: 2013 IEEE Security and Privacy Workshops. IEEE; 2013. p. 98–104.

Enronspam. https://labs-repos.iit.demokritos.gr/skel/i-config/downloads/enron-spam/ . Accessed 20 Oct 2019.

Spamassassin. http://www.spamassassin.org/publiccorpus/ . Accessed 20 Oct 2019.

Lingspam. https://labs-repos.iit.demokritos.gr/skel/i-config/downloads/lingspampublic.tar.gz/ . Accessed 20 Oct 2019.

Alexa top sites. https://aws.amazon.com/alexa-top-sites/ . Accessed 20 Oct 2019.

Bambenek consulting—master feeds. available online: http://osint.bambenekconsulting.com/feeds/ . Accessed 20 Oct 2019.

Dgarchive. https://dgarchive.caad.fkie.fraunhofer.de/site/ . Accessed 20 Oct 2019.

Zago M, Pérez MG, Pérez GM. Umudga: A dataset for profiling algorithmically generated domain names in botnet detection. Data in Brief. 2020;105400.

Zhou Y, Jiang X. Dissecting android malware: characterization and evolution. In: 2012 IEEE Symposium on security and privacy. IEEE; 2012. p. 95–109.

Virusshare. http://virusshare.com/ . Accessed 20 Oct 2019.

Virustotal. https://virustotal.com/ . Accessed 20 Oct 2019.

Comodo. https://www.comodo.com/home/internet-security/updates/vdp/database . Accessed 20 Oct 2019.

Contagio. http://contagiodump.blogspot.com/ . Accessed 20 Oct 2019.

Kumar R, Xiaosong Z, Khan RU, Kumar J, Ahad I. Effective and explainable detection of android malware based on machine learning algorithms. In: Proceedings of the 2018 international conference on computing and artificial intelligence. ACM; 2018. p. 35–40.

Microsoft malware classification (big 2015). arXiv:org/abs/1802.10135/ . Accessed 20 Oct 2019.

Koroniotis N, Moustafa N, Sitnikova E, Turnbull B. Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: bot-iot dataset. Future Gen Comput Syst. 2019;100:779–96.

McIntosh TR, Jang-Jaccard J, Watters PA. Large scale behavioral analysis of ransomware attacks. In: International conference on neural information processing. New York: Springer; 2018. p. 217–29.

Han J, Pei J, Kamber M. Data mining: concepts and techniques, 2011.

Witten IH, Frank E. Data mining: Practical machine learning tools and techniques, 2005.

Dua S, Du X. Data mining and machine learning in cybersecurity, 2016.

Kotpalliwar MV, Wajgi R. Classification of attacks using support vector machine (svm) on kddcup’99 ids database. In: 2015 Fifth international conference on communication systems and network technologies. IEEE; 2015. p. 987–90.

Pervez MS, Farid DM. Feature selection and intrusion classification in nsl-kdd cup 99 dataset employing svms. In: The 8th international conference on software, knowledge, information management and applications (SKIMA 2014). IEEE; 2014. p. 1–6.

Yan M, Liu Z. A new method of transductive svm-based network intrusion detection. In: International conference on computer and computing technologies in agriculture. New York: Springer; 2010. p. 87–95.

Li Y, Xia J, Zhang S, Yan J, Ai X, Dai K. An efficient intrusion detection system based on support vector machines and gradually feature removal method. Expert Syst Appl. 2012;39(1):424–30.

Raman MG, Somu N, Jagarapu S, Manghnani T, Selvam T, Krithivasan K, Sriram VS. An efficient intrusion detection technique based on support vector machine and improved binary gravitational search algorithm. Artificial Intelligence Review. 2019, p. 1–32.

Kokila R, Selvi ST, Govindarajan K. Ddos detection and analysis in sdn-based environment using support vector machine classifier. In: 2014 Sixth international conference on advanced computing (ICoAC). IEEE; 2014. p. 205–10.

Xie M, Hu J, Slay J. Evaluating host-based anomaly detection systems: Application of the one-class svm algorithm to adfa-ld. In: 2014 11th international conference on fuzzy systems and knowledge discovery (FSKD). IEEE; 2014. p. 978–82.

Saxena H, Richariya V. Intrusion detection in kdd99 dataset using svm-pso and feature reduction with information gain. Int J Comput Appl. 2014;98:6.

Chandrasekhar A, Raghuveer K. Confederation of fcm clustering, ann and svm techniques to implement hybrid nids using corrected kdd cup 99 dataset. In: 2014 international conference on communication and signal processing. IEEE; 2014. p. 672–76.

Shapoorifard H, Shamsinejad P. Intrusion detection using a novel hybrid method incorporating an improved knn. Int J Comput Appl. 2017;173(1):5–9.

Vishwakarma S, Sharma V, Tiwari A. An intrusion detection system using knn-aco algorithm. Int J Comput Appl. 2017;171(10):18–23.

Meng W, Li W, Kwok L-F. Design of intelligent knn-based alarm filter using knowledge-based alert verification in intrusion detection. Secur Commun Netw. 2015;8(18):3883–95.

Dada E. A hybridized svm-knn-pdapso approach to intrusion detection system. In: Proc. Fac. Seminar Ser., 2017, p. 14–21.

Sharifi AM, Amirgholipour SK, Pourebrahimi A. Intrusion detection based on joint of k-means and knn. J Converg Inform Technol. 2015;10(5):42.

Lin W-C, Ke S-W, Tsai C-F. Cann: an intrusion detection system based on combining cluster centers and nearest neighbors. Knowl Based Syst. 2015;78:13–21.

Koc L, Mazzuchi TA, Sarkani S. A network intrusion detection system based on a hidden naïve bayes multiclass classifier. Exp Syst Appl. 2012;39(18):13492–500.

Moon D, Im H, Kim I, Park JH. Dtb-ids: an intrusion detection system based on decision tree using behavior analysis for preventing apt attacks. J Supercomput. 2017;73(7):2881–95.

Ingre, B., Yadav, A., Soni, A.K.: Decision tree based intrusion detection system for nsl-kdd dataset. In: International conference on information and communication technology for intelligent systems. New York: Springer; 2017. p. 207–18.

Malik AJ, Khan FA. A hybrid technique using binary particle swarm optimization and decision tree pruning for network intrusion detection. Cluster Comput. 2018;21(1):667–80.

Relan NG, Patil DR. Implementation of network intrusion detection system using variant of decision tree algorithm. In: 2015 international conference on nascent technologies in the engineering field (ICNTE). IEEE; 2015. p. 1–5.

Rai K, Devi MS, Guleria A. Decision tree based algorithm for intrusion detection. Int J Adv Netw Appl. 2016;7(4):2828.

Sarker IH, Abushark YB, Alsolami F, Khan AI. Intrudtree: a machine learning based cyber security intrusion detection model. Symmetry. 2020;12(5):754.

Puthran S, Shah K. Intrusion detection using improved decision tree algorithm with binary and quad split. In: International symposium on security in computing and communication. New York: Springer; 2016. p. 427–438.

Balogun AO, Jimoh RG. Anomaly intrusion detection using an hybrid of decision tree and k-nearest neighbor, 2015.

Azad C, Jha VK. Genetic algorithm to solve the problem of small disjunct in the decision tree based intrusion detection system. Int J Comput Netw Inform Secur. 2015;7(8):56.

Jo S, Sung H, Ahn B. A comparative study on the performance of intrusion detection using decision tree and artificial neural network models. J Korea Soc Dig Indus Inform Manag. 2015;11(4):33–45.

Zhan J, Zulkernine M, Haque A. Random-forests-based network intrusion detection systems. IEEE Trans Syst Man Cybern C. 2008;38(5):649–59.

Tajbakhsh A, Rahmati M, Mirzaei A. Intrusion detection using fuzzy association rules. Appl Soft Comput. 2009;9(2):462–9.

Mitchell R, Chen R. Behavior rule specification-based intrusion detection for safety critical medical cyber physical systems. IEEE Trans Depend Secure Comput. 2014;12(1):16–30.

Alazab M, Venkataraman S, Watters P. Towards understanding malware behaviour by the extraction of api calls. In: 2010 second cybercrime and trustworthy computing Workshop. IEEE; 2010. p. 52–59.

Yuan Y, Kaklamanos G, Hogrefe D. A novel semi-supervised adaboost technique for network anomaly detection. In: Proceedings of the 19th ACM international conference on modeling, analysis and simulation of wireless and mobile systems. ACM; 2016. p. 111–14.

Ariu D, Tronci R, Giacinto G. Hmmpayl: an intrusion detection system based on hidden markov models. Comput Secur. 2011;30(4):221–41.

Årnes A, Valeur F, Vigna G, Kemmerer RA. Using hidden markov models to evaluate the risks of intrusions. In: International workshop on recent advances in intrusion detection. New York: Springer; 2006. p. 145–64.

Hansen JV, Lowry PB, Meservy RD, McDonald DM. Genetic programming for prevention of cyberterrorism through dynamic and evolving intrusion detection. Decis Supp Syst. 2007;43(4):1362–74.

Aslahi-Shahri B, Rahmani R, Chizari M, Maralani A, Eslami M, Golkar MJ, Ebrahimi A. A hybrid method consisting of ga and svm for intrusion detection system. Neural Comput Appl. 2016;27(6):1669–76.

Alrawashdeh K, Purdy C. Toward an online anomaly intrusion detection system based on deep learning. In: 2016 15th IEEE international conference on machine learning and applications (ICMLA). IEEE; 2016. p. 195–200.

Yin C, Zhu Y, Fei J, He X. A deep learning approach for intrusion detection using recurrent neural networks. IEEE Access. 2017;5:21954–61.

Kim J, Kim J, Thu HLT, Kim H. Long short term memory recurrent neural network classifier for intrusion detection. In: 2016 international conference on platform technology and service (PlatCon). IEEE; 2016. p. 1–5.

Almiani M, AbuGhazleh A, Al-Rahayfeh A, Atiewi S, Razaque A. Deep recurrent neural network for iot intrusion detection system. Simulation Modelling Practice and Theory. 2019;102031.

Kolosnjaji B, Zarras A, Webster G, Eckert C. Deep learning for classification of malware system call sequences. In: Australasian joint conference on artificial intelligence. New York: Springer; 2016. p. 137–49.

Wang W, Zhu M, Zeng X, Ye X, Sheng Y. Malware traffic classification using convolutional neural network for representation learning. In: 2017 international conference on information networking (ICOIN). IEEE; 2017. p. 712–17.

Alauthman M, Aslam N, Al-kasassbeh M, Khan S, Al-Qerem A, Choo K-KR. An efficient reinforcement learning-based botnet detection approach. J Netw Comput Appl. 2020;150:102479.

Blanco R, Cilla JJ, Briongos S, Malagón P, Moya JM. Applying cost-sensitive classifiers with reinforcement learning to ids. In: International conference on intelligent data engineering and automated learning. New York: Springer; 2018. p. 531–38.

Lopez-Martin M, Carro B, Sanchez-Esguevillas A. Application of deep reinforcement learning to intrusion detection for supervised problems. Exp Syst Appl. 2020;141:112963.

Sarker IH, Kayes A, Watters P. Effectiveness analysis of machine learning classification models for predicting personalized context-aware smartphone usage. J Big Data. 2019;6(1):1–28.

Holte RC. Very simple classification rules perform well on most commonly used datasets. Mach Learn. 1993;11(1):63–90.

John GH, Langley P. Estimating continuous distributions in bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc.; 1995. p. 338–45.

Quinlan JR. C4.5: Programs for machine learning. Machine Learning, 1993.

Sarker IH, Colman A, Han J, Khan AI, Abushark YB, Salah K. Behavdt: a behavioral decision tree learning to build user-centric context-aware predictive model. Mobile Networks and Applications. 2019, p. 1–11.

Aha DW, Kibler D, Albert MK. Instance-based learning algorithms. Mach Learn. 1991;6(1):37–66.

Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK. Improvements to platt’s smo algorithm for svm classifier design. Neural Comput. 2001;13(3):637–49.

Freund Y, Schapire RE, et al: Experiments with a new boosting algorithm. In: Icml, vol. 96, p. 148–156 (1996). Citeseer

Le Cessie S, Van Houwelingen JC. Ridge estimators in logistic regression. J Royal Stat Soc C. 1992;41(1):191–201.

Watters PA, McCombie S, Layton R, Pieprzyk J. Characterising and predicting cyber attacks using the cyber attacker model profile (camp). J Money Launder Control. 2012.

Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

Sarker IH. Context-aware rule learning from smartphone data: survey, challenges and future directions. J Big Data. 2019;6(1):95.

MacQueen J. Some methods for classification and analysis of multivariate observations. In: Fifth Berkeley symposium on mathematical statistics and probability, vol. 1, 1967.

Rokach L. A survey of clustering algorithms. In: Data Mining and Knowledge Discovery Handbook. New York: Springer; 2010. p. 269–98.

Sneath PH. The application of computers to taxonomy. J Gen Microbiol. 1957;17:1.

Sorensen T. method of establishing groups of equal amplitude in plant sociology based on similarity of species. Biol Skr. 1948;5.

Sarker IH, Colman A, Kabir MA, Han J. Individualized time-series segmentation for mining mobile phone user behavior. Comput J. 2018;61(3):349–68.

Kim G, Lee S, Kim S. A novel hybrid intrusion detection method integrating anomaly detection with misuse detection. Exp Syst Appl. 2014;41(4):1690–700.

MathSciNet   Google Scholar  

Agrawal R, Imieliński T, Swami A. Mining association rules between sets of items in large databases. In: ACM SIGMOD Record. ACM; 1993. vol. 22, p. 207–16.

Flach PA, Lachiche N. Confirmation-guided discovery of first-order rules with tertius. Mach Learn. 2001;42(1–2):61–95.

Agrawal R, Srikant R, et al: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. Very Large Data Bases, VLDB, 1994, vol. 1215, p. 487–99.

Houtsma M, Swami A. Set-oriented mining for association rules in relational databases. In: Proceedings of the eleventh international conference on data engineering. IEEE; 1995. p. 25–33.

Ma BLWHY. Integrating classification and association rule mining. In: Proceedings of the fourth international conference on knowledge discovery and data mining, 1998.

Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In: ACM Sigmod Record. ACM; 2000. vol. 29, p. 1–12.

Sarker IH, Salim FD. Mining user behavioral rules from smartphone data through association analysis. In: Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Melbourne, Australia. New York: Springer; 2018. p. 450–61.

Das A, Ng W-K, Woon Y-K. Rapid association rule mining. In: Proceedings of the tenth international conference on information and knowledge management. ACM; 2001. p. 474–81.

Zaki MJ. Scalable algorithms for association mining. IEEE Trans Knowl Data Eng. 2000;12(3):372–90.

Coelho IM, Coelho VN, Luz EJS, Ochi LS, Guimarães FG, Rios E. A gpu deep learning metaheuristic based model for time series forecasting. Appl Energy. 2017;201:412–8.

Van Efferen L, Ali-Eldin AM. A multi-layer perceptron approach for flow-based anomaly detection. In: 2017 International symposium on networks, computers and communications (ISNCC). IEEE; 2017. p. 1–6.

Liu H, Lang B, Liu M, Yan H. Cnn and rnn based payload classification methods for attack detection. Knowl Based Syst. 2019;163:332–41.

Berman DS, Buczak AL, Chavis JS, Corbett CL. A survey of deep learning methods for cyber security. Information. 2019;10(4):122.

Bellman R. A markovian decision process. J Math Mech. 1957;1:679–84.

Kaelbling LP, Littman ML, Moore AW. Reinforcement learning: a survey. J Artif Intell Res. 1996;4:237–85.

Sarker IH. A machine learning based robust prediction model for real-life mobile phone data. Internet of Things. 2019;5:180–93.

Kayes ASM, Han J, Colman A. OntCAAC: an ontology-based approach to context-aware access control for software services. Comput J. 2015;58(11):3000–34.

Kayes ASM, Rahayu W, Dillon T. An ontology-based approach to dynamic contextual role for pervasive access control. In: AINA 2018. IEEE Computer Society, 2018.

Colombo P, Ferrari E. Access control technologies for big data management systems: literature review and future trends. Cybersecurity. 2019;2(1):1–13.

Aleroud A, Karabatis G. Contextual information fusion for intrusion detection: a survey and taxonomy. Knowl Inform Syst. 2017;52(3):563–619.

Sarker IH, Abushark YB, Khan AI. Contextpca: Predicting context-aware smartphone apps usage based on machine learning techniques. Symmetry. 2020;12(4):499.

Madsen RE, Hansen LK, Winther O. Singular value decomposition and principal component analysis. Neural Netw. 2004;1:1–5.

Qiao L-B, Zhang B-F, Lai Z-Q, Su J-S. Mining of attack models in ids alerts from network backbone by a two-stage clustering method. In: 2012 IEEE 26th international parallel and distributed processing symposium workshops & Phd Forum. IEEE; 2012. p. 1263–9.

Sarker IH, Colman A, Han J. Recencyminer: mining recency-based personalized behavior from contextual smartphone data. J Big Data. 2019;6(1):49.

Ullah F, Babar MA. Architectural tactics for big data cybersecurity analytics systems: a review. J Syst Softw. 2019;151:81–118.

Zhao S, Leftwich K, Owens M, Magrone F, Schonemann J, Anderson B, Medhi D. I-can-mama: Integrated campus network monitoring and management. In: 2014 IEEE network operations and management symposium (NOMS). IEEE; 2014. p. 1–7.

Abomhara M, et al. Cyber security and the internet of things: vulnerabilities, threats, intruders and attacks. J Cyber Secur Mob. 2015;4(1):65–88.

Helali RGM. Data mining based network intrusion detection system: A survey. In: Novel algorithms and techniques in telecommunications and networking. New York: Springer; 2010. p. 501–505.

Ryoo J, Rizvi S, Aiken W, Kissell J. Cloud security auditing: challenges and emerging approaches. IEEE Secur Priv. 2013;12(6):68–74.

Densham B. Three cyber-security strategies to mitigate the impact of a data breach. Netw Secur. 2015;2015(1):5–8.

Salah K, Rehman MHU, Nizamuddin N, Al-Fuqaha A. Blockchain for ai: review and open research challenges. IEEE Access. 2019;7:10127–49.

Gandomi A, Haider M. Beyond the hype: big data concepts, methods, and analytics. Int J Inform Manag. 2015;35(2):137–44.

Golchha N. Big data-the information revolution. Int J Adv Res. 2015;1(12):791–4.

Hariri RH, Fredericks EM, Bowers KM. Uncertainty in big data analytics: survey, opportunities, and challenges. J Big Data. 2019;6(1):44.

Tsai C-W, Lai C-F, Chao H-C, Vasilakos AV. Big data analytics: a survey. J Big data. 2015;2(1):21.

Download references

Acknowledgements

The authors would like to thank all the reviewers for their rigorous review and comments in several revision rounds. The reviews are detailed and helpful to improve and finalize the manuscript. The authors are highly grateful to them.

Author information

Authors and affiliations.

Swinburne University of Technology, Melbourne, VIC, 3122, Australia

Iqbal H. Sarker

Chittagong University of Engineering and Technology, Chittagong, 4349, Bangladesh

La Trobe University, Melbourne, VIC, 3086, Australia

A. S. M. Kayes, Paul Watters & Alex Ng

University of Nevada, Reno, USA

Shahriar Badsha

Macquarie University, Sydney, NSW, 2109, Australia

Hamed Alqahtani

You can also search for this author in PubMed   Google Scholar

Contributions

This article provides not only a discussion on cybersecurity data science and relevant methods but also to discuss the applicability towards data-driven intelligent decision making in cybersecurity systems and services. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Iqbal H. Sarker .

Ethics declarations

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Sarker, I.H., Kayes, A.S.M., Badsha, S. et al. Cybersecurity data science: an overview from machine learning perspective. J Big Data 7 , 41 (2020). https://doi.org/10.1186/s40537-020-00318-5

Download citation

Received : 26 October 2019

Accepted : 21 June 2020

Published : 01 July 2020

DOI : https://doi.org/10.1186/s40537-020-00318-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Decision making
  • Cyber-attack
  • Security modeling
  • Intrusion detection
  • Cyber threat intelligence

data science thesis

Mon - Sat 9:00am - 12:00am

  • Get a quote

List of Best Research and Thesis Topic Ideas for Data Science in 2022

In an era driven by digital and technological transformation, businesses actively seek skilled and talented data science potentials capable of leveraging data insights to enhance business productivity and achieve organizational objectives. In keeping with an increasing demand for data science professionals, universities offer various data science and big data courses to prepare students for the tech industry. Research projects are a crucial part of these programs and a well- executed data science project can make your CV appear more robust and compelling. A  broad range of data science topics exist that offer exciting possibilities for research but choosing data science research topics can be a real challenge for students . After all, a good research project relies first and foremost on data analytics research topics that draw upon both mono-disciplinary and multi-disciplinary research to explore endless possibilities for real –world applications.

As one of the top-most masters and PhD online dissertation writing services , we are geared to assist students in the entire research process right from the initial conception to the final execution to ensure that you have a truly fulfilling and enriching research experience. These resources are also helpful for those students who are taking online classes .

By taking advantage of our best digital marketing research topics in data science you can be assured of producing an innovative research project that will impress your research professors and make a huge difference in attracting the right employers.

Get an Immediate Response

Discuss your requirments with our writers

Get 3 Customize Research Topic within 24 Hours

Undergraduate Masters PhD Others

Data science thesis topics

We have compiled a list of data science research topics for students studying data science that can be utilized in data science projects in 2022. our team of professional data experts have brought together master or MBA thesis topics in data science  that cater to core areas  driving the field of data science and big data that will relieve all your research anxieties and  provide a solid grounding for  an interesting research projects . The article will feature data science thesis ideas that can be immensely beneficial for students as they cover a broad research agenda for future data science . These ideas have been drawn from the 8 v’s of big data namely Volume, Value, Veracity, Visualization, Variety, Velocity, Viscosity, and Virility that provide interesting and challenging research areas for prospective researches  in their masters or PhD thesis . Overall, the general big data research topics can be divided into distinct categories to facilitate the research topic selection process.

  • Security and privacy issues
  • Cloud Computing Platforms for Big Data Adoption and Analytics
  • Real-time data analytics for processing of image , video and text
  • Modeling uncertainty

How “The Research Guardian” Can Help You A lot!

Our top thesis writing experts are available 24/7 to assist you the right university projects. Whether its critical literature reviews to complete your PhD. or Master Levels thesis.

DATA SCIENCE PHD RESEARCH TOPICS

The article will also guide students engaged in doctoral research by introducing them to an outstanding list of data science thesis topics that can lead to major real-time applications of big data analytics in your research projects.

  • Intelligent traffic control ; Gathering and monitoring traffic information using CCTV images.
  • Asymmetric protected storage methodology over multi-cloud service providers in Big data.
  • Leveraging disseminated data over big data analytics environment.
  • Internet of Things.
  • Large-scale data system and anomaly detection.

What makes us a unique research service for your research needs?

We offer all –round and superb research services that have a distinguished track record in helping students secure their desired grades in research projects in big data analytics and hence pave the way for a promising career ahead. These are the features that set us apart in the market for research services that effectively deal with all significant issues in your research for.

  • Plagiarism –free ; We strictly adhere to a non-plagiarism policy in all our research work to  provide you with well-written, original content  with low similarity index   to maximize  chances of acceptance of your research submissions.
  • Publication; We don’t just suggest PhD data science research topics but our PhD consultancy services take your research to the next level by ensuring its publication in well-reputed journals. A PhD thesis is indispensable for a PhD degree and with our premier best PhD thesis services that  tackle all aspects  of research writing and cater to  essential requirements of journals , we will bring you closer to your dream of being a PhD in the field of data analytics.
  • Research ethics: Solid research ethics lie at the core of our services where we actively seek to protect the  privacy and confidentiality of  the technical and personal information of our valued customers.
  • Research experience: We take pride in our world –class team of computing industry professionals equipped with the expertise and experience to assist in choosing data science research topics and subsequent phases in research including findings solutions, code development and final manuscript writing.
  • Business ethics: We are driven by a business philosophy that‘s wholly committed to achieving total customer satisfaction by providing constant online and offline support and timely submissions so that you can keep track of the progress of your research.

Now, we’ll proceed to cover specific research problems encompassing both data analytics research topics and big data thesis topics that have applications across multiple domains.

Get Help from Expert Thesis Writers!

TheresearchGuardian.com providing expert thesis assistance for university students at any sort of level. Our thesis writing service has been serving students since 2011.

Multi-modal Transfer Learning for Cross-Modal Information Retrieval

Aim and objectives.

The research aims to examine and explore the use of CMR approach in bringing about a flexible retrieval experience by combining data across different modalities to ensure abundant multimedia data.

  • Develop methods to enable learning across different modalities in shared cross modal spaces comprising texts and images as well as consider the limitations of existing cross –modal retrieval algorithms.
  • Investigate the presence and effects of bias in cross modal transfer learning and suggesting strategies for bias detection and mitigation.
  • Develop a tool with query expansion and relevance feedback capabilities to facilitate search and retrieval of multi-modal data.
  • Investigate the methods of multi modal learning and elaborate on the importance of multi-modal deep learning to provide a comprehensive learning experience.

The Role of Machine Learning in Facilitating the Implication of the Scientific Computing and Software Engineering

  • Evaluate how machine learning leads to improvements in computational APA reference generator tools and thus aids in  the implementation of scientific computing
  • Evaluating the effectiveness of machine learning in solving complex problems and improving the efficiency of scientific computing and software engineering processes.
  • Assessing the potential benefits and challenges of using machine learning in these fields, including factors such as cost, accuracy, and scalability.
  • Examining the ethical and social implications of using machine learning in scientific computing and software engineering, such as issues related to bias, transparency, and accountability.

Trustworthy AI

The research aims to explore the crucial role of data science in advancing scientific goals and solving problems as well as the implications involved in use of AI systems especially with respect to ethical concerns.

  • Investigate the value of digital infrastructures  available through open data   in  aiding sharing  and inter linking of data for enhanced global collaborative research efforts
  • Provide explanations of the outcomes of a machine learning model  for a meaningful interpretation to build trust among users about the reliability and authenticity of data
  • Investigate how formal models can be used to verify and establish the efficacy of the results derived from probabilistic model.
  • Review the concept of Trustworthy computing as a relevant framework for addressing the ethical concerns associated with AI systems.

The Implementation of Data Science and their impact on the management environment and sustainability

The aim of the research is to demonstrate how data science and analytics can be leveraged in achieving sustainable development.

  • To examine the implementation of data science using data-driven decision-making tools
  • To evaluate the impact of modern information technology on management environment and sustainability.
  • To examine the use of  data science in achieving more effective and efficient environment management
  • Explore how data science and analytics can be used to achieve sustainability goals across three dimensions of economic, social and environmental.

Big data analytics in healthcare systems

The aim of the research is to examine the application of creating smart healthcare systems and   how it can   lead to more efficient, accessible and cost –effective health care.

  • Identify the potential Areas or opportunities in big data to transform the healthcare system such as for diagnosis, treatment planning, or drug development.
  • Assessing the potential benefits and challenges of using AI and deep learning in healthcare, including factors such as cost, efficiency, and accessibility
  • Evaluating the effectiveness of AI and deep learning in improving patient outcomes, such as reducing morbidity and mortality rates, improving accuracy and speed of diagnoses, or reducing medical errors
  • Examining the ethical and social implications of using AI and deep learning in healthcare, such as issues related to bias, privacy, and autonomy.

Large-Scale Data-Driven Financial Risk Assessment

The research aims to explore the possibility offered by big data in a consistent and real time assessment of financial risks.

  • Investigate how the use of big data can help to identify and forecast risks that can harm a business.
  • Categories the types of financial risks faced by companies.
  • Describe the importance of financial risk management for companies in business terms.
  • Train a machine learning model to classify transactions as fraudulent or genuine.

Scalable Architectures for Parallel Data Processing

Big data has exposed us to an ever –growing volume of data which cannot be handled through traditional data management and analysis systems. This has given rise to the use of scalable system architectures to efficiently process big data and exploit its true value. The research aims to analyses the current state of practice in scalable architectures and identify common patterns and techniques to design scalable architectures for parallel data processing.

  • To design and implement a prototype scalable architecture for parallel data processing
  • To evaluate the performance and scalability of the prototype architecture using benchmarks and real-world datasets
  • To compare the prototype architecture with existing solutions and identify its strengths and weaknesses
  • To evaluate the trade-offs and limitations of different scalable architectures for parallel data processing
  • To provide recommendations for the use of the prototype architecture in different scenarios, such as batch processing, stream processing, and interactive querying

Robotic manipulation modelling

The aim of this research is to develop and validate a model-based control approach for robotic manipulation of small, precise objects.

  • Develop a mathematical model of the robotic system that captures the dynamics of the manipulator and the grasped object.
  • Design a control algorithm that uses the developed model to achieve stable and accurate grasping of the object.
  • Test the proposed approach in simulation and validate the results through experiments with a physical robotic system.
  • Evaluate the performance of the proposed approach in terms of stability, accuracy, and robustness to uncertainties and perturbations.
  • Identify potential applications and areas for future work in the field of robotic manipulation for precision tasks.

Big data analytics and its impacts on marketing strategy

The aim of this research is to investigate the impact of big data analytics on marketing strategy and to identify best practices for leveraging this technology to inform decision-making.

  • Review the literature on big data analytics and marketing strategy to identify key trends and challenges
  • Conduct a case study analysis of companies that have successfully integrated big data analytics into their marketing strategies
  • Identify the key factors that contribute to the effectiveness of big data analytics in marketing decision-making
  • Develop a framework for integrating big data analytics into marketing strategy.
  • Investigate the ethical implications of big data analytics in marketing and suggest best practices for responsible use of this technology.

Looking For Customize Thesis Topics?

Take a review of different varieties of thesis topics and samples from our website TheResearchGuardian.com on multiple subjects for every educational level.

Platforms for large scale data computing: big data analysis and acceptance

To investigate the performance and scalability of different large-scale data computing platforms.

  • To compare the features and capabilities of different platforms and determine which is most suitable for a given use case.
  • To identify best practices for using these platforms, including considerations for data management, security, and cost.
  • To explore the potential for integrating these platforms with other technologies and tools for data analysis and visualization.
  • To develop case studies or practical examples of how these platforms have been used to solve real-world data analysis challenges.

Distributed data clustering

Distributed data clustering can be a useful approach for analyzing and understanding complex datasets, as it allows for the identification of patterns and relationships that may not be immediately apparent.

To develop and evaluate new algorithms for distributed data clustering that is efficient and scalable.

  • To compare the performance and accuracy of different distributed data clustering algorithms on a variety of datasets.
  • To investigate the impact of different parameters and settings on the performance of distributed data clustering algorithms.
  • To explore the potential for integrating distributed data clustering with other machine learning and data analysis techniques.
  • To apply distributed data clustering to real-world problems and evaluate its effectiveness.

Analyzing and predicting urbanization patterns using GIS and data mining techniques".

The aim of this project is to use GIS and data mining techniques to analyze and predict urbanization patterns in a specific region.

  • To collect and process relevant data on urbanization patterns, including population density, land use, and infrastructure development, using GIS tools.
  • To apply data mining techniques, such as clustering and regression analysis, to identify trends and patterns in the data.
  • To use the results of the data analysis to develop a predictive model for urbanization patterns in the region.
  • To present the results of the analysis and the predictive model in a clear and visually appealing way, using GIS maps and other visualization techniques.

Use of big data and IOT in the media industry

Big data and the Internet of Things (IoT) are emerging technologies that are transforming the way that information is collected, analyzed, and disseminated in the media sector. The aim of the research is to understand how big data and IoT re used to dictate information flow in the media industry

  • Identifying the key ways in which big data and IoT are being used in the media sector, such as for content creation, audience engagement, or advertising.
  • Analyzing the benefits and challenges of using big data and IoT in the media industry, including factors such as cost, efficiency, and effectiveness.
  • Examining the ethical and social implications of using big data and IoT in the media sector, including issues such as privacy, security, and bias.
  • Determining the potential impact of big data and IoT on the media landscape and the role of traditional media in an increasingly digital world.

Exigency computer systems for meteorology and disaster prevention

The research aims to explore the role of exigency computer systems to detect weather and other hazards for disaster prevention and response

  • Identifying the key components and features of exigency computer systems for meteorology and disaster prevention, such as data sources, analytics tools, and communication channels.
  • Evaluating the effectiveness of exigency computer systems in providing accurate and timely information about weather and other hazards.
  • Assessing the impact of exigency computer systems on the ability of decision makers to prepare for and respond to disasters.
  • Examining the challenges and limitations of using exigency computer systems, such as the need for reliable data sources, the complexity of the systems, or the potential for human error.

Network security and cryptography

Overall, the goal of research is to improve our understanding of how to protect communication and information in the digital age, and to develop practical solutions for addressing the complex and evolving security challenges faced by individuals, organizations, and societies.

  • Developing new algorithms and protocols for securing communication over networks, such as for data confidentiality, data integrity, and authentication
  • Investigating the security of existing cryptographic primitives, such as encryption and hashing algorithms, and identifying vulnerabilities that could be exploited by attackers.
  • Evaluating the effectiveness of different network security technologies and protocols, such as firewalls, intrusion detection systems, and virtual private networks (VPNs), in protecting against different types of attacks.
  • Exploring the use of cryptography in emerging areas, such as cloud computing, the Internet of Things (IoT), and blockchain, and identifying the unique security challenges and opportunities presented by these domains.
  • Investigating the trade-offs between security and other factors, such as performance, usability, and cost, and developing strategies for balancing these conflicting priorities.

Meet Our Professionals Ranging From Renowned Universities

Related topics.

  • Sports Management Research Topics
  • Special Education Research Topics
  • Software Engineering Research Topics
  • Primary Education Research Topics
  • Microbiology Research Topics
  • Luxury Brand Research Topics
  • Cyber Security Research Topics
  • Commercial Law Research Topics
  • Change Management Research Topics
  • Artificial intelligence Research Topics

data science thesis

BSc/MSc Thesis

Our research group offers various interesting topics for a BSc or MSc thesis, the latter both in Computer Science and Scientific Computing . These topics are typically closely related to ongoing research projects (see our Research Page and Publications ). Below, we outline the basic procedure you should follow when planning to do a thesis in our group. Please read the following carefully! You also might want to take a quick look at past topics students covered in their theses. Please also note that we currently cannot accommodate all requests for advising a thesis as in current semester  as well as in the upcoming summer semester 2024 we are already advising numerous MSc and BSc theses.

Requirements

A key requirement is that you have taken some advanced courses offered by our group. This includes Data Science for Text Analytics  or  Complex Network Analysis (ICNA) and the more recent master level class on Natural Language Processing with Transformers  (INLPT). Student should also have some background in machine learning, ideally in combination with NLP. We also strongly recommend that prior to starting a thesis (especially a BSc thesis) in our group, you do an advanced software practical to become familiar with the data and tools we use in many of our projects. Most students typically do this in the semester before they officially start their thesis. Further requirements include

  • very good programming experience with Python (strongly preferred, including framework like pandas and numpy)
  • solid background in statistics and linear algebra
  • (optionally) experience with the machine learning frameworks such as PyTorch
  • (optionally) experience with NLP frameworks such as spaCy, gensim, LangChain
  • (optionally) experience with Opensearch or Elasticsearch
  • knowledge using tools such as Github and Docker

It is also advantageous if you have taken some graduate courses in the areas of efficient algorithms (e.g., IEA1 ) and in particular machine learning (e.g., IML , IFML or IAI ). Being familiar with frameworks like scikit-learn , Keras or PyTorch is advantageous.

If you have only taken the undergraduate course introduction to databases (IDB) and none of the other above courses, it is unlikely that we can accommodate your request.

Make also sure that you are familiar with the examination regulations ("Prüfungsordnung") that apply to your program of study.

Getting in Contact

Prior to getting in contact with us you should, of course, read this page in its entirety. If you think your interests and expertise are a good fit for our group and research activities, send an email to Prof. Michael Gertz with the subject "Anfrage BSc Arbeit" or "Anfrage MSc Arbeit" and include the following information:

  • your current transcript (as PDF). You can download this from the LSF .
  • information about your field of application ("Anwendungsfach"), in particular the courses you have taken
  • your programming experience and projects you worked on
  • areas of interest based on the research conducted in our group
  • any other information you think might strengthen your request

We will then review this information and get back to you with the scheduling of an appointment in person to discuss further details.

Thesis Expose

Once we agree on a topic for your thesis, before you officially register for a thesis, we would like to get an idea of how you approach scientific research and whether you are able to do scientific writing. For this, we require that you write an expose of your planned thesis research (see, e.g., here or here ) . This document is about 4-6 pages and has to include a description of

  • the context of your project and research
  • problem statement(s)
  • objectives and planned approaches
  • related work
  • milestones towards a timely completion of the thesis

Especially for the related work, it is important that you get a good overview  early on in your thesis project; of course, your advisor will give you some starting points. Most of the time, such an expose becomes an integral part of the introductory chapter of your thesis, so there is no time and effort wasted. The expose needs to be submitted to your advisor on schedule (which you arrange with your advisor), who will then discuss the expose with you and coordinate the next steps. Occasionally we also have students give a 10-15 minute presentation of their research plan in front of the members of our group in order to get further ideas, comments, suggestions, and pointers on their thesis.

Official Registration

In agreement with your advisor, after you have submitted an expose of good quality, you plan for an official start date of the thesis. For this, please fill out the  form suitable for your program of study:

  • Für Anmeldung einer Bachelorarbeit, siehe hier . 
  • For officially registering your master's thesis, see here . 
  • Registration form for a MSc thesis in Scientific Computing (please see Mrs. Kiesel to obtain a form).

Hand in this form to Prof. Michael Gertz who will then turn in the signed form.

Thesis Research and Advising

  • Here are some hints on grammar and style we maintain locally.
  • Some easy, purely syntactic  hints  on writing good research papers (from Prof. Felix Naumann )
  • Dos and don'ts, Universität Heidelberg, Prof. Dr. Anette Frank
  • Leitfaden zur Abfassung wissenschaftlicher Arbeiten, Ruhr-Universität Bochum, Katarina Klein
  • Leitfaden zur Abfassung wissenschaftlicher Arbeiten, TU Dresden, Maria Lieber

In addition, you can find a detailed description how to write a seminar paper using our template for seminar papers. The hints in this template might also be crucial when you are writing a thesis: [ seminar template .zip ] [ report sample pdf ] [ slides english pdf ] [ slides german pdf ]

Feel also free to ask us for copies of BSc/MSc thesis students did in the past in our group.

Thesis Template

  • Thesis template [.zip] ; see a sample PDF here .

Thesis Presentation

  • English LaTeX-Beamer template for the presentation: template [.zip] , sample PDF
  • German LaTeX-Beamer template for the presentation: template [.zip] , sample PDF

MIT Libraries home DSpace@MIT

  • DSpace@MIT Home
  • MIT Libraries

This collection of MIT Theses in DSpace contains selected theses and dissertations from all MIT departments. Please note that this is NOT a complete collection of MIT theses. To search all MIT theses, use MIT Libraries' catalog .

MIT's DSpace contains more than 58,000 theses completed at MIT dating as far back as the mid 1800's. Theses in this collection have been scanned by the MIT Libraries or submitted in electronic format by thesis authors. Since 2004 all new Masters and Ph.D. theses are scanned and added to this collection after degrees are awarded.

MIT Theses are openly available to all readers. Please share how this access affects or benefits you. Your story matters.

If you have questions about MIT theses in DSpace, [email protected] . See also Access & Availability Questions or About MIT Theses in DSpace .

If you are a recent MIT graduate, your thesis will be added to DSpace within 3-6 months after your graduation date. Please email [email protected] with any questions.

Permissions

MIT Theses may be protected by copyright. Please refer to the MIT Libraries Permissions Policy for permission information. Note that the copyright holder for most MIT theses is identified on the title page of the thesis.

Theses by Department

  • Comparative Media Studies
  • Computation for Design and Optimization
  • Computational and Systems Biology
  • Department of Aeronautics and Astronautics
  • Department of Architecture
  • Department of Biological Engineering
  • Department of Biology
  • Department of Brain and Cognitive Sciences
  • Department of Chemical Engineering
  • Department of Chemistry
  • Department of Civil and Environmental Engineering
  • Department of Earth, Atmospheric, and Planetary Sciences
  • Department of Economics
  • Department of Electrical Engineering and Computer Sciences
  • Department of Humanities
  • Department of Linguistics and Philosophy
  • Department of Materials Science and Engineering
  • Department of Mathematics
  • Department of Mechanical Engineering
  • Department of Nuclear Science and Engineering
  • Department of Ocean Engineering
  • Department of Physics
  • Department of Political Science
  • Department of Urban Studies and Planning
  • Engineering Systems Division
  • Harvard-MIT Program of Health Sciences and Technology
  • Institute for Data, Systems, and Society
  • Media Arts & Sciences
  • Operations Research Center
  • Program in Real Estate Development
  • Program in Writing and Humanistic Studies
  • Science, Technology & Society
  • Science Writing
  • Sloan School of Management
  • Supply Chain Management
  • System Design & Management
  • Technology and Policy Program

Collections in this community

Doctoral theses, graduate theses, undergraduate theses, recent submissions.

Thumbnail

L-dopa metabolism and the regulation of brain polysome aggregation 

Thumbnail

The North-Eastern Fishery question since 1886, a record of diplomatic relations 

Thumbnail

Metal complexes as models for vitamin B₆ catalysis 

feed

DS MS Thesis Defense | Dennis Hofmann | Tuesday, April 23, 2024 @ Noon, Gordon Library

DATA SCIENCE   

MS Thesis Defense  

Dennis Hofmann

Tuesday, April 23, 2024   | 12:00PM - 1:00PM

Location: Gordon Library, 303 Conference Room 

Thesis Committee:

Advisor: Elke Rundensteiner

Reader: Frank Zou 

Title: Agree to Disagree: Robust Anomaly Detection with Noisy Labels

Anomaly detection is extremely challenging due to the scarcity of reliable anomaly labels. Recent techniques thus rely on learning from generated lower-quality labels employing either clean sample selection or label refurbishment to correct the noisy labels. Both these approaches struggle for anomaly detection as a result of conflating anomalous samples with noisy labeled samples. For sample selection, the class imbalance of anomaly detection combined with the higher noise rate of anomalies (driven by their high diversity) leads selection techniques to unintentionally discard crucial anomaly samples. On the other hand, label refurbishment methods rely on anomalies having distinct properties from inliers, such as higher prediction variance. This can lead to incorrect refurbishment, especially for marginal clean samples which exhibit similar characteristics. To overcome these limitations, we introduce Unity, a new learning-from-noisy-labels approach for anomaly detection that elegantly leverages the merits of both sample selection and label refurbishment. Unity leverages two deep anomaly classifiers to collaboratively select easy samples with clean labels based on prediction agreement and marginal samples with clean labels via disagreement resolution. Instead of discarding samples that may have noisy labels, Unity introduces a feature-space-based metric called ContrastCorr to refurbish the remaining labels. The set of selected and refurbished clean samples are then combined to robustly update the anomaly classifiers in an iterative label cleaning process. Our experimental study on a rich variety of anomaly detection benchmark datasets demonstrates that Unity consistently outperforms state-of-the-art techniques for learning from noisy labels.

DEPARTMENT(S):

Phone number:.

Computer Science Thesis Oral

April 22, 2024 10:00am — 12:00pm.

Location: In Person and Virtual - ET - Traffic21 Classroom, Gates Hillman 6501 and Zoom

Speaker: MARK GILLESPIE , Ph.D. Candidate, Computer Science Department, Carnegie Mellon University https://markjgillespie.com/

Evolving Intrinsic Triangulations

This thesis presents algorithms and data structures for performing robust computation on surfaces that evolve over time. Throughout scientific and geometric computing, surfaces are often modeled as triangle meshes. However, finding high-quality meshes remains a challenge because meshes play two distinct and often-conflicting roles: defining both the surface geometry and a space of functions on that surface.

One solution to this dilemma, which has proven quite powerful in recent years, is the use of intrinsic triangulations to decouple these two concerns. The key idea is that given a triangle mesh representing an input surface, one can find many alternative triangulations which encode the exact same intrinsic geometry but offer alternative function spaces to work in. This technique makes it easy to find high-quality intrinsic triangle meshes, sidestepping the tradeoffs of classical mesh construction. However, the fact that intrinsic triangulations exactly preserve the input geometry—one of the central benefits of the technique—also makes it challenging to apply to surfaces whose geometry changes over time.

In this thesis we relax the assumption of exact geometry preservation, allowing the intrinsic perspective to be applied to time-evolving surfaces. We take as examples the problems of mesh simplification and surface parameterization. In the case of mesh simplification, we provide a general-purpose data structure for intrinsic triangulations which share only the topological class of the input surface, but may feature different geometry. In the case of surface parameterization, we build more efficient data structures and algorithms for the special case where the geometry changes conformally, using a connection between discrete conformal maps and hyperbolic geometry. In both cases, we find that the intrinsic perspective leads to simple algorithms which are still robust and efficient on a variety of examples.

Thesis Committee: Keenan Crane (Chair) James McCann Ioannis Gkioulekas Boris Springborn (Technische Universität Berlin)

In Person and Zoom Participation.  See announcement.

Add event to Google Add event to iCal

IMAGES

  1. thesis in data science

    data science thesis

  2. Thesis data analysis

    data science thesis

  3. Computer Science Thesis Data Analysis

    data science thesis

  4. GitHub

    data science thesis

  5. Master Thesis Data Collection

    data science thesis

  6. 2: Steps of methodology of the thesis

    data science thesis

VIDEO

  1. DATA SCIENCE [MODULE-1]

  2. DATA SCIENCE [MODULE-2]

  3. A student's perspective: What is the MSc Thesis process like?

  4. PhD Thesis Defense. Vadim Sotskov

  5. Data Science MSc thesis oral presentation

  6. What Is a Thesis?

COMMENTS

  1. How to write a great data science thesis

    They will stress the importance of structure, substance and style. They will urge you to write down your methodology and results first, then progress to the literature review, introduction and conclusions and to write the summary or abstract last. To write clearly and directly with the reader's expectations always in mind.

  2. 10 Best Research and Thesis Topic Ideas for Data Science in 2022

    In this article, we have listed 10 such research and thesis topic ideas to take up as data science projects in 2022. Handling practical video analytics in a distributed cloud: With increased dependency on the internet, sharing videos has become a mode of data and information exchange. The role of the implementation of the Internet of Things ...

  3. Thesis/Capstone for Master's in Data Science

    Data Science; Capstone and Thesis Overview; Capstone and Thesis Overview. Capstone and thesis are similar in that they both represent a culminating, scholarly effort of high quality. Both should clearly state a problem or issue to be addressed. Both will allow students to complete a larger project and produce a product or publication that can ...

  4. Research Topics & Ideas: Data Science

    If you're just starting out exploring data science-related topics for your dissertation, thesis or research project, you've come to the right place. In this post, we'll help kickstart your research by providing a hearty list of data science and analytics-related research ideas, including examples from recent studies.. PS - This is just the start…

  5. Five Tips For Writing A Great Data Science Thesis

    Although educational programs, conventions and thesis requirements vary wildly, I hope to offer some common guidelines for any student currently working on a Data Science thesis. The article offers five guidance points, but may effectively be summarized in a single line: "Write for your reader, not for yourself."

  6. Computational and Data Sciences (PhD) Dissertations

    Computational and Data Sciences (PhD) Dissertations. Below is a selection of dissertations from the Doctor of Philosophy in Computational and Data Sciences program in Schmid College that have been included in Chapman University Digital Commons. Additional dissertations from years prior to 2019 are available through the Leatherby Libraries ...

  7. Thesis Option

    Data Science master's students can choose to satisfy the research experience requirement by selecting the thesis option. Students will spend the majority of their second year working on a substantial data science project that culminates in the submission and oral defense of a master's thesis. While all thesis projects must be related to data science, students are given leeway in finding a ...

  8. Ten Research Challenge Areas in Data Science

    Abstract. To drive progress in the field of data science, we propose 10 challenge areas for the research community to pursue. Since data science is broad, with methods drawing from computer science, statistics, and other disciplines, and with applications appearing in all sectors, these challenge areas speak to the breadth of issues spanning ...

  9. Computing & Info Sciences: Data Science (Thesis) (Master of Science)

    The MS-CIS, Data Science concentration (Thesis), requires a total of 30 graduate credit hours, of which 24 credit hours must be earned through coursework. The student must enroll in the graduate Thesis course for at least two semesters, which would require the student to conduct an in-depth study of a research problem leading to the composition ...

  10. 17 Compelling Machine Learning Ph.D. Dissertations

    This dissertation revisits and makes progress on some old but challenging problems concerning least squares estimation, the work-horse of supervised machine learning. Two major problems are addressed: (i) least squares estimation with heavy-tailed errors, and (ii) least squares estimation in non-Donsker classes.

  11. Data Science Masters Theses // Arch : Northwestern University

    Data Science Masters Theses. The Master of Science in Data Science program requires the successful completion of 12 courses to obtain a degree. These requirements cover six core courses, a leadership or project management course, two required courses corresponding to a declared specialization, two electives, and a capstone project or thesis.

  12. PDF Undergraduate Fundamentals of Machine Learning

    knowledge from data. This 'knowledge' may a ord us some sort of summarization, visualization, grouping, or even predictive power over data sets. With all that said, it's important to emphasize the limitations of machine learning. It is not nor will it ever be a replacement for critical thought and methodical, procedural work in data science.

  13. data science Latest Research Papers

    Assessing the effects of fuel energy consumption, foreign direct investment and GDP on CO2 emission: New data science evidence from Europe & Central Asia. Fuel . 10.1016/j.fuel.2021.123098 . 2022 . Vol 314 . pp. 123098. Author (s): Muhammad Mohsin . Sobia Naseem .

  14. Instructions for MSc Thesis

    For a Data Science thesis, this part typically describes the method for the analysis. Chapter 5: Results. This chapter describes the results obtained when the methods of Chapter 4 are used on data. For a Computer Science thesis, this part typically describes the performance of the developed algorithm(s) on various synthetic and real datasets.

  15. MSc in Data Science, Project Guide

    RTDS+ (120-point thesis option) Contact Introduction. The project is an essential component of the Masters course. It is a substantial piece of full-time independent research in some area of data science. You will carry out your project under the individual supervision of a member of CDT staff.

  16. 37 Research Topics In Data Science To Stay On Top Of » EML

    22.) Cybersecurity. Cybersecurity is a relatively new research topic in data science and in general, but it's already garnering a lot of attention from businesses and organizations. After all, with the increasing number of cyber attacks in recent years, it's clear that we need to find better ways to protect our data.

  17. Bachelor and Master Theses

    Jiahui Li: Styled Text Summarization via Domain-specific Paraphrasing , Master Thesis Scientific Computing, July 2023. Sophia Matthis: Multi-Aspect Exploration of Plenary Protocols, Master Thesis, June 2023. Till Rostalski: A Generic Patient Similarity Framework for Clinical Data Analysis, Bachelor Thesis, June 2023.

  18. Data Science

    A comparative study on Unsupervised Deep Learning Methods for X-Ray Image denoising with Multi-Image Self2Self and Single Frequency Denoising. Author: Sözen, Ç., 14 Oct 2022. Supervisor: Tavakol, M. (Supervisor 1), Zhaorui, Y. (External person) (External coach) & Vilanova, A. (Supervisor 2) Student thesis: Master.

  19. MS Thesis Archive

    Elliott Barinberg, M.S. Data Science. Within this thesis work, the applications of data collection, machine learning, and data visualization were used on National Hockey League (NHL) shot data collected between the 2014-2015 season and the 2022-2023 season. Modeling sports data to better understand player evaluation has always been a goal of ...

  20. Cybersecurity data science: an overview from machine learning

    In a computing context, cybersecurity is undergoing massive shifts in technology and its operations in recent days, and data science is driving the change. Extracting security incident patterns or insights from cybersecurity data and building corresponding data-driven model, is the key to make a security system automated and intelligent. To understand and analyze the actual phenomena with data ...

  21. Best Big Data Science Research Topics for Masters and PhD

    Data science thesis topics. We have compiled a list of data science research topics for students studying data science that can be utilized in data science projects in 2022. our team of professional data experts have brought together master or MBA thesis topics in data science that cater to core areas driving the field of data science and big ...

  22. BSc/MSc Thesis

    BSc/MSc Thesis. Our research group offers various interesting topics for a BSc or MSc thesis, the latter both in Computer Science and Scientific Computing. These topics are typically closely related to ongoing research projects (see our Research Page and Publications ). Below, we outline the basic procedure you should follow when planning to do ...

  23. MIT Theses

    MIT's DSpace contains more than 58,000 theses completed at MIT dating as far back as the mid 1800's. Theses in this collection have been scanned by the MIT Libraries or submitted in electronic format by thesis authors. Since 2004 all new Masters and Ph.D. theses are scanned and added to this collection after degrees are awarded.

  24. DS MS Thesis Defense

    DATA SCIENCE MS Thesis Defense Dennis Hofmann Tuesday, April 23, 2024 | 12:00PM - 1:00PM Location: Gordon Library, 303 Conference Room Thesis Committee: Advisor: Elke Rundensteiner Reader: Frank Zou

  25. Computer Science Thesis Oral

    This thesis presents algorithms and data structures for performing robust computation on surfaces that evolve over time. Throughout scientific and geometric computing, surfaces are often modeled as triangle meshes. However, finding high-quality meshes remains a challenge because meshes play two distinct and often-conflicting roles: defining both the surface geometry and a space of functions on ...