Role of Artificial Intelligence in the Internet of Things (IoT) cybersecurity

  • Open access
  • Published: 24 February 2021
  • Volume 1 , article number  7 , ( 2021 )

Cite this article

You have full access to this open access article

  • Murat Kuzlu 1 ,
  • Corinne Fair 2 &
  • Ozgur Guler 3  

45k Accesses

89 Citations

21 Altmetric

Explore all metrics

In recent years, the use of the Internet of Things (IoT) has increased exponentially, and cybersecurity concerns have increased along with it. On the cutting edge of cybersecurity is Artificial Intelligence (AI), which is used for the development of complex algorithms to protect networks and systems, including IoT systems. However, cyber-attackers have figured out how to exploit AI and have even begun to use adversarial AI in order to carry out cybersecurity attacks. This review paper compiles information from several other surveys and research papers regarding IoT, AI, and attacks with and against AI and explores the relationship between these three topics with the purpose of comprehensively presenting and summarizing relevant literature in these fields.

Similar content being viewed by others

machine learning in iot research papers

A Survey on Security Attacks in Internet of Things and Challenges in Existing Countermeasures

machine learning in iot research papers

Review on Security of Internet of Things: Security Requirements, Threats, and Proposed Solutions

machine learning in iot research papers

IoT Attacks and Malware

Avoid common mistakes on your manuscript.

1 Introduction

Since around 2008, when the Internet of Things (IoT) was born [ 1 ], its growth has been booming, and now IoT is a part of daily life and has a place in many homes and businesses. IoT is hard to define as it has been evolving and changing since its conception, but it can be best understood as a network of digital and analog machines and computing devices provided with unique identifiers (UIDs) that have the ability to exchange data without human intervention [ 2 ]. In most cases, this manifests as a human interfacing with a central hub device or application, often a mobile app, that then goes on to send data and instructions to one or multiple fringe IoT devices [ 3 ]. The fringe devices are able to complete functions if required and send data back to the hub device or application, which the human can then view.

The IoT concept has given the world a higher level of accessibility, integrity, availability, scalability, confidentiality, and interoperability in terms of device connectivity [ 4 ]. However, IoTs are vulnerable to cyberattacks due to a combination of their multiple attack surfaces and their newness and thus lack of security standardizations and requirements [ 5 ]. There are a large variety of cyberattacks that attackers can leverage against IoTs, depending on what aspect of the system they are targeting and what they hope to gain from the attack. As such, there is a large volume of research into cybersecurity surrounding IoT. This includes Artificial Intelligence (AI) approaches to protecting IoT systems from attackers, usually in terms of detecting unusual behavior that may indicate an attack is occurring [ 6 ]. However, in the case of IoT, cyber-attackers always have the upper hand as they only need to find one vulnerability while cybersecurity experts must protect multiple targets. This has led to increased use of AI by cyber-attackers as well, in order to thwart the complicated algorithms that detect anomalous activity and pass by unnoticed [ 7 ]. AI has received much attention with the growth of IoT technologies. With this growth, AI technologies, such as decision trees, linear regression, machine learning, support vector machines, and neural networks, have been used in IoT cybersecurity applications to able to identify threats and potential attacks.

Authors in [ 8 ] provide a comprehensive review of the security risks related to IoT application and possible counteractions as well as compare IoT technologies in terms of integrity, anonymity, confidentiality, privacy, access control, authentication, authorization, resilience, and self-organization. The authors propose deep learning models using CICIDS2017 datasets for DDoS attack detection for the cybersecurity in IoT (Internet of Things), which provide high accuracy, i.e., 97.16% [ 9 ]. In [ 10 ], the authors evaluate the Artificial Neural Networks (ANN) in a gateway device to able to detect anomalies in the data sent from the edge devices. The results show that the proposed approach can improve the security of IoT systems. The authors in [ 11 ] propose an AI-based control approach for detection and estimation as well as compensation of cyber attacks in industrial IoT systems. In [ 12 ], The authors provide a robust pervasive detection for IoT Environments and develop a variety of adversarial attacks and defense mechanisms against them as well as validate their approach through datasets including MNIST, CIFAR-10, and SVHN. In [ 13 ], the authors analyze the recent evolution of AI decision-making in cyber physical systems and find that such evolution is virtually autonomous due to the increasing integration of IoT devices in cyber physical systems, and the value of AI decision-making due to its speed and efficiency in handling large loads of data is likely going to make this evolution inevitable. The authors of [ 14 ] discuss new approaches to risk analytics using AI and machine learning, particularly in IoT networks present in industry settings. Finally, [ 15 ] discusses methods of capturing and assessing cybersecurity risks to IoT devices for the purpose of standardizing such practices so that risk in IoT systems may be more efficiently identified and protected against.

This review paper covers a variety of topics regarding cybersecurity, the Internet of Things (IoT), Artificial Intelligence (AI), and how they all relate to each other in three survey-style sections and provides a comprehensive review of cyberattacks against IoT devices as well as provides recommended AI-based methods of protecting against these attacks. The ultimate goal of this paper is to create a resource for others who are researching these prevalent topics by presenting summaries of and making connections between relevant works covering different aspects of these subjects.

2 Methods of attacking IoT devices

Due to the lax security in many IoT devices, cyberattackers have found many ways to attack IoT devices from many different attack surfaces. Attack surfaces can vary from the IoT device itself, both its hardware and software, the network on which the IoT device is connected to, and the application with which the device interfaces; these are the three most commonly used attack surfaces as together they make up the main parts of an IoT system. Figure  1 illustrates a basic breakdown of a common IoT system; most of the attacks discussed in this paper occur at the network gateway and/or cloud data server connections, as these connections are generally where IoT security is most lacking.

figure 1

A high-level breakdown of typical IoT structure

2.1 Initial reconnaissance

Before IoT attackers even attempt cyberattacks on an IoT device, they will often study the device to identify vulnerabilities. This is often done by buying a copy of the IoT device they are targeting from the market. They then reverse engineer the device to create a test attack to see what outputs can be obtained and what avenues exist to attack the device. Examples of this include opening up the device and analyzing the internal hardware—such as the flash memory—in order to learn about the software, and tampering with the microcontroller to identify sensitive information or cause unintended behavior [ 16 ]. In order to counter reverse engineering, it is important for IoT devices to have hardware-based security. The application processor, which consists of sensors, actuators, power supply, and connectivity, should be placed in a tamper-resistant environment [ 16 ]. Device authentication can also be done with hardware-based security, such that the device can prove to the server it is connected to that it is not fake.

2.2 Physical attacks

An often low-tech type category of attacks includes physical attacks, in which the hardware of the target device is used to the benefit of the attacker in some way. There are several different types of physical attacks. These include attacks such as outage attacks, where the network that the devices are connected to are shut off to disrupt their functions; physical damage, where devices or their components are damaged to prevent proper functionality; malicious code injection, an example of which includes an attacker plugging a USB containing a virus into the target device; and object jamming, in which signal jammers are used to block or manipulate the signals put out by the devices [ 17 ]. Permanent denial of service (PDoS) attacks, which are discussed later in this paper, can be carried out as a physical attack; if an IoT device is connected to a high voltage power source, for example, its power system may become overloaded and would then require replacement [ 18 ].

2.3 Man-in-the-Middle

One of the most popular attacks on IoTs is Man-in-the-Middle (MITM) attack. With regards to computers in general, an MITM attack intercepts communication between two nodes and allows the attacker to take the role of a proxy. Attackers can perform MITM attacks between many different connections such as a computer and a router, two cell phones, and, most commonly, a server and a client. Figure  2 shows a basic example of an MITM attack between a client and a server. In regards to IoT, the attacker usually performs MITM attacks between an IoT device and the application with which it interfaces. IoT devices, in particular, tend to be more vulnerable to MITM attacks as they lack the standard implementations to fight the attacks. There are two common modes of MITM attacks: cloud polling and direct connection. In cloud polling, the smart home device is in constant communication with the cloud, usually to look for firmware updates. Attackers can redirect network traffic using Address Resolution Protocol (ARP) poisoning or by altering Domain Name System (DNS) settings or intercept HTTPS traffic by using self-signed certificates or tools such as (Secure Sockets Layer) SSL strip [ 19 ]. Many IoT devices do not verify the authenticity or the trust level of certificates, making the self-signed certificate method particularly effective. In the case of direct connections, devices communicate with a hub or application in the same network. By doing this, mobile apps can locate new devices by probing every IP address on the local network for a specific port. An attacker can do the same thing to discover devices on the network [ 19 ]. An example of an MITM IoT attack is that of a smart refrigerator that could display the user’s Google calendar. It seems like a harmless feature, but attackers found that the system did not validate SSL certificates, which allowed them to perform an MITM attack and steal the user’s Google credentials [ 19 ].

figure 2

A simple representation of a Man-in-the-Middle attack

2.3.1 Bluetooth Man-in-the-Middle

A common form of MITM attack leveraged against IoT devices is via Bluetooth connection. Many IoT devices run Bluetooth Low Energy (BLE), which is designed with IoT devices in mind to be smaller, cheaper, and more power-efficient [ 20 ]. However, BLE is vulnerable to MITM attacks. BLE uses AES-CCM encryption; AES encryption is considered secure, but the way that the encryption keys are exchanged is often insecure. The level of security relies on the pairing method used to exchange temporary keys between the devices. BLE specifically uses three-phase pairing processes: first, the initiating device sends a pairing request, and the devices exchange pairing capabilities over an insecure channel; second, the devices exchange temporary keys and verify that they are using the same temporary key, which is then used to generate a short-term key (some newer devices use a long-term key exchanged using Elliptic Curve Diffie-Hellman public-key cryptography, which is significantly more secure than the standard BLE protocol); third, the created key is exchanged over a secure connection and can be used to encrypt data [ 20 ]. Figure  3 represents this three-phase pairing process.

figure 3

A diagram illustrating the basic BLE pairing process

The temporary key is determined according to the pairing method, which is determined on the OS level of the device. There are three common pairing methods popular with IoT devices. One, called Just Works, always sets the temporary key to 0, which is obviously very insecure. However, it remains one of if not the most popular pairing methods used with BLE devices [ 20 ]. The second, Passkey, uses six-digit number combinations, which the user must manually enter into a device, which is fairly secure, though there are methods of bypassing this [ 20 ]. Finally, the Out-of-Band pairing method exchanges temporary keys using methods such as Near Field Communication. The security level of this method is determined by the security capabilities of the exchange method. If the exchange channel is protected from MITM attacks, the BLE connection can also be considered protected. Unfortunately, the Out-of-Band method is not yet common in IoT devices [ 20 ]. Another important feature of BLE devices is the Generic Attribute Profile (GATT), which is used to communicate between devices using a standardized data schema. The GATT describes devices’ roles, general behaviors, and other metadata. Any BLE-supported app within the range of an IoT device can read its GATT schema, which provides the app with necessary information [ 20 ]. In order for attackers to perform MITM attacks in BLE networks, the attacker must use two connected BLE devices himself: one device acting as the IoT device to connect to the target mobile app, and a fake mobile app to connect to the target IoT device. Some other tools for BLE MITM attacks exist, such as GATTacker, a Node.js package that scans and copies BLE signals and then runs a cloned version of the IoT device, and BtleJuice, which allows MITM attacks on Bluetooth Smart devices which have improved security over BLE [ 20 ].

2.3.2 False data injection attacks

Once an attacker has access to some or all of the devices on an IoT network via an MITM attack, one example of an attack they could carry out next is a False Data Injection (FDI) attack. FDI attacks are when an attacker alters measurements from IoT sensors by a small amount so as to avoid suspicion and then outputs the faulty data [ 21 ]. FDI attacks can be perpetrated in a number of ways, but in practice doing so via MITM attacks is the most practical. FDI attacks are often leveraged against sensors that send data to an algorithm that attempts to make predictions based on the data it has received or otherwise uses data to make conclusions. These algorithms, sometimes referred to as predictive maintenance systems, are commonly used in monitoring the state of a mechanical machine and predicting when it will need to be maintained or tuned [ 21 ]. These predictive maintenance algorithms and similar would also be a staple feature of smart cities, FDI attacks against which could be disastrous. An example of an FDI attack on a predictive maintenance system is sensors on an airplane engine that predict when the engine will need critical maintenance. When attackers are able to access even a small portion of the sensors, they are able to create a small amount of noise that goes undetected by faulty data detection mechanisms but is just enough to skew the algorithm’s predictions [ 21 ]. In testing, it would even be enough to delay critical maintenance to the system, potentially causing catastrophic failure while in use, which could cause a costly unplanned delay or loss of life.

2.4 Botnets

Another kind of common attack on IoT devices is recruiting many devices to create botnets and launch Distributed Denial of Service (DDoS) attacks. A denial of service (DoS) attack is characterized by an orchestrated effort to prevent legitimate use of a service; a DDoS attack uses attacks from multiple entities to achieve this goal. DDoS attacks aim to overwhelm the infrastructure of the target service and disrupt normal data flow. DDoS attacks generally go through a few phases: recruitment, in which the attacker scans for vulnerable machines to be used in the DDoS attack against the target; exploitation and infection, in which the vulnerable machines are exploited, and malicious code is injected; communication, in which the attacker assesses the infected machines, sees which are online and decides when to schedule attacks or upgrade the machines; and attack, in which the attacker commands the infected machines to send malicious packets to the target [ 22 ]. One of the most popular ways to gain infected machines and conduct DDoS attacks is through IoT devices due to their high availability and generally poor security and maintenance. Figure  4 shows a common command structure, in which the attacker’s master computer sends commands to one or more infected command and control centers, who each control a series of zombie devices that can then attack the target.

figure 4

A graphical representation of a common botnet hierarchy

One of the most famous malware, the Mirai worm, has been used to perpetrate some of the largest DDoS attacks ever known and is designed to infect and control IoT devices such as DVRs, CCTV cameras, and home routers. The infected devices become part of a large-scale botnet and can perpetrate several types of DDoS attacks. Mirai was built to handle multiple different CPU architectures that are popular to use in IoT devices, such as x86, ARM, Sparc, PowerPC, Motorola, etc., in order to capture as many devices as possible [ 23 ]. In order to be covert, the virus is quite small and actually does not reside in the device’s hard disk. It stays in memory, which means that once the device is rebooted, the virus is lost. However, devices that have been infected once are susceptible to reinfection due to having already been discovered as being vulnerable, and reinfection can take as little as a few minutes [ 23 ]. Today, many well-known IoT-targeting botnet viruses are derived from Mirai’s source code, including Okiru, Satori, and Reaper [ 23 ].

2.5 Denial of service attacks

IoT devices may often carry out DoS attacks, but they themselves are susceptible to them as well. IoT devices are particularly susceptible to permanent denial of service (PDoS) attacks that render a device or system completely inoperable. This can be done by overloading the battery or power systems or, more popularly, firmware attacks. In a firmware attack, the attacker may use vulnerabilities to replace a device’s basic software (usually its operating system) with a corrupted or defective version of the software, rendering it useless [ 18 ]. This process, when done legitimately, is known as flashing, and its illegitimate counterpart is known as “phlashing”. When a device is phlashed, the owner of the device has no choice but to flash the device with a clean copy of the OS and any content that might’ve been put on the device. In a particularly powerful attack, the corrupted software could overwork the hardware of the device such that recovery is impossible without replacing parts of the device [ 18 ]. The attacks to the device’s power system, though less popular, are possibly even more devastating. One example of this type of attack is a USB device with malware loaded on it that, when plugged into a computer, overuses the device’s power to the point that the hardware of the device is rendered completely ruined and needs to be replaced [ 18 ].

One example of PDoS malware is known as BrickerBot. BrickerBot uses brute force dictionary attacks to gain access to IoT devices and, once logged in to the device, runs a series of commands that result in permanent damage to the device. These commands include misconfiguring the device’s storage and kernel parameters, hindering internet connection, sabotaging device performance, and wiping all files on the device [ 24 ]. This attack is devastating enough that it often requires reinstallation of hardware or complete replacement of the device. If the hardware survives the attack, the software certainly didn’t and would need reflashing, which would lose everything that might have been on it. Interestingly enough, BrickerBot was designed to target the same devices the Mirai botnet targets and would employ as bots, and uses the same or a similar dictionary to make its brute force attacks. As it turns out, BrickerBot was actually intended to render useless those devices that Mirai would have been able to recruit in an effort to fight back against the botnet [ 24 ].

Due to the structure of IoT systems, there are multiple attack surfaces, but the most popular way of attacking IoT systems is through their connections as these tend to be the weakest links. In the future, it is advisable that IoT developers ensure that their products have strong protections against such attacks, and the introduction of IoT security standards would prevent users from unknowingly purchasing products that are insecure. Alternatively, keeping the network that the IoT system resides on secure will help prevent many popular attacks, and keeping the system largely separated from other critical systems or having backup measures will help mitigate the damage done should an attack be carried out.

3 Artificial Intelligence in cybersecurity

In order to dynamically protect systems from cyber threats, many cybersecurity experts are turning to Artificial Intelligence (AI). AI is most commonly used for intrusion detection in cybersecurity by analyzing traffic patterns and looking for an activity that is characteristic of an attack.

3.1 Machine learning

There are two main kinds of machine learning: supervised and unsupervised learning. Supervised learning is when humans manually label training data as malicious or legitimate and then input that data into the algorithm to create a model that has “classes” of data that it compares the traffic it is analyzing. Unsupervised learning forgoes training data and manual labeling, and instead the algorithm groups together similar pieces of data into classes and then classifies them according to the data coherence within one class and the data modularity between classes [ 25 ]. One popular machine learning algorithm for cybersecurity is naïve Bayes, which seeks to classify data based on the Bayesian theorem wherein anomalous activities are all assumed to originate from independent events instead of one attack. Naïve Bayes is a supervised learning algorithm, and once it is trained and has generated its classes will analyze each activity to determine the probability that it is anomalous [ 25 ]. Machine learning algorithms can also be used to create the other models discussed in this section

3.2 Decision trees

A decision tree is a type of AI that creates a set of rules based on its training data samples. It uses iterative division to find a description (often simply “attack” or “normal”) that best categorizes the traffic it is analyzing. An example of this approach in cybersecurity is detecting DoS attacks by analyzing the flow rate, size, and duration of traffic. For example, if the flow rate is low, but the duration of the traffic is long, it is likely to be an attack and will, therefore, be classified as such [ 25 ]. Decision trees can also be used to detect command injection attacks in robotic vehicles by categorizing values from CPU consumption, network flow, and volume of data written [ 25 ] as shown in Fig.  5 . This technique is popular as it is intuitive in that what the AI does and doesn’t consider anomalous traffic is known to the developer. Additionally, once an effective series of rules is found, the AI can analyze traffic in real-time, providing an almost immediate alert if unusual activity is detected.

figure 5

An example of a decision tree for classifying network traffic

Another approach to decision trees is the Rule-Learning technique, which searches for a set of attack characteristics in each iteration while maximizing some score that denotes the quality of the classification (i.e., the number of incorrectly classified data samples) [ 25 ]. The main difference between traditional decision trees and the rule-learning techniques is that traditional decision trees look for characteristics that will lead to a classification, whereas the rule-learning technique finds a complete set of rules that can describe a class. This can be an advantage as it can factor in human advice when generating rules, which creates an optimized set of rules [ 25 ].

3.3 K-nearest neighbors

The k-nearest neighbor (k-NN) technique learns from data samples to create classes by analyzing the Euclidean distance between a new piece of data and already classified pieces of data to decide what class the new piece should be put in, to put it simply [ 25 ]. For example, the new piece of data when k, the number of nearest neighbors, equals three (3) would be classified into class two (2), but when k equals nine (9), the new piece would be classified in class 1 as shown in Fig.  6 . The k-NN technique is attractive for intrusion detection systems as it can quickly learn from new traffic patterns to notice previously unseen, even zero-day attacks. Cybersecurity experts are also researching applications of k-NN for real-time detection of cyberattacks [ 25 ]. The technique has been employed to detect attacks such as false data injection attacks and performs well when data can be represented through a model that allows the measurement of their distance to other data, i.e., through a Gaussian distribution or a vector.

figure 6

How k-NN technique can classify a data point differently given different k values

3.4 Support vector machines

Support vector machines (SVMs) are an extension of linear regression models that locates a plane that separates data into two classes [ 25 ]. This plane can be linear, non-linear, polynomial, Gaussian, sigmoid, etc., depending on the function used in the algorithm. SVMs can also separate data into more than two classes by using more than one plane. In cybersecurity, this technique is used to analyze Internet traffic patterns and separate them into their component classes such as HTTP, FTP, SMTP, and so on [ 25 ]. As SVM is a supervised machine learning technique, it is often used in applications where attacks can be simulated, such as using network traffic generated from penetration testing as training data.

3.5 Artificial neural networks

Artificial neural networks (ANNs) are a technique derived from the way that neurons interact with each other in the brain in order to pass and interpret information. In ANNs, a neuron is a mathematical equation that reads data and outputs a target value, which is then passed along to the next neuron based on its value. The ANN algorithm then iterates until the output value is acceptably close to the target value, which allows the neurons to learn and correct their weights by measuring the error between the expected value and the previous output value. Once this process is finished, the algorithm presents a mathematical equation that outputs a value that can be used to classify the data [ 25 ].

A large benefit of ANNs is that they are able to adjust their mathematical models when presented with new information, whereas other mathematical models may become obsolete as new types of traffic and attacks become common [ 25 ]. This also means that ANNs are adept at catching previously unseen and zero-day attacks as they take new information into heavier consideration than static mathematical models can. Because of this, ANNs make solid intrusion detection systems and have performed well with attacks such as DoS [ 25 ].

At present, using AI in cybersecurity is a small but rapidly growing field. It is also expensive and resource intensive, so using AI to protect a small system may not be feasible. However, businesses that have large networks may benefit from these solutions, especially if they are considering or have already introduced IoT devices into their network. AI cybersecurity would also be beneficial in the massive systems one would find in a smart city, and the AI would be able to give very quick response times that are important in systems like traffic management. In the future, AI cybersecurity could also be integrated into smaller systems such as self-driving cars or smart homes. Additionally, many AI cybersecurity measures detect or thwart attacks in progress rather than preventing attacks in the first place, meaning that other preventative security measured should also be in place.

4 AI to attack IoT

Not all AI is used for the purposes of cybersecurity; cybercriminals have begun using malicious AI to aid attacks, often to thwart the intrusion detection algorithms in the case of IoT, or attacking beneficial AI in such a way that the AI works against its own system.

4.1 Automation of vulnerability detection

Machine learning can be used to discover vulnerabilities in a system. While this can be useful for those trying to secure a system to intelligently search for vulnerabilities that need to be patched, attackers also use this technology to locate and exploit vulnerabilities in their target system. As technology soars in usage, especially technologies with low-security standards such as IoT devices, the number of vulnerabilities that attackers are able to exploit has soared as well, including zero-day vulnerabilities. In order to identify vulnerabilities quickly, attackers often use AI to discover vulnerabilities and exploit them much more quickly than developers can fix them. Developers are able to use these detection tools as well, but it should be noted that developers are at a disadvantage when it comes to securing a system or device; they must find and correct every single vulnerability that could potentially exist, while attackers need only find one, making automatic detection a valuable tool for attackers.

4.1.1 Fuzzing

Fuzzing, at its core, is a testing method that generates random inputs (i.e., numbers, chars, metadata, binary, and especially “known-to-be-dangerous” values such as zero, negative or very large numbers, SQL requests, special characters) that causes the target software to crash [ 26 ]. It can be divided into dumb fuzzing and smart fuzzing. Dumb fuzzing simply generates defects by randomly changing the input variables; this is very fast as changing the input variable is simple, but it is not very good at finding defects as code coverage is narrow [ 26 ]. Smart fuzzing, on the other hand, generates input values suitable for the target software based on the software’s format and error generation. This software analysis is a big advantage for smart fuzzing as it allows the fuzzing algorithm to know where errors can occur; however, developing an efficient smart fuzzing algorithm takes expert knowledge and tuning [ 26 ].

4.1.2 Symbolic execution

Symbolic execution is a technique similar to fuzzing that searches for vulnerabilities by setting input variables to a symbol instead of a real value [ 26 ]. This technique is often split into offline and online symbolic execution. Offline symbolic execution chooses only one path to explore at a time to create new input variables by resolving the path predicate [ 26 ]. This means that each time one wishes to explore a new path, the algorithm must be run from the beginning, which is a disadvantage due to the large amount of overhead due to code re-execution. Online symbolic execution replicates states and generates path predicates at every branch statement [ 26 ]. This method does not incur much overhead, but it does require a large amount of storage to store all the status information and simultaneous processing of all the states it creates, leading to significant resource consumption.

4.2 Input attacks

When an attacker alters the input of an AI system in such a way that causes the AI to malfunction or give an incorrect output, it is known as an input attack. Input attacks are carried out by adding an attack pattern to the input, which can be anything from putting tape on a physical stop sign to confuse self-driving cars to adding small amounts of noise to an image that is imperceptible to the human eye but will confuse an AI [ 27 ]. Notably, the actual algorithm and security of the AI does not need to be compromised in order to carry out an input attack—only the input that the attacker wants to compromise the output of must be altered. In the case of tape on a stop sign, the attacker may not need to use technology at all. However, more sophisticated attacks are completely hidden from the human eye, wherein the attacker may alter a tiny part of the image in a very precise manner that is designed to misdirect the algorithm. That being said, input attacks are often categorized based on where they rest on two axes: perceivability and format.

The perceivability of an input attack is the measure of how noticeable the attack is to the human eye, while the format is the measure of how digital versus physical the attack is [ 27 ]. On one end of the perceivability axis is perceivable attacks. Altering targets, such as by deforming, removing part of, or changing its colors, and adding to the target, such as affixing physical tape or adding digital marks, are types of perceivable attacks [ 27 ]. While perceivable attacks are perceivable by humans, humans may not notice slight changes like tape on a stop sign or consider them important. A human driver still sees a stop sign with tape or scratches as a stop sign, even though a self-driving car may not. This lends itself to the effectiveness of perceivable attacks, allowing them to, in many cases, hide in plain sight. Conversely, imperceivable attacks are invisible to the human eye. This can include things such as “digital dust,” which is a small amount of noise added to the entire image that is not visible to the human eye but significant enough to an AI to change its output or an imperceptible pattern on a 3D printed object that can be picked up by AI [ 27 ]. Imperceivable attacks can also be made through audio, such as playing audio at ranges outside of the human hearing range that would be picked up by a microphone [ 27 ]. Imperceivable attacks are generally more of a security risk, as there is almost no chance that a human would notice the attack before the AI algorithm outputs an incorrect response.

The format of an attack is usually either digital or physical, without many attacks that are a combination of both [ 27 ]. In many cases of physical attacks, the attack pattern must be more obvious rather than imperceivable as physical objects must be digitized to be processed and, in that process, may lose some finer detail [ 27 ]. Some attacks are still difficult to perceive even with the detail loss, however, as with the case of 3D printed objects with a pattern that blends into the structure of the object such that it is imperceptible to humans [ 27 ]. Opposite of physical attacks are digital attacks, which attack digital inputs such as images, videos, audio recordings, and files. As these inputs are already digitized, there is no process wherein detail is lost, and as such attackers can make very exact attacks, allowing them to be more imperceptible to the human eye than physical attacks [ 27 ]. Digital attacks are not necessarily imperceptible. However—photoshopping glasses with a strange pattern over a celebrity, for example, may cause the AI to identify the image as a different person, but still a person nonetheless. An example of input attacks specific to IoT smart cars and, more broadly, smart cities. As mentioned earlier, simply placing pieces of tape in a specific way on a stop sign is enough for an algorithm to not recognize the stop sign or even classify it as a green light—this is harmful for passengers in the car if the car does not heed the stop sign, and at a larger scale could alter traffic pattern detectors in smart cities. Additionally, noise-based input attacks could cause smart assistants to malfunction and carry out unintended commands.

4.3 Data poisoning/false data injection

Data poisoning attacks and input attacks are very similar, but while the goal of input attacks is simply to alter the output of the affected input, the goal of data poisoning is to alter inputs over a long enough period of time that the AI that analyzes data has shifted and is inherently flawed; because of this, data poisoning is usually carried out while the AI is still being trained before it is actually deployed [ 27 ]. In many cases, the AI learns to fail on specific inputs that the attacker chooses; for example, if a military uses AI to detect aircraft, the enemy military may poison the AI so that it does not recognize certain types of aircraft like drones [ 27 ]. Data poisoning can also be used on AIs that are constantly learning and analyzing data in order to make and adjust predictions, such as in predictive maintenance systems. There are three main methods attackers can use to poison an AI.

4.3.1 Dataset poisoning

Poisoning the dataset of an AI is perhaps the most direct method of data poisoning—as AI gain all of their knowledge from the training datasets they are provided, any flaws within those datasets will subsequently flaw the AI’s knowledge. A basic example of this is shown in Fig.  7 : a significant portion of the data is corrupted in the second dataset, leading the resultant machine learning model to be flawed. Dataset poisoning is done by including incorrect or mislabeled information in the target dataset [ 27 ]. As AI learn by recognizing patterns in datasets, poisoned datasets break patterns or may introduce new incorrect patterns, causing the AI to misidentify inputs or identify them incorrectly [ 27 ]. Many datasets are very large, so finding poisoned data within datasets can be difficult. Continuing the example of traffic patterns, an attacker could change dataset labels in such a way that the AI no longer recognizes stop signs or add data and labels that cause the AI to classify a red light as a green light.

figure 7

A visual representation of dataset poisoning

4.3.2 Algorithm poisoning

Algorithm poisoning attacks take advantage of weaknesses that may be in the learning algorithm of the AI. This method of attack is very prominent in federated learning, which is a method of training machine learning while protecting data privacy of an individual. Federated learning, rather than collecting potentially sensitive data from users and combining it into one dataset, trains small models directly on users’ devices and then combines these models to form the final model. The users’ data never leaves their devices, and so is more secure; however, if an attacker is one of the users that the algorithm is using the data of, they are free to manipulate their own data in order to poison the model [ 27 ]. The poisoned algorithm, when combined with the rest of the algorithms, has the potential to poison the final model. They could degrade the model or even install a backdoor in this manner.

One example of federated learning is Google’s Gboard, which used federated learning to learn about text patterns in order to train predictive keyboards [ 28 ]. Although Google has extensive data vetting measures, in a less careful approach, users could potentially type nonsensical messages to confuse the predictive text or, more sinisterly, inject code into the algorithm to give themselves a backdoor. Similarly, some cutting-edge IoT devices are beginning to employ federated learning in order to learn from each other. One example of this is using machine learning to predict air pressure changes as it flows through gradually clogging filters, allowing the IoT sensor to predict when the filter will need to be changed [ 29 ]. This learning process would take a long enough time to make the study infeasible with just a few filters, but with federated learning the process is able to be sped up significantly. However, users could easily manipulate the process with their own filters in order to poison the algorithm. Although this is a relatively innocent example of algorithm poisoning, as federated learning increases in IoT, so will the potentially harmful applications of federated learning.

4.3.3 Model poisoning

Finally, some attackers simply replace a legitimate model with an already poisoned model prepared ahead of time; all the attacker has to do is get into the system which stores the model and replace the file [ 27 ]. Alternatively, the equations and data within the trained model file could be altered. This method is potentially dangerous as even if a model trained model is double-checked and data is verified to be not poisoned, the attacker can still alter the model at various points in its distribution, such as while the model is still in company’s network awaiting placement on an IoT device or on an individual IoT device once it has been distributed [ 27 ].

Many of the attacks as described above can be mitigated or prevented by properly sanitizing inputs and checking for unusual data. However, some attacks are subtle and can bypass the notice of humans and even other AI, especially when the attacks are created by malevolent AI systems. These attacks and how to defend against effectively them are at the forefront of current research as the popularity of these attacks grow, but at present many attacks do not use AI for the same reason that many security systems do not: AI is resource intensive and a good algorithm requires high-level knowledge to build, making it inaccessible and infeasible to many attackers.

5 Summary of attacks and their defenses

The various attacks discussed in this paper are listed in Table  1 , and are paired with one or more ways of protecting an IoT system from the attack. While comprehensively protecting an IoT system can be a challenging task due to the number of attack surfaces present, many of the methods listed will defend against many types of attacks; for example, as many of the attacks listed are carried out by first conducting MITM attacks, protecting the network on which an IoT system resides will protect the system from many common attacks.

6 Conclusion

Due to the nature of IoT systems to have many attack surfaces, there exists a variety of attacks against these systems, and more are being discovered as IoT grows in popularity. It is necessary to protect systems against these attacks as effectively as possible. As the number and speed of attacks grow, experts are turning to AI as a means of protecting these systems intelligently and in real-time. Of course, attackers find ways to thwart these AI and may even use AI to attack systems. This paper explores popular techniques to attempt to disrupt or compromise IoT and explains at a surface level how these attacks are carried out. Where applicable, examples are also provided in order to clarify these explanations. Next, several AI algorithms are introduced, and their applications in cybersecurity are investigated. In many cases, these models are not yet common in commercial applications but rather are still undergoing research and development or are still difficult to implement and thus rare. Nonetheless, the models discussed are promising and may become common attack detection systems within just a couple of years. Methods of attacking AI and using AI to attack are also discussed, with the frame of IoT systems. The growth of IoT systems will see these types of attacks become more and more of a threat, especially as massive networks such as smart cities begin experimentation; both as massive networks are harder to protect with a multitude of attack surfaces, and as daily life and safety revolve around AI which needs to be more or less failure-proof. This is followed by a chart reiterating the threats covered in this paper, paired with common or recommended methods of protecting against each attack. Having covered all these topics, this paper aims to provide a useful tool with which researchers and cybersecurity professionals may study IoT in the context of cybersecurity and AI in order to secure IoT systems. Additionally, it also aims to emphasize the implications of up and coming technology and the impacts that each of these fields will have on the others. It is important to consider all the potential consequences of a technological development both before and after it is made public, as cyberattackers are constantly looking to use new technologies to their benefit, whether this means diverting the technology from its original purpose or using the technology as a tool to perpetuate other attacks. This paper discusses how IoT and AI have been taken advantage of for criminal purposes or have had weaknesses exploited as an example of this, which will help readers understand current risks and help cultivate an understanding such that these weaknesses are accounted for in the future in order to prevent cyberattacks.

Evans D. The Internet of Things: how the next evolution of the internet is changing everything. Cisco Internet Business Solutions Group: Cisco; 2011.

Google Scholar  

Rouse M. What is IoT (Internet of Things) and how does it work? IoT Agenda, TechTarget. http://www.internetofthingsagenda.techtarget.com/definition/Internet-of-Things-IoT . Accessed 11 Feb 2020.

Linthicum D. App nirvana: when the internet of things meets the API economy. https://techbeacon.com/app-dev-testing/app-nirvana-when-internet-things-meets-api-economy . Accessed 15 Nov 2019.

Lu Y, Xu LD. Internet of Things (IoT) cybersecurity research: a review of current research topics. IEEE Internet Things J. 2019;6(2):2103–15.

Article   Google Scholar  

Vorakulpipat C, Rattanalerdnusorn E, Thaenkaew P, Hai HD. Recent challenges, trends, and concerns related to IoT security: aan evolutionary study. In: 2018 20th international conference on advanced communication technology (ICACT), Chuncheon-si Gangwon-do, Korea (South); 2018. p. 405–10.

Lakhani A. The role of artificial intelligence in IoT and OT security. https://www.csoonline.com/article/3317836/the-role-of-artificial-intelligence-in-iot-and-ot-security.html . Accessed 11 Feb 2020.

Pendse A. Transforming cybersecurity with AI and ML: view. https://ciso.economictimes.indiatimes.com/news/transforming-cybersecurity-with-ai-and-ml/67899197 . Accessed 12 Feb 2020.

Meneghello F, Calore M, Zucchetto D, Polese M, Zanella A. IoT: internet of threats? A survey of practical security vulnerabilities in real IoT devices. IEEE Internet Things J. 2019;6(5):8182–201.

Roopak M, Yun Tian G, Chambers J. Models deep learning, for cyber security in IoT networks. In: IEEE 9th annual computing and communication workshop and conference (CCWC), Las Vegas, NV, USA. 2019;2019:0452–7.

Cañedo J, Skjellum A. Using machine learning to secure IoT systems. In: 2016 14th annual conference on privacy, security and trust (PST), Auckland; 2016. p. 219–22, https://doi.org/10.1109/PST.2016.7906930 .

Farivar F, Haghighi MS, Jolfaei A, Alazab M. Artificial intelligence for detection, estimation, and compensation of malicious attacks in nonlinear cyber-physical systems and industrial IoT. IEEE Trans Ind Inf. 2020;16(4):2716–25. https://doi.org/10.1109/TII.2019.2956474 .

Wang S, Qiao Z. Robust pervasive detection for adversarial samples of artificial intelligence in IoT environments. IEEE Access. 2019;7:88693–704. https://doi.org/10.1109/ACCESS.2019.2919695 .

Radanliev P, De Roure D, Van Kleek M, Santos O, Ani U. Artificial intelligence in cyber physical systems. AI & society. 2020; p. 1–14.

Radanliev P, De Roure D, Page K, Nurse JR, Mantilla Montalvo R, Santos O, Maddox LT, Burnap P. Cyber risk at the edge: current and future trends on cyber risk analytics and artificial intelligence in the industrial internet of things and industry 4.0 supply chains. Cybersecurity. 2020;3:1–21.

Radanliev P, De Roure DC, Nurse JR, Montalvo RM, Cannady S, Santos O, Burnap P, Maple C. Future developments in standardisation of cyber risk in the Internet of Things (IoT). SN Appl Sci. 2020;2(2):169.

Woo S. The right security for IoT: physical attacks and how to counter them. In: Minj VP, editor. Profit From IoT. http://www.iot.electronicsforu.com/headlines/the-right-security-for-iot-physical-attacks-and-how-to-counter-them/ . Accessed 13 June 2019.

Akram H, Dimitri K, Mohammed M. A comprehensive iot attacks survey based on a building-blocked reference mode. Int J Adv Comput Sci Appl. 2018. https://doi.org/10.14569/IJACSA.2018.090349 .

Herberger C. DDoS fire & forget: PDoS—a permanent denial of service. Radware Blog, Radware Ltd. http://www.blog.radware.com/security/2015/10/ddos-fire-forget-pdos-a-permanent-denial-of-service/ . Accessed 12 Sept 2016.

Cekerevac Z, Dvorak Z, Prigoda L, Čekerevac P. Internet of things and the man-in-the-middle attacks–security and economic risks. Mest J. 2017;5:15–25. https://doi.org/10.12709/mest.05.05.02.03 .

Melamed T. An active man-in-the-middle attack on bluetooth smart devices. WIT Press, International Journal of Safety and Security Engineering. http://www.witpress.com/elibrary/sse-volumes/8/2/2120 . Accessed 1 Feb 2018.

Mode G, Calyam P, Hoque K. False data injection attacks in Internet of Things and deep learning enabled predictive analytics; 2019.

De Donno M, Dragoni N, Giaretta A, Spognardi A. Analysis of DDoS-capable IoT malwares. In: 2017 federated conference on computer science and information systems (FedCSIS), Prague; 2017. p. 807–16. https://doi.org/10.15439/2017F288 .

Mirai Botnet DDoS Attack. Corero, Corero. http://www.corero.com/resource-hub/mirai-botnet-ddos-attack/ . Accessed 9 Dec 2019.

BrickerBot Malware emerges, permanently bricks iot devices. Trend Micro, Trend Micro Incorporated. http://www.trendmicro.com/vinfo/us/security/news/internet-of-things/brickerbot-malware-permanently-bricks-iot-devices . Accessed 19 Apr 2017.

Zeadally S, Adi E, Baig Z, Khan IA. Harnessing artificial intelligence capabilities to improve cybersecurity. IEEE Access. 2020;8:23817–37.

Jurn J, Kim T, Kim H. An automated vulnerability detection and remediation method for software security. Sustainability. 2018;10:1652. https://doi.org/10.3390/su10051652 .

Comiter M. Attacking artificial intelligence. Belfer Center for Science and International Affairs, Belfer Center for Science and International Affairs. http://www.belfercenter.org/sites/default/files/2019-08/AttackingAI/AttackingAI.pdf . Accessed 25 Aug 2019.

McMahan B, Daniel R. Federated learning: collaborative machine learning without centralized training data. Google AI Blog, Google. http://www.ai.googleblog.com/2017/04/federated-learning-collaborative.html . Accessed 6 Apr 2017.

Rojek M. Federated learning for IoT. Medium, becoming human: artificial intelligence magazine. http://www.becominghuman.ai/theres-a-better-way-of-doing-ai-in-The-iot-era-feabbbc1b589 . Accessed 16 Apr 2019.

Porter E. What is a botnet? And how to protect yourself in 2020. SafetyDetectives, Safety Detectives. http://www.safetydetectives.com/blog/what-is-a-botnet-and-how-to-protect-yourself-in/#review-2 . Accessed 28 Dec 2019.

Hendrickson J. What is the mirai botnet, and how can i protect my devices? How to geek, LifeSavvy media. http://www.howtogeek.com/408036/what-is-the-mirai-botnet-and-how-can-i-protect-my-devices/ . Accessed 22 Mar 2019.

Understanding denial of service attacks. Cybersecurity and infrastructure security agency CISA. http://www.us-cert.cisa.gov/ncas/tips/ST04-015 . Accessed 20 Nov 2019.

Moisejevs I. Poisoning attacks on machine learning. Towards data science, medium. http://www.towardsdatascience.com/poisoning-attacks-on-machine-learning-1ff247c254db . Accessed 15 July 2019.

Fang M et al. Local model poisoning attacks to Byzantine-Robust federated learning. In: Usenix security symposium. arXiv:1911.11815 . Accessed 6 Apr 2020.

Download references

Acknowledgements

This work was supported in part by the Commonwealth Cyber Initiative, an investment in the advancement of cyber R&D, innovation and workforce development in Virginia, USA. For more information about CCI, visit cyberinitiative.org.

Author information

Authors and affiliations.

Batten College of Engineering and Technology, Old Dominion University, Norfolk, VA, USA

Murat Kuzlu

Computer Science, Christopher Newport University, Newport News, VA, USA

Corinne Fair

eKare, Inc, Fairfax, VA, USA

Ozgur Guler

You can also search for this author in PubMed   Google Scholar

Contributions

MK, and CF conceived and designed the work as well as contributed to the acquisition, analysis, and interpretation of data. All authors discussed the results and wrote the final manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Murat Kuzlu .

Ethics declarations

Competing interests.

The authors declare that they no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Kuzlu, M., Fair, C. & Guler, O. Role of Artificial Intelligence in the Internet of Things (IoT) cybersecurity. Discov Internet Things 1 , 7 (2021). https://doi.org/10.1007/s43926-020-00001-4

Download citation

Received : 29 September 2020

Accepted : 30 November 2020

Published : 24 February 2021

DOI : https://doi.org/10.1007/s43926-020-00001-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Artificial Intelligence
  • Internet of Things (IoT)
  • Cybersecurity
  • Find a journal
  • Publish with us
  • Track your research

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • PeerJ Comput Sci
  • PMC10280223

Logo of peerjcs

Machine learning and deep learning approaches in IoT

1 Department of Computer Science, University of Engineering and Technology, Lahore, Punjab, Pakistan

Muhammad Awais

Muhammad shoaib, khaldoon s. khurshid, mahmoud othman.

2 Computer Science Department, Future University in Egypt, New Cairo, Egypt

Associated Data

The following information was supplied regarding data availability:

There was no raw data in our literature review.

The internet is a booming sector for exchanging information because of all the gadgets in today’s world. Attacks on Internet of Things (IoT) devices are alarming as these devices evolve. The two primary areas of the IoT that should be secure in terms of authentication, authorization, and data privacy are the IoMT (Internet of Medical Things) and the IoV (Internet of Vehicles). IoMT and IoV devices monitor real-time healthcare and traffic trends to protect an individual’s life. With the proliferation of these devices comes a rise in security assaults and threats, necessitating the deployment of an IPS (intrusion prevention system) for these systems. As a result, machine learning and deep learning technologies are utilized to identify and control security in IoMT and IoV devices. This research study aims to investigate the research fields of current IoT security research trends. Papers about the domain were searched, and the top 50 papers were selected. In addition, research objectives are specified concerning the problem, which leads to research questions. After evaluating the associated research, data is retrieved from digital archives. Furthermore, based on the findings of this SLR, a taxonomy of IoT subdomains has been given. This article also identifies the difficult areas and suggests ideas for further research in the IoT.

Introduction

The Internet of Things (IoT) is the network of objects embedded with sensors and software to exchange information with other devices over the internet ( Hassan et al., 2019 ). The immense usage of IoT devices ( Atzori, Iera & Morabito, 2017 ), including wearable devices, smartphones, and sensors, has been adopted ( Da Xu, He & Li, 2014 ) to help humans to achieve their daily life goals ( Almiani et al., 2020 ). Since the last decade, a great proliferation has been observed in IoT devices, from eight billion to 41 billion in the next five years ( Ahmad & Alsmadi, 2021 ) and their users from five billion to 27 billion. Due to the tremendous rate of IoT devices, subdomains of IoT are being emerged, such as the Internet of Medical Things (IoMT) and Internet of Vehicles (IoV) ( Quasim et al., 2019 ). IoMT and IoV are directly related to humans ( Sullivan, 2022 ), so these are emerging fields in IoT. Authentication, privacy, access control, confidentiality, and unauthorized access to computing devices are major challenges for IoT devices in the internet era ( Rizvi et al., 2018 ).

With the increasing number of internet devices and users ( Sharma & Liu, 2020 ), there could be numerous problems and challenges which need to be addressed for security and privacy issues ( Sha et al., 2018 ). The security of IoT devices without passwords is a significant security concern. Most IoT devices are used without a password or with only a simple password. Hackers can simply gain access to these devices and abuse them. These flaws expose data and allow unauthorized people to get access to IoT devices. Infrastructure and privacy of networked IoT devices are required to protect against attacks since these vulnerabilities cause network capacity to be exceeded ( Ahmad & Alsmadi, 2021 ). However, attacks against IoMT devices can endanger valuable human lives; on the other hand, IoV requires safe and secure networks to function properly. In case of an attack on these lifesaving technologies, we need to develop countermeasures to cope with these challenges.

Researchers proposed rule-based systems to overcome the security issues in the IoT domain ( Feng et al., 2018 ). However, these rule-based systems do not work properly with the latest security attacks ( Makkar et al., 2021 ). Machine learning has emerged as a rising field in the network security domain. Systems that can learn from past data are a good measure to ensure security ( Bagaa et al., 2020 ). Although these systems have been used to secure the network security issues ( Vaiyapuri, Binbusayyis & Varadarajan, 2021 ), they require specific hardware and software.

When dealing with IoT devices, they do not have high-end computing. Researchers have been actively working on lightweight software that does not require expensive, complex hardware ( Boutros-Saikali, Saikali & Abou Naoum, 2018 ). To review the recent literature about the mentioned topic, researchers have conducted research to assist the practitioners in studying the core concepts of IoT security. Some of those researchers have only worked on the SLR of machine learning-based approaches, such as  Xiao et al. (2018) discussed machine learning for IoT ( Cui et al., 2018 ), and other authors only focused on the deep learning-based approach ( Zikria et al., 2020 ; Tiwari et al., 2019 ). The authors worked on these topics are limited to only IoV ( Yang et al., 2017 ), IoMT ( Papaioannou et al., 2020 ), and IPSs ( Oke et al., 2018 ). The previous SLR ignored the security issues such as authentication, confidentiality, authorization, and privacy ( Tahsien, Karimipour & Spachos, 2020 ).

This research addresses the mentioned shortcomings of the previous systematic review to fill this gap. This study’s systematic literature review is conducted based on developed research questions.

To the best of our knowledge, no prior researchers have worked on the systematic review of IoT security issues in its subdomains, including IoMT, IoT, and IoH. Additionally, the prior works are limited to the traditional security issues such as intrusion detection and prevention problems ( Priyan & Devi, 2019 ). Furthermore, the IoT devices do not contain powerful hardware, so processing the attack information could be challenging. Therefore, a detailed systematic review of these security attacks in resource-constrained IoT domains and their sub-fields is necessary. Additionally, this survey can help the early-stage researchers and practitioners to know more about these emerging fields. In Table 1 , we have compared our work with the prior research studies based on the following three dimensions, including IoV security, IoMT, and IPS for IoT devices. Based on the systematic literature review, 50 papers have been selected that cover the basic criteria. This survey covers only those papers related to IoT security, IoMT, IoT, and IPS. Papers older than 2016 are not included in this survey. Moreover, papers that are not covered the main security issues such as authentication, authorization, and privacy are not included in the survey. The selected papers are evaluated qualitatively and empirically in different aspects.

The novelty of this systematic literature review is that it covers all the existing literature related to IoT security in the domains of IoMT, IoV, and IPS. Moreover, the research covers the security aspects in terms of privacy, authentication, and authorization in all defined domains of IoT. According to the query string, no identified survey covers all these dimensions.

This article is organized as follows: Section II covers the existing literature survey and provides the path to the current SLR of the article. Section III presents the methodology adapted to conduct a good survey with research questions and objectives. Section IV covers the answers to these research questions, and Section V presents the taxonomy of the domain. The last section, VI, covers the conclusion of the article.

Literature Review

Bai et al. (2021) published a systematic review on the security issues in the healthcare domain. They focused on the security and provenance issues for the Internet of Medical Things. As per the authors, no prior work was done on the security issues for the IoT in the medical domain. Their work only focuses on the security issues for the healthcare domain, and they reviewed the existing security issues from 2011 to 2020. They selected sixty-nine papers from five repositories related to IoT applications in the healthcare domain.

Additionally, the present work only focused on a single dimension of IoT regarding security and provenance. They do not address the security issues, particularly device authentication, authorization, and data privacy. Moreover, this article focused on machine learning and deep learning techniques to overcome the security issues in IoMT, IoV, and IPS.

Another study conducted by Abbasi et al. (2021) focused on the applications and challenges of the internet of vehicles. The authors categorized the services and applications of the internet of vehicles. Their work only focuses on the application and service issues for the IoV domain, and they reviewed the existing literature from 2010 to 2019. The selected papers were from six digital repositories related to IoT applications in vehicular networks. They also discussed some of the challenges and open issues in the current domain. The major focus of this research was to study the services and applications of the internet of vehicles. However, this study did not address a major component of the internet of vehicles, i.e., security issues. According to the author, the security and accuracy of the systems are very important in deploying these systems. However, the current study ignors this dimension of this work. An insecure system might not work in a real-time environment, and it will always be open to new security attacks. Therefore, security issues in the recent internet of vehicles domain must be addressed to deploy the newer systems in real-time. This article did not perform the quality assessment on the selected papers and ignored those that worked with machine learning and deep learning techniques to secure IoMT and IoV. Furthermore, only the intrusion detection systems are discussed, not the intrusion prevention system to secure systems before the attack.

Patel, Qassim & Wills (2010) presented the intrusion prevention system covering security-related issues. The authors categorized the intrusion detection and prevention system according to the security perspective. This survey deals with both the intrusion detection and prevention systems, which are helpful for the users to cover the basic security issues. IPS is working with the security tools such as firewalls and malware filters. This review only considers the related literature from 2000 to 2010. The selected papers discussed that as the number of Internet-related devices increases, security issues are also raised. Internet-connected devices are affected by different security issues such as malware intrusion, authorization, and violation of private data. As the IoT devices are not mature enough to handle the security, there is a need to implement an advanced intelligent system for security that maintains data privacy in all aspects and avoids unknown attacks. However, the current study has not mentioned the dimensions of the work. Therefore, a lightweight system should be implemented to resolve the authorization and authentication issues in IPS. The current study discussed the machine learning techniques and ignored the proper quality assessment criteria for the selected studies.

Selected studies discuss machine learning and deep learning techniques to secure the devices from attacks. The studies mentioned in Table 1 have shortcomings as they have only focused on a single subdomain of IoT and have presented literature on its security constraints ( Cui et al., 2018 ). However, as these security constraints vary from field to field, and thus an opportunity exists to synthesize the existing work into a single study to perform a comparative analysis. In this regard, the novelty of our study includes IoMT, IoV, and IPS subdomains for the complete SLR. According to the defined research questions, papers are selected from 2016 to 2021.

Research Methodology

Systematic literature review guidelines ( Bai et al., 2021 ) are followed in this review. Three main stages are included in this review according to the research protocol that is: planning, conducting, and reviewing the data. Search protocol is described after the finalization of research questions. These research questions are helpful to search the related review data and avoid the biasness in the selected studies.

Review plan

Figures 1 and ​ and2 2 show the methodology that defines the research process for the classification scheme, relevant publications, and publications mapping criteria. A search strategy has been implemented to find all the related data of IoT. A complete systematic approach is used to select the relevant studies without biases. In this review, the structured process has been followed that involved:

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-09-1204-g001.jpg

  • • Research objectives
  • • Research questions
  • • Organizing search repositories
  • • Selection studies
  • • Screening results
  • • Data extraction
  • • Results
  • • Review report finalization

To achieve the objectives of the above-defined review plan, see the RQs in Table 2 :

  • • The RQ1 identifies and explores different high-level databases that have been published in the literature on IoT-related smart devices security by using machine learning and deep learning. These answers might help choose the best venues from the highest priority platforms.
  • • RQ2 helps assist the primary study conducted within the last five years, which discusses the implementation of secure systems in IoT.
  • • RQ3 deals with the basic methods to implement authorization, authentication, and privacy in IoMT, IoV, and IPS environments.
  • • The objective of RQ4 is to implement machine learning and deep learning techniques to cover all the security issues faced by the IoMT.

Review conduct

In this systematic literature review, we have extracted the most relevant data from the selected digital databases. Furthermore, the predefined inclusion/exclusion criteria select the papers from the repositories. Moreover, the quality assessment is added to enhance the paper selection approach. After that, the most important papers are extracted from the existing literature by implementing snowballing techniques.

Automated search in digital repositories

Systematic search is implemented to extract the related data from the available online repositories and filter the irrelevant information. Moreover, manual and automatic search techniques have been applied while exploring the search terms. Different digital libraries were visited during this process, and only those repositories have been selected that are searched from our search process and commonly accessed literature survey. Those public venues are selected that are related to SLR. Google Scholar also added the venue that even accessed the data from the indirect venues. Therefore, the following digital venues are covered almost all the relevant searches selected as a primary source for automatic search:

  • • Google Scholar ( https://scholar.google.com/ )
  • • HEC Digital Library ( http://www.digitallibrary.edu.pk/ )
  • • ACM Digital Library ( http://dl.acm.org )
  • • IEEE eXplore ( http://ieeexplore.ieee.org )
  • • ScienceDirect ( https://www.sciencedirect.com )

Manual search is implemented to collect more related literature on IoT machine learning techniques and their related domain. The extracted information will provide a limited search of the related data, so it is specified according to the given conditions:

  • • Primary keywords are selected based on the research questions
  • • Identify the secondary keywords that were used as additional keywords
  • • The search string is developed by adding the “AND” and “OR” Boolean operators

Primary keywords are chosen as key identifiers to search the IoT data. Secondary or additional keywords are added with the primary keyword to search the related data. Boolean operators, keywords, and wildcards have been added to develop the final search query.

Figure 3 defines the search query that is restrictive to appear during the initial search process. The query is unable to search the final string data. The final string query is too restrictive and searches only the related articles on Google scholar and other defined repositories. Moreover, after implementing the search query, related articles are selected that fulfill the defined criteria of the papers. Additionally, selecting studies with string queries is very effective compared to traditional systems. Figure 4 depicts the visualization of all possible combinations of the query string.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-09-1204-g003.jpg

Selection based on Inclusion/Exclusion Criteria

Inclusion criteria.

  • • The included paper must have the IoT as a central topic.
  • • The paper must target the research questions.
  • • Selected papers must be published in the SJR index journal.
  • • Conference papers must be published in the top conferences.
  • • Paper explores challenges, issues, and shortcomings of IoT devices.
  • • The paper must discuss the IoMT, IoV, and IPS.
  • • Papers must discuss the machine learning and deep learning methods to solve the IoT problems.

Exclusion criteria

  • • Papers are excluded that are not written in the English language.
  • • Exclude papers that do not discuss any RQ.
  • • Exclude papers that are published before 2016.
  • • Exclude duplicate papers.
  • • Add the most recent version of the paper.

Selection based on quality assessment

The selection of the papers is based on the quality assessment, which is the most important step in conducting any review. Quality assessment is done to enhance the quality of the paper. As the primary study papers vary in the design, different tools such as qualitative and quantitative methods ( Bai et al., 2021 ), and Abbasi et al. (2021) are used to perform the QA in the review. QA of our study carries out by the three authors, and each study is scored based on the defined criteria:

  • 1. The paper published in Impact Factor journal awarded 2, otherwise 1.
  • 2. Paper covers more than 3 IoT security issues award 2, if it discusses anyone IoT security issue award 1, otherwise 0.
  • 3. Paper has citation award 1, else than 0.
  • 4. If a paper has research gap award 1, define the problem award 2; otherwise, 0.
  • 5. Paper discusses the evaluation of the research paper award 2 if results are given award 1, otherwise 0.
  • 6. The conclusion is given of the paper award 1, otherwise 0.

The overall score of the questions is 10. The papers having scored more than 6 are included to finalize the results. Table 3 shows the possible scores for the Journals and Conferences with the grading 0 to 4.

Selection based on Snowballing

After performing the quality assessment technique, snowballing is implemented on Bai et al. (2021) reference list to finalize the extracted papers. Only those papers are selected through snowballing that fulfill the criteria of inclusion/exclusion. The papers are found by implementing search query on different digital libraries that are defined in Table 4 . searching. The inclusion/exclusion of the paper is decided after reading the abstract of the paper and then reading the other part of the paper. Figure 5 shows that total 50 papers are extracted by filtering.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-09-1204-g005.jpg

Review report

The final selected papers are inspected thoroughly and selected; the 50 papers are based on the search query and fulfill inclusion/exclusion criteria. Overview of the selected papers from the above query is mentioned in Table 4 . Papers are excluded less than the four pages and filter papers according to the following parameters: since 2016, title, introduction, abstract, and conclusion. Finally, the papers are extracted with full articles. The paper count of per year is defined in the Fig. 5 .

Figure 6 shows that most of the journal papers are added in this review paper, and the reports are skipped as they are not fulfilling the inclusion/exclusion criteria.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-09-1204-g006.jpg

Figure 7 shows that the selected papers are from a different geographical areas, and most of them belong to the different states of America.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-09-1204-g007.jpg

Assessment and Discussion of Research Questions

The research questions are evaluated from the selected 50 papers extracted from a systematic literature review.

Which are relevant publication channels for IoT research?

IoT security is still a challenging domain in research due to the growth of IoT devices which causes security threats. There is a need to identify the proper publication tools and venues to access the relevant data to solve these security issues in the IoT domain. Moreover, this section presents knowledge base publication sources, types, and publication channels.

After the inspection phase, a maximum of eight publications are selected from the IEEE eXplore and one from the ACM journal, as mentioned in Table 5 . These publications are considered the world’s largest professional publishing source. Table 6 presents all the publication channels from where the papers are selected for the current SLR. Table 7 discussed the contribution and proposed solutions that are provided by different authors related to current studies.

Moreover, Table 8 presents the quality score of the each study which determines the classification of the studies for the systematic literature review. These studies are classified based on empirical search, research type and methodology. Quality assessment score are used to categorized the studies that are included in the paper. These empirical studies are further classified such as surveys, evaluation studies, primary search and experimental search. On the basis of these classifications, research taxonomy is defined in later sections. The codes are assigned including IoV and IoM for Internet of Vehicle and Internet of Medical Things respectively. Future research is clearly defined the path for the new researcher to explore more relevant studies.

What are the current challenges in different IoT types regarding implementing security measures?

IoMT and IoV are the major areas in the IoT domain discussed regarding security, as these are sensitive to human life. In recent years, security measures have been implemented in these areas, such as authentication, authorization, and privacy ( Tahsien, Karimipour & Spachos, 2020 ). Some current challenges discussed in Table 9 are faced with implementing the security measures. Moreover, the existing literature has not considered the authentication, authorization, and privacy of data in the IoMT and IoV domain using machine learning and deep learning techniques. Machine learning and deep learning algorithms are implemented to overcome these challenges to cover security issues ( Cui et al., 2018 ; Stiawan, Abdullah & Idris, 2010 ).

What are some of the authorization and authentication methods used for general IoT security purposes?

According to the existing literature, Tama & Rhee (2017) , security is the major issue in the IoT domain. Security methods are implemented to secure the IoT in all perspectives, including authentication and authorization ( Uprety, Rawat & Li, 2021 ). A two-way authentication method provides the required security and resists attacks ( Mahmood, 2020 ). If one factor is compromised, the second factor still provides enough security to the IoT system. Elliptic-curve cryptography (ECC) keys are mostly used for one-factor authentication ( Bhatia, Verma & Sharma, 2020 ) as it provides overall lightweight and reliable protection. Moreover, biometric sensors are used as second-factor authentication for everyday use due to their convenient approach ( Sikarwar & Das, 2021 ). Different methods of authentication and authorization are described in Table 10 .

How can we implement or utilize lightweight ML-based security methods on resource-constrained IoMT devices?

Attackers mostly target the integrity and availability of the IoMT systems. AI techniques build detection models to avoid these attacks ( Holbrook & Alamaniotis, 2019 ). Machine learning and deep learning models are used for intrusion detection. When any suspicious activities are detected in the system, then termination of the compromised connection is imposed to diminish the attack. In Table 11 , ML-based lightweight methods are described in this study.

Discussion and Future Direction

This section summarizes the result related to this systematic literature review.

Taxonomic hierarchy

This systematic literature review aimed to implement the security parameters by selecting the relevant papers and critically reviewing them. We designed the taxonomic hierarchy of selected studies shown in Fig. 8 that are only focused on the security issues of the IoV domain. We have investigated the challenges and developments in different aspects, including high-level features, methods, and security areas. However, these aspects are further divided into sub-domains that show each aspect’s depth and their role in terms of secure devices. Table 12 shows the criteria, evaluation and findings of the papers that are added in the review.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-09-1204-g008.jpg

General observations and future directions

This systematic literature review studies different machine learning and deep learning methods. We have reviewed an extensive number of studies; another IoT-related security is implemented in the domain of IoMT, IoV, and IPS. Moreover, selected studies were shortlisted considering the defined inclusion/exclusion criteria and quality assessment scoring. In addition, thematic analysis was performed to extract relevancy relations from these selected studies, which are coded.

The codes are selected from the existing literature shows in Table 13 , as “IoMT” for the internet of medical things, “IoV” for the internet of the vehicle, and the “IPS” for intrusion prevention systems. After that, papers are selected that worked on machine learning and assigned them the code in the domain of machine learning which is “MM”, “MV”, and “MP”. In the last, papers are categorized according to the deep learning techniques that are “DM”, “DV”, and “DP”. Selected studies were carried out by assessing and analyzing their aims, methodologies, area of discussion, and limitations.

Furthermore, Table 14 defined the codes that are implemented on the selected papers from the defined query strings. All the IoT devices have low computing power, so there is a need to implement a security model covering these embedded devices’ authentication, authorizations, and privacy issues. A lightweight method is found to solve the security issues in the IoMT domain, as mentioned in the RQ4.

IoT domain faces huge challenges in IoMT, IoV, and IPS to security. Uprety, Rawat & Li (2021) investigated the challenges, including weak password protection, insecure interfaces, less data protection, and poor management of IoT devices. The main challenge identified in this article is authorization and authentication problems. Two-way factor analysis techniques are implemented to implement the authorization in IoT. Two-way factor authentications ( Mahmood, 2020 ) are selected as the best security option and avoid attacks. If any security is compromised, then the other one provides essential security. Elliptic-curve cryptography (ECC) keys are used for first-factor authentication in IoMT and other domains of IoT, as they are lightweight and provide reliable protection. Security issues are also raised when the data is delivered over the internet, so constrained application protocols are used to overcome these security issues. Constrained application protocol ( Kumar et al., 2019 ) is like the application protocol used for resource-constrained IoT applications, including IoMT and IoV.

Moreover, some attacks target the availability and integrity of the system, such as stepping-stone attacks. Therefore, the deep neural network is implemented to build the intrusion prevention system. However, important gaps need to be addressed to ensure the IoT devices are secure and not affected by any attacks on critical infrastructures.

The main objective of this SLR is to cover the security issues in the different domains of the internet of things by considering related articles. In order to accomplish the security issues, the hierarchical taxonomy of the selected articles is formulated in Fig. 8 . At the top of the hierarchy discussed the internet of things as it is the major concern area. This taxonomy hierarchy shows the broader view of the SLR. It has inspected the different methods and security issues in IoT, including IoMT, IoV, and IPS. Furthermore, machine learning and deep learning methods are defined with the security issues of authorization, authentication, and privacy in the sub-levels to better understand the IoT and its domains.

Questions for primary study

According to the defined systematic literature review, we carried out the following shortcomings in the existing research.

  • • What are the other major security issues in IoT subdomains, and which intrusion detection and prevention system exists that covers all the security issues in IoT subdomains?
  • • Which model can be implemented for the security of IoT devices in all the domains, including IoMT, IoV, IoH, and IoT. Future research requires authentication, vulnerability, condentiality, authorization, and privacy methods.
  • • Most techniques used for IPS are not provided complete security on complex attacks. Future researchers can develop the intrusion prevention systems for IoT that can be implemented for multiple IoT subdomains to secure devices from all attacks.
  • • Different security methods are implemented according to the nature of the IoT domain, including a signature group scheme with various limitations. Therefore, Researchers are suggested to implement the security in IoT domains that protect the devices from different attacks to access better results.
  • • In the current era, heavy models are implemented in IoT devices to deal with complex and dynamic attacks. All the IoT devices have less computing power and cannot tackle this heavy software to overcome the security issues. Therefore, a lightweight method has been required that covers the security issues in the domain of IoT and provides the authentic model to secure these devices from vulnerable attacks.

As the enhancement in IoT, there is a need to implement more security issues to provide a secure IoT environment. Different hardware and software security parameters are implemented to protect the data from interruption and unauthorized access ( Tahsien, Karimipour & Spachos, 2020 ; Mawgoud, Karadawy & Tawfik, 2019 ; Bhatia, Verma & Sharma, 2020 ; Vajar et al., 2021 ). The authentication, authorization, and privacy issues are currently discussed in IoT domains. However, some other security issues can also be addressed, such as secure data availability at the right time, resource authentication, integrity, and confidentiality ( Das & Nene, 2017 ). The current state-of-the-art security models are not cover security in all the domains of IoT. Few of them are cover the IoMT security issues ( Aljumaie et al., 2021 ; Raj & Madiajagan, 2021 ), and the remaining are focused on the IoV ( Ali, Hassan & Saeed, 2021 ). Other domains exist in IoT, including IoH and IoT, which should also be secured from threats. In the current era, heavy models are implemented in IoT devices to deal with complex and dynamic attacks. All the IoT devices have less computing power and cannot tackle this heavy software to overcome the security issues. Different machine learning ( Anbalagan et al., 2021 ) and deep learning techniques are implemented to maintain privacy in IoT. As the traditional deep learning models work with large data sets and for the training of that data, enormous computational power is required ( Lv et al., 2020 ; Li et al., 2022 ). Therefore, a lightweight method has been required that covers the security issues in the domain of IoT and provides the authentic model to secure these devices from vulnerable attacks.

We have followed the systematic approach to extract the machine learning and deep learning models in IoT devices. A systematic literature review analyzes the research trends in IoT for security. A query string is constructed and applied to different repositories to select the relevant publications. Proper inclusion-exclusion criteria and quality assessment are conducted to extract the related 50 articles from the repositories from 2016 to 2021.

Existing literature focused only on the single domain of IoT to implement security in all perspectives. However, in this SLR, by using the query string, bias selection of related articles is removed, and only those searched by the query string are selected. The result reveals that most papers are selected from journals and the top conferences. The selected papers discussed machine learning and deep learning techniques to implement security in IoT subdomains. In future work, machine learning and deep learning techniques will be implemented in other domains of IoT. Furthermore, various security parameters such as confidentiality, vulnerability, authentication, and privacy of data can be implemented to secure the IoT devices.

Supplemental Information

Supplemental information 1, funding statement.

The authors received no funding for this work.

Additional Information and Declarations

The authors declare there are no competing interests.

Abqa Javed conceived and designed the experiments, analyzed the data, prepared figures and/or tables, and approved the final draft.

Muhammad Awais conceived and designed the experiments, performed the computation work, prepared figures and/or tables, and approved the final draft.

Muhammad Shoaib performed the experiments, prepared figures and/or tables, and approved the final draft.

Khaldoon S. Khurshid performed the experiments, performed the computation work, authored or reviewed drafts of the article, and approved the final draft.

Mahmoud Othman analyzed the data, authored or reviewed drafts of the article, and approved the final draft.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 26 March 2024

Predicting and improving complex beer flavor through machine learning

  • Michiel Schreurs   ORCID: orcid.org/0000-0002-9449-5619 1 , 2 , 3   na1 ,
  • Supinya Piampongsant 1 , 2 , 3   na1 ,
  • Miguel Roncoroni   ORCID: orcid.org/0000-0001-7461-1427 1 , 2 , 3   na1 ,
  • Lloyd Cool   ORCID: orcid.org/0000-0001-9936-3124 1 , 2 , 3 , 4 ,
  • Beatriz Herrera-Malaver   ORCID: orcid.org/0000-0002-5096-9974 1 , 2 , 3 ,
  • Christophe Vanderaa   ORCID: orcid.org/0000-0001-7443-5427 4 ,
  • Florian A. Theßeling 1 , 2 , 3 ,
  • Łukasz Kreft   ORCID: orcid.org/0000-0001-7620-4657 5 ,
  • Alexander Botzki   ORCID: orcid.org/0000-0001-6691-4233 5 ,
  • Philippe Malcorps 6 ,
  • Luk Daenen 6 ,
  • Tom Wenseleers   ORCID: orcid.org/0000-0002-1434-861X 4 &
  • Kevin J. Verstrepen   ORCID: orcid.org/0000-0002-3077-6219 1 , 2 , 3  

Nature Communications volume  15 , Article number:  2368 ( 2024 ) Cite this article

50k Accesses

851 Altmetric

Metrics details

  • Chemical engineering
  • Gas chromatography
  • Machine learning
  • Metabolomics
  • Taste receptors

The perception and appreciation of food flavor depends on many interacting chemical compounds and external factors, and therefore proves challenging to understand and predict. Here, we combine extensive chemical and sensory analyses of 250 different beers to train machine learning models that allow predicting flavor and consumer appreciation. For each beer, we measure over 200 chemical properties, perform quantitative descriptive sensory analysis with a trained tasting panel and map data from over 180,000 consumer reviews to train 10 different machine learning models. The best-performing algorithm, Gradient Boosting, yields models that significantly outperform predictions based on conventional statistics and accurately predict complex food features and consumer appreciation from chemical profiles. Model dissection allows identifying specific and unexpected compounds as drivers of beer flavor and appreciation. Adding these compounds results in variants of commercial alcoholic and non-alcoholic beers with improved consumer appreciation. Together, our study reveals how big data and machine learning uncover complex links between food chemistry, flavor and consumer perception, and lays the foundation to develop novel, tailored foods with superior flavors.

Similar content being viewed by others

machine learning in iot research papers

BitterSweet: Building machine learning models for predicting the bitter and sweet taste of small molecules

Rudraksh Tuwani, Somin Wadhwa & Ganesh Bagler

machine learning in iot research papers

Sensory lexicon and aroma volatiles analysis of brewing malt

Xiaoxia Su, Miao Yu, … Tianyi Du

machine learning in iot research papers

Predicting odor from molecular structure: a multi-label classification approach

Kushagra Saini & Venkatnarayan Ramanathan

Introduction

Predicting and understanding food perception and appreciation is one of the major challenges in food science. Accurate modeling of food flavor and appreciation could yield important opportunities for both producers and consumers, including quality control, product fingerprinting, counterfeit detection, spoilage detection, and the development of new products and product combinations (food pairing) 1 , 2 , 3 , 4 , 5 , 6 . Accurate models for flavor and consumer appreciation would contribute greatly to our scientific understanding of how humans perceive and appreciate flavor. Moreover, accurate predictive models would also facilitate and standardize existing food assessment methods and could supplement or replace assessments by trained and consumer tasting panels, which are variable, expensive and time-consuming 7 , 8 , 9 . Lastly, apart from providing objective, quantitative, accurate and contextual information that can help producers, models can also guide consumers in understanding their personal preferences 10 .

Despite the myriad of applications, predicting food flavor and appreciation from its chemical properties remains a largely elusive goal in sensory science, especially for complex food and beverages 11 , 12 . A key obstacle is the immense number of flavor-active chemicals underlying food flavor. Flavor compounds can vary widely in chemical structure and concentration, making them technically challenging and labor-intensive to quantify, even in the face of innovations in metabolomics, such as non-targeted metabolic fingerprinting 13 , 14 . Moreover, sensory analysis is perhaps even more complicated. Flavor perception is highly complex, resulting from hundreds of different molecules interacting at the physiochemical and sensorial level. Sensory perception is often non-linear, characterized by complex and concentration-dependent synergistic and antagonistic effects 15 , 16 , 17 , 18 , 19 , 20 , 21 that are further convoluted by the genetics, environment, culture and psychology of consumers 22 , 23 , 24 . Perceived flavor is therefore difficult to measure, with problems of sensitivity, accuracy, and reproducibility that can only be resolved by gathering sufficiently large datasets 25 . Trained tasting panels are considered the prime source of quality sensory data, but require meticulous training, are low throughput and high cost. Public databases containing consumer reviews of food products could provide a valuable alternative, especially for studying appreciation scores, which do not require formal training 25 . Public databases offer the advantage of amassing large amounts of data, increasing the statistical power to identify potential drivers of appreciation. However, public datasets suffer from biases, including a bias in the volunteers that contribute to the database, as well as confounding factors such as price, cult status and psychological conformity towards previous ratings of the product.

Classical multivariate statistics and machine learning methods have been used to predict flavor of specific compounds by, for example, linking structural properties of a compound to its potential biological activities or linking concentrations of specific compounds to sensory profiles 1 , 26 . Importantly, most previous studies focused on predicting organoleptic properties of single compounds (often based on their chemical structure) 27 , 28 , 29 , 30 , 31 , 32 , 33 , thus ignoring the fact that these compounds are present in a complex matrix in food or beverages and excluding complex interactions between compounds. Moreover, the classical statistics commonly used in sensory science 34 , 35 , 36 , 37 , 38 , 39 require a large sample size and sufficient variance amongst predictors to create accurate models. They are not fit for studying an extensive set of hundreds of interacting flavor compounds, since they are sensitive to outliers, have a high tendency to overfit and are less suited for non-linear and discontinuous relationships 40 .

In this study, we combine extensive chemical analyses and sensory data of a set of different commercial beers with machine learning approaches to develop models that predict taste, smell, mouthfeel and appreciation from compound concentrations. Beer is particularly suited to model the relationship between chemistry, flavor and appreciation. First, beer is a complex product, consisting of thousands of flavor compounds that partake in complex sensory interactions 41 , 42 , 43 . This chemical diversity arises from the raw materials (malt, yeast, hops, water and spices) and biochemical conversions during the brewing process (kilning, mashing, boiling, fermentation, maturation and aging) 44 , 45 . Second, the advent of the internet saw beer consumers embrace online review platforms, such as RateBeer (ZX Ventures, Anheuser-Busch InBev SA/NV) and BeerAdvocate (Next Glass, inc.). In this way, the beer community provides massive data sets of beer flavor and appreciation scores, creating extraordinarily large sensory databases to complement the analyses of our professional sensory panel. Specifically, we characterize over 200 chemical properties of 250 commercial beers, spread across 22 beer styles, and link these to the descriptive sensory profiling data of a 16-person in-house trained tasting panel and data acquired from over 180,000 public consumer reviews. These unique and extensive datasets enable us to train a suite of machine learning models to predict flavor and appreciation from a beer’s chemical profile. Dissection of the best-performing models allows us to pinpoint specific compounds as potential drivers of beer flavor and appreciation. Follow-up experiments confirm the importance of these compounds and ultimately allow us to significantly improve the flavor and appreciation of selected commercial beers. Together, our study represents a significant step towards understanding complex flavors and reinforces the value of machine learning to develop and refine complex foods. In this way, it represents a stepping stone for further computer-aided food engineering applications 46 .

To generate a comprehensive dataset on beer flavor, we selected 250 commercial Belgian beers across 22 different beer styles (Supplementary Fig.  S1 ). Beers with ≤ 4.2% alcohol by volume (ABV) were classified as non-alcoholic and low-alcoholic. Blonds and Tripels constitute a significant portion of the dataset (12.4% and 11.2%, respectively) reflecting their presence on the Belgian beer market and the heterogeneity of beers within these styles. By contrast, lager beers are less diverse and dominated by a handful of brands. Rare styles such as Brut or Faro make up only a small fraction of the dataset (2% and 1%, respectively) because fewer of these beers are produced and because they are dominated by distinct characteristics in terms of flavor and chemical composition.

Extensive analysis identifies relationships between chemical compounds in beer

For each beer, we measured 226 different chemical properties, including common brewing parameters such as alcohol content, iso-alpha acids, pH, sugar concentration 47 , and over 200 flavor compounds (Methods, Supplementary Table  S1 ). A large portion (37.2%) are terpenoids arising from hopping, responsible for herbal and fruity flavors 16 , 48 . A second major category are yeast metabolites, such as esters and alcohols, that result in fruity and solvent notes 48 , 49 , 50 . Other measured compounds are primarily derived from malt, or other microbes such as non- Saccharomyces yeasts and bacteria (‘wild flora’). Compounds that arise from spices or staling are labeled under ‘Others’. Five attributes (caloric value, total acids and total ester, hop aroma and sulfur compounds) are calculated from multiple individually measured compounds.

As a first step in identifying relationships between chemical properties, we determined correlations between the concentrations of the compounds (Fig.  1 , upper panel, Supplementary Data  1 and 2 , and Supplementary Fig.  S2 . For the sake of clarity, only a subset of the measured compounds is shown in Fig.  1 ). Compounds of the same origin typically show a positive correlation, while absence of correlation hints at parameters varying independently. For example, the hop aroma compounds citronellol, and alpha-terpineol show moderate correlations with each other (Spearman’s rho=0.39 and 0.57), but not with the bittering hop component iso-alpha acids (Spearman’s rho=0.16 and −0.07). This illustrates how brewers can independently modify hop aroma and bitterness by selecting hop varieties and dosage time. If hops are added early in the boiling phase, chemical conversions increase bitterness while aromas evaporate, conversely, late addition of hops preserves aroma but limits bitterness 51 . Similarly, hop-derived iso-alpha acids show a strong anti-correlation with lactic acid and acetic acid, likely reflecting growth inhibition of lactic acid and acetic acid bacteria, or the consequent use of fewer hops in sour beer styles, such as West Flanders ales and Fruit beers, that rely on these bacteria for their distinct flavors 52 . Finally, yeast-derived esters (ethyl acetate, ethyl decanoate, ethyl hexanoate, ethyl octanoate) and alcohols (ethanol, isoamyl alcohol, isobutanol, and glycerol), correlate with Spearman coefficients above 0.5, suggesting that these secondary metabolites are correlated with the yeast genetic background and/or fermentation parameters and may be difficult to influence individually, although the choice of yeast strain may offer some control 53 .

figure 1

Spearman rank correlations are shown. Descriptors are grouped according to their origin (malt (blue), hops (green), yeast (red), wild flora (yellow), Others (black)), and sensory aspect (aroma, taste, palate, and overall appreciation). Please note that for the chemical compounds, for the sake of clarity, only a subset of the total number of measured compounds is shown, with an emphasis on the key compounds for each source. For more details, see the main text and Methods section. Chemical data can be found in Supplementary Data  1 , correlations between all chemical compounds are depicted in Supplementary Fig.  S2 and correlation values can be found in Supplementary Data  2 . See Supplementary Data  4 for sensory panel assessments and Supplementary Data  5 for correlation values between all sensory descriptors.

Interestingly, different beer styles show distinct patterns for some flavor compounds (Supplementary Fig.  S3 ). These observations agree with expectations for key beer styles, and serve as a control for our measurements. For instance, Stouts generally show high values for color (darker), while hoppy beers contain elevated levels of iso-alpha acids, compounds associated with bitter hop taste. Acetic and lactic acid are not prevalent in most beers, with notable exceptions such as Kriek, Lambic, Faro, West Flanders ales and Flanders Old Brown, which use acid-producing bacteria ( Lactobacillus and Pediococcus ) or unconventional yeast ( Brettanomyces ) 54 , 55 . Glycerol, ethanol and esters show similar distributions across all beer styles, reflecting their common origin as products of yeast metabolism during fermentation 45 , 53 . Finally, low/no-alcohol beers contain low concentrations of glycerol and esters. This is in line with the production process for most of the low/no-alcohol beers in our dataset, which are produced through limiting fermentation or by stripping away alcohol via evaporation or dialysis, with both methods having the unintended side-effect of reducing the amount of flavor compounds in the final beer 56 , 57 .

Besides expected associations, our data also reveals less trivial associations between beer styles and specific parameters. For example, geraniol and citronellol, two monoterpenoids responsible for citrus, floral and rose flavors and characteristic of Citra hops, are found in relatively high amounts in Christmas, Saison, and Brett/co-fermented beers, where they may originate from terpenoid-rich spices such as coriander seeds instead of hops 58 .

Tasting panel assessments reveal sensorial relationships in beer

To assess the sensory profile of each beer, a trained tasting panel evaluated each of the 250 beers for 50 sensory attributes, including different hop, malt and yeast flavors, off-flavors and spices. Panelists used a tasting sheet (Supplementary Data  3 ) to score the different attributes. Panel consistency was evaluated by repeating 12 samples across different sessions and performing ANOVA. In 95% of cases no significant difference was found across sessions ( p  > 0.05), indicating good panel consistency (Supplementary Table  S2 ).

Aroma and taste perception reported by the trained panel are often linked (Fig.  1 , bottom left panel and Supplementary Data  4 and 5 ), with high correlations between hops aroma and taste (Spearman’s rho=0.83). Bitter taste was found to correlate with hop aroma and taste in general (Spearman’s rho=0.80 and 0.69), and particularly with “grassy” noble hops (Spearman’s rho=0.75). Barnyard flavor, most often associated with sour beers, is identified together with stale hops (Spearman’s rho=0.97) that are used in these beers. Lactic and acetic acid, which often co-occur, are correlated (Spearman’s rho=0.66). Interestingly, sweetness and bitterness are anti-correlated (Spearman’s rho = −0.48), confirming the hypothesis that they mask each other 59 , 60 . Beer body is highly correlated with alcohol (Spearman’s rho = 0.79), and overall appreciation is found to correlate with multiple aspects that describe beer mouthfeel (alcohol, carbonation; Spearman’s rho= 0.32, 0.39), as well as with hop and ester aroma intensity (Spearman’s rho=0.39 and 0.35).

Similar to the chemical analyses, sensorial analyses confirmed typical features of specific beer styles (Supplementary Fig.  S4 ). For example, sour beers (Faro, Flanders Old Brown, Fruit beer, Kriek, Lambic, West Flanders ale) were rated acidic, with flavors of both acetic and lactic acid. Hoppy beers were found to be bitter and showed hop-associated aromas like citrus and tropical fruit. Malt taste is most detected among scotch, stout/porters, and strong ales, while low/no-alcohol beers, which often have a reputation for being ‘worty’ (reminiscent of unfermented, sweet malt extract) appear in the middle. Unsurprisingly, hop aromas are most strongly detected among hoppy beers. Like its chemical counterpart (Supplementary Fig.  S3 ), acidity shows a right-skewed distribution, with the most acidic beers being Krieks, Lambics, and West Flanders ales.

Tasting panel assessments of specific flavors correlate with chemical composition

We find that the concentrations of several chemical compounds strongly correlate with specific aroma or taste, as evaluated by the tasting panel (Fig.  2 , Supplementary Fig.  S5 , Supplementary Data  6 ). In some cases, these correlations confirm expectations and serve as a useful control for data quality. For example, iso-alpha acids, the bittering compounds in hops, strongly correlate with bitterness (Spearman’s rho=0.68), while ethanol and glycerol correlate with tasters’ perceptions of alcohol and body, the mouthfeel sensation of fullness (Spearman’s rho=0.82/0.62 and 0.72/0.57 respectively) and darker color from roasted malts is a good indication of malt perception (Spearman’s rho=0.54).

figure 2

Heatmap colors indicate Spearman’s Rho. Axes are organized according to sensory categories (aroma, taste, mouthfeel, overall), chemical categories and chemical sources in beer (malt (blue), hops (green), yeast (red), wild flora (yellow), Others (black)). See Supplementary Data  6 for all correlation values.

Interestingly, for some relationships between chemical compounds and perceived flavor, correlations are weaker than expected. For example, the rose-smelling phenethyl acetate only weakly correlates with floral aroma. This hints at more complex relationships and interactions between compounds and suggests a need for a more complex model than simple correlations. Lastly, we uncovered unexpected correlations. For instance, the esters ethyl decanoate and ethyl octanoate appear to correlate slightly with hop perception and bitterness, possibly due to their fruity flavor. Iron is anti-correlated with hop aromas and bitterness, most likely because it is also anti-correlated with iso-alpha acids. This could be a sign of metal chelation of hop acids 61 , given that our analyses measure unbound hop acids and total iron content, or could result from the higher iron content in dark and Fruit beers, which typically have less hoppy and bitter flavors 62 .

Public consumer reviews complement expert panel data

To complement and expand the sensory data of our trained tasting panel, we collected 180,000 reviews of our 250 beers from the online consumer review platform RateBeer. This provided numerical scores for beer appearance, aroma, taste, palate, overall quality as well as the average overall score.

Public datasets are known to suffer from biases, such as price, cult status and psychological conformity towards previous ratings of a product. For example, prices correlate with appreciation scores for these online consumer reviews (rho=0.49, Supplementary Fig.  S6 ), but not for our trained tasting panel (rho=0.19). This suggests that prices affect consumer appreciation, which has been reported in wine 63 , while blind tastings are unaffected. Moreover, we observe that some beer styles, like lagers and non-alcoholic beers, generally receive lower scores, reflecting that online reviewers are mostly beer aficionados with a preference for specialty beers over lager beers. In general, we find a modest correlation between our trained panel’s overall appreciation score and the online consumer appreciation scores (Fig.  3 , rho=0.29). Apart from the aforementioned biases in the online datasets, serving temperature, sample freshness and surroundings, which are all tightly controlled during the tasting panel sessions, can vary tremendously across online consumers and can further contribute to (among others, appreciation) differences between the two categories of tasters. Importantly, in contrast to the overall appreciation scores, for many sensory aspects the results from the professional panel correlated well with results obtained from RateBeer reviews. Correlations were highest for features that are relatively easy to recognize even for untrained tasters, like bitterness, sweetness, alcohol and malt aroma (Fig.  3 and below).

figure 3

RateBeer text mining results can be found in Supplementary Data  7 . Rho values shown are Spearman correlation values, with asterisks indicating significant correlations ( p  < 0.05, two-sided). All p values were smaller than 0.001, except for Esters aroma (0.0553), Esters taste (0.3275), Esters aroma—banana (0.0019), Coriander (0.0508) and Diacetyl (0.0134).

Besides collecting consumer appreciation from these online reviews, we developed automated text analysis tools to gather additional data from review texts (Supplementary Data  7 ). Processing review texts on the RateBeer database yielded comparable results to the scores given by the trained panel for many common sensory aspects, including acidity, bitterness, sweetness, alcohol, malt, and hop tastes (Fig.  3 ). This is in line with what would be expected, since these attributes require less training for accurate assessment and are less influenced by environmental factors such as temperature, serving glass and odors in the environment. Consumer reviews also correlate well with our trained panel for 4-vinyl guaiacol, a compound associated with a very characteristic aroma. By contrast, correlations for more specific aromas like ester, coriander or diacetyl are underrepresented in the online reviews, underscoring the importance of using a trained tasting panel and standardized tasting sheets with explicit factors to be scored for evaluating specific aspects of a beer. Taken together, our results suggest that public reviews are trustworthy for some, but not all, flavor features and can complement or substitute taste panel data for these sensory aspects.

Models can predict beer sensory profiles from chemical data

The rich datasets of chemical analyses, tasting panel assessments and public reviews gathered in the first part of this study provided us with a unique opportunity to develop predictive models that link chemical data to sensorial features. Given the complexity of beer flavor, basic statistical tools such as correlations or linear regression may not always be the most suitable for making accurate predictions. Instead, we applied different machine learning models that can model both simple linear and complex interactive relationships. Specifically, we constructed a set of regression models to predict (a) trained panel scores for beer flavor and quality and (b) public reviews’ appreciation scores from beer chemical profiles. We trained and tested 10 different models (Methods), 3 linear regression-based models (simple linear regression with first-order interactions (LR), lasso regression with first-order interactions (Lasso), partial least squares regressor (PLSR)), 5 decision tree models (AdaBoost regressor (ABR), extra trees (ET), gradient boosting regressor (GBR), random forest (RF) and XGBoost regressor (XGBR)), 1 support vector regression (SVR), and 1 artificial neural network (ANN) model.

To compare the performance of our machine learning models, the dataset was randomly split into a training and test set, stratified by beer style. After a model was trained on data in the training set, its performance was evaluated on its ability to predict the test dataset obtained from multi-output models (based on the coefficient of determination, see Methods). Additionally, individual-attribute models were ranked per descriptor and the average rank was calculated, as proposed by Korneva et al. 64 . Importantly, both ways of evaluating the models’ performance agreed in general. Performance of the different models varied (Table  1 ). It should be noted that all models perform better at predicting RateBeer results than results from our trained tasting panel. One reason could be that sensory data is inherently variable, and this variability is averaged out with the large number of public reviews from RateBeer. Additionally, all tree-based models perform better at predicting taste than aroma. Linear models (LR) performed particularly poorly, with negative R 2 values, due to severe overfitting (training set R 2  = 1). Overfitting is a common issue in linear models with many parameters and limited samples, especially with interaction terms further amplifying the number of parameters. L1 regularization (Lasso) successfully overcomes this overfitting, out-competing multiple tree-based models on the RateBeer dataset. Similarly, the dimensionality reduction of PLSR avoids overfitting and improves performance, to some extent. Still, tree-based models (ABR, ET, GBR, RF and XGBR) show the best performance, out-competing the linear models (LR, Lasso, PLSR) commonly used in sensory science 65 .

GBR models showed the best overall performance in predicting sensory responses from chemical information, with R 2 values up to 0.75 depending on the predicted sensory feature (Supplementary Table  S4 ). The GBR models predict consumer appreciation (RateBeer) better than our trained panel’s appreciation (R 2 value of 0.67 compared to R 2 value of 0.09) (Supplementary Table  S3 and Supplementary Table  S4 ). ANN models showed intermediate performance, likely because neural networks typically perform best with larger datasets 66 . The SVR shows intermediate performance, mostly due to the weak predictions of specific attributes that lower the overall performance (Supplementary Table  S4 ).

Model dissection identifies specific, unexpected compounds as drivers of consumer appreciation

Next, we leveraged our models to infer important contributors to sensory perception and consumer appreciation. Consumer preference is a crucial sensory aspects, because a product that shows low consumer appreciation scores often does not succeed commercially 25 . Additionally, the requirement for a large number of representative evaluators makes consumer trials one of the more costly and time-consuming aspects of product development. Hence, a model for predicting chemical drivers of overall appreciation would be a welcome addition to the available toolbox for food development and optimization.

Since GBR models on our RateBeer dataset showed the best overall performance, we focused on these models. Specifically, we used two approaches to identify important contributors. First, rankings of the most important predictors for each sensorial trait in the GBR models were obtained based on impurity-based feature importance (mean decrease in impurity). High-ranked parameters were hypothesized to be either the true causal chemical properties underlying the trait, to correlate with the actual causal properties, or to take part in sensory interactions affecting the trait 67 (Fig.  4A ). In a second approach, we used SHAP 68 to determine which parameters contributed most to the model for making predictions of consumer appreciation (Fig.  4B ). SHAP calculates parameter contributions to model predictions on a per-sample basis, which can be aggregated into an importance score.

figure 4

A The impurity-based feature importance (mean deviance in impurity, MDI) calculated from the Gradient Boosting Regression (GBR) model predicting RateBeer appreciation scores. The top 15 highest ranked chemical properties are shown. B SHAP summary plot for the top 15 parameters contributing to our GBR model. Each point on the graph represents a sample from our dataset. The color represents the concentration of that parameter, with bluer colors representing low values and redder colors representing higher values. Greater absolute values on the horizontal axis indicate a higher impact of the parameter on the prediction of the model. C Spearman correlations between the 15 most important chemical properties and consumer overall appreciation. Numbers indicate the Spearman Rho correlation coefficient, and the rank of this correlation compared to all other correlations. The top 15 important compounds were determined using SHAP (panel B).

Both approaches identified ethyl acetate as the most predictive parameter for beer appreciation (Fig.  4 ). Ethyl acetate is the most abundant ester in beer with a typical ‘fruity’, ‘solvent’ and ‘alcoholic’ flavor, but is often considered less important than other esters like isoamyl acetate. The second most important parameter identified by SHAP is ethanol, the most abundant beer compound after water. Apart from directly contributing to beer flavor and mouthfeel, ethanol drastically influences the physical properties of beer, dictating how easily volatile compounds escape the beer matrix to contribute to beer aroma 69 . Importantly, it should also be noted that the importance of ethanol for appreciation is likely inflated by the very low appreciation scores of non-alcoholic beers (Supplementary Fig.  S4 ). Despite not often being considered a driver of beer appreciation, protein level also ranks highly in both approaches, possibly due to its effect on mouthfeel and body 70 . Lactic acid, which contributes to the tart taste of sour beers, is the fourth most important parameter identified by SHAP, possibly due to the generally high appreciation of sour beers in our dataset.

Interestingly, some of the most important predictive parameters for our model are not well-established as beer flavors or are even commonly regarded as being negative for beer quality. For example, our models identify methanethiol and ethyl phenyl acetate, an ester commonly linked to beer staling 71 , as a key factor contributing to beer appreciation. Although there is no doubt that high concentrations of these compounds are considered unpleasant, the positive effects of modest concentrations are not yet known 72 , 73 .

To compare our approach to conventional statistics, we evaluated how well the 15 most important SHAP-derived parameters correlate with consumer appreciation (Fig.  4C ). Interestingly, only 6 of the properties derived by SHAP rank amongst the top 15 most correlated parameters. For some chemical compounds, the correlations are so low that they would have likely been considered unimportant. For example, lactic acid, the fourth most important parameter, shows a bimodal distribution for appreciation, with sour beers forming a separate cluster, that is missed entirely by the Spearman correlation. Additionally, the correlation plots reveal outliers, emphasizing the need for robust analysis tools. Together, this highlights the need for alternative models, like the Gradient Boosting model, that better grasp the complexity of (beer) flavor.

Finally, to observe the relationships between these chemical properties and their predicted targets, partial dependence plots were constructed for the six most important predictors of consumer appreciation 74 , 75 , 76 (Supplementary Fig.  S7 ). One-way partial dependence plots show how a change in concentration affects the predicted appreciation. These plots reveal an important limitation of our models: appreciation predictions remain constant at ever-increasing concentrations. This implies that once a threshold concentration is reached, further increasing the concentration does not affect appreciation. This is false, as it is well-documented that certain compounds become unpleasant at high concentrations, including ethyl acetate (‘nail polish’) 77 and methanethiol (‘sulfury’ and ‘rotten cabbage’) 78 . The inability of our models to grasp that flavor compounds have optimal levels, above which they become negative, is a consequence of working with commercial beer brands where (off-)flavors are rarely too high to negatively impact the product. The two-way partial dependence plots show how changing the concentration of two compounds influences predicted appreciation, visualizing their interactions (Supplementary Fig.  S7 ). In our case, the top 5 parameters are dominated by additive or synergistic interactions, with high concentrations for both compounds resulting in the highest predicted appreciation.

To assess the robustness of our best-performing models and model predictions, we performed 100 iterations of the GBR, RF and ET models. In general, all iterations of the models yielded similar performance (Supplementary Fig.  S8 ). Moreover, the main predictors (including the top predictors ethanol and ethyl acetate) remained virtually the same, especially for GBR and RF. For the iterations of the ET model, we did observe more variation in the top predictors, which is likely a consequence of the model’s inherent random architecture in combination with co-correlations between certain predictors. However, even in this case, several of the top predictors (ethanol and ethyl acetate) remain unchanged, although their rank in importance changes (Supplementary Fig.  S8 ).

Next, we investigated if a combination of RateBeer and trained panel data into one consolidated dataset would lead to stronger models, under the hypothesis that such a model would suffer less from bias in the datasets. A GBR model was trained to predict appreciation on the combined dataset. This model underperformed compared to the RateBeer model, both in the native case and when including a dataset identifier (R 2  = 0.67, 0.26 and 0.42 respectively). For the latter, the dataset identifier is the most important feature (Supplementary Fig.  S9 ), while most of the feature importance remains unchanged, with ethyl acetate and ethanol ranking highest, like in the original model trained only on RateBeer data. It seems that the large variation in the panel dataset introduces noise, weakening the models’ performances and reliability. In addition, it seems reasonable to assume that both datasets are fundamentally different, with the panel dataset obtained by blind tastings by a trained professional panel.

Lastly, we evaluated whether beer style identifiers would further enhance the model’s performance. A GBR model was trained with parameters that explicitly encoded the styles of the samples. This did not improve model performance (R2 = 0.66 with style information vs R2 = 0.67). The most important chemical features are consistent with the model trained without style information (eg. ethanol and ethyl acetate), and with the exception of the most preferred (strong ale) and least preferred (low/no-alcohol) styles, none of the styles were among the most important features (Supplementary Fig.  S9 , Supplementary Table  S5 and S6 ). This is likely due to a combination of style-specific chemical signatures, such as iso-alpha acids and lactic acid, that implicitly convey style information to the original models, as well as the low number of samples belonging to some styles, making it difficult for the model to learn style-specific patterns. Moreover, beer styles are not rigorously defined, with some styles overlapping in features and some beers being misattributed to a specific style, all of which leads to more noise in models that use style parameters.

Model validation

To test if our predictive models give insight into beer appreciation, we set up experiments aimed at improving existing commercial beers. We specifically selected overall appreciation as the trait to be examined because of its complexity and commercial relevance. Beer flavor comprises a complex bouquet rather than single aromas and tastes 53 . Hence, adding a single compound to the extent that a difference is noticeable may lead to an unbalanced, artificial flavor. Therefore, we evaluated the effect of combinations of compounds. Because Blond beers represent the most extensive style in our dataset, we selected a beer from this style as the starting material for these experiments (Beer 64 in Supplementary Data  1 ).

In the first set of experiments, we adjusted the concentrations of compounds that made up the most important predictors of overall appreciation (ethyl acetate, ethanol, lactic acid, ethyl phenyl acetate) together with correlated compounds (ethyl hexanoate, isoamyl acetate, glycerol), bringing them up to 95 th percentile ethanol-normalized concentrations (Methods) within the Blond group (‘Spiked’ concentration in Fig.  5A ). Compared to controls, the spiked beers were found to have significantly improved overall appreciation among trained panelists, with panelist noting increased intensity of ester flavors, sweetness, alcohol, and body fullness (Fig.  5B ). To disentangle the contribution of ethanol to these results, a second experiment was performed without the addition of ethanol. This resulted in a similar outcome, including increased perception of alcohol and overall appreciation.

figure 5

Adding the top chemical compounds, identified as best predictors of appreciation by our model, into poorly appreciated beers results in increased appreciation from our trained panel. Results of sensory tests between base beers and those spiked with compounds identified as the best predictors by the model. A Blond and Non/Low-alcohol (0.0% ABV) base beers were brought up to 95th-percentile ethanol-normalized concentrations within each style. B For each sensory attribute, tasters indicated the more intense sample and selected the sample they preferred. The numbers above the bars correspond to the p values that indicate significant changes in perceived flavor (two-sided binomial test: alpha 0.05, n  = 20 or 13).

In a last experiment, we tested whether using the model’s predictions can boost the appreciation of a non-alcoholic beer (beer 223 in Supplementary Data  1 ). Again, the addition of a mixture of predicted compounds (omitting ethanol, in this case) resulted in a significant increase in appreciation, body, ester flavor and sweetness.

Predicting flavor and consumer appreciation from chemical composition is one of the ultimate goals of sensory science. A reliable, systematic and unbiased way to link chemical profiles to flavor and food appreciation would be a significant asset to the food and beverage industry. Such tools would substantially aid in quality control and recipe development, offer an efficient and cost-effective alternative to pilot studies and consumer trials and would ultimately allow food manufacturers to produce superior, tailor-made products that better meet the demands of specific consumer groups more efficiently.

A limited set of studies have previously tried, to varying degrees of success, to predict beer flavor and beer popularity based on (a limited set of) chemical compounds and flavors 79 , 80 . Current sensitive, high-throughput technologies allow measuring an unprecedented number of chemical compounds and properties in a large set of samples, yielding a dataset that can train models that help close the gaps between chemistry and flavor, even for a complex natural product like beer. To our knowledge, no previous research gathered data at this scale (250 samples, 226 chemical parameters, 50 sensory attributes and 5 consumer scores) to disentangle and validate the chemical aspects driving beer preference using various machine-learning techniques. We find that modern machine learning models outperform conventional statistical tools, such as correlations and linear models, and can successfully predict flavor appreciation from chemical composition. This could be attributed to the natural incorporation of interactions and non-linear or discontinuous effects in machine learning models, which are not easily grasped by the linear model architecture. While linear models and partial least squares regression represent the most widespread statistical approaches in sensory science, in part because they allow interpretation 65 , 81 , 82 , modern machine learning methods allow for building better predictive models while preserving the possibility to dissect and exploit the underlying patterns. Of the 10 different models we trained, tree-based models, such as our best performing GBR, showed the best overall performance in predicting sensory responses from chemical information, outcompeting artificial neural networks. This agrees with previous reports for models trained on tabular data 83 . Our results are in line with the findings of Colantonio et al. who also identified the gradient boosting architecture as performing best at predicting appreciation and flavor (of tomatoes and blueberries, in their specific study) 26 . Importantly, besides our larger experimental scale, we were able to directly confirm our models’ predictions in vivo.

Our study confirms that flavor compound concentration does not always correlate with perception, suggesting complex interactions that are often missed by more conventional statistics and simple models. Specifically, we find that tree-based algorithms may perform best in developing models that link complex food chemistry with aroma. Furthermore, we show that massive datasets of untrained consumer reviews provide a valuable source of data, that can complement or even replace trained tasting panels, especially for appreciation and basic flavors, such as sweetness and bitterness. This holds despite biases that are known to occur in such datasets, such as price or conformity bias. Moreover, GBR models predict taste better than aroma. This is likely because taste (e.g. bitterness) often directly relates to the corresponding chemical measurements (e.g., iso-alpha acids), whereas such a link is less clear for aromas, which often result from the interplay between multiple volatile compounds. We also find that our models are best at predicting acidity and alcohol, likely because there is a direct relation between the measured chemical compounds (acids and ethanol) and the corresponding perceived sensorial attribute (acidity and alcohol), and because even untrained consumers are generally able to recognize these flavors and aromas.

The predictions of our final models, trained on review data, hold even for blind tastings with small groups of trained tasters, as demonstrated by our ability to validate specific compounds as drivers of beer flavor and appreciation. Since adding a single compound to the extent of a noticeable difference may result in an unbalanced flavor profile, we specifically tested our identified key drivers as a combination of compounds. While this approach does not allow us to validate if a particular single compound would affect flavor and/or appreciation, our experiments do show that this combination of compounds increases consumer appreciation.

It is important to stress that, while it represents an important step forward, our approach still has several major limitations. A key weakness of the GBR model architecture is that amongst co-correlating variables, the largest main effect is consistently preferred for model building. As a result, co-correlating variables often have artificially low importance scores, both for impurity and SHAP-based methods, like we observed in the comparison to the more randomized Extra Trees models. This implies that chemicals identified as key drivers of a specific sensory feature by GBR might not be the true causative compounds, but rather co-correlate with the actual causative chemical. For example, the high importance of ethyl acetate could be (partially) attributed to the total ester content, ethanol or ethyl hexanoate (rho=0.77, rho=0.72 and rho=0.68), while ethyl phenylacetate could hide the importance of prenyl isobutyrate and ethyl benzoate (rho=0.77 and rho=0.76). Expanding our GBR model to include beer style as a parameter did not yield additional power or insight. This is likely due to style-specific chemical signatures, such as iso-alpha acids and lactic acid, that implicitly convey style information to the original model, as well as the smaller sample size per style, limiting the power to uncover style-specific patterns. This can be partly attributed to the curse of dimensionality, where the high number of parameters results in the models mainly incorporating single parameter effects, rather than complex interactions such as style-dependent effects 67 . A larger number of samples may overcome some of these limitations and offer more insight into style-specific effects. On the other hand, beer style is not a rigid scientific classification, and beers within one style often differ a lot, which further complicates the analysis of style as a model factor.

Our study is limited to beers from Belgian breweries. Although these beers cover a large portion of the beer styles available globally, some beer styles and consumer patterns may be missing, while other features might be overrepresented. For example, many Belgian ales exhibit yeast-driven flavor profiles, which is reflected in the chemical drivers of appreciation discovered by this study. In future work, expanding the scope to include diverse markets and beer styles could lead to the identification of even more drivers of appreciation and better models for special niche products that were not present in our beer set.

In addition to inherent limitations of GBR models, there are also some limitations associated with studying food aroma. Even if our chemical analyses measured most of the known aroma compounds, the total number of flavor compounds in complex foods like beer is still larger than the subset we were able to measure in this study. For example, hop-derived thiols, that influence flavor at very low concentrations, are notoriously difficult to measure in a high-throughput experiment. Moreover, consumer perception remains subjective and prone to biases that are difficult to avoid. It is also important to stress that the models are still immature and that more extensive datasets will be crucial for developing more complete models in the future. Besides more samples and parameters, our dataset does not include any demographic information about the tasters. Including such data could lead to better models that grasp external factors like age and culture. Another limitation is that our set of beers consists of high-quality end-products and lacks beers that are unfit for sale, which limits the current model in accurately predicting products that are appreciated very badly. Finally, while models could be readily applied in quality control, their use in sensory science and product development is restrained by their inability to discern causal relationships. Given that the models cannot distinguish compounds that genuinely drive consumer perception from those that merely correlate, validation experiments are essential to identify true causative compounds.

Despite the inherent limitations, dissection of our models enabled us to pinpoint specific molecules as potential drivers of beer aroma and consumer appreciation, including compounds that were unexpected and would not have been identified using standard approaches. Important drivers of beer appreciation uncovered by our models include protein levels, ethyl acetate, ethyl phenyl acetate and lactic acid. Currently, many brewers already use lactic acid to acidify their brewing water and ensure optimal pH for enzymatic activity during the mashing process. Our results suggest that adding lactic acid can also improve beer appreciation, although its individual effect remains to be tested. Interestingly, ethanol appears to be unnecessary to improve beer appreciation, both for blond beer and alcohol-free beer. Given the growing consumer interest in alcohol-free beer, with a predicted annual market growth of >7% 84 , it is relevant for brewers to know what compounds can further increase consumer appreciation of these beers. Hence, our model may readily provide avenues to further improve the flavor and consumer appreciation of both alcoholic and non-alcoholic beers, which is generally considered one of the key challenges for future beer production.

Whereas we see a direct implementation of our results for the development of superior alcohol-free beverages and other food products, our study can also serve as a stepping stone for the development of novel alcohol-containing beverages. We want to echo the growing body of scientific evidence for the negative effects of alcohol consumption, both on the individual level by the mutagenic, teratogenic and carcinogenic effects of ethanol 85 , 86 , as well as the burden on society caused by alcohol abuse and addiction. We encourage the use of our results for the production of healthier, tastier products, including novel and improved beverages with lower alcohol contents. Furthermore, we strongly discourage the use of these technologies to improve the appreciation or addictive properties of harmful substances.

The present work demonstrates that despite some important remaining hurdles, combining the latest developments in chemical analyses, sensory analysis and modern machine learning methods offers exciting avenues for food chemistry and engineering. Soon, these tools may provide solutions in quality control and recipe development, as well as new approaches to sensory science and flavor research.

Beer selection

250 commercial Belgian beers were selected to cover the broad diversity of beer styles and corresponding diversity in chemical composition and aroma. See Supplementary Fig.  S1 .

Chemical dataset

Sample preparation.

Beers within their expiration date were purchased from commercial retailers. Samples were prepared in biological duplicates at room temperature, unless explicitly stated otherwise. Bottle pressure was measured with a manual pressure device (Steinfurth Mess-Systeme GmbH) and used to calculate CO 2 concentration. The beer was poured through two filter papers (Macherey-Nagel, 500713032 MN 713 ¼) to remove carbon dioxide and prevent spontaneous foaming. Samples were then prepared for measurements by targeted Headspace-Gas Chromatography-Flame Ionization Detector/Flame Photometric Detector (HS-GC-FID/FPD), Headspace-Solid Phase Microextraction-Gas Chromatography-Mass Spectrometry (HS-SPME-GC-MS), colorimetric analysis, enzymatic analysis, Near-Infrared (NIR) analysis, as described in the sections below. The mean values of biological duplicates are reported for each compound.

HS-GC-FID/FPD

HS-GC-FID/FPD (Shimadzu GC 2010 Plus) was used to measure higher alcohols, acetaldehyde, esters, 4-vinyl guaicol, and sulfur compounds. Each measurement comprised 5 ml of sample pipetted into a 20 ml glass vial containing 1.75 g NaCl (VWR, 27810.295). 100 µl of 2-heptanol (Sigma-Aldrich, H3003) (internal standard) solution in ethanol (Fisher Chemical, E/0650DF/C17) was added for a final concentration of 2.44 mg/L. Samples were flushed with nitrogen for 10 s, sealed with a silicone septum, stored at −80 °C and analyzed in batches of 20.

The GC was equipped with a DB-WAXetr column (length, 30 m; internal diameter, 0.32 mm; layer thickness, 0.50 µm; Agilent Technologies, Santa Clara, CA, USA) to the FID and an HP-5 column (length, 30 m; internal diameter, 0.25 mm; layer thickness, 0.25 µm; Agilent Technologies, Santa Clara, CA, USA) to the FPD. N 2 was used as the carrier gas. Samples were incubated for 20 min at 70 °C in the headspace autosampler (Flow rate, 35 cm/s; Injection volume, 1000 µL; Injection mode, split; Combi PAL autosampler, CTC analytics, Switzerland). The injector, FID and FPD temperatures were kept at 250 °C. The GC oven temperature was first held at 50 °C for 5 min and then allowed to rise to 80 °C at a rate of 5 °C/min, followed by a second ramp of 4 °C/min until 200 °C kept for 3 min and a final ramp of (4 °C/min) until 230 °C for 1 min. Results were analyzed with the GCSolution software version 2.4 (Shimadzu, Kyoto, Japan). The GC was calibrated with a 5% EtOH solution (VWR International) containing the volatiles under study (Supplementary Table  S7 ).

HS-SPME-GC-MS

HS-SPME-GC-MS (Shimadzu GCMS-QP-2010 Ultra) was used to measure additional volatile compounds, mainly comprising terpenoids and esters. Samples were analyzed by HS-SPME using a triphase DVB/Carboxen/PDMS 50/30 μm SPME fiber (Supelco Co., Bellefonte, PA, USA) followed by gas chromatography (Thermo Fisher Scientific Trace 1300 series, USA) coupled to a mass spectrometer (Thermo Fisher Scientific ISQ series MS) equipped with a TriPlus RSH autosampler. 5 ml of degassed beer sample was placed in 20 ml vials containing 1.75 g NaCl (VWR, 27810.295). 5 µl internal standard mix was added, containing 2-heptanol (1 g/L) (Sigma-Aldrich, H3003), 4-fluorobenzaldehyde (1 g/L) (Sigma-Aldrich, 128376), 2,3-hexanedione (1 g/L) (Sigma-Aldrich, 144169) and guaiacol (1 g/L) (Sigma-Aldrich, W253200) in ethanol (Fisher Chemical, E/0650DF/C17). Each sample was incubated at 60 °C in the autosampler oven with constant agitation. After 5 min equilibration, the SPME fiber was exposed to the sample headspace for 30 min. The compounds trapped on the fiber were thermally desorbed in the injection port of the chromatograph by heating the fiber for 15 min at 270 °C.

The GC-MS was equipped with a low polarity RXi-5Sil MS column (length, 20 m; internal diameter, 0.18 mm; layer thickness, 0.18 µm; Restek, Bellefonte, PA, USA). Injection was performed in splitless mode at 320 °C, a split flow of 9 ml/min, a purge flow of 5 ml/min and an open valve time of 3 min. To obtain a pulsed injection, a programmed gas flow was used whereby the helium gas flow was set at 2.7 mL/min for 0.1 min, followed by a decrease in flow of 20 ml/min to the normal 0.9 mL/min. The temperature was first held at 30 °C for 3 min and then allowed to rise to 80 °C at a rate of 7 °C/min, followed by a second ramp of 2 °C/min till 125 °C and a final ramp of 8 °C/min with a final temperature of 270 °C.

Mass acquisition range was 33 to 550 amu at a scan rate of 5 scans/s. Electron impact ionization energy was 70 eV. The interface and ion source were kept at 275 °C and 250 °C, respectively. A mix of linear n-alkanes (from C7 to C40, Supelco Co.) was injected into the GC-MS under identical conditions to serve as external retention index markers. Identification and quantification of the compounds were performed using an in-house developed R script as described in Goelen et al. and Reher et al. 87 , 88 (for package information, see Supplementary Table  S8 ). Briefly, chromatograms were analyzed using AMDIS (v2.71) 89 to separate overlapping peaks and obtain pure compound spectra. The NIST MS Search software (v2.0 g) in combination with the NIST2017, FFNSC3 and Adams4 libraries were used to manually identify the empirical spectra, taking into account the expected retention time. After background subtraction and correcting for retention time shifts between samples run on different days based on alkane ladders, compound elution profiles were extracted and integrated using a file with 284 target compounds of interest, which were either recovered in our identified AMDIS list of spectra or were known to occur in beer. Compound elution profiles were estimated for every peak in every chromatogram over a time-restricted window using weighted non-negative least square analysis after which peak areas were integrated 87 , 88 . Batch effect correction was performed by normalizing against the most stable internal standard compound, 4-fluorobenzaldehyde. Out of all 284 target compounds that were analyzed, 167 were visually judged to have reliable elution profiles and were used for final analysis.

Discrete photometric and enzymatic analysis

Discrete photometric and enzymatic analysis (Thermo Scientific TM Gallery TM Plus Beermaster Discrete Analyzer) was used to measure acetic acid, ammonia, beta-glucan, iso-alpha acids, color, sugars, glycerol, iron, pH, protein, and sulfite. 2 ml of sample volume was used for the analyses. Information regarding the reagents and standard solutions used for analyses and calibrations is included in Supplementary Table  S7 and Supplementary Table  S9 .

NIR analyses

NIR analysis (Anton Paar Alcolyzer Beer ME System) was used to measure ethanol. Measurements comprised 50 ml of sample, and a 10% EtOH solution was used for calibration.

Correlation calculations

Pairwise Spearman Rank correlations were calculated between all chemical properties.

Sensory dataset

Trained panel.

Our trained tasting panel consisted of volunteers who gave prior verbal informed consent. All compounds used for the validation experiment were of food-grade quality. The tasting sessions were approved by the Social and Societal Ethics Committee of the KU Leuven (G-2022-5677-R2(MAR)). All online reviewers agreed to the Terms and Conditions of the RateBeer website.

Sensory analysis was performed according to the American Society of Brewing Chemists (ASBC) Sensory Analysis Methods 90 . 30 volunteers were screened through a series of triangle tests. The sixteen most sensitive and consistent tasters were retained as taste panel members. The resulting panel was diverse in age [22–42, mean: 29], sex [56% male] and nationality [7 different countries]. The panel developed a consensus vocabulary to describe beer aroma, taste and mouthfeel. Panelists were trained to identify and score 50 different attributes, using a 7-point scale to rate attributes’ intensity. The scoring sheet is included as Supplementary Data  3 . Sensory assessments took place between 10–12 a.m. The beers were served in black-colored glasses. Per session, between 5 and 12 beers of the same style were tasted at 12 °C to 16 °C. Two reference beers were added to each set and indicated as ‘Reference 1 & 2’, allowing panel members to calibrate their ratings. Not all panelists were present at every tasting. Scores were scaled by standard deviation and mean-centered per taster. Values are represented as z-scores and clustered by Euclidean distance. Pairwise Spearman correlations were calculated between taste and aroma sensory attributes. Panel consistency was evaluated by repeating samples on different sessions and performing ANOVA to identify differences, using the ‘stats’ package (v4.2.2) in R (for package information, see Supplementary Table  S8 ).

Online reviews from a public database

The ‘scrapy’ package in Python (v3.6) (for package information, see Supplementary Table  S8 ). was used to collect 232,288 online reviews (mean=922, min=6, max=5343) from RateBeer, an online beer review database. Each review entry comprised 5 numerical scores (appearance, aroma, taste, palate and overall quality) and an optional review text. The total number of reviews per reviewer was collected separately. Numerical scores were scaled and centered per rater, and mean scores were calculated per beer.

For the review texts, the language was estimated using the packages ‘langdetect’ and ‘langid’ in Python. Reviews that were classified as English by both packages were kept. Reviewers with fewer than 100 entries overall were discarded. 181,025 reviews from >6000 reviewers from >40 countries remained. Text processing was done using the ‘nltk’ package in Python. Texts were corrected for slang and misspellings; proper nouns and rare words that are relevant to the beer context were specified and kept as-is (‘Chimay’,’Lambic’, etc.). A dictionary of semantically similar sensorial terms, for example ‘floral’ and ‘flower’, was created and collapsed together into one term. Words were stemmed and lemmatized to avoid identifying words such as ‘acid’ and ‘acidity’ as separate terms. Numbers and punctuation were removed.

Sentences from up to 50 randomly chosen reviews per beer were manually categorized according to the aspect of beer they describe (appearance, aroma, taste, palate, overall quality—not to be confused with the 5 numerical scores described above) or flagged as irrelevant if they contained no useful information. If a beer contained fewer than 50 reviews, all reviews were manually classified. This labeled data set was used to train a model that classified the rest of the sentences for all beers 91 . Sentences describing taste and aroma were extracted, and term frequency–inverse document frequency (TFIDF) was implemented to calculate enrichment scores for sensorial words per beer.

The sex of the tasting subject was not considered when building our sensory database. Instead, results from different panelists were averaged, both for our trained panel (56% male, 44% female) and the RateBeer reviews (70% male, 30% female for RateBeer as a whole).

Beer price collection and processing

Beer prices were collected from the following stores: Colruyt, Delhaize, Total Wine, BeerHawk, The Belgian Beer Shop, The Belgian Shop, and Beer of Belgium. Where applicable, prices were converted to Euros and normalized per liter. Spearman correlations were calculated between these prices and mean overall appreciation scores from RateBeer and the taste panel, respectively.

Pairwise Spearman Rank correlations were calculated between all sensory properties.

Machine learning models

Predictive modeling of sensory profiles from chemical data.

Regression models were constructed to predict (a) trained panel scores for beer flavors and quality from beer chemical profiles and (b) public reviews’ appreciation scores from beer chemical profiles. Z-scores were used to represent sensory attributes in both data sets. Chemical properties with log-normal distributions (Shapiro-Wilk test, p  <  0.05 ) were log-transformed. Missing chemical measurements (0.1% of all data) were replaced with mean values per attribute. Observations from 250 beers were randomly separated into a training set (70%, 175 beers) and a test set (30%, 75 beers), stratified per beer style. Chemical measurements (p = 231) were normalized based on the training set average and standard deviation. In total, three linear regression-based models: linear regression with first-order interaction terms (LR), lasso regression with first-order interaction terms (Lasso) and partial least squares regression (PLSR); five decision tree models, Adaboost regressor (ABR), Extra Trees (ET), Gradient Boosting regressor (GBR), Random Forest (RF) and XGBoost regressor (XGBR); one support vector machine model (SVR) and one artificial neural network model (ANN) were trained. The models were implemented using the ‘scikit-learn’ package (v1.2.2) and ‘xgboost’ package (v1.7.3) in Python (v3.9.16). Models were trained, and hyperparameters optimized, using five-fold cross-validated grid search with the coefficient of determination (R 2 ) as the evaluation metric. The ANN (scikit-learn’s MLPRegressor) was optimized using Bayesian Tree-Structured Parzen Estimator optimization with the ‘Optuna’ Python package (v3.2.0). Individual models were trained per attribute, and a multi-output model was trained on all attributes simultaneously.

Model dissection

GBR was found to outperform other methods, resulting in models with the highest average R 2 values in both trained panel and public review data sets. Impurity-based rankings of the most important predictors for each predicted sensorial trait were obtained using the ‘scikit-learn’ package. To observe the relationships between these chemical properties and their predicted targets, partial dependence plots (PDP) were constructed for the six most important predictors of consumer appreciation 74 , 75 .

The ‘SHAP’ package in Python (v0.41.0) was implemented to provide an alternative ranking of predictor importance and to visualize the predictors’ effects as a function of their concentration 68 .

Validation of causal chemical properties

To validate the effects of the most important model features on predicted sensory attributes, beers were spiked with the chemical compounds identified by the models and descriptive sensory analyses were carried out according to the American Society of Brewing Chemists (ASBC) protocol 90 .

Compound spiking was done 30 min before tasting. Compounds were spiked into fresh beer bottles, that were immediately resealed and inverted three times. Fresh bottles of beer were opened for the same duration, resealed, and inverted thrice, to serve as controls. Pairs of spiked samples and controls were served simultaneously, chilled and in dark glasses as outlined in the Trained panel section above. Tasters were instructed to select the glass with the higher flavor intensity for each attribute (directional difference test 92 ) and to select the glass they prefer.

The final concentration after spiking was equal to the within-style average, after normalizing by ethanol concentration. This was done to ensure balanced flavor profiles in the final spiked beer. The same methods were applied to improve a non-alcoholic beer. Compounds were the following: ethyl acetate (Merck KGaA, W241415), ethyl hexanoate (Merck KGaA, W243906), isoamyl acetate (Merck KGaA, W205508), phenethyl acetate (Merck KGaA, W285706), ethanol (96%, Colruyt), glycerol (Merck KGaA, W252506), lactic acid (Merck KGaA, 261106).

Significant differences in preference or perceived intensity were determined by performing the two-sided binomial test on each attribute.

Reporting summary

Further information on research design is available in the  Nature Portfolio Reporting Summary linked to this article.

Data availability

The data that support the findings of this work are available in the Supplementary Data files and have been deposited to Zenodo under accession code 10653704 93 . The RateBeer scores data are under restricted access, they are not publicly available as they are property of RateBeer (ZX Ventures, USA). Access can be obtained from the authors upon reasonable request and with permission of RateBeer (ZX Ventures, USA).  Source data are provided with this paper.

Code availability

The code for training the machine learning models, analyzing the models, and generating the figures has been deposited to Zenodo under accession code 10653704 93 .

Tieman, D. et al. A chemical genetic roadmap to improved tomato flavor. Science 355 , 391–394 (2017).

Article   ADS   CAS   PubMed   Google Scholar  

Plutowska, B. & Wardencki, W. Application of gas chromatography–olfactometry (GC–O) in analysis and quality assessment of alcoholic beverages – A review. Food Chem. 107 , 449–463 (2008).

Article   CAS   Google Scholar  

Legin, A., Rudnitskaya, A., Seleznev, B. & Vlasov, Y. Electronic tongue for quality assessment of ethanol, vodka and eau-de-vie. Anal. Chim. Acta 534 , 129–135 (2005).

Loutfi, A., Coradeschi, S., Mani, G. K., Shankar, P. & Rayappan, J. B. B. Electronic noses for food quality: A review. J. Food Eng. 144 , 103–111 (2015).

Ahn, Y.-Y., Ahnert, S. E., Bagrow, J. P. & Barabási, A.-L. Flavor network and the principles of food pairing. Sci. Rep. 1 , 196 (2011).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Bartoshuk, L. M. & Klee, H. J. Better fruits and vegetables through sensory analysis. Curr. Biol. 23 , R374–R378 (2013).

Article   CAS   PubMed   Google Scholar  

Piggott, J. R. Design questions in sensory and consumer science. Food Qual. Prefer. 3293 , 217–220 (1995).

Article   Google Scholar  

Kermit, M. & Lengard, V. Assessing the performance of a sensory panel-panellist monitoring and tracking. J. Chemom. 19 , 154–161 (2005).

Cook, D. J., Hollowood, T. A., Linforth, R. S. T. & Taylor, A. J. Correlating instrumental measurements of texture and flavour release with human perception. Int. J. Food Sci. Technol. 40 , 631–641 (2005).

Chinchanachokchai, S., Thontirawong, P. & Chinchanachokchai, P. A tale of two recommender systems: The moderating role of consumer expertise on artificial intelligence based product recommendations. J. Retail. Consum. Serv. 61 , 1–12 (2021).

Ross, C. F. Sensory science at the human-machine interface. Trends Food Sci. Technol. 20 , 63–72 (2009).

Chambers, E. IV & Koppel, K. Associations of volatile compounds with sensory aroma and flavor: The complex nature of flavor. Molecules 18 , 4887–4905 (2013).

Pinu, F. R. Metabolomics—The new frontier in food safety and quality research. Food Res. Int. 72 , 80–81 (2015).

Danezis, G. P., Tsagkaris, A. S., Brusic, V. & Georgiou, C. A. Food authentication: state of the art and prospects. Curr. Opin. Food Sci. 10 , 22–31 (2016).

Shepherd, G. M. Smell images and the flavour system in the human brain. Nature 444 , 316–321 (2006).

Meilgaard, M. C. Prediction of flavor differences between beers from their chemical composition. J. Agric. Food Chem. 30 , 1009–1017 (1982).

Xu, L. et al. Widespread receptor-driven modulation in peripheral olfactory coding. Science 368 , eaaz5390 (2020).

Kupferschmidt, K. Following the flavor. Science 340 , 808–809 (2013).

Billesbølle, C. B. et al. Structural basis of odorant recognition by a human odorant receptor. Nature 615 , 742–749 (2023).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Smith, B. Perspective: Complexities of flavour. Nature 486 , S6–S6 (2012).

Pfister, P. et al. Odorant receptor inhibition is fundamental to odor encoding. Curr. Biol. 30 , 2574–2587 (2020).

Moskowitz, H. W., Kumaraiah, V., Sharma, K. N., Jacobs, H. L. & Sharma, S. D. Cross-cultural differences in simple taste preferences. Science 190 , 1217–1218 (1975).

Eriksson, N. et al. A genetic variant near olfactory receptor genes influences cilantro preference. Flavour 1 , 22 (2012).

Ferdenzi, C. et al. Variability of affective responses to odors: Culture, gender, and olfactory knowledge. Chem. Senses 38 , 175–186 (2013).

Article   PubMed   Google Scholar  

Lawless, H. T. & Heymann, H. Sensory evaluation of food: Principles and practices. (Springer, New York, NY). https://doi.org/10.1007/978-1-4419-6488-5 (2010).

Colantonio, V. et al. Metabolomic selection for enhanced fruit flavor. Proc. Natl. Acad. Sci. 119 , e2115865119 (2022).

Fritz, F., Preissner, R. & Banerjee, P. VirtualTaste: a web server for the prediction of organoleptic properties of chemical compounds. Nucleic Acids Res 49 , W679–W684 (2021).

Tuwani, R., Wadhwa, S. & Bagler, G. BitterSweet: Building machine learning models for predicting the bitter and sweet taste of small molecules. Sci. Rep. 9 , 1–13 (2019).

Dagan-Wiener, A. et al. Bitter or not? BitterPredict, a tool for predicting taste from chemical structure. Sci. Rep. 7 , 1–13 (2017).

Pallante, L. et al. Toward a general and interpretable umami taste predictor using a multi-objective machine learning approach. Sci. Rep. 12 , 1–11 (2022).

Malavolta, M. et al. A survey on computational taste predictors. Eur. Food Res. Technol. 248 , 2215–2235 (2022).

Lee, B. K. et al. A principal odor map unifies diverse tasks in olfactory perception. Science 381 , 999–1006 (2023).

Mayhew, E. J. et al. Transport features predict if a molecule is odorous. Proc. Natl. Acad. Sci. 119 , e2116576119 (2022).

Niu, Y. et al. Sensory evaluation of the synergism among ester odorants in light aroma-type liquor by odor threshold, aroma intensity and flash GC electronic nose. Food Res. Int. 113 , 102–114 (2018).

Yu, P., Low, M. Y. & Zhou, W. Design of experiments and regression modelling in food flavour and sensory analysis: A review. Trends Food Sci. Technol. 71 , 202–215 (2018).

Oladokun, O. et al. The impact of hop bitter acid and polyphenol profiles on the perceived bitterness of beer. Food Chem. 205 , 212–220 (2016).

Linforth, R., Cabannes, M., Hewson, L., Yang, N. & Taylor, A. Effect of fat content on flavor delivery during consumption: An in vivo model. J. Agric. Food Chem. 58 , 6905–6911 (2010).

Guo, S., Na Jom, K. & Ge, Y. Influence of roasting condition on flavor profile of sunflower seeds: A flavoromics approach. Sci. Rep. 9 , 11295 (2019).

Ren, Q. et al. The changes of microbial community and flavor compound in the fermentation process of Chinese rice wine using Fagopyrum tataricum grain as feedstock. Sci. Rep. 9 , 3365 (2019).

Hastie, T., Friedman, J. & Tibshirani, R. The Elements of Statistical Learning. (Springer, New York, NY). https://doi.org/10.1007/978-0-387-21606-5 (2001).

Dietz, C., Cook, D., Huismann, M., Wilson, C. & Ford, R. The multisensory perception of hop essential oil: a review. J. Inst. Brew. 126 , 320–342 (2020).

CAS   Google Scholar  

Roncoroni, Miguel & Verstrepen, Kevin Joan. Belgian Beer: Tested and Tasted. (Lannoo, 2018).

Meilgaard, M. Flavor chemistry of beer: Part II: Flavor and threshold of 239 aroma volatiles. in (1975).

Bokulich, N. A. & Bamforth, C. W. The microbiology of malting and brewing. Microbiol. Mol. Biol. Rev. MMBR 77 , 157–172 (2013).

Dzialo, M. C., Park, R., Steensels, J., Lievens, B. & Verstrepen, K. J. Physiology, ecology and industrial applications of aroma formation in yeast. FEMS Microbiol. Rev. 41 , S95–S128 (2017).

Article   PubMed   PubMed Central   Google Scholar  

Datta, A. et al. Computer-aided food engineering. Nat. Food 3 , 894–904 (2022).

American Society of Brewing Chemists. Beer Methods. (American Society of Brewing Chemists, St. Paul, MN, U.S.A.).

Olaniran, A. O., Hiralal, L., Mokoena, M. P. & Pillay, B. Flavour-active volatile compounds in beer: production, regulation and control. J. Inst. Brew. 123 , 13–23 (2017).

Verstrepen, K. J. et al. Flavor-active esters: Adding fruitiness to beer. J. Biosci. Bioeng. 96 , 110–118 (2003).

Meilgaard, M. C. Flavour chemistry of beer. part I: flavour interaction between principal volatiles. Master Brew. Assoc. Am. Tech. Q 12 , 107–117 (1975).

Briggs, D. E., Boulton, C. A., Brookes, P. A. & Stevens, R. Brewing 227–254. (Woodhead Publishing). https://doi.org/10.1533/9781855739062.227 (2004).

Bossaert, S., Crauwels, S., De Rouck, G. & Lievens, B. The power of sour - A review: Old traditions, new opportunities. BrewingScience 72 , 78–88 (2019).

Google Scholar  

Verstrepen, K. J. et al. Flavor active esters: Adding fruitiness to beer. J. Biosci. Bioeng. 96 , 110–118 (2003).

Snauwaert, I. et al. Microbial diversity and metabolite composition of Belgian red-brown acidic ales. Int. J. Food Microbiol. 221 , 1–11 (2016).

Spitaels, F. et al. The microbial diversity of traditional spontaneously fermented lambic beer. PLoS ONE 9 , e95384 (2014).

Blanco, C. A., Andrés-Iglesias, C. & Montero, O. Low-alcohol Beers: Flavor Compounds, Defects, and Improvement Strategies. Crit. Rev. Food Sci. Nutr. 56 , 1379–1388 (2016).

Jackowski, M. & Trusek, A. Non-Alcohol. beer Prod. – Overv. 20 , 32–38 (2018).

Takoi, K. et al. The contribution of geraniol metabolism to the citrus flavour of beer: Synergy of geraniol and β-citronellol under coexistence with excess linalool. J. Inst. Brew. 116 , 251–260 (2010).

Kroeze, J. H. & Bartoshuk, L. M. Bitterness suppression as revealed by split-tongue taste stimulation in humans. Physiol. Behav. 35 , 779–783 (1985).

Mennella, J. A. et al. A spoonful of sugar helps the medicine go down”: Bitter masking bysucrose among children and adults. Chem. Senses 40 , 17–25 (2015).

Wietstock, P., Kunz, T., Perreira, F. & Methner, F.-J. Metal chelation behavior of hop acids in buffered model systems. BrewingScience 69 , 56–63 (2016).

Sancho, D., Blanco, C. A., Caballero, I. & Pascual, A. Free iron in pale, dark and alcohol-free commercial lager beers. J. Sci. Food Agric. 91 , 1142–1147 (2011).

Rodrigues, H. & Parr, W. V. Contribution of cross-cultural studies to understanding wine appreciation: A review. Food Res. Int. 115 , 251–258 (2019).

Korneva, E. & Blockeel, H. Towards better evaluation of multi-target regression models. in ECML PKDD 2020 Workshops (eds. Koprinska, I. et al.) 353–362 (Springer International Publishing, Cham, 2020). https://doi.org/10.1007/978-3-030-65965-3_23 .

Gastón Ares. Mathematical and Statistical Methods in Food Science and Technology. (Wiley, 2013).

Grinsztajn, L., Oyallon, E. & Varoquaux, G. Why do tree-based models still outperform deep learning on tabular data? Preprint at http://arxiv.org/abs/2207.08815 (2022).

Gries, S. T. Statistics for Linguistics with R: A Practical Introduction. in Statistics for Linguistics with R (De Gruyter Mouton, 2021). https://doi.org/10.1515/9783110718256 .

Lundberg, S. M. et al. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2 , 56–67 (2020).

Ickes, C. M. & Cadwallader, K. R. Effects of ethanol on flavor perception in alcoholic beverages. Chemosens. Percept. 10 , 119–134 (2017).

Kato, M. et al. Influence of high molecular weight polypeptides on the mouthfeel of commercial beer. J. Inst. Brew. 127 , 27–40 (2021).

Wauters, R. et al. Novel Saccharomyces cerevisiae variants slow down the accumulation of staling aldehydes and improve beer shelf-life. Food Chem. 398 , 1–11 (2023).

Li, H., Jia, S. & Zhang, W. Rapid determination of low-level sulfur compounds in beer by headspace gas chromatography with a pulsed flame photometric detector. J. Am. Soc. Brew. Chem. 66 , 188–191 (2008).

Dercksen, A., Laurens, J., Torline, P., Axcell, B. C. & Rohwer, E. Quantitative analysis of volatile sulfur compounds in beer using a membrane extraction interface. J. Am. Soc. Brew. Chem. 54 , 228–233 (1996).

Molnar, C. Interpretable Machine Learning: A Guide for Making Black-Box Models Interpretable. (2020).

Zhao, Q. & Hastie, T. Causal interpretations of black-box models. J. Bus. Econ. Stat. Publ. Am. Stat. Assoc. 39 , 272–281 (2019).

Article   MathSciNet   Google Scholar  

Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning. (Springer, 2019).

Labrado, D. et al. Identification by NMR of key compounds present in beer distillates and residual phases after dealcoholization by vacuum distillation. J. Sci. Food Agric. 100 , 3971–3978 (2020).

Lusk, L. T., Kay, S. B., Porubcan, A. & Ryder, D. S. Key olfactory cues for beer oxidation. J. Am. Soc. Brew. Chem. 70 , 257–261 (2012).

Gonzalez Viejo, C., Torrico, D. D., Dunshea, F. R. & Fuentes, S. Development of artificial neural network models to assess beer acceptability based on sensory properties using a robotic pourer: A comparative model approach to achieve an artificial intelligence system. Beverages 5 , 33 (2019).

Gonzalez Viejo, C., Fuentes, S., Torrico, D. D., Godbole, A. & Dunshea, F. R. Chemical characterization of aromas in beer and their effect on consumers liking. Food Chem. 293 , 479–485 (2019).

Gilbert, J. L. et al. Identifying breeding priorities for blueberry flavor using biochemical, sensory, and genotype by environment analyses. PLOS ONE 10 , 1–21 (2015).

Goulet, C. et al. Role of an esterase in flavor volatile variation within the tomato clade. Proc. Natl. Acad. Sci. 109 , 19009–19014 (2012).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Borisov, V. et al. Deep Neural Networks and Tabular Data: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 1–21 https://doi.org/10.1109/TNNLS.2022.3229161 (2022).

Statista. Statista Consumer Market Outlook: Beer - Worldwide.

Seitz, H. K. & Stickel, F. Molecular mechanisms of alcoholmediated carcinogenesis. Nat. Rev. Cancer 7 , 599–612 (2007).

Voordeckers, K. et al. Ethanol exposure increases mutation rate through error-prone polymerases. Nat. Commun. 11 , 3664 (2020).

Goelen, T. et al. Bacterial phylogeny predicts volatile organic compound composition and olfactory response of an aphid parasitoid. Oikos 129 , 1415–1428 (2020).

Article   ADS   Google Scholar  

Reher, T. et al. Evaluation of hop (Humulus lupulus) as a repellent for the management of Drosophila suzukii. Crop Prot. 124 , 104839 (2019).

Stein, S. E. An integrated method for spectrum extraction and compound identification from gas chromatography/mass spectrometry data. J. Am. Soc. Mass Spectrom. 10 , 770–781 (1999).

American Society of Brewing Chemists. Sensory Analysis Methods. (American Society of Brewing Chemists, St. Paul, MN, U.S.A., 1992).

McAuley, J., Leskovec, J. & Jurafsky, D. Learning Attitudes and Attributes from Multi-Aspect Reviews. Preprint at https://doi.org/10.48550/arXiv.1210.3926 (2012).

Meilgaard, M. C., Carr, B. T. & Carr, B. T. Sensory Evaluation Techniques. (CRC Press, Boca Raton). https://doi.org/10.1201/b16452 (2014).

Schreurs, M. et al. Data from: Predicting and improving complex beer flavor through machine learning. Zenodo https://doi.org/10.5281/zenodo.10653704 (2024).

Download references

Acknowledgements

We thank all lab members for their discussions and thank all tasting panel members for their contributions. Special thanks go out to Dr. Karin Voordeckers for her tremendous help in proofreading and improving the manuscript. M.S. was supported by a Baillet-Latour fellowship, L.C. acknowledges financial support from KU Leuven (C16/17/006), F.A.T. was supported by a PhD fellowship from FWO (1S08821N). Research in the lab of K.J.V. is supported by KU Leuven, FWO, VIB, VLAIO and the Brewing Science Serves Health Fund. Research in the lab of T.W. is supported by FWO (G.0A51.15) and KU Leuven (C16/17/006).

Author information

These authors contributed equally: Michiel Schreurs, Supinya Piampongsant, Miguel Roncoroni.

Authors and Affiliations

VIB—KU Leuven Center for Microbiology, Gaston Geenslaan 1, B-3001, Leuven, Belgium

Michiel Schreurs, Supinya Piampongsant, Miguel Roncoroni, Lloyd Cool, Beatriz Herrera-Malaver, Florian A. Theßeling & Kevin J. Verstrepen

CMPG Laboratory of Genetics and Genomics, KU Leuven, Gaston Geenslaan 1, B-3001, Leuven, Belgium

Leuven Institute for Beer Research (LIBR), Gaston Geenslaan 1, B-3001, Leuven, Belgium

Laboratory of Socioecology and Social Evolution, KU Leuven, Naamsestraat 59, B-3000, Leuven, Belgium

Lloyd Cool, Christophe Vanderaa & Tom Wenseleers

VIB Bioinformatics Core, VIB, Rijvisschestraat 120, B-9052, Ghent, Belgium

Łukasz Kreft & Alexander Botzki

AB InBev SA/NV, Brouwerijplein 1, B-3000, Leuven, Belgium

Philippe Malcorps & Luk Daenen

You can also search for this author in PubMed   Google Scholar

Contributions

S.P., M.S. and K.J.V. conceived the experiments. S.P., M.S. and K.J.V. designed the experiments. S.P., M.S., M.R., B.H. and F.A.T. performed the experiments. S.P., M.S., L.C., C.V., L.K., A.B., P.M., L.D., T.W. and K.J.V. contributed analysis ideas. S.P., M.S., L.C., C.V., T.W. and K.J.V. analyzed the data. All authors contributed to writing the manuscript.

Corresponding author

Correspondence to Kevin J. Verstrepen .

Ethics declarations

Competing interests.

K.J.V. is affiliated with bar.on. The other authors declare no competing interests.

Peer review

Peer review information.

Nature Communications thanks Florian Bauer, Andrew John Macintosh and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, peer review file, description of additional supplementary files, supplementary data 1, supplementary data 2, supplementary data 3, supplementary data 4, supplementary data 5, supplementary data 6, supplementary data 7, reporting summary, source data, source data, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Schreurs, M., Piampongsant, S., Roncoroni, M. et al. Predicting and improving complex beer flavor through machine learning. Nat Commun 15 , 2368 (2024). https://doi.org/10.1038/s41467-024-46346-0

Download citation

Received : 30 October 2023

Accepted : 21 February 2024

Published : 26 March 2024

DOI : https://doi.org/10.1038/s41467-024-46346-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

machine learning in iot research papers

IoT and Machine Learning Based Prediction of Smart Building Indoor Temperature

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Help | Advanced Search

Computer Science > Machine Learning

Title: iot security: botnet detection in iot using machine learning.

Abstract: The acceptance of Internet of Things (IoT) applications and services has seen an enormous rise of interest in IoT. Organizations have begun to create various IoT based gadgets ranging from small personal devices such as a smart watch to a whole network of smart grid, smart mining, smart manufacturing, and autonomous driver-less vehicles. The overwhelming amount and ubiquitous presence have attracted potential hackers for cyber-attacks and data theft. Security is considered as one of the prominent challenges in IoT. The key scope of this research work is to propose an innovative model using machine learning algorithm to detect and mitigate botnet-based distributed denial of service (DDoS) attack in IoT network. Our proposed model tackles the security issue concerning the threats from bots. Different machine learning algorithms such as K- Nearest Neighbour (KNN), Naive Bayes model and Multi-layer Perception Artificial Neural Network (MLP ANN) were used to develop a model where data are trained by BoT-IoT dataset. The best algorithm was selected by a reference point based on accuracy percentage and area under the receiver operating characteristics curve (ROC AUC) score. Feature engineering and Synthetic minority oversampling technique (SMOTE) were combined with machine learning algorithms (MLAs). Performance comparison of three algorithms used was done in class imbalance dataset and on the class balanced dataset.

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

DBLP - CS Bibliography

Bibtex formatted citation.

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

  • Reference Manager
  • Simple TEXT file

People also looked at

Original research article, impact of industrial policy on urban green innovation: empirical evidence of china’s national high-tech zones based on double machine learning.

www.frontiersin.org

  • College of Economics and Management, Taiyuan University of Technology, Taiyuan, China

Effective industrial policies need to be implemented, particularly aligning with environmental protection goals to drive the high-quality growth of China’s economy in the new era. Setting up national high-tech zones falls under the purview of both regional and industrial policies. Using panel data from 163 prefecture-level cities in China from 2007 to 2019, this paper empirically analyzes the impact of national high-tech zones on the level of urban green innovation and its underlying mechanisms. It utilizes the national high-tech zones as a quasi-natural experiment and employs a double machine learning model. The study findings reveal that the policy for national high-tech zones greatly enhances urban green innovation. This conclusion remains consistent even after adjusting the measurement method, empirical samples, and controlling for other policy interferences. The findings from the heterogeneity analysis reveal that the impact of the national high-tech zone policy on green innovation exhibits significant regional heterogeneity, with a particularly significant effect in the central and western regions. Among cities, there is a notable push for green innovation levels in second-tier, third-tier, and fourth-tier cities. The moderating effect results indicate that, at the current stage of development, transportation infrastructure primarily exerts a negative moderating effect on how the national high-tech zone policy impacts the level of urban green innovation. This research provides robust empirical evidence for informing the optimization of the industrial policy of China and the establishment of a future ecological civilization system.

1 Introduction

The Chinese economy currently focuses on high-quality development rather than quick growth. The traditional demographic and resource advantages gradually diminish, making the earlier crude development model reliant on excessive resource input and consumption unsustainable. Simultaneously, resource impoverishment, environmental pollution, and carbon emissions are growing more severe ( Wang F. et al., 2022 ). Consequently, pursuing a mutually beneficial equilibrium between the economy and the environment has emerged as a critical concern in China’s economic growth. Green innovation, the integration of innovation with sustainability development ideas, is progressively gaining significance within the framework of reshaping China’s economic development strategy and addressing the challenges associated with resource and environmental limitations. In light of the present circumstances, and with the objectives outlined in the “3060 Plan” for carbon peak and carbon neutral, the pursuit of a green and innovative development trajectory, emphasizing heightened innovation alongside environ-mental preservation, has emerged as a pivotal concern within the context of China’s contemporary economic progress.

Industrial policy is pivotal in government intervention within market-driven resource allocation and correcting structural disparities. The government orchestrates this initiative to bolster industrial expansion and operational effectiveness. In contrast to Western industrial policies, those in China are predominantly crafted within the administrative framework and promulgated through administrative regulations. Over an extended period, numerous industrial policies have been devised in response to regional disparities in industrial development. These policies aim to identify new growth opportunities in diverse regions, focusing on optimizing and upgrading industrial structures. These strategies have been implemented at various administrative levels, from the central government to local authorities ( Sun and Sun, 2015 ). As a distinctive regional economic policy in China, the national high-tech zone represents one of the foremost supportive measures a city can acquire at the national level. Its crucial role involves facilitating the dissemination and advancement of regional economic growth. Over more than three decades, it has evolved into the primary platform through which China executes its strategy of concentrating on high-tech industries and fostering development driven by innovation. Concurrently, the national high-tech zone, operating as a geographically focused policy customized for a specific region ( Cao, 2019 ), enhances the precision of policy support for the industries under its purview, covering a more limited range of municipalities, counties, and regions. Contrasting with conventional regional industrial policies, the industry-focused policy within national high-tech zones prioritizes comprehensive resource allocation advice and economic foundations to maximize synergy and promote the long-term sustainable growth of the regional economy, and this represents a significant paradigm shift in location-based policies within the framework of carrying out the new development idea. Its inception embodies a combination of central authorization, high-level strategic planning, local grassroots decision-making, and innovative system development. In recent years, driven by the objective of dual carbon, national high-tech have proactively promoted environmentally friendly innovation. Nevertheless, given the proliferation of new industrial policies and the escalating complexity of the policy framework, has the setting up of national high-tech zones genuinely elevated the level of urban green innovation in contrast to conventional regional industrial policies? What are the underlying mechanisms? Simultaneously, concerning the variations among different cities, have the industrial policy tools within the national high-tech zones been employed judiciously and adaptable? What are the concrete practical outcomes? Investigating these matters has emerged as a significant subject requiring resolution by government, industry and academia.

2 Literature review and research hypothesis

2.1 literature review.

When considering industrial policy, the setting up national high-tech zones embodies the intersection of regional and industrial policies. Domestic and international academic research concerning setting up national high-tech zones primarily centers on economic activities and innovation. Notably, the economic impact of national high-tech zones encompasses a wide range of factors, including their influence on total factor productivity ( Tan and Zhang, 2018 ; Wang and Liu, 2023 ), foreign trade ( Alder et al., 2016 ), industrial structure upgrades ( Yuan and Zhu, 2018 ), and economic growth ( Liu and Zhao, 2015 ; Huang and Fernández-Maldonado, 2016 ; Wang Z. et al., 2022 ). Regarding innovation, numerous researchers have confirmed the positive effects of national high-tech zones on company innovation ( Vásquez-Urriago et al., 2014 ; Díez-Vial and Fernández-Olmos, 2017 ; Wang and Xu, 2020 ); Nevertheless, a few scholars have disagreed on this matter ( Hong et al., 2016 ; Sosnovskikh, 2017 ). In general, the consensus among scholars is that setting up high-tech national zones fosters regional innovation significantly. This consensus is supported by various aspects of innovation, including innovation efficiency ( Park and Lee, 2004 ; Chandrashekar and Bala Subrahmanya, 2017 ), agglomeration effect ( De Beule and Van Beveren, 2012 ), innovation capability ( Yang and Guo, 2020 ), among other relevant dimensions. The existing literature predominantly delves into the correlation between the setting up of national high-tech zones, innovation, and economic significance. However, the rise of digital economic developments, notably industrial digitization, has accentuated the limitations of the traditional innovation paradigm. These shortcomings, such as the inadequate exploration of the social importance and sustainability of innovation, have become apparent in recent years. As the primary driver of sustainable development, green innovation represents a potent avenue for achieving economic benefits and environmental value ( Weber et al., 2014 ). Its distinctiveness from other innovation forms lies in its potential to facilitate the transformation of development modes, reshape economic structures, and address pollution prevention and control challenges. However, in the context of green innovation, based on the double-difference approach, Wang et al. (2020) has pointed out that national high-tech zones enhance the effectiveness of urban green innovation, but this is only significant in the eastern region.

Furthermore, scholars have also explored the mechanisms underlying the innovation effects of national high-tech. For example, Cattapan et al. (2012) focused on science parks in Italy. They found that green innovation represents a potent avenue for achieving economic benefits as the primary driver of sustainable development, and environmental value technology transfer services positively influence product innovation. Albahari et al. (2017) confirmed that higher education institutions’ involvement in advancing corporate innovation within technology and science parks has a beneficial moderating effect. Using the moderating effect of spatial agglomeration as a basis, Li WH. et al. (2022) found that industrial agglomeration has a significantly unfavorable moderating influence on the effectiveness of performance transformation in national high-tech zones. Multiple studies have examined the national high-tech zone industrial policy’s regulatory framework and urban innovation. However, in the age of rapidly expanding new infrastructure, infrastructure construction is concentrated on information technologies like blockchain, big data, cloud computing, artificial intelligence, and the Internet; Further research is needed to explore whether traditional infrastructure, particularly transportation infrastructure, can promote urban green innovation. Transportation infrastructure has consistently been vital in fostering economic expansion, integrating regional resources, and facilitating coordinated development ( Behrens et al., 2007 ; Zhang et al., 2018 ; Pokharel et al., 2021 ). Therefore, it is necessary to investigate whether transportation infrastructure can continue encouraging innovative urban green practices in the digital economy.

In summary, the existing literature has extensively examined the influence of national high-tech zones on economic growth and innovation from various levels and perspectives, establishing a solid foundation and offering valuable research insights for this study. Nonetheless, previous studies frequently overlooked the impact of national high-tech zones on urban green innovation levels, and a subsequent series of work in this paper aims to address this issue. Further exploration and expansion are needed to understand the industrial policy framework’s strategy for relating national high-tech zones to urban green innovation. Furthermore, there is a need for further improvement and refinement of the research model and methodology. Based on these, this paper aims to discuss the industrial policy effects of national high-tech zones from the perspective of urban green innovation to enrich and expand the existing research.

In contrast to earlier research, the marginal contribution of this paper is organized into three dimensions: 1) Most scholars have primarily focused on the effects of national high-tech zones on economic activity and innovation, with less emphasis on green innovation and rare studies according to the level of green innovation perspective. The study on national high-tech zones as an industrial policy that has already been done is enhanced by this work. 2) Regarding the research methodology, the Double Machine Learning (DML) approach is used to evaluate the policy effects of national high-tech zones, leveraging the advantages of machine learning algorithms for high-dimensional and non-parametric prediction. This approach circumvents the problems of model setting bias and the “curse of dimensionality” encountered in traditional econometric models ( Chernozhukov et al., 2018 ), enhancing the credibility of the research findings. 3) By introducing transportation infrastructure as a moderator variable, this study investigates the underlying mechanism of national high-tech zones on urban green innovation, offering suggestions for maximizing the influence of these zones on policy.

2.2 Theoretical analysis and hypotheses

2.2.1 national high-tech zones’ industrial policies and urban green innovation.

As one of the ways to land industrial policies at the national level, national high-tech zones serve as effective driving forces for enhancing China’s ability to innovate regionally and its contribution to economic growth ( Xu et al., 2022 ). Green innovation is a novel form of innovation activity that harmoniously balances the competing goals of environmental preservation and technological advancement, facilitating the superior expansion of the economy by alleviating the strain on resources and the environment ( Li, 2015 ). National high-tech zones mainly impact urban green innovation through three main aspects. Firstly, based on innovation compensation effects, national high-tech zones, established based on the government’s strategic planning, receive special treatment in areas such as land, taxation, financing, credit, and more, serving as pioneering special zones and experimental fields established by the government to promote high-quality regional development. When the government offers R&D subsidies to enterprises engaged in green innovation activities within the zones, enterprises are inclined to respond positively to the government’s policy support and enhance their level of green innovation as a means of seeking external legitimacy ( Fang et al., 2021 ), thereby contributing to the advancement of urban green innovation. Secondly, based on the industrial restructuring effect, strict regulation of businesses with high emissions, high energy consumption, and high pollution levels is another aspect of implementing the national high-tech zone program. Consequently, businesses with significant emissions and energy consumption are required to optimize their industrial structure to access various benefits within the park, resulting in the gradual transformation and upgrading of high-energy-consumption industries towards green practices, thereby further contributing to regional green innovation. Based on Porter’s hypothesis, the green and low-carbon requirements of the park policy increase the production costs for polluting industries, prompting polluting enterprises to upgrade their existing technology and adopt green innovation practices. Lastly, based on the theory of industrial agglomeration, the national high-tech zones’ industrial policy facilitates the concentration of innovative talents to a certain extent, resulting in intensified competition in the green innovation market. Increased competition fosters the sharing of knowledge, technology, and talent, stimulating a market environment where the survival of the fittest prevails ( Melitz and Ottaviano, 2008 ). These increase the effectiveness of urban green innovation, helping to propel urban green innovation forward. Furthermore, the infrastructure development within the national high-tech zones establishes a favorable physical environment for enterprises to engage in creative endeavors. Also, it enables the influx of high-quality innovation capital from foreign sources, complementing the inherent characteristics of national high-tech zones that attract such capital and concentrate green innovation resources, ultimately resulting in both environmental and economic benefits. Based on the above analysis, Hypothesis 1 is proposed:

Hypothesis 1. Implementing industrial policies in national high-tech zones enhances levels of urban green innovation.

2.2.2 Heterogeneity analysis

Given the variations in economic foundations, industrial statuses, and population distributions across different regions, development strategies in different regions are also influenced by these variations ( Chen and Zheng, 2008 ). Theoretically, when using administrative boundaries or geographic locations as benchmarks, the impact of national high-tech zone industrial policy on urban green innovation should be achieved through strategies like aligning with the region’s existing industrial structure. Compared to the western and central regions, the eastern region exhibits more incredible innovation and dynamism due to advantages such as a developed economy, good infrastructure, advanced management concepts, and technologies, combined with a relatively high initial level of green innovation factor endowment. Considering the diminishing marginal effect principle of green innovation, the industrial policy implementation in national high-tech zones favors an “icing on the cake” approach in the eastern region, contrasting with a “send carbon in the snow” approach in the central and western regions. In other words, the economic benefits of national high-tech zones for promoting urban green innovation may need to be more robust than their impact on the central and western regions. Literature confirms that establishing national high-tech zones yields a more beneficial technology agglomeration effect in the less developed central and western regions ( Liu and Zhao, 2015 ), leading to a more substantial impact on enhancing the level of urban green innovation.

Moreover, local governments consider economic development, industrial structure, and infrastructure levels when establishing national high-tech zones. These factors serve as the foundation for regional classification to address variations in regional quality and to compensate for gaps in theoretical research on the link between national high-tech zone industrial policy implementation and urban green innovation. Consequently, the execution of industrial policies in national high-tech zones relies on other vital factors influencing urban green innovation. Significant variations exist in economic development and infrastructure levels among cities of different grades ( Luo and Wang, 2023 ). Generally, cities with higher rankings exhibit strong economic growth and infrastructure, contrasting those with lower rankings. Consequently, the effect of establishing a national high-tech zone on green innovation may vary across different city grades. Thus, considering the disparities across city rankings, we delve deeper into identifying the underlying reasons for regional diversity in the green innovation outcomes of industrial policies implemented in national high-tech zones based on city grades. Based on the above analysis, Hypothesis 2 is proposed:

Hypothesis 2. There is regional heterogeneity and city-level heterogeneity in the impact of national high-tech zone policies on the level of urban green innovation.

2.2.3 The moderating effect of transportation infrastructure

Implementing industrial policies and facilitating the flow of innovation factors are closely intertwined with the role of transport infrastructure as carriers and linkages. Generally, enhanced transportation infrastructure facilitates the absorption of local factors and improves resource allocation efficiency, thereby influencing the spatial redistribution of production factors like labor, resources, and technology across cities. Enhanced transportation infrastructure fosters the development of more robust and advanced innovation networks ( Fritsch and Slavtchev, 2011 ). Banister and Berechman (2001) highlighted that transportation infrastructure exhibits network properties that are fundamental to its agglomeration or diffusion effects. From this perspective, robust infrastructure impacts various economic activities, including interregional labor mobility, factor agglomeration, and knowledge exchange among firms, thereby expediting the spillover effects of green technological innovations ( Yu et al., 2013 ). In turn, this could positively moderate the influence of national hi-tech zone policies on green innovation. On the other hand, while transportation infrastructure facilitates the growth of national high-tech zone policies, it also brings negative impacts, including high pollution, emissions, and ecological landscape fragmentation. Improving transportation infrastructure can also lead to the “relative congestion effect” in national high-tech zones. This phenomenon, observed in specific regions, refers to the excessive concentration of similar enterprises across different links of the same industrial chain, which exacerbates the competition for innovation resources among enterprises, making it challenging for enterprises in the region to allocate their limited innovation resources to technological research and development activities ( Li et al., 2015 ). As a result, there needs to be a higher green innovation level. Therefore, the impact of transportation infrastructure in the current stage of development will be more complex. When the level of transport infrastructure is moderate, adequate transport infrastructure supports the promotion of urban green innovation through national high-tech zone policies. However, the impact of transport infrastructure regulation may be harmful. Based on the above analysis, Hypothesis 3 is proposed:

Hypothesis 3. Transportation infrastructure moderates the relationship between national high-tech zones and levels of urban green invention.

3 Research design

3.1 model setting.

This research explores the impact of industrial policies of national high-tech zones on the level of urban green innovation. Many related studies utilize traditional causal inference models to assess the impact of these policies. However, these models have several limitations in their application. For instance, the commonly used double-difference model in the parallel trend test has stringent requirements for the sample data. Although the synthetic control approach can create a virtual control group that meets parallel trends’ needs, it is limited to addressing the ‘one-to-many’ problem and requires excluding groups with extreme values. The selection of matching variables in propensity score matching is subjective, among other limitations ( Zhang and Li, 2023 ). To address the limitations of conventional causal inference models, scholars have started to explore applying machine learning to infer causality ( Chernozhukov et al., 2018 ; Knittel and Stolper, 2021 ). Machine learning algorithms excel at an impartial assessment of the effect on the intended target variable for making accurate predictions.

In contrast to traditional machine learning algorithms, the formal proposal of DML was made in 2018 ( Chernozhukov et al., 2018 ). This approach offers a more robust approach to causal inference by mitigating bias through the incorporation of residual modeling. Currently, some scholars utilize DML to assess causality in economic phenomena. For instance, Hull and Grodecka-Messi (2022) examined the effects of local taxation, crime, education, and public services on migration using DML in the context of Swedish cities between 2010 and 2016. These existing research findings serve as valuable references for this study. Compared to traditional causal inference models, DML offers distinct advantages in variable selection and model estimation ( Zhang and Li, 2023 ). However, in promoting urban green innovation in China, there is a high probability of non-linear relationships between variables, and the traditional linear regression model may lead to bias and errors. Moreover, the double machine learning model can effectively avoid problems such as setting bias. Based on this, the present study employs a DML model to evaluate the policy implications of establishing a national high-tech zone.

3.1.1 Double machine learning framework

Prior to applying the DML algorithm, this paper refers to the practice of Chernozhukov et al. (2018) to construct a partially linear DML model, as depicted in Eq. 1 below:

where i represents the city, t represents the year, and l n G I i t represents the explained variable, which in this paper is the green innovation level of the city. Z o n e i t represents the disposition variable, which in this case is a national high-tech zone’s policy variable. It takes a value of 1 after the implementation of the pilot and 0 otherwise. θ 0 is the disposal factor that is the focus of this paper. X i t represents the set of high-dimensional control variables. Machine learning algorithms are utilized to estimate the specific form of g ^ X i t , whereas U i t , which has a conditional mean of 0, stands for the error term. n represents the sample size. Direct estimation of Eq. 1 provides an estimate for the coefficient of dispositions.

We can further explore the estimation bias by combining Eqs 1 , 2 as depicted in Eq. ( 3 ) below:

where a = 1 n ∑ i ∈ I , t ∈ T   Z o n e i t 2 − 1 1 n ∑ i ∈ I , t ∈ T   Z o n e i t U i t , by a normal distribution having 0 as the mean, b = 1 n ∑ i ∈ I , t ∈ T   Z o n e i t 2 − 1 1 n ∑ i ∈ I , t ∈ T   Z o n e i t g X i t − g ^ X i t . It is important to note that DML utilizes machine learning and a regularization algorithm to estimate a specific functional form g ^ X i t . The introduction of “canonical bias” is inevitable as it prevents the estimates from having excessive variance while maintaining their unbiasedness. Specifically, the convergence of g ^ X i t to g X i t , n −φg > n −1/2 , as n tends to infinity, b also tends to infinity, θ ^ 0 is difficult to converge to θ 0 . To expedite convergence and ensure unbiasedness of the disposal coefficient estimates with small samples, an auxiliary regression is constructed as follows:

where m X i t represents the disposition variable’s regression function on the high-dimensional control variable, this function also requires estimation using a machine learning algorithm in the specific form of m ^ X i t . Additionally, V i t represents the error term with a 0 conditional mean.

3.1.2 The test of the mediating effect within the DML framework

This study investigates how the national high-tech zone industrial policy influences the urban green innovation. It incorporates moderating variables within the DML framework, drawing on the testing procedure outlined by Jiang (2022) , and integrates it with the practice of He et al. (2022) , as outlined below:

Equation 5 is based on Eq. 1 with the addition of variables l n t r a i t and Z o n e i t * l n t r a i t .Where l n t r a i t represents the moderating variable, which in this paper is the transportation infrastructure. Z o n e i t * l n t r a i t represents the interaction term of the moderating variable and the disposition variable. The variables l n t r a i t and Z o n e i t are added to the high-dimensional control variables X i t , and the rest of the variables in Eq. 5 are identical to Eq. 1 . θ 1 represents the disposal factor to focus on.

3.2 Variable selection

3.2.1 dependent variable: level of urban green innovation (lngi).

Nowadays, many academics use indicators like the number of applications for patents or authorizations to assess the degree of urban innovation. To be more precise, the quantity of patent applications is a measure of technological innovation effort, while the number of patents authorized undergoes strict auditing and can provide a more direct reflection of the achievements and capacity of scientific and technological innovation. Thus, this paper refers to the studies of Zhou and Shen (2020) and Li X. et al. (2022) to utilize the count of authorized green invention patents in each prefecture-level city to indicate the level of green innovation. For the empirical study, the count of authorized green patents plus 1 is transformed using logarithm.

3.2.2 Disposal variable: dummy variables for national high-tech zones (Zone)

The national high-tech zone dummy variable’s value correlates with the city in which it is located and the list of national high-tech zones released by China’s Ministry of Science and Technology. If a national high-tech zone was established in the city by 2017, the value is set to 1 for the year the high-tech zone is established and subsequent years. Otherwise, it is set to 0.

3.2.3 Moderating variable: transportation infrastructure (lntra)

Previous studies have shown that China’s highway freight transport comprises 75% of the total freight transport ( Li and Tang, 2015 ). Highway transportation infrastructure has a significant influence on the evolution of the Chinese economy. The development and improvement of highway infrastructure are crucial for modern transportation. This paper uses the research methods of Wu (2019) and uses the roadway mileage (measured in kilometers) to population as a measure of the quality of the transportation system.

3.2.4 Control variables

(1) Foreign direct investment (lnfdi): There is general agreement among academics that foreign direct investment (FDI) significantly influences urban green innovation, as FDI provides expertise in management, human resources, and cutting-edge industrial technology ( Luo et al., 2021 ). Thus, it is necessary to consider and control the level of FDI. This paper uses the ratio of foreign investment to the local GDP in a million yuan.

(2) Financial development level (lnfd): Innovation in science and technology is greatly aided by finance. For the green innovation-driven strategy to advance, it is imperative that funding for science and technology innovation be strengthened. The amount of capital raised for innovation is strongly impacted by the state of urban financial development ( Zhou and Du, 2021 ). Thus, this paper uses the loan balance to GDP ratio as an indicator.

(3) Human capital (lnhum): Highly skilled human capital is essential for cities to drive green innovation. Generally, highly qualified human capital significantly boosts green innovation ( Ansaris et al., 2016 ). Therefore, a measure was employed: the proportion of people in the city who had completed their bachelor’s degree or above.

(4) Industrial structure (lnind): Generally, the secondary industry in China is the primary source of pollution, and there is a significant impact of industrial structure on green innovation ( Qiu et al., 2023 ). The metric used in this paper is the secondary industry-to-GDP ratio for the area.

(5) Regional economic development level (lnagdp): A region’s level of economic growth is indicative of the material foundation for urban green innovation and in-fluences the growth of green innovation in the region ( Bo et al., 2020 ). This research uses the annual gross domestic product per capita as a measurement.

3.3 Data source

By 2017, China had developed 157 national high-tech zones in total. In conjunction with the study’s objectives, this study performs sample adjustments and a screening process. The study’s sample period spans from 2007 to 2019. 57 national high-tech zones that were created prior to 2000 are omitted to lessen the impact on the test results of towns having high-tech zones founded before 2007. Due to the limitations of high-tech areas in cities at the county level in promoting urban green innovation, 8 high-tech zones located in county-level cities are excluded. And 4 high-tech zones with missing severe data are excluded. Among the list of established national high-tech zones, 88 high-tech zones are distributed across 83 prefecture-level cities due to multiple districts within a single city. As a result, 83 cities are selected as the experimental group for this study. Additionally, a control group of 80 cities was selected from among those that did not have high-tech zones by the end of 2019, resulting in a final sample size of 163 cities. This paper collects green patent data for each city from the China Green Patent Statistical Report published by the State Intellectual Property Office. The author compiled the list of national high-tech zones and the starting year of their establishment on the official government website. In addition, the remaining data in this paper primarily originated from the China Urban Statistical Yearbook (2007–2019), the EPS database, and the official websites of the respective city’s Bureau of Statistics. Missing values were addressed through linear interpolation. To address heteroskedasticity in the model, the study logarithmically transforms the variables, excluding the disposal variable. Table 1 shows the descriptive analysis of the variables.

www.frontiersin.org

Table 1 . Descriptive analysis.

4 Empirical analysis

4.1 national high-tech zones’ policy effects on urban green innovation.

This study utilizes the DML model to estimate the impact of industrial policies implemented in national high-tech zones at the level of urban green innovation. Following the approach of Zhang and Li (2023) , the sample is split in a ratio of 1:4, and the random forest algorithm is used to perform predictions and combine Eq. ( 1 ) with Eq. ( 4 ) for the regression. Table 2 presents the results with and without controlling for time and city effects. The results indicate that the treatment effect sizes for these four columns are 0.376, 0.293, 0.396, and 0.268, correspondingly, each of which was significant at a 1% level. Thus, Hypothesis 1 is supported.

www.frontiersin.org

Table 2 . Benchmark regression results.

4.2 Robustness tests

4.2.1 eliminate the influence of extreme values.

To reduce the impact of extreme values on the estimation outcomes, all variables on the benchmark regression, excluding the disposal variable, undergo a shrinkage process based on the upper and lower 1% and 5% quantiles. Values lower than the lowest and higher than the highest quantile are replaced accordingly. Regression analyses are conducted. Table 3 demonstrates that removing outliers did not substantially alter the findings of this study.

www.frontiersin.org

Table 3 . Extreme values removal results.

4.2.2 Considering province-time interaction fixed effects

Since provinces are critical administrative units in the governance system of the Chinese government, cities within the same province often share similarities in policy environment and location characteristics. Therefore, to account for the influence of temporal changes across different provinces, this study incorporates province-time interaction fixed effects based on the benchmark regression. Table 4 presents the individual regression results. Based on the regression results, after accounting for the correlation between different city characteristics within the same province, national high-tech zone policies continue to significantly influence urban green innovation, even at the 1% level.

www.frontiersin.org

Table 4 . The addition of province and time fixed effects interaction terms.

4.2.3 Excluding other policy disturbances

When analyzing how national high-tech zones affect strategy for urban green innovation, it is susceptible to the influence of concurrent policies. This study accounts for other comparable policies during the same period to ensure an accurate estimation of the policy effect. Since 2007, national high-tech zone policies have been successively implemented, including the development of “smart cities.” Therefore, this study incorporates a policy dummy variable for “smart cities” in the benchmark regression. The specific regression findings are shown in Table 5 . After controlling for the impact of concurrent policies, the importance of national high-tech zones’ policy impact remains consistent.

www.frontiersin.org

Table 5 . Results of removing the impact of parallel policies.

4.2.4 Resetting the DML model

To mitigate the potential bias introduced by the settings in the DML model on the conclusions, the purpose of this study is to assess the conclusions’ robustness using the following methods. First, the sample split ratio of the DML model is adjusted from 1:4 to 1:2 to examine the potential impact of the sample split ratio on the conclusions of this study. Second, the machine learning algorithm is substituted, replacing the random forest algorithm, which has been utilized as a prediction algorithm, with lasso regression, gradient boosting, and neural networks to investigate the potential influence of prediction algorithms on the conclusions of this study. Third, regarding benchmark regression, additional linear models were constructed and analyzed using DML, which involves subjective decisions regarding model form selection. Therefore, DML was employed to construct more comprehensive interactive models, aiming to assess the influence of model settings on the conclusions of this study. The main and auxiliary regressions utilized for the analysis were modified as follows:

Combining Eqs ( 7 ), ( 8 ) for the regression, the interactive model yielded estimated coefficients for the disposition effect:

The results of Eq. ( 9 ) are shown in column (5) of Table 6 . And all the regression results obtained from the modified DML model are presented in Table 6 .

www.frontiersin.org

Table 6 . Results of resetting the DML model.

The findings indicate that the sample split ratio in the DML model, the prediction algorithm used, or the model estimation approach does not impact the conclusion that the national high-tech zone policy raises urban areas’ level of green innovation. These factors only modify the magnitude of the policy effect to some degree.

4.3 Heterogeneity analysis

4.3.1 regional heterogeneity.

The sample cities were further divided into the east, central, and west regions based on the three major economic subregions to examine regional variations in national high-tech zone policies ' effects on urban green innovation, with the results presented in Table 7 . National high-tech zone policies do not statistically significantly affect urban green innovation in the eastern region. However, they have a considerable beneficial influence in the central and western areas. The lack of statistical significance may be explained by the possibility that the setting up of national high-tech zones in the eastern region will provide obstacles to the growth of urban green innovation, such as resource strain and environmental pollution. Given the central and western regions’ relatively underdeveloped economic status and industrial structure, coupled with the preceding theoretical analysis, establishing national high-tech zones is a crucial catalyst, significantly boosting urban green innovation levels. Furthermore, the central government emphasizes that setting high-tech national zones should consider regional resource endowments and local conditions, implementing tailored policies. The central and western regions possess unique geographic locations and natural conditions that make them well-suited for developing solar energy, wind energy, and other forms of green energy. Compared to the central region, the national high-tech zone initiative has a more pronounced impact on promoting urban green innovation in the western region. While further optimization is needed for the western region’s urban innovation environment, the policy on national high-tech zones has a more substantial incentive effect in this region due to its more significant development potential, positive transformation of industrial structure, and increased policy support from the state, including the development strategy for the western region.

www.frontiersin.org

Table 7 . Heterogeneity test results for different regions.

4.3.2 Urban hierarchical heterogeneity

The New Tier 1 Cities Institute’s ‘2020 City Business Charm Ranking’ is the basis for this study, with the sample cities categorized into Tier 1 (New Tier 1), Tier 2, Tier 3, Tier 4, and Tier 5. Table 8 presents the regression findings for each of the groups.

www.frontiersin.org

Table 8 . Heterogeneity test results for different classes of cities.

The results in Table 8 reveal significant heterogeneity at the city level regarding national high-tech zones’ effects on urban green innovation, confirming Hypothesis 2 . In particular, the coefficients for the first-tier cities are not statistically significant due to the small sample size, and the same applies to the fifth-tier cities. This could be attributed to the relatively weak economy and infrastructure development issues in the fifth-tier cities. Additionally, due to their limited level of development, the fifth-tier cities may have a relatively homogeneous industrial structure, with a dominance of traditional industries or agriculture and a need for a more diversified industrial layout. National high-tech zones have not greatly aided the development of green innovation in these cities. In contrast, national high-tech zone policies in second-tier, third-tier, and fourth-tier cities have a noteworthy favorable impact on green innovation, indicating their favorable influence on enhancing green innovation in these cities. Despite the lower level of economic development in fourth-tier cities compared to second-tier and third-tier cities, the fourth-tier cities’ national high-tech zones have the most pronounced impact on promoting green innovation. This could be attributed to the ongoing transformation of industries in fourth-tier cities, which are still in the technology diffusion and imitation stage, allowing these cities’ national high-tech zones to maintain a high marginal effect. Thus, Hypothesis 2 is supported.

5 Further analysis

According to the empirical findings, setting high-tech national zones significantly raises the bar for urban green innovation. Therefore, it is essential to understand the underlying factors and mechanisms that contribute to the positive correlation. This paper constructs a moderating effect test model using Eqs 5 , 6 and provides a detailed discussion by introducing transportation infrastructure as a moderating variable.

The empirical finding of the moderating impact of transportation infrastructure is shown in Table 9 . The dichotomous interaction term Zone*lntra is significantly negative at the 5% level, suggesting that the impact of national high-tech zone policies on the level of urban green innovation is negatively moderated by transportation infrastructure. This result deviates from the general expectation, but it aligns with the complexity of the role played by transportation infrastructure in the context of modern economic development, as discussed in the previous theoretical analysis. This could be attributed to the insufficient green innovation benefits generated by the policy on national high-tech zones at the current stage, which fails to compensate for the adverse effects of excessive resource consumption and environmental pollution caused by the construction of the zone. Furthermore, transportation infrastructure can lead to an excessive concentration of similar enterprises in the high-tech zones. This excessive concentration creates a relative crowding effect, intensifying competition among enterprises. It diminishes their inclination to engage in green innovation collaboration and investment and hinders their effective implementation of technological research and development activities. Moreover, the excessive clustering of similar enterprises implies a need for more diversity in green innovation activities among businesses located in national high-tech zones. This results in duplicated green innovation outputs and hinders the advancement of green innovation. Thus, Hypothesis 3 is supported.

www.frontiersin.org

Table 9 . Empirical results of moderating effects.

6 Conclusion and policy recommendations

6.1 conclusion.

Based on panel data from 163 prefecture-level cities in China from 2007 to 2019, the net effect of setting national high-tech zones on urban green innovation was analyzed using the double machine learning model. The results found that: firstly, the national high-tech zone policy significantly raises the degree of local green innovation, and these results remain robust even after accounting for various factors that could affect the estimation results. Secondly, in the central and western regions, the level of urban green innovation is positively impacted by the national high-tech zone policy; However, this impact is less significant in the eastern region. In the western region compared to the central region, the national high-tech zone initiative has a stronger impact on increasing the level of urban green innovation. Across different city levels, compared to second-tier and third-tier cities, the high-tech zone policy has a more substantial impact on increasing the level of green innovation in fourth-tier cities. Thirdly, based on the moderating effect mechanism test, the construction of transportation infrastructure weakens the promotional effect of national high-tech zones on urban green innovation.

6.2 Policy recommendations

In order that national high-tech zones can better promote China’s high-quality development, this paper proposes the following policy recommendations:

(1) Urban green innovation in China depends on accelerating the setting up of national high-tech zones and creating an atmosphere that supports innovation. Establishing national high-tech zones as testbeds for high-quality development and green innovation has significantly elevated urban green innovation. Thus, cities can efficiently foster urban green innovation by supporting the development of national high-tech zones. Cities that have already established national high-tech zones should further encourage enterprises within these zones to increase their investment in research and development. They should also proceed to foster the leadership of national high-tech zones for urban green innovation, assuming the role of pilot cities as models and leaders. Additionally, it is essential to establish mechanisms for cooperation and synergy between the pilot cities and their neighboring cities to promote collective green development in the region.

(2) Expanding the pilot program and implementing tailored policies based on local conditions are essential. Industrial policies about national high-tech zones have differing effects on urban green innovation. Regions should leverage their comparative advantages, consider urban development’s commonalities and unique aspects, and foster a stable and sustainable green innovation ecosystem. The western and central regions should prioritize constructing and enhancing new infrastructure and bolster support for the high-tech green industry. The western region should seize the opportunity presented by national policies that prioritize support, quicken the rate of environmental innovation, and progressively bridge the gap with the eastern and central regions in various aspects. Furthermore, second-tier, third-tier, and fourth-tier cities should enhance the advantages of national high-tech zone policies, further maintaining the high standard of green innovation and keeping green innovation at an elevated level. Regions facing challenges in green innovation, particularly fifth-tier cities, should learn from the development experiences of advanced regions with national high-tech zones to compensate for their deficiencies in green innovation.

(3) Highlighting the importance of transportation regulation and enhancing collaboration in green innovation is crucial. Firstly, transportation infrastructure should be maximized to strengthen coordination and cooperation among regions, facilitate the smooth movement of innovative talents across regions, and facilitate the rational sharing of innovative resources, collectively enhancing green innovation. Additionally, attention ought to be given to the industrial clustering effect of parks to prevent the wastage of resources and inefficiencies resulting from the excessive clustering of similar industries. Efforts should be focused on effectively harnessing the latent potential of crucial transportation infrastructure areas as long-term drivers of development, promptly mitigating the negative impact of transportation infrastructure construction, and gradually achieving the synergistic promotion of the setting up of national high-tech zones and the raising of urban levels of green innovation, among other overarching objectives.

6.3 Limitations and future research

Our study has some limitations because the research in this paper is conducted in the institutional context of China. For example, not all countries are suitable for implementing similar industrial policies to develop the economy while focusing on environmental protection. However, we recognize that this study is interesting and relevant, and it encourages us to focus more intensely on environmental protection from an industrial policy perspective. Moreover, this paper exhibits certain limitations in the research process. Firstly, the urban green innovation measurement index was developed using the quantity of green patent authorizations. Future studies could focus on green innovation processes, such as the quality of green patents granted. Secondly, the paper employs machine learning techniques for causal inference. Subsequent investigations could delve further into the potential applications of machine learning algorithms in environmental sciences to maximize the benefits of innovative research methodologies.

Data availability statement

The raw data supporting the conclusion of this article will be made available by the authors, without undue reservation.

Author contributions

WC: Conceptualization, Data curation, Formal Analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Visualization, Writing–review and editing. YJ: Conceptualization, Data curation, Formal Analysis, Investigation, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing–original draft, Writing–review and editing. BT: Investigation, Project administration, Writing–review and editing.

The authors declare that financial support was received for the research, authorship, and/or publication of this article. This research was supported by the Youth Fund for Humanities and Social Science research of Ministry of Education (20YJC790004).

Acknowledgments

The authors are grateful to the editors and the reviewers for their insightful comments.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Albahari, A., Pérez-Canto, S., Barge-Gil, A., and Modrego, A. (2017). Technology parks versus science parks: does the university make the difference? Technol. Forecast. Soc. Change 116, 13–28. doi:10.1016/j.techfore.2016.11.012

CrossRef Full Text | Google Scholar

Alder, S., Shao, L., and Zilibotti, F. (2016). Economic reforms and industrial policy in a panel of Chinese cities. J. Econ. Growth 21, 305–349. doi:10.1007/s10887-016-9131-x

Ansaris, M., Ashrafi, S., and Jebellie, H. (2016). The impact of human capital on green innovation. Industrial Manag. J. 8 (2), 141–162. doi:10.22059/imj.2016.60653

Banister, D., and Berechman, Y. (2001). Transport investment and the promotion of economic growth. J. Transp. Geogr. 9 (3), 209–218. doi:10.1016/s0966-6923(01)00013-8

Behrens, K., Lamorgese, A. R., Ottaviano, G. I., and Tabuchi, T. (2007). Changes in transport and non-transport costs: local vs global impacts in a spatial network. Regional Sci. Urban Econ. 37 (6), 625–648. doi:10.1016/j.regsciurbeco.2007.08.003

Bo, W., Yongzhong, Z., Lingshan, C., and Xing, Y. (2020). Urban green innovation level and decomposition of its determinants in China. Sci. Res. Manag. 41 (8), 123. doi:10.19571/j.cnki.1000-2995.2020.08.013

Cao, Q. F. (2019). The latest researches on place based policy and its implications for the construction of xiong’an national new district. Sci. Technol. Prog. Policy 36 (2), 36–43. (in Chinese).

Google Scholar

Cattapan, P., Passarelli, M., and Petrone, M. (2012). Brokerage and SME innovation: an analysis of the technology transfer service at area science park, Italy. Industry High. Educ. 26 (5), 381–391. doi:10.5367/ihe.2012.0119

Chandrashekar, D., and Bala Subrahmanya, M. H. (2017). Absorptive capacity as a determinant of innovation in SMEs: a study of Bengaluru high-tech manufacturing cluster. Small Enterp. Res. 24 (3), 290–315. doi:10.1080/13215906.2017.1396491

Chen, M., and Zheng, Y. (2008). China's regional disparity and its policy responses. China & World Econ. 16 (4), 16–32. doi:10.1111/j.1749-124x.2008.00119.x

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., et al. (2018). Double/debiased machine learning for treatment and structural parameters. Econ. J. 21 (1), C1–C68. doi:10.1111/ectj.12097

De Beule, F., and Van Beveren, I. (2012). Does firm agglomeration drive product innovation and renewal? An application for Belgium. Tijdschr. Econ. Soc. Geogr. 103 (4), 457–472. doi:10.1111/j.1467-9663.2012.00715.x

Díez-Vial, I., and Fernández-Olmos, M. (2017). The effect of science and technology parks on firms’ performance: how can firms benefit most under economic downturns? Technol. Analysis Strategic Manag. 29 (10), 1153–1166. doi:10.1080/09537325.2016.1274390

Fang, Z., Kong, X., Sensoy, A., Cui, X., and Cheng, F. (2021). Government’s awareness of environmental protection and corporate green innovation: a natural experiment from the new environmental protection law in China. Econ. Analysis Policy 70, 294–312. doi:10.1016/j.eap.2021.03.003

Fritsch, M., and Slavtchev, V. (2011). Determinants of the efficiency of regional innovation systems. Reg. Stud. 45 (7), 905–918. doi:10.1080/00343400802251494

He, J. A., Peng, F. P., and Xie, X. Y. (2022). Mixed-ownership reform, political connection and enterprise innovation: based on the double/unbiased machine learning method. Sci. Technol. Manag. Res. 42 (11), 116–126. (in Chinese).

Hong, J., Feng, B., Wu, Y., and Wang, L. (2016). Do government grants promote innovation efficiency in China's high-tech industries? Technovation 57, 4–13. doi:10.1016/j.technovation.2016.06.001

Huang, W. J., and Fernández-Maldonado, A. M. (2016). High-tech development and spatial planning: comparing The Netherlands and Taiwan from an institutional perspective. Eur. Plan. Stud. 24 (9), 1662–1683. doi:10.1080/09654313.2016.1187717

Hull, I., and Grodecka-Messi, A. (2022). Measuring the impact of taxes and public services on property values: a double machine learning approach . arXiv preprint arXiv:2203.14751.

Jiang, T. (2022). Mediating effects and moderating effects in causal inference. China Ind. Econ. 5, 100–120. doi:10.19581/j.cnki.ciejournal.2022.05.005

Knittel, C. R., and Stolper, S. (2021). Machine learning about treatment effect heterogeneity: the case of household energy use. Nashv. TN 37203, 440–444. doi:10.1257/pandp.20211090

Li, H., and Tang, L. (2015). Transportation infrastructure investment, spatial spillover effect and enterprise inventory. Manag. World 4, 126–136. doi:10.19744/j.cnki.11-1235/f.2015.04.012

Li, W. H., Liu, F., and Liu, T. S. (2022a). Can national high-tech zones improve the urban innovation efficiency? an empirical test based on the effect of spatial agglomeration regulation. Manag. Rev. 34 (5), 93. doi:10.14120/j.cnki.cn11-5057/f.2022.05.007

Li, X. (2015). Analysis and outlook of the related researches on green innovation. R&D Manag. 27 (2), 1–11. doi:10.13581/j.cnki.rdm.2015.02.001

Li, X., Shao, X., Chang, T., and Albu, L. L. (2022b). Does digital finance promote the green innovation of China's listed companies? Energy Econ. 114, 106254. doi:10.1016/j.eneco.2022.106254

Li, X. P., Li, P., Lu, D. G., and Jiang, F. T. (2015). Economic agglomeration, selection effects and firm productivity. J. Manag. World 4, 25–37+51. (in Chinese). doi:10.19744/j.cnki.11-1235/f.2015.04.004

Liu, R. M., and Zhao, R. J. (2015). Does the national high-tech zone promote regional economic development? A verification based on differences-in-differences method. J. Manag. World 8, 30–38. doi:10.19744/j.cnki.11-1235/f.2015.08.005

Luo, R., and Wang, Q. M. (2023). Does the construction of national demonstration logistics park produce economic growth effect? Econ. Surv. 40 (1), 47–56. doi:10.15931/j.cnki.1006-1096.2023.01.015

Luo, Y., Salman, M., and Lu, Z. (2021). Heterogeneous impacts of environmental regulations and foreign direct investment on green innovation across different regions in China. Sci. total Environ. 759, 143744. doi:10.1016/j.scitotenv.2020.143744

PubMed Abstract | CrossRef Full Text | Google Scholar

Melitz, M. J., and Ottaviano, G. I. (2008). Market size, trade, and productivity. Rev. Econ. Stud. 75 (1), 295–316. doi:10.1111/j.1467-937x.2007.00463.x

Park, S. C., and Lee, S. K. (2004). The regional innovation system in Sweden: a study of regional clusters for the development of high technology. Ai Soc. 18 (3), 276–292. doi:10.1007/s00146-003-0277-7

Pokharel, R., Bertolini, L., Te Brömmelstroet, M., and Acharya, S. R. (2021). Spatio-temporal evolution of cities and regional economic development in Nepal: does transport infrastructure matter? J. Transp. Geogr. 90, 102904. doi:10.1016/j.jtrangeo.2020.102904

Qiu, Y., Wang, H., and Wu, J. (2023). Impact of industrial structure upgrading on green innovation: evidence from Chinese cities. Environ. Sci. Pollut. Res. 30 (2), 3887–3900. doi:10.1007/s11356-022-22162-1

Sosnovskikh, S. (2017). Industrial clusters in Russia: the development of special economic zones and industrial parks. Russ. J. Econ. 3 (2), 174–199. doi:10.1016/j.ruje.2017.06.004

Sun, Z., and Sun, J. C. (2015). The effect of Chinese industrial policy: industrial upgrading or short-term economic growth. China Ind. Econ. 7, 52–67. (in Chinese). doi:10.19581/j.cnki.ciejournal.2015.07.004

Tan, J., and Zhang, J. (2018). Does national high-tech development zones promote the growth of urban total factor productivity? —based on" quasi-natural experiments" of 277 cities. Res. Econ. Manag. 39 (9), 75–90. doi:10.13502/j.cnki.issn1000-7636.2018.09.007

Vásquez-Urriago, Á. R., Barge-Gil, A., Rico, A. M., and Paraskevopoulou, E. (2014). The impact of science and technology parks on firms’ product innovation: empirical evidence from Spain. J. Evol. Econ. 24, 835–873. doi:10.1007/s00191-013-0337-1

Wang, F., Dong, M., Ren, J., Luo, S., Zhao, H., and Liu, J. (2022a). The impact of urban spatial structure on air pollution: empirical evidence from China. Environ. Dev. Sustain. 24, 5531–5550. doi:10.1007/s10668-021-01670-z

Wang, M., and Liu, X. (2023). The impact of the establishment of national high-tech zones on total factor productivity of Chinese enterprises. China Econ. 18 (3), 68–93. doi:10.19602/j.chinaeconomist.2023.05.04

Wang, Q., She, S., and Zeng, J. (2020). The mechanism and effect identification of the impact of National High-tech Zones on urban green innovation: based on a DID test. China Popul. Resour. Environ. 30 (02), 129–137.

Wang, W. S., and Xu, T. S. (2020). A research on the impact of national high-teach zone establishment on enterprise innovation performance. Econ. Surv. 37 (6), 76–87. doi:10.15931/j.cnki.1006-1096.20201010.001

Wang, Z., Yang, Y., and Wei, Y. (2022b). Has the construction of national high-tech zones promoted regional economic growth? empirical research from prefecture-level cities in China. Sustainability 14 (10), 6349. doi:10.3390/su14106349

Weber, M., Driessen, P. P., and Runhaar, H. A. (2014). Evaluating environmental policy instruments mixes; a methodology illustrated by noise policy in The Netherlands. J. Environ. Plan. Manag. 57 (9), 1381–1397. doi:10.1080/09640568.2013.808609

Wu, Y. B. (2019). Does fiscal decentralization promote technological innovation. Mod. Econ. Sci. 41, 13–25.

Xu, S. D., Jiang, J., and Zheng, J. (2022). Has the establishment of national high-tech zones promoted industrial Co-Agglomeration? an empirical test based on difference in difference method. Inq. into Econ. Issues 11, 113–127. (in Chinese).

Yang, F., and Guo, G. (2020). Fuzzy comprehensive evaluation of innovation capability of Chinese national high-tech zone based on entropy weight—taking the northern coastal comprehensive economic zone as an example. J. Intelligent Fuzzy Syst. 38 (6), 7857–7864. doi:10.3233/jifs-179855

Yu, N., De Jong, M., Storm, S., and Mi, J. (2013). Spatial spillover effects of transport infrastructure: evidence from Chinese regions. J. Transp. Geogr. 28, 56–66. doi:10.1016/j.jtrangeo.2012.10.009

Yuan, H., and Zhu, C. L. (2018). Do national high-tech zones promote the transformation and upgrading of China’s industrial structure. China Ind. Econ. 8, 60–77. doi:10.19581/j.cnki.ciejournal.2018.08.004

Zhang, T., Chen, L., and Dong, Z. (2018). Highway construction, firm dynamics and regional economic efficiency. China Ind. Econ. 1, 79–99. doi:10.19581/j.cnki.ciejournal.20180115.003

Zhang, T., and Li, J. C. (2023). Network infrastructure, inclusive green growth, and regional inequality: from causal inference based on double machine learning. J. Quantitative Technol. Econ. 40 (4), 113–135. doi:10.13653/j.cnki.jqte.20230310.005

Zhou, L., and Shen, K. (2020). National city group construction and green innovation. China Popul. Resour. Environ. 30 (8), 92–99.

Zhou, X., and Du, J. (2021). Does environmental regulation induce improved financial development for green technological innovation in China? J. Environ. Manag. 300, 113685. doi:10.1016/j.jenvman.2021.113685

Keywords: national high-tech zone, industrial policy, green innovation, heterogeneity analysis, moderating effect, double machine learning

Citation: Cao W, Jia Y and Tan B (2024) Impact of industrial policy on urban green innovation: empirical evidence of China’s national high-tech zones based on double machine learning. Front. Environ. Sci. 12:1369433. doi: 10.3389/fenvs.2024.1369433

Received: 12 January 2024; Accepted: 15 March 2024; Published: 04 April 2024.

Reviewed by:

Copyright © 2024 Cao, Jia and Tan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Yu Jia, [email protected]

IMAGES

  1. (PDF) A Survey of Machine Learning Methods for IoT and their Future

    machine learning in iot research papers

  2. (PDF) Machine Learning in IoT Security: Current Solutions and Future

    machine learning in iot research papers

  3. (PDF) A SURVEY ON KEY TECHNOLOGIES AND APPLICATIONS OF IOT

    machine learning in iot research papers

  4. IoT-machine learning publication analysis from 2010 to 2021.

    machine learning in iot research papers

  5. (PDF) Towards a Multimodal System for Precision Agriculture using IoT

    machine learning in iot research papers

  6. Iot Paper Ieee Format

    machine learning in iot research papers

VIDEO

  1. AI + IoT for Industrial Manufacturing

  2. CodingScientist Skill Development Program

  3. Deep Learning for IoT Devices

  4. ROLE OF AI AND MACHINE LEARNING IN IoT

  5. IOT with Machine Learning

  6. Internet of Things IoT and Machine Learning Model of Plant Disease Prediction–Blister Blight for Tea

COMMENTS

  1. Machine Learning-Enabled Internet of Things (IoT): Data, Applications

    Machine learning (ML) allows the Internet of Things (IoT) to gain hidden insights from the treasure trove of sensed data and be truly ubiquitous without explicitly looking for knowledge and data patterns. Without ML, IoT cannot withstand the future requirements of businesses, governments, and individual users. The primary goal of IoT is to perceive what is happening in our surroundings and ...

  2. Review article Machine learning approaches to IoT security: A

    This paper aims to investigate research trends for the applications of machine learning in IoT security. We adopted a systematic approach to evaluating recent studies and future trends in IoT security by extracting the most relevant and scholarly literature published in the last two years (2019 and 2020).

  3. Machine learning techniques for IoT security: Current research and

    The paper compared leading IoT NID proposals and emphasized potential future research directions, focusing mainly on machine learning algorithms. To provide a comprehensive review, Al-Garadi et al. [ 11 ], presented ML as separate sections, despite the latter being a subset of the former.

  4. (PDF) Machine Learning Powered IoT for Smart Applications

    To imitate the human intelligence level, the machine or software is made smarter by using advanced deep learning. In the paper, several diverse types of IoT technologies will be referenced ...

  5. Machine learning for internet of things data analysis: a survey

    Machine learning algorithms in eight categories based on recent studies on IoT data and frequency of machine learning algorithms are reviewed and summarized in Section 5. The matching of the algorithms to particular smart city applications is carried out in Section 6 , and the conclusion together with future research trends and open issues are ...

  6. A survey on application of machine learning for Internet of Things

    The application of machine learning for IoT enables users to obtain deep analytics and develop efficient intelligent IoT applications. This paper is different from the previously published survey ...

  7. Machine Learning in Real-Time Internet of Things (IoT) Systems: A

    Over the last decade, machine learning (ML) and deep learning (DL) algorithms have significantly evolved and been employed in diverse applications, such as computer vision, natural language processing, automated speech recognition, etc. Real-time safety-critical embedded and Internet of Things (IoT) systems, such as autonomous driving systems, UAVs, drones, security robots, etc., heavily rely ...

  8. Machine learning and data analytics for the IoT

    In this paper, we study and analyze the role of machine learning to facilitate data analytics for the IoT paradigm. We present a thorough analysis of the integration of machine learning with the IoT paradigm in Sect. 2. In Sect. 3, we define the application of machine learning for processing and analysis of IoT data.

  9. Artificial intelligence Internet of Things: A new paradigm of

    In the sensing and device layer, the AIoT paradigm can take advantage of recently developed edge computing architectures 2 and machine-learning approaches, such as active learning (AL), 3 transfer learning (TL), 4 and federated learning (FL). 5 AL techniques can deal with the time-varying and unpredictable data over the IoT network. TL utilizes pre-trained models developed at the edge servers ...

  10. Machine Learning: Algorithms, Real-World Applications and Research

    In the current age of the Fourth Industrial Revolution (4IR or Industry 4.0), the digital world has a wealth of data, such as Internet of Things (IoT) data, cybersecurity data, mobile data, business data, social media data, health data, etc. To intelligently analyze these data and develop the corresponding smart and automated applications, the knowledge of artificial intelligence (AI ...

  11. Machine Learning in IoT Security: Current Solutions and Future

    Machine Learning (ML) and Deep Learning (DL) techniques, which are able to provide embedded intelligence in the IoT devices and networks, can be leveraged to cope with different security problems. In this paper, we systematically review the security requirements, attack vectors, and the current security solutions for the IoT networks.

  12. Edge Machine Learning for AI-Enabled IoT Devices: A Review

    Several research papers focused on the possibility of bringing artificial intelligence to devices with limited resources ... In this work, a detailed review on models, architectures, and requirements on solutions that implement edge machine learning on IoT devices was presented, with the main goal to define the state of the art and envisioning ...

  13. IoT and Machine Learning

    In IoT applications, intelligent processing and analysis of big data act as key for their development. Data science technologies help to find new pattern and new insights from data to make IoT applications more intelligent. Data science with IoT is mainly used in various sectors dealing with volume, velocity and pattern recognition.

  14. Internet of Things (IoT) for Next-Generation Smart Systems: A Review of

    This paper presents an exhaustive review for these key enabling technologies and also discusses the new emerging use cases of 5G-IoT driven by the advances in artificial intelligence, machine and deep learning, ongoing 5G initiatives, quality of service (QoS) requirements in 5G and its standardization issues.

  15. Machine Learning in IoT Security: Current Solutions and ...

    MCA-based DDoS detection mechanism focuses on the server. side and in the context of the IoT, this mechanism will detect. DDoS attack as a result of data flow between the back-end. servers for ...

  16. Enhancing IoT Intelligence: A Transformer-based Reinforcement Learning

    The results indicate significant advancements in enabling RL agents to navigate the complexities of IoT ecosystems, highlighting the potential of our approach to revolutionize intelligent automation and decision-making in the IoT landscape. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2404.04205 [cs.LG]

  17. Role of Artificial Intelligence in the Internet of Things (IoT

    This review paper compiles information from several other surveys and research papers regarding IoT, AI, and attacks with and against AI and explores the relationship between these three topics with the purpose of comprehensively presenting and summarizing relevant literature in these fields. ... Using machine learning to secure IoT systems. In ...

  18. Review article Machine learning approaches to IoT security: A

    In this systematic literature review (SLR) paper, our goal is to provide a research asset to researchers on recent research trends in IoT security. As the main driver of our SLR paper, we proposed six research questions related to IoT security and machine learning. This extensive literature survey on the most recent publications in IoT security ...

  19. Machine learning and deep learning approaches in IoT

    As a result, machine learning and deep learning technologies are utilized to identify and control security in IoMT and IoV devices. This research study aims to investigate the research fields of current IoT security research trends. Papers about the domain were searched, and the top 50 papers were selected.

  20. Predicting and improving complex beer flavor through machine learning

    Metrics. The perception and appreciation of food flavor depends on many interacting chemical compounds and external factors, and therefore proves challenging to understand and predict. Here, we ...

  21. IoT and Machine Learning Based Prediction of Smart Building Indoor

    The paper carries out a Machine Learning based experimentation on recorded real sensor data [1] to validate the approach. Following that, the paper suggests integration of following strategy into an Edge Computing based IoT architecture for enabling the building to work in an energy-efficient fashion. ...

  22. Research on Fault Diagnosis System of IOT for Oil Well Pump Based on

    This paper deeply explains the algorithm steps of using LSTM for fault diagnosis of oil well pumps and the principle of machine learning for fault diagnosis and prediction, and deeply explains the main module content of the IoT central computer software design. In order to realize automatic prediction and processing of remote fault diagnosis of oil well pumps distributed in different regions ...

  23. Accelerating Progress: Qualcomm AI Research Makes Code Available for

    Qualcomm AI Research releasing code for their top papers in various machine learning areas is a way to fuel collaboration and to advance the state-of-the-art in AI. The team will be releasing more papers with code in the following months. Follow the GitHub account for more updates, as well as the Qualcomm Innovation Center and the Qualcomm AI Hub.

  24. IoT Security: Botnet detection in IoT using Machine learning

    Security is considered as one of the prominent challenges in IoT. The key scope of this research work is to propose an innovative model using machine learning algorithm to detect and mitigate botnet-based distributed denial of service (DDoS) attack in IoT network. Our proposed model tackles the security issue concerning the threats from bots.

  25. Frontiers

    This paper uses the research methods of Wu (2019) and uses the roadway mileage (measured in kilometers) to population as a measure of the quality of the transportation system. 3.2.4 Control variables ... Secondly, the paper employs machine learning techniques for causal inference. Subsequent investigations could delve further into the potential ...

  26. Internet of Things and smart sensors in agriculture: Scopes and

    IoT generates huge data that requires fast data analysis and such task can be completed through AI algorithms with better efficiency and high quality of decision making. New logic and methods like machine learning, machine vision, artificial neural networks, natural languages processing, etc., have improved the automation process in agriculture.

  27. Road Accident Prediction Using Machine Learning

    A deep learning-based road accident prediction system utilizing various factors, such as speed, traffic condition, weather, and more, aims to accurately predict road accidents, ultimately contributing to enhancing road safety. Road accidents are a significant cause of fatalities and injuries worldwide. Predicting road accidents is crucial for implementing preventive measures and saving lives ...

  28. Breast cancer diagnosis using machine learning techniques

    The goal of this research is to diagnose breast cancer using machine learning methods including decision trees, support vector machines (SVM), and naïve Bayesian classifiers. The properties of cell nuclei taken from breast biopsies are included in the Breast Cancer Wisconsin dataset, which is used in this study. Three different machine learning algorithms - naive Bayesian classifier, SVM and ...