Google develops a bioacoustic model called HeAR that can detect diseases through coughing, talking, breathing

8 min readAug 23, 2024

With the advancement of deep learning technology, neural networks are now able to learn high-quality general representations directly from raw speech data and apply them to a variety of semantic and non-semantic speech-related tasks. For example, by analyzing non-semantic features in speech (such as pronunciation, resonance, etc.), some cerebrovascular and neurodegenerative diseases (such as stroke, Parkinson’s disease, Alzheimer’s disease, etc.) can be detected and monitored. In addition, sounds originating from airflow in the respiratory system (such as coughs and breathing patterns) can also be used for health monitoring. For example, doctors can diagnose the corresponding disease by identifying the “woosh” sound similar to whooping cough or the wheezing sound in acute cardiovascular events.

Google’s research team has developed a bioacoustic-based model called Health Acoustic Representations (HeAR) , which is designed to detect diseases by analyzing the body’s acoustic signals, such as coughing, speaking, and breathing. The HeAR model is trained using 3 million audio data, including about 100 million cough sounds, to identify acoustic patterns associated with health.

The HeAR system was tested on 13 health acoustic event detection tasks, 14 cough inference tasks, and 6 lung function inference tasks, and exceeded the performance of existing baseline models in many tasks.

For example, in the cough inference task, HeAR performed best among 10 tasks, including detection of COVID-19, tuberculosis, etc. In addition, HeAR also performed very well in lung function inference tasks, especially in key indicators such as forced expiratory volume (FEV1) and forced vital capacity (FVC).

HeAR’s goal is to help researchers develop custom bioacoustic models when data is limited, thereby accelerating research on specific diseases and populations.

Salcit Technologies in India has applied the HeAR model to develop a product called Swaasa® that analyzes cough sounds and assesses lung health, particularly for early detection of tuberculosis (TB), a treatable disease that still goes undiagnosed millions of cases each year due to poor access to healthcare.

The company is exploring how HeAR can help expand the capabilities of its bioacoustic AI models. First, Swaasa® is using HeAR to study and enhance early detection of tuberculosis based on cough sounds.

HeAR’s innovations and main functions

HeAR was trained using 313 million audio clips extracted from YouTube and evaluated on 33 tasks on 6 different datasets. These tasks include health acoustic event detection, cough inference, lung function assessment, etc.

Innovation

Self-supervised learning framework

The HeAR system uses a self-supervised learning (SSL) framework, which is a learning method that does not rely on a large amount of manually labeled data. By training the masked autoencoder (MAE), the system can learn a general low-dimensional audio representation from large-scale unlabeled audio data. This method can effectively improve the generalization ability of the model in a variety of tasks, especially when dealing with out-of-distribution (OOD) data.

Large-scale dataset training

The HeAR system was trained on a large-scale dataset of 313 million two-second audio clips extracted from three billion YouTube videos, covering a variety of non-semantic health acoustic events (such as coughing, breathing, etc.). The use of a large-scale dataset improves the robustness and wide applicability of the system.

Healthy Acoustic Event Detector

The system introduces a multi-label classification convolutional neural network (CNN) as a health acoustic event detector, which can identify non-speech health acoustic events in audio clips. These events include coughing, baby coughing, breathing, clearing the throat, laughing, and speaking. This detector not only enhances the functionality of the system, but also enables the system to handle a variety of different health acoustic tasks.

Multi-task performance evaluation

The HeAR system was benchmarked on 33 different health acoustic tasks, demonstrating its superior performance in a variety of tasks. In particular, on cough inference and lung function inference tasks, the HeAR system surpassed many existing technical benchmarks, demonstrating its potential as a general health acoustic model.

Key Features

Disease inference and screening:

The HeAR system encodes two-second audio clips and generates audio embeddings that can be used for downstream tasks. These embeddings can be directly applied in various health acoustic tasks, such as health event detection, cough inference, and lung function inference.
HeAR can infer the possibility of specific diseases by analyzing health acoustic signals such as cough sounds. For example, it can be used to detect tuberculosis (TB), COVID-19, chronic obstructive pulmonary disease (COPD), etc. This inference function is particularly suitable for resource-limited environments, where screening can be performed through simple audio collection devices (such as smartphones).

Healthy Acoustic Event Detection

HeAR can detect and identify various health-related acoustic events from audio data, such as coughing, breathing, throat clearing, laughing, and talking. The detection of these events can be used to monitor health status and provide early warning of diseases.
The HeAR system can infer relevant health information based on the cough sounds in the audio, such as detecting specific diseases (such as COVID-19 or tuberculosis), determining an individual’s gender, age, BMI, and lifestyle habits (such as smoking status).

Lung function inference

HeAR can estimate the patient’s lung function parameters such as forced expiratory volume (FEV1), forced vital capacity (FVC), peak expiratory flow (PEF), etc. by analyzing respiratory audio data.
These assessments can help doctors monitor changes in patients’ lung function and support disease management. They are important for screening chronic obstructive pulmonary disease (COPD) and monitoring patients’ lung function.

Equipment compatibility and environmental adaptability

HeAR has been trained and tested on a variety of devices (such as different models of smartphones) and can adapt to audio data from different recording devices. This makes HeAR more compatible with devices in real-world applications and suitable for audio recording environments of different qualities, including those with limited resources.

Self-supervised learning and data efficiency

HeAR uses a self-supervised learning model to achieve higher task generalization capabilities by training on a large amount of unlabeled audio data. Compared with traditional methods, HeAR can maintain high performance even when data is scarce, which makes it effective when there is less health data.
Efficient data usage and generalization

Medical research and development support

HeAR is a basic model that is open to researchers to accelerate the development of customized bioacoustic models for specific diseases and populations. This capability allows medical researchers to develop health monitoring tools for specific application scenarios in a shorter time.

Technical methods

The technical approach of the HeAR system consists of three main parts: data processing, model training, and task evaluation. The following is a detailed introduction to each part:

1. Data processing

Healthy Acoustic Event Detector :

The HeAR system first uses a multi-label classification convolutional neural network (CNN) as a health acoustic event detector to detect non-semantic health acoustic events in audio clips, including coughing, baby coughing, breathing, throat clearing, laughter, and speaking.
The audio data was processed as mono, 16kHz sampling rate, and converted to a log-mel spectrogram with 48 frequency bands covering the frequency range of 125Hz to 7.5kHz, and processed with per-channel energy normalization (PCEN).
These spectrograms are input to a small convolutional neural network, which is trained with a balanced binary cross entropy loss function and ultimately outputs logits (log odds) for each predicted class.

Dataset :

The HeAR system was trained using a dataset called YT-NS (YouTube Non-Semantic), which contains two-second audio clips extracted from three billion non-copyrighted YouTube videos, for a total of 313 million audio clips (about 174,000 hours of audio).
Since most of the events we are interested in are short, a two-second time window is chosen. The audio encoder of HeAR is trained entirely on this dataset.

2. Model Training

Self-supervised learning framework :

The HeAR system adopts a generative learning framework based on self-supervised learning (SSL). Specifically, a masked autoencoder (MAE) model is used to learn audio representation. The MAE model learns audio representation by training an autoencoder to reconstruct masked 16×16 spectrogram patches.
During training, 75% of the input spectrogram patches are masked and encoded by a ViT-L encoder (Visual Transformer). Then, the learnable mask tokens are added to the encoded token sequence, and an 8-layer transformer decoder is responsible for reconstructing the missing patches, which is optimized by minimizing the L2 distance between the normalized patches and the predicted results.
The HeAR system was trained using the AdamW optimizer for a total of 950k steps (approximately 4 epochs) with a global batch size of 4096. The learning rate was scheduled using cosine annealing with an initial learning rate of 4.8e-4, following the commonly used linear batch scaling rule.

Benchmark Results

The HeAR system was extensively benchmarked on 33 tasks, covering three major categories of tasks: health acoustic event detection, cough inference, and lung function inference.

Healthy Acoustic Event Detection: HeAR performs well in healthy acoustic event detection tasks, especially when dealing with acoustic events such as coughing, breathing, throat clearing, laughter, etc., and can accurately identify these events. These detection tasks are validated on 6 different datasets.
Cough Inference Task: HeAR achieved top results on 10 out of 14 cough inference tasks, including diagnosing specific diseases (such as COVID-19 and tuberculosis) and inferring demographic information (such as gender, smoking status, age, etc.).
Lung function evaluation: Among the five lung function-related tasks (such as forced expiratory volume, vital capacity, peak expiratory flow, etc.), HeAR performed better than other baseline models in four tasks.

1. Overall performance

Across all tasks, the HeAR system achieved a Mean Reciprocal Rank (MRR) score of 0.708 and was the best performer in 17 out of 33 tasks, demonstrating its superior performance as a general-purpose healthy acoustic model.
The specific performance of the tasks is divided into three categories: healthy acoustic event detection, cough inference, and lung function inference. HeAR achieved the highest scores in 3, 10, and 5 tasks respectively in these three categories.

2. Health Acoustic Event Detection

Datasets: FSD50K and FluSense.
Main results: In the health acoustic event detection task, while the CLAP model performed best overall (mean average precision of 0.691 and MRR of 0.846), HeAR performed best among models not trained with FSD50K (mean average precision of 0.658 and MRR of 0.538).
HeAR performs well on breathing, coughing, laughing, breathing sound, and sneezing detection tasks. For example, in the breathing sound detection task of the FSD50K dataset, HeAR achieves an average precision of 0.434, which is significantly higher than other models.

3. Cough inference

Datasets: CoughVID, Coswara, CIDRZ tuberculosis datasets.
Main results: HeAR performs best in 10 out of 14 cough inference tasks, especially in detecting COVID-19, tuberculosis, chest X-ray (CXR) abnormalities, and inferring gender, age, and BMI.
In the COVID-19 detection task on the CIDRZ dataset, HeAR achieved an AUROC of 0.710, significantly higher than other baseline models. For the gender inference task, HeAR achieved an AUROC of 0.897 on the CoughVID dataset.