Wearable devices, from smartwatches to fitness trackers, have become ubiquitous, continuously capturing a rich stream of data about our lives. They keep track of our heart rate, how many steps we take, how well we sleep, and a lot more. This deluge of information holds immense potential for personalized health and wellness. However, despite the fact that we are able to easily observe what our bodies are doing (such as a heart rate of 150 beats per minute), the crucial context of why (such as “a brisk uphill run” versus “a stressful public speaking event”) is frequently absent. There has been a significant obstacle in the way of realizing these devices’ full potential due to this gap between the raw sensor data and its actual meaning.
The lack of large-scale datasets that pair sensor recordings with rich, descriptive text is the primary obstacle. The time and cost of manually annotating millions of hours’ worth of data are prohibitive. To solve this, and to truly let wearable data “speak for itself”, we need models that can learn the intricate connections between sensor signals and human language directly from the data.
In “SensorLM: Learning the Language of Wearable Sensors”, we introduce SensorLM, a family of sensor–language foundation models that bridges this gap. Pre-trained on an unprecedented 59.7 million hours of multimodal sensor data from over 103,000 individuals, SensorLM learns to interpret and generate nuanced, human-readable descriptions from high-dimensional wearable data, setting a new state of the art in sensor data understanding.
Contents
Training the SensorLM models
We sampled nearly 2.5 million person-days of de-identified data from 103,643 people in 127 nations to produce the SensorLM-required sensor dataset. This data was collected between March 1st and May 1st, 2024, from Fitbit or Pixel Watch devices, with participants consenting to the use of their de-identified data for research to contribute to general knowledge about health and science.
To overcome the annotation bottleneck, we developed a novel hierarchical pipeline that automatically generates descriptive text captions by calculating statistics, identifying trends, and describing events from the sensor data itself. This process allowed us to curate the largest-known sensor-language dataset to date, orders of magnitude larger than those used in previous studies.
Contrastive learning and generative pre-training are two well-known multimodal pre-training strategies that the SensorLM architecture builds on and brings together. Contrastive Learning: The model learns to match a segment of sensor data with its corresponding text description from a set of options. This teaches it to differentiate between various states and activities (such as the difference between a “light swim” and a “strength workout”). Generative Pre-training: The model learns to generate text captions directly from the sensor data. This equips it with the ability to produce rich, context-aware descriptions from understanding the high-dimensional sensor signals.
By integrating these approaches into a single, cohesive framework, SensorLM develops a deep, multimodal understanding of the relationship between sensor signals and language.
Key capabilities and scaling behaviors
We tested SensorLM on a wide range of real-world tasks in healthcare and human activity recognition. The results demonstrate significant advances over previous state-of-the-art models.
Activity recognition and retrieval
SensorLM shines in tasks with limited labeled data. It excels in few-shot learning, quickly learning from just a few examples and achieving remarkable zero-shot activity classification from 20 activities without any fine-tuning. This makes the model highly adaptable to new tasks and users with minimal data. Additionally, SensorLM makes it possible for cross-modal understanding between sensor data and language descriptions through powerful cross-modal retrieval. This allows us to query descriptions using sensor input, or find specific sensor patterns using natural language, facilitating expert-driven analysis (see further results can be found in the paper).
Generative capabilities
Beyond its classification power, SensorLM demonstrates impressive caption generation capabilities. A wearable device’s high-dimensional sensor signals are all that are needed for SensorLM to generate captions that are contextually and hierarchically relevant. Experimental results indicate that these generated captions were more coherent and factually correct than those produced by powerful non-specialist LLMs.
Behaviour of scaling Our experiments also revealed that SensorLM’s performance consistently improves with more data, larger model sizes, and increased computation, aligning with established scaling laws. We have only scratched the surface of what is possible with large-scale sensor-language pre-training, as evidenced by this steady growth, and further research into this paradigm is highly beneficial.
Conclusion
Our research establishes a foundation for unlocking the understanding of wearable sensor data through natural language, enabled by a novel hierarchical captioning pipeline and the largest sensor-language dataset to date. In terms of making personal health data understandable and actionable, the SensorLM family of models represents a significant advancement. We can move beyond straightforward metrics to truly individualized insights by teaching AI to comprehend our bodies’ language. Looking forward, we plan to scale pre-training data into new domains, including metabolic health and detailed sleep analysis, to address the messy reality of consumer health devices. We envision SensorLM leading to a future generation of digital health coaches, clinical monitoring tools, and personal wellness applications that can offer advice through natural language query, interaction, and generation. Any future products or applications inspired by this foundational research may require further assessment of any clinical and regulatory considerations that may be applicable.