• DMP ID: 10.48321/D1BK5T
  • Version: 21 Apr 2023

This page describes a data management plan written for the University of Arizona (arizona.edu) using the DMPTool.

Using natural language processing to determine predictors of healthy diet and physical activity behavior change in ovarian cancer survivors

Contributors to this project

Project details

  • Research domain: Health sciences
  • Project Start: January 01, 2021
  • Project End: December 31, 2022
  • Created: February 15, 2022
  • Modified: April 21, 2023
  • Ethical issues related to data that this DMP describes? no

Citation

  • When citing this DMP use:
    Damian Yukio Romero Diaz. (2022). "Using natural language processing to determine predictors of healthy diet and physical activity behavior change in ovarian cancer survivors" [Data Management Plan]. DMPHub. https://doi.org/10.48321/D1BK5T
  • When connecting to this DMP to related project outputs (such as datasets) use the ID:
    https://doi.org/10.48321/D1BK5T

Funding status and sources for this project

Project description

  • Cancer survivors are a growing population in the United States; more than 16 million currently live in the US and by 2030 this number is expected to exceed 22 million. It is estimated that more than 50 percent of new cancer cases could be eliminated through a combination of healthy behaviors (e.g., physical activity and healthy diet); and cancer survivors are at high risk for developing new and recurrent cancer. Unfortunately, a significant percentage of cancer survivors are not attaining the cancer preventive guidelines of healthy diet and physical activity. In the past few decades, a variety of telephone-based lifestyle interventions have demonstrated effectiveness in helping survivors meet cancer preventive guidelines, however these trials are labor intensive and expensive to deliver, limiting their potential for broad dissemination. We propose to address this hurdle by taking advantage of recent advances in artificial intelligence to reduce the cost and maximize the impact of these much-needed interventions. Machine learning (ML) and Natural Language Processing (NLP) are analytical techniques that automatically learn from direct and indirect patterns in data. We propose to use machine learned algorithms to analyze speech to aid in predicting who may be at risk of poor adoption of healthy lifestyle behaviors. These speech data will come from the Lifestyle Intervention for Ovarian cancer Enhanced Survival (LIVES) study, a telephone-based lifestyle intervention testing whether a diet low in fat and high in vegetables, fruit, and fiber, coupled with increased physical activity will increase time to disease progression in 1200 ovarian cancer survivors who have recently completed treatment, as compared to an attention control. Intervention coaches employed motivational interviewing to elicit behavior change and all calls on the LIVES trial were recorded with repeat assessments of diet, physical activity, patient reported and clinical outcomes. We will use this existing and robust longitudinal data set, which pairs conversational speech data with explicit outcomes, to achieve the following objectives. 1) Develop a ML model to identify patterns in the interactions between coaches and their participants that signal a likelihood of optimal behavior change in diet and physical activity given the comprehensive LIVES data set, utilizing voice recorded calls, demographics, and clinical and patient reported outcomes collected at multiple time points. 2) Decompose the ML model in terms of “intervenable factors”, so that participant affect, coach adherence to the intervention protocol, and other important aspects of the interaction can be individually evaluated for their role in predicting behavior change, as well as adherence to intervention goals. This decomposition will directly enable early and targeted adjustments to intervention plans for individuals, reducing the cost and increasing the efficacy of intervention strategies. ML and NLP methods can produce models that listen to a coaching conversation and automatically predict whether it will result in positive change towards enactment of healthy lifestyle behaviors. Such predictive models would enable more efficient, effective, and individualized lifestyle interventions, the first step towards personalized behavioral medicine.

Planned outputs

Python code for the creation of machine-learned models

We will make accessible all the computer code used to generate the machine-learned models and the necessary documentation to use the computer code and models. This code can be used by researchers who have access to similar data (annotated patient telephone coaching recordings) to create machine-learned models that can automatically annotate and predict patient outcomes of the same kind. The computer code will be distributed as standard Python (.py) UTF8-encoded files.

  • Format:Software
  • Anticipated volume:unspecified
  • Release timeline:December 31, 2022
  • Intended repository:ReDATA
  • License for reuse:Apache License 2.0

Machine-learned models for data annotation

We expect to release ~6 machine-learned models for data annotation in one of two open formats (to be determined): PyTorch or TensorFlow. These models will help researchers automatically annotate similar data with some accuracy. While the results may not be completely accurate, they will be provided for future researchers to help their language data annotation efforts, which may reduce the economic impact of their project. We expect the models to be able to do the following functions: divide speaker turns according to raw audio from telephone motivational interviews; divide speaker turns based on text transcriptions and time-stamps derived from motivational interviews, detect linguistic constructs such as questions based on text transcriptions; and profile speaker's personality based on raw audio from motivational interviews.

  • Format:Software
  • Anticipated volume:unspecified
  • Release timeline:December 31, 2022
  • Intended repository:ReDATA
  • License for reuse:Apache License 2.0

Machine-learned models for outcome prediction

We expect to release ~4 machine-learned models for outcome prediction in one of two open formats (to be determined): PyTorch or TensorFlow. These models will help researchers automatically predict dietary outcomes for patients undergoing motivational interview interventions. There will be at least one general model that will identify patterns in the interactions between coaches and their participants that signal a likelihood of optimal behavior change in diet and physical activity given voice-recorded motivational interviews. There will be at least two different models of different "intervenable factors" that will identify participant affect, coach adherence to the intervention protocol, and other important aspects of the interaction that will be individually evaluated for their role in predicting behavior change, as well as adherence to intervention goals.

  • Format:Software
  • Anticipated volume:unspecified
  • Release timeline:December 31, 2022
  • Intended repository:ReDATA
  • License for reuse:Apache License 2.0

Other works associated with this research project