Title: | Tools for Population Health Management Analytics |
---|---|
Description: | Created for population health analytics and monitoring. The functions in this package work best when working with patient level Master Patient Index-like datasets . Built to be used by NHS bodies and other health service providers. |
Authors: | Asif Laldin [aut, cre],
Gary Hutson [aut] |
Maintainer: | Asif Laldin <[email protected]> |
License: | AGPL (>= 3) |
Version: | 1.0.2 |
Built: | 2025-02-12 04:26:47 UTC |
Source: | https://github.com/ald0405/sangertools |
Age Band Creation: Create a new column of 5 year Age Bands from an integer column
age_bandizer(df, Age_col)
age_bandizer(df, Age_col)
df |
a tidy dataframe in standard Master Patient Index format ie SangerTools::PopHealthData |
Age_col |
a integer column within @param df NAs must be removed or imputed prior to running this function |
A dataframe with width ncol(df)+1, new column will be named Ageband and will be a factor with levels defined
library(SangerTools) library(dplyr) health_data <- SangerTools::PopHealthData
library(SangerTools) library(dplyr) health_data <- SangerTools::PopHealthData
An alternative age banding function that allows users greater flexibility for defining band size. This function utilises Base R standard evaluation. The function currently supports band size of 2, 5, 10 & 20. The input,column, Age_col should be numeric and must not contain NAs; if either of these conditions is violated the function will terminate.
age_bandizer_2(df, Age_col, Age_band_size = 5)
age_bandizer_2(df, Age_col, Age_band_size = 5)
df |
A dataframe with a numerical column denoting Age. |
Age_col |
A numerical column within 'df'; passed with quotation marks. |
Age_band_size |
The size of the Age band to use. Defaults to 5; will take values 2,5,10,20. |
A dataframe containing a new column 'Ageband' which has factor levels defined.
## Not run: library(SangerTools) df <- data.frame(Age = sample(x = 0:120, size = 100, replace = TRUE)) df_agebanded <- age_bandizer_2( df = df, Age_col = "Age", Age_band = 5 ) print(df_agebanded) ## End(Not run)
## Not run: library(SangerTools) df <- data.frame(Age = sample(x = 0:120, size = 100, replace = TRUE)) df_agebanded <- age_bandizer_2( df = df, Age_col = "Age", Age_band = 5 ) print(df_agebanded) ## End(Not run)
Create a ggplot2 column chart of categorical variables with labels, in ascending order.
The plot will be customised using the provided theme theme_sanger
, y-axis labels will have a comma for every third integer value.
If the column provided to 'grouping_var' has more than approximately 5 values, you may need to consider
rotating x axis labels using theme
A comprehensive explanation of ggplot2 customisation is available here
categorical_col_chart(df, grouping_var)
categorical_col_chart(df, grouping_var)
df |
A dataframe with categorical variables |
grouping_var |
a categorical variable by which to group the count by |
a ggplot2 object
library(SangerTools) library(dplyr) library(ggplot2) # Group by Age Band health_data <- SangerTools::PopHealthData health_data %>% dplyr::filter(Smoker == 1) %>% SangerTools::categorical_col_chart(AgeBand) + labs( title = "Smoking Population by Age Band", subtitle = "Majority of Smokers are Working Aged ", x = NULL, y = "Patient Number" )
library(SangerTools) library(dplyr) library(ggplot2) # Group by Age Band health_data <- SangerTools::PopHealthData health_data %>% dplyr::filter(Smoker == 1) %>% SangerTools::categorical_col_chart(AgeBand) + labs( title = "Smoking Population by Age Band", subtitle = "Majority of Smokers are Working Aged ", x = NULL, y = "Patient Number" )
Population Health Management commonly leads practitioners to identify a cohort that will have an intervention applied. As a rule of thumb most analysts will work with pseudonymised data sets. For targeted interventions patients require re-identification; this process is generally carried out by a third party organisation. As third party organisations work with many health care providers they have a strict set of requirements. This has been based around SW CSU's required formatting.
cohort_processing( df, Split_by, path, prefix = "DSCRO", com_code = "11M", date_format = "%Y%m%d", suffix = "_REID_V01" )
cohort_processing( df, Split_by, path, prefix = "DSCRO", com_code = "11M", date_format = "%Y%m%d", suffix = "_REID_V01" )
df |
a tidy dataframe in standard Master Patient Index format ie SangerTools::PopHealthData. |
Split_by |
A column within df that will be used to split the patients and will also appear in the file name. Ideally should be a health organisation code such as GP Practice Code or Hospital Trust Code. Should only have alpha-numeric values |
path |
A file path to which the CSV files will be written |
prefix |
File name prefix, default is "DSCRO" See more here: NHS DSCRO |
com_code |
Commissioner Code, default is "11M"; Gloucestershire. |
date_format |
A date format passed internally to 'format(Sys.Date())'; will form part of file name to denote date of generation. You can read more about date formatting in R from R lang |
suffix |
A file name suffix, default is "_REID_V01", To be left as blank use "", without spaces. |
n number of CSV files written to the location specified by path argument.
Calculate the crude prevalence of a health condition from a Master Patient Index like dataset
crude_rates(df, Condition, ...)
crude_rates(df, Condition, ...)
df |
a tidy dataframe in standard Master Patient Index format ie SangerTools::PopHealthData |
Condition |
A Health condition flag denoted by 1 & 0; where 1 denotes the patient being positive for the health condition |
... |
Variables used to standardise by; Must always have Ageband, additional variables are optional |
a tibble with Crude Prevalence Rates(Rate per 1,000) for each value included in ...
library(SangerTools) library(dplyr) health_data <- SangerTools::PopHealthData glimpse(health_data) # Generate crude prevalene rate stats crude_prevalence <- SangerTools::crude_rates(health_data, Diabetes, Locality) print(crude_prevalence)
library(SangerTools) library(dplyr) health_data <- SangerTools::PopHealthData glimpse(health_data) # Generate crude prevalene rate stats crude_prevalence <- SangerTools::crude_rates(health_data, Diabetes, Locality) print(crude_prevalence)
DataFrame to SQL; Write your DataFrame or Tibble directly to SQL from R This wrapper function allows for the easy movement of your computed results in R to a SQL Database for saving. The function uses a ODBC driver to establish a connection. You will need to select a Database that your user has write-access to. The user credentials are the same as your OS login details; as such this function will most likely only work from you work computer.
df_to_sql(df, driver, server, database, sql_table_name, overwrite = FALSE, ...)
df_to_sql(df, driver, server, database, sql_table_name, overwrite = FALSE, ...)
df |
A 'dataFrame' or 'tibble' ie PopHealthData. |
driver |
A driver for database ie "SQL Server"; must be passed in quotation. |
server |
The unique name of your database server; must be passed in quotation. |
database |
The name of the database to which you will write 'df'; must be passed in quotation. |
sql_table_name |
The name that 'df' will be referred to in SQL database; must be passed in quotation. |
overwrite |
If there is a SQL table with the same name whether it will be overwritten; defaults to FALSE. |
... |
Function forwarding for additional functionality. |
A message confirming that a new table has been created in a SQL 'database'.
## Not run: library(odbc) library(DBI) health_data <- SangerTools::PopHealthData df_to_sql( df = health_data, driver = "SQL SERVER", database = "DATABASE", sql_table_name = "New Table Name", overwrite = FALSE ) ## End(Not run)
## Not run: library(odbc) library(DBI) health_data <- SangerTools::PopHealthData df_to_sql( df = health_data, driver = "SQL SERVER", database = "DATABASE", sql_table_name = "New Table Name", overwrite = FALSE ) ## End(Not run)
This function copies a data frame or tibble to your clipboard in a format that allows for a simple paste into excel whilst maintaining column and row structure. By default row_names has been set to FALSE.
excel_clip(df, row_names = FALSE, col_names = TRUE, ...)
excel_clip(df, row_names = FALSE, col_names = TRUE, ...)
df |
A dataframe or tibble |
row_names |
Set to FALSE for row.names not to be included |
col_names |
Set to TRUE for col.names to be included |
... |
function forwarding for additional write.table functionality |
a data frame copied to your clipboard
A fabricated Master Patient Index (MPI) inspired by Gloucestershire's population to be used with functions included in SangerTools
master_patient_index
master_patient_index
A tibble with 10,000 rows and 11 variables:
A Pseudonymised NHS Patient Identifier
The identifiable sex of the patient
Health Condition Flag: 1 denotes if the patient is a smoker
Health Condition Flag: 1 denotes if the patient has diabetes
Health Condition Flag: 1 denotes if the patient has dementia
Health Condition Flag: 1 denotes if the patient is Obese
Age of the patient
The decile of indices of multiple deprivation: https://www.gov.uk/government/statistics/english-indices-of-deprivation-2019
The identifiable ethnicity of the patient
The region where the patient lives - sampled from Gloucestershire Clinical Commissioning Group
The network of General Practioners that the patient is registerd with - sampled from Gloucestershire Clinical Commissioning Group
Generated by Asif Laldin [email protected], Feb-2022
library(dplyr) data(master_patient_index) # Convert diabetes data to factor' master_patient_index %>% glimpse()
library(dplyr) data(master_patient_index) # Convert diabetes data to factor' master_patient_index %>% glimpse()
This function reads multiple CSVs in a directory must be same structure. This function reads multiple excel files into R after which all files are aggregated into a single data frame.
There are assumptions about they underlying files:
All files must have column names for each column (The function will fail without this; later versions will amend this)
All files have the same number of columns
All files have the same column names
All files should have data starting from the same row number
All relevant data is stored in the same sheet in each of the files
multiple_csv_reader(file_path, sheet = 1, rows_to_skip = 0, col_names = TRUE)
multiple_csv_reader(file_path, sheet = 1, rows_to_skip = 0, col_names = TRUE)
file_path |
The Directory in which the files are located |
sheet |
Sheet to read. Either a string (the name of a sheet), or an integer (the position of the sheet). Defaults to the first sheet |
rows_to_skip |
The number of rows from the top to be excluded |
col_names |
If columns are named; defaults to TRUE |
a data frame object full of file paths
library(SangerTools) file_path <- "my_file_path_where_csvs_are_stored" if (length(SangerTools::multiple_csv_reader(file_path)) == 0) { message("This won't work without changing the variable input to a local file path with CSVs in") }
library(SangerTools) file_path <- "my_file_path_where_csvs_are_stored" if (length(SangerTools::multiple_csv_reader(file_path)) == 0) { message("This won't work without changing the variable input to a local file path with CSVs in") }
This function reads multiple excel files into R after which all files are aggregated into a single data frame.
There are assumptions about they underlying files:
All files must have column names for each column (The function will fail without this; later versions will amend this)
All files have the same number of columns
All files have the same column names
All files should have data starting from the same row number
All relevant data is stored in the same sheet in each of the files
To understand more about the underlying function that 'multiple_excel_reader' wraps around Click Here
multiple_excel_reader( file_path, pattern = "*.xlsx", sheet = 1, rows_to_skip = 0, col_names = TRUE )
multiple_excel_reader( file_path, pattern = "*.xlsx", sheet = 1, rows_to_skip = 0, col_names = TRUE )
file_path |
The Directory in which the files are located |
pattern |
The file extension of the files of which you are going to read. Defaults to "*.xlsx" |
sheet |
Sheet to read. Either a string (the name of a sheet), or an integer (the position of the sheet). Defaults to the first sheet |
rows_to_skip |
The number of rows from the top to be excluded |
col_names |
A boolean value to determine if column headers name are present in files. Currently only accepts TRUE |
a data frame object full of file paths
## Not run: combined_excel_files <- multiple_excel_reader("Inputs/", 1, TRUE) ## End(Not run)
## Not run: combined_excel_files <- multiple_excel_reader("Inputs/", 1, TRUE) ## End(Not run)
Population Health NHS data to use with the package and allows the calculation of the various metrics.
PopHealthData
PopHealthData
A small dataset with 1000 observations (rows) and 8 columns, as described hereunder:
The identifiable sex of the patient
Indicates if the patient is a smoker
Flag to indicate if patient has a type of diabetes
The age of the patient when they came into contact with the service
The decile of indices of multiple deprivation: https://www.gov.uk/government/statistics/english-indices-of-deprivation-2019
The identifiable ethnicity of the patient
The region where the patient lives - sampled from Gloucestershire Clinical Commissioning Group
The primary care network of the patient
This anonymous function allows you to apply the Sanger Theme colours to your ggplot2 plot
scale_fill_sanger()
scale_fill_sanger()
A custom colour filled ggplot2 plot
library(SangerTools) library(dplyr) library(ggplot2) # Group by Age Band health_data <- SangerTools::PopHealthData health_data %>% dplyr::filter(Smoker == 1) %>% SangerTools::categorical_col_chart(AgeBand) + labs( title = "Smoking Population by Age Band", subtitle = "Majority of Smokers are Working Aged ", x = NULL, y = "Patient Number" )+ scale_fill_sanger()
library(SangerTools) library(dplyr) library(ggplot2) # Group by Age Band health_data <- SangerTools::PopHealthData health_data %>% dplyr::filter(Smoker == 1) %>% SangerTools::categorical_col_chart(AgeBand) + labs( title = "Smoking Population by Age Band", subtitle = "Majority of Smokers are Working Aged ", x = NULL, y = "Patient Number" )+ scale_fill_sanger()
Displays a brand colour palette for showing the hex codes associated with brand
show_brand_palette()
show_brand_palette()
a Base R plot object
library(scales) library(SangerTools) show_brand_palette()
library(scales) library(SangerTools) show_brand_palette()
Displays extended brand colour palette for charting
show_extended_palette()
show_extended_palette()
a Base R plot object
library(scales) library(SangerTools) show_extended_palette()
library(scales) library(SangerTools) show_extended_palette()
A simpler alternative to cohort_processing
. Will split a data frame
and save as a csv
split_and_save(df, Split_by, path, prefix = NULL)
split_and_save(df, Split_by, path, prefix = NULL)
df |
A 'dataFrame' or 'tibble' ie PopHealthData. |
Split_by |
A column within df that will be used to split the patients and will also appear in the file name. Ideally should be a health organisation code such as GP Practice Code or Hospital Trust Code. Should only have alpha-numeric values |
path |
A file path to which the CSV files will be written |
prefix |
File name prefix |
n number of CSV files written to the location specified by path argument.
## Not run: split_and_save( df = pseudo_data, Split_by = "Locality", file_path = "Inputs/", prefix = NULL ) ## End(Not run)
## Not run: split_and_save( df = pseudo_data, Split_by = "Locality", file_path = "Inputs/", prefix = NULL ) ## End(Not run)
Standardisation will be performed
for all unique values in the column passed to 'split_by'. If input data frame does not contain age bands
or age bands are not of class factor, it is recommended to use age_bandizer
or age_bandizer_2
.
After the function has run, the output can be copied using excel_clip
or written to a database using df_to_sql
.
Alternatively, if you are interested in seeing the effects of age confounding; consider joining the outputs of this function with the output from crude_rates
using a left_join
standardised_rates_df( df, Split_by, Condition, Population_Standard, Granular = FALSE, ... )
standardised_rates_df( df, Split_by, Condition, Population_Standard, Granular = FALSE, ... )
df |
a tidy data frame in standard Master Patient Index format ie SangerTools::PopHealthData. |
Split_by |
A column name within df for which the standardised rates will be calculated for. |
Condition |
A Health condition flag denoted by 1 & 0; where 1 denotes the patient being positive for the health condition. |
Population_Standard |
Population Standard Weight used for Standardising; default set to NULL; which denotes use of Age Structure of df. |
Granular |
Takes a boolean value. If set to TRUE will output a tibble with Standardised Rates using values provided in 'Split_col' and '...'By default is set to FALSE. |
... |
Variables used to standardise by; Must always have Age band for age standardisation, additional variables are optional and should be passed separated by commas. |
A tibble containing standardised Prevalence Rates by specified group.
library(SangerTools) health_data <- SangerTools::age_bandizer(df = SangerTools::master_patient_index, Age_col=Age) df_rates <- standardised_rates_df( df = health_data, Split_by = Locality, Condition = Diabetes, Population_Standard = NULL, Granular = TRUE, Ageband ) print(df_rates)
library(SangerTools) health_data <- SangerTools::age_bandizer(df = SangerTools::master_patient_index, Age_col=Age) df_rates <- standardised_rates_df( df = health_data, Split_by = Locality, Condition = Diabetes, Population_Standard = NULL, Granular = TRUE, Ageband ) print(df_rates)
A customised ggplot2 theme for the SangerTools package
theme_sanger()
theme_sanger()
A customised ggplot2 plot
library(SangerTools) library(ggthemes) library(ggplot2) library(ggtext) categorical_col_chart(SangerTools::PopHealthData, Locality) + theme_sanger()+ labs(title = "Categorical Column Chart", x = "Locality", y = "Number of Patients")+ scale_fill_sanger()
library(SangerTools) library(ggthemes) library(ggplot2) library(ggtext) categorical_col_chart(SangerTools::PopHealthData, Locality) + theme_sanger()+ labs(title = "Categorical Column Chart", x = "Locality", y = "Number of Patients")+ scale_fill_sanger()
Data is taken from ONS and is split into 5 year age band
uk_pop_standard
uk_pop_standard
A tibble with 29 rows and 2 variables:
dbl Year price was recorded
5 Year age band for population
https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates