Package 'SangerTools'

Title: Tools for Population Health Management Analytics
Description: Created for population health analytics and monitoring. The functions in this package work best when working with patient level Master Patient Index-like datasets . Built to be used by NHS bodies and other health service providers.
Authors: Asif Laldin [aut, cre], Gary Hutson [aut]
Maintainer: Asif Laldin <[email protected]>
License: AGPL (>= 3)
Version: 1.0.2
Built: 2025-02-12 04:26:47 UTC
Source: https://github.com/ald0405/sangertools

Help Index


Age Band Creation: Create a new column of 5 year Age Bands from an integer column

Description

Age Band Creation: Create a new column of 5 year Age Bands from an integer column

Usage

age_bandizer(df, Age_col)

Arguments

df

a tidy dataframe in standard Master Patient Index format ie SangerTools::PopHealthData

Age_col

a integer column within @param df NAs must be removed or imputed prior to running this function

Value

A dataframe with width ncol(df)+1, new column will be named Ageband and will be a factor with levels defined

Examples

library(SangerTools)
library(dplyr)
health_data <- SangerTools::PopHealthData

Create age bands from a numerical column

Description

An alternative age banding function that allows users greater flexibility for defining band size. This function utilises Base R standard evaluation. The function currently supports band size of 2, 5, 10 & 20. The input,column, Age_col should be numeric and must not contain NAs; if either of these conditions is violated the function will terminate.

Usage

age_bandizer_2(df, Age_col, Age_band_size = 5)

Arguments

df

A dataframe with a numerical column denoting Age.

Age_col

A numerical column within 'df'; passed with quotation marks.

Age_band_size

The size of the Age band to use. Defaults to 5; will take values 2,5,10,20.

Value

A dataframe containing a new column 'Ageband' which has factor levels defined.

Examples

## Not run: 
library(SangerTools)
df <- data.frame(Age = sample(x = 0:120, size = 100, replace = TRUE))
df_agebanded <- age_bandizer_2(
  df = df,
  Age_col = "Age",
  Age_band = 5
)
print(df_agebanded)

## End(Not run)

Plot Counts of Categorical Variables

Description

Create a ggplot2 column chart of categorical variables with labels, in ascending order. The plot will be customised using the provided theme theme_sanger, y-axis labels will have a comma for every third integer value. If the column provided to 'grouping_var' has more than approximately 5 values, you may need to consider rotating x axis labels using theme

A comprehensive explanation of ggplot2 customisation is available here

Usage

categorical_col_chart(df, grouping_var)

Arguments

df

A dataframe with categorical variables

grouping_var

a categorical variable by which to group the count by

Value

a ggplot2 object

Examples

library(SangerTools)
library(dplyr)
library(ggplot2)
# Group by Age Band
health_data <- SangerTools::PopHealthData
health_data %>%
  dplyr::filter(Smoker == 1) %>%
  SangerTools::categorical_col_chart(AgeBand) +
  labs(
    title = "Smoking Population by Age Band",
    subtitle = "Majority of Smokers are Working Aged ",
    x = NULL,
    y = "Patient Number"
  )

Patient Cohort Re-Identification Processing

Description

Population Health Management commonly leads practitioners to identify a cohort that will have an intervention applied. As a rule of thumb most analysts will work with pseudonymised data sets. For targeted interventions patients require re-identification; this process is generally carried out by a third party organisation. As third party organisations work with many health care providers they have a strict set of requirements. This has been based around SW CSU's required formatting.

Usage

cohort_processing(
  df,
  Split_by,
  path,
  prefix = "DSCRO",
  com_code = "11M",
  date_format = "%Y%m%d",
  suffix = "_REID_V01"
)

Arguments

df

a tidy dataframe in standard Master Patient Index format ie SangerTools::PopHealthData.

Split_by

A column within df that will be used to split the patients and will also appear in the file name. Ideally should be a health organisation code such as GP Practice Code or Hospital Trust Code. Should only have alpha-numeric values

path

A file path to which the CSV files will be written

prefix

File name prefix, default is "DSCRO" See more here: NHS DSCRO

com_code

Commissioner Code, default is "11M"; Gloucestershire.

date_format

A date format passed internally to 'format(Sys.Date())'; will form part of file name to denote date of generation. You can read more about date formatting in R from R lang

suffix

A file name suffix, default is "_REID_V01", To be left as blank use "", without spaces.

Value

n number of CSV files written to the location specified by path argument.


Crude Prevalence Calculator

Description

Calculate the crude prevalence of a health condition from a Master Patient Index like dataset

Usage

crude_rates(df, Condition, ...)

Arguments

df

a tidy dataframe in standard Master Patient Index format ie SangerTools::PopHealthData

Condition

A Health condition flag denoted by 1 & 0; where 1 denotes the patient being positive for the health condition

...

Variables used to standardise by; Must always have Ageband, additional variables are optional

Value

a tibble with Crude Prevalence Rates(Rate per 1,000) for each value included in ...

Examples

library(SangerTools)
library(dplyr)
health_data <- SangerTools::PopHealthData
glimpse(health_data)
# Generate crude prevalene rate stats
crude_prevalence <- SangerTools::crude_rates(health_data, Diabetes, Locality)
print(crude_prevalence)

Dataframe to SQL

Description

DataFrame to SQL; Write your DataFrame or Tibble directly to SQL from R This wrapper function allows for the easy movement of your computed results in R to a SQL Database for saving. The function uses a ODBC driver to establish a connection. You will need to select a Database that your user has write-access to. The user credentials are the same as your OS login details; as such this function will most likely only work from you work computer.

Usage

df_to_sql(df, driver, server, database, sql_table_name, overwrite = FALSE, ...)

Arguments

df

A 'dataFrame' or 'tibble' ie PopHealthData.

driver

A driver for database ie "SQL Server"; must be passed in quotation.

server

The unique name of your database server; must be passed in quotation.

database

The name of the database to which you will write 'df'; must be passed in quotation.

sql_table_name

The name that 'df' will be referred to in SQL database; must be passed in quotation.

overwrite

If there is a SQL table with the same name whether it will be overwritten; defaults to FALSE.

...

Function forwarding for additional functionality.

Value

A message confirming that a new table has been created in a SQL 'database'.

Examples

## Not run: 
library(odbc)
library(DBI)
health_data <- SangerTools::PopHealthData
df_to_sql(
  df = health_data,
  driver = "SQL SERVER",
  database = "DATABASE",
  sql_table_name = "New Table Name",
  overwrite = FALSE
)

## End(Not run)

Dataframe or Tibble to Clipboard

Description

This function copies a data frame or tibble to your clipboard in a format that allows for a simple paste into excel whilst maintaining column and row structure. By default row_names has been set to FALSE.

Usage

excel_clip(df, row_names = FALSE, col_names = TRUE, ...)

Arguments

df

A dataframe or tibble

row_names

Set to FALSE for row.names not to be included

col_names

Set to TRUE for col.names to be included

...

function forwarding for additional write.table functionality

Value

a data frame copied to your clipboard


Master Patient Index

Description

A fabricated Master Patient Index (MPI) inspired by Gloucestershire's population to be used with functions included in SangerTools

Usage

master_patient_index

Format

A tibble with 10,000 rows and 11 variables:

PseudoNHSNumber

A Pseudonymised NHS Patient Identifier

Sex

The identifiable sex of the patient

Smoker

Health Condition Flag: 1 denotes if the patient is a smoker

Diabetes

Health Condition Flag: 1 denotes if the patient has diabetes

Dementia

Health Condition Flag: 1 denotes if the patient has dementia

Obesity

Health Condition Flag: 1 denotes if the patient is Obese

Age

Age of the patient

IMD_Decile

The decile of indices of multiple deprivation: https://www.gov.uk/government/statistics/english-indices-of-deprivation-2019

Ethnicity

The identifiable ethnicity of the patient

Locality

The region where the patient lives - sampled from Gloucestershire Clinical Commissioning Group

PrimaryCareNetwork

The network of General Practioners that the patient is registerd with - sampled from Gloucestershire Clinical Commissioning Group

Source

Generated by Asif Laldin [email protected], Feb-2022

Examples

library(dplyr)
data(master_patient_index)
# Convert diabetes data to factor'
master_patient_index %>%
 glimpse()

Read Multiple CSV files into R

Description

This function reads multiple CSVs in a directory must be same structure. This function reads multiple excel files into R after which all files are aggregated into a single data frame.

There are assumptions about they underlying files:

  • All files must have column names for each column (The function will fail without this; later versions will amend this)

  • All files have the same number of columns

  • All files have the same column names

  • All files should have data starting from the same row number

  • All relevant data is stored in the same sheet in each of the files

Usage

multiple_csv_reader(file_path, sheet = 1, rows_to_skip = 0, col_names = TRUE)

Arguments

file_path

The Directory in which the files are located

sheet

Sheet to read. Either a string (the name of a sheet), or an integer (the position of the sheet). Defaults to the first sheet

rows_to_skip

The number of rows from the top to be excluded

col_names

If columns are named; defaults to TRUE

Value

a data frame object full of file paths

Examples

library(SangerTools)
file_path <- "my_file_path_where_csvs_are_stored"
if (length(SangerTools::multiple_csv_reader(file_path)) == 0) {
  message("This won't work without changing the variable input to a local file path with CSVs in")
}

Read Multiple Excel files into R

Description

This function reads multiple excel files into R after which all files are aggregated into a single data frame.

There are assumptions about they underlying files:

  • All files must have column names for each column (The function will fail without this; later versions will amend this)

  • All files have the same number of columns

  • All files have the same column names

  • All files should have data starting from the same row number

  • All relevant data is stored in the same sheet in each of the files

To understand more about the underlying function that 'multiple_excel_reader' wraps around Click Here

Usage

multiple_excel_reader(
  file_path,
  pattern = "*.xlsx",
  sheet = 1,
  rows_to_skip = 0,
  col_names = TRUE
)

Arguments

file_path

The Directory in which the files are located

pattern

The file extension of the files of which you are going to read. Defaults to "*.xlsx"

sheet

Sheet to read. Either a string (the name of a sheet), or an integer (the position of the sheet). Defaults to the first sheet

rows_to_skip

The number of rows from the top to be excluded

col_names

A boolean value to determine if column headers name are present in files. Currently only accepts TRUE

Value

a data frame object full of file paths

Examples

## Not run: 
combined_excel_files <- multiple_excel_reader("Inputs/", 1, TRUE)

## End(Not run)

PopHealthData - Population health data for testing functions

Description

Population Health NHS data to use with the package and allows the calculation of the various metrics.

Usage

PopHealthData

Format

A small dataset with 1000 observations (rows) and 8 columns, as described hereunder:

Sex

The identifiable sex of the patient

Smoker

Indicates if the patient is a smoker

Diabetes

Flag to indicate if patient has a type of diabetes

AgeBand

The age of the patient when they came into contact with the service

IMD_Decile

The decile of indices of multiple deprivation: https://www.gov.uk/government/statistics/english-indices-of-deprivation-2019

Ethnicity

The identifiable ethnicity of the patient

Locality

The region where the patient lives - sampled from Gloucestershire Clinical Commissioning Group

PrimaryCareNetwork

The primary care network of the patient


Branded discrete colour scale

Description

This anonymous function allows you to apply the Sanger Theme colours to your ggplot2 plot

Usage

scale_fill_sanger()

Value

A custom colour filled ggplot2 plot

Examples

library(SangerTools)
library(dplyr)
library(ggplot2)
# Group by Age Band
health_data <- SangerTools::PopHealthData
health_data %>%
  dplyr::filter(Smoker == 1) %>%
  SangerTools::categorical_col_chart(AgeBand) +
  labs(
    title = "Smoking Population by Age Band",
    subtitle = "Majority of Smokers are Working Aged ",
    x = NULL,
    y = "Patient Number"
  )+
  scale_fill_sanger()

Brand Colour Palette

Description

Displays a brand colour palette for showing the hex codes associated with brand

Usage

show_brand_palette()

Value

a Base R plot object

Examples

library(scales)
library(SangerTools)
show_brand_palette()

Extended Brand Colour Palette

Description

Displays extended brand colour palette for charting

Usage

show_extended_palette()

Value

a Base R plot object

Examples

library(scales)
library(SangerTools)
show_extended_palette()

Split & Save

Description

A simpler alternative to cohort_processing. Will split a data frame and save as a csv

Usage

split_and_save(df, Split_by, path, prefix = NULL)

Arguments

df

A 'dataFrame' or 'tibble' ie PopHealthData.

Split_by

A column within df that will be used to split the patients and will also appear in the file name. Ideally should be a health organisation code such as GP Practice Code or Hospital Trust Code. Should only have alpha-numeric values

path

A file path to which the CSV files will be written

prefix

File name prefix

Value

n number of CSV files written to the location specified by path argument.

Examples

## Not run: 
split_and_save(
 df = pseudo_data,
 Split_by = "Locality",
 file_path = "Inputs/",
 prefix = NULL
)

## End(Not run)

Standardised Prevalence Rates.

Description

Standardisation will be performed for all unique values in the column passed to 'split_by'. If input data frame does not contain age bands or age bands are not of class factor, it is recommended to use age_bandizer or age_bandizer_2. After the function has run, the output can be copied using excel_clip or written to a database using df_to_sql. Alternatively, if you are interested in seeing the effects of age confounding; consider joining the outputs of this function with the output from crude_rates using a left_join

Usage

standardised_rates_df(
  df,
  Split_by,
  Condition,
  Population_Standard,
  Granular = FALSE,
  ...
)

Arguments

df

a tidy data frame in standard Master Patient Index format ie SangerTools::PopHealthData.

Split_by

A column name within df for which the standardised rates will be calculated for.

Condition

A Health condition flag denoted by 1 & 0; where 1 denotes the patient being positive for the health condition.

Population_Standard

Population Standard Weight used for Standardising; default set to NULL; which denotes use of Age Structure of df.

Granular

Takes a boolean value. If set to TRUE will output a tibble with Standardised Rates using values provided in 'Split_col' and '...'By default is set to FALSE.

...

Variables used to standardise by; Must always have Age band for age standardisation, additional variables are optional and should be passed separated by commas.

Value

A tibble containing standardised Prevalence Rates by specified group.

Examples

library(SangerTools)
health_data <- SangerTools::age_bandizer(df = SangerTools::master_patient_index,
                                         Age_col=Age)
df_rates <- standardised_rates_df(
  df = health_data,
  Split_by = Locality,
  Condition = Diabetes,
  Population_Standard = NULL,
  Granular = TRUE,
  Ageband
)
print(df_rates)

Customised ggplot2 Theme

Description

A customised ggplot2 theme for the SangerTools package

Usage

theme_sanger()

Value

A customised ggplot2 plot

Examples

library(SangerTools)
library(ggthemes)
library(ggplot2)
library(ggtext)
categorical_col_chart(SangerTools::PopHealthData, Locality) +
  theme_sanger()+
  labs(title = "Categorical Column Chart",
  x = "Locality",
  y = "Number of Patients")+
  scale_fill_sanger()

Data set of 2018 UK Population

Description

Data is taken from ONS and is split into 5 year age band

Usage

uk_pop_standard

Format

A tibble with 29 rows and 2 variables:

UK_Population

dbl Year price was recorded

Ageband

5 Year age band for population

Source

https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates