--- title: "Demo: Data and Functions" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Demo: Data and Functions} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4, message = FALSE, warning = FALSE ) ``` ```{r setup} library(SangerTools) library(dplyr) library(ggplot2) ``` This vignette is a single-page tour of **SangerTools**: every bundled dataset, every exported function, and a short end-to-end mini-case that chains them together. If you have just installed the package and want to confirm everything works, run this notebook top to bottom. # Bundled datasets SangerTools ships three fabricated population health datasets so every example in the package is reproducible without external files. ## `PopHealthData` A small (1,000 rows) NHS-style population dataset with one row per patient. ```{r pop-health} data("PopHealthData", package = "SangerTools") glimpse(PopHealthData) head(PopHealthData) ``` ## `master_patient_index` A larger (10,000 rows) fabricated Master Patient Index inspired by Gloucestershire's population. Includes a numeric `Age` column suitable for banding. ```{r mpi} data("master_patient_index", package = "SangerTools") glimpse(master_patient_index) ``` ## `uk_pop_standard` A 5-year age band weighting taken from ONS 2018 mid-year population estimates, ready to use as a standard population for direct standardisation. ```{r uk-pop} data("uk_pop_standard", package = "SangerTools") uk_pop_standard ``` # Function gallery ## Wrangling ### `age_bandizer()` — fixed 5-year bands ```{r age-bandizer} mpi_banded <- age_bandizer(master_patient_index, Age) mpi_banded %>% count(Ageband) %>% head() ``` ### `age_bandizer_2()` — configurable band width ```{r age-bandizer-2} ages <- data.frame(Age = sample(0:100, 30, replace = TRUE)) age_bandizer_2(ages, Age_col = "Age", Age_band_size = 10) %>% head() ``` ### `cohort_processing()` and `split_and_save()` Both split a data frame by an organisational identifier and write one CSV per group. They are file-system side-effects, so we show the signatures here and run them in a temp directory below. ```{r split-helpers, eval = FALSE} cohort_processing( df = PopHealthData, Split_by = "Locality", path = "outputs/" ) split_and_save( df = PopHealthData, Split_by = "Locality", path = "outputs/", prefix = "Locality_" ) ``` A quick live demo that cleans up after itself: ```{r split-live} tmp <- file.path(tempdir(), "sanger-demo") dir.create(tmp, showWarnings = FALSE, recursive = TRUE) split_and_save( df = PopHealthData, Split_by = "Locality", path = paste0(tmp, "/"), prefix = "Locality_" ) list.files(tmp, pattern = "\\.csv$") unlink(tmp, recursive = TRUE) ``` ## Analytics ### `crude_rates()` Crude prevalence per 1,000 patients, grouped by one or more variables. ```{r crude} crude_rates(PopHealthData, Diabetes, Locality) ``` ### `standardised_rates_df()` Direct age-standardised prevalence using either the dataset's own age structure or a supplied population standard. ```{r standardised} standardised_rates_df( df = mpi_banded, Split_by = Locality, Condition = Diabetes, Population_Standard = NULL, Granular = FALSE, Ageband ) ``` ## Charts and theming ### `categorical_col_chart()` + `theme_sanger()` + `scale_fill_sanger()` ```{r chart-themed} PopHealthData %>% filter(Smoker == 1) %>% categorical_col_chart(AgeBand) + labs( title = "Smoking population by age band", subtitle = "Most smokers are working-aged", x = NULL, y = "Patients" ) + scale_fill_sanger() ``` ### Palette previews ```{r palettes, fig.height = 3} show_brand_palette() show_extended_palette() ``` ## I/O ### `excel_clip()` Copies a data frame to the system clipboard in a tab-separated layout that pastes cleanly into Excel. Windows only (it uses the `"clipboard"` device); skipped in this vignette. ```{r excel-clip, eval = FALSE} excel_clip(PopHealthData) ``` ### `df_to_sql()` Writes a data frame to a Microsoft SQL Server table via ODBC. Requires a configured DSN and a Windows machine with integrated authentication; not runnable in a vignette build. ```{r df-to-sql, eval = FALSE} df_to_sql( df = PopHealthData, driver = "ODBC Driver 17 for SQL Server", server = "your-server", database = "your-database", sql_table_name = "PopHealthData", overwrite = FALSE ) ``` ### `multiple_csv_reader()` / `multiple_excel_reader()` Aggregate a directory of CSVs or Excel files into one tidy frame. Pointed at an empty directory below to keep the vignette hermetic. ```{r readers} empty <- file.path(tempdir(), "no-files") dir.create(empty, showWarnings = FALSE) multiple_csv_reader(paste0(empty, "/")) unlink(empty, recursive = TRUE) ``` # Putting it together A short end-to-end mini case: take the Master Patient Index, band ages, compute age-standardised diabetes prevalence by locality using the ONS UK standard population, and plot the result with the Sanger theme. ```{r e2e} banded <- age_bandizer(master_patient_index, Age) rates <- standardised_rates_df( df = banded, Split_by = Locality, Condition = Diabetes, Population_Standard = NULL, Granular = FALSE, Ageband ) rates ggplot(rates, aes(x = reorder(Locality, Standardised_Rate_1k), y = Standardised_Rate_1k, fill = Locality)) + geom_col(show.legend = FALSE) + coord_flip() + labs( title = "Age-standardised diabetes prevalence", subtitle = "Per 1,000 patients, fabricated Gloucestershire MPI", x = NULL, y = "Standardised rate per 1,000" ) + theme_sanger() + scale_fill_sanger() ```