Welcome to this project where we will embark on a comparative journey of data wrangling and cleaning using two powerful tools: RStudio and Tableau Prep. Our mission is to analyze the “bc-stats_2018-new-homes-data_tosend.xlsx” Excel file and explore the similarities and differences between these two platforms.
Through this project, we aim to gain valuable insights into the data wrangling process and understand the unique features and advantages offered by RStudio and Tableau Prep. By working side-by-side with these tools, we will uncover distinct approaches, techniques, and functionalities employed to transform raw data into a clean and structured format.
By comparing the workflows, functionalities, and outcomes achieved using RStudio and Tableau Prep, we will gain a comprehensive understanding of how each tool can effectively handle data wrangling tasks. This comparative analysis will provide us with valuable insights into the strengths and limitations of each platform, allowing us to make informed decisions about which tool best suits our specific data manipulation needs.
So, fasten your seatbelts and get ready for an exciting exploration of data wrangling with RStudio and Tableau Prep. Let’s dive into the world of data transformation and discover the nuances and possibilities of these two powerful tools!
Let’s dive into the data and uncover its secrets. In this section, we will list the sheet names in the file and assign more intuitive names to each sheet to enhance our understanding.
# Specifying the path and filename of the Excel file
housing <- "bc-stats_2018-new-homes-data_tosend.xlsx"
# Listing the sheet names in the Excel file
excel_sheets(housing)
## [1] "Notes" "Regional Districts" "Single Detached"
## [4] "Multi Unit Homes" "Purpose Built Rental"
To import a specific sheet from the Excel file and examine its structure, we can use the read_excel() function with the sheet parameter set to the desired sheet name. Here’s the updated code and description:
# Importing the "Single Detached" sheet from the Excel file
read_excel(housing, sheet = "Single Detached")
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## # A tibble: 162 × 6
## Single Detached Homes by Community, 2016, 201…¹ ...2 ...3 ...4 ...5 ...6
## <chr> <chr> <chr> <chr> <lgl> <chr>
## 1 Community Sing… <NA> <NA> NA Note:
## 2 <NA> 2016 2017 2018 NA * Fo…
## 3 100 Mile House 25 16 19 NA **
## 4 Abbotsford 336 252 270 NA Cent…
## 5 Alert Bay * * * NA Delt…
## 6 Anmore 18 18 6 NA Kent…
## 7 Armstrong 35 43 34 NA Lang…
## 8 Ashcroft * * * NA Nort…
## 9 Barriere * * 12 NA Nort…
## 10 Belcarra 9 * 5 NA West…
## # … with 152 more rows, and abbreviated variable name
## # ¹`Single Detached Homes by Community, 2016, 2017, 2018`
By reading the “Single Detached” sheet, we gain insight into the position and column types of the data. This step is crucial for understanding the structure of the dataset and preparing it for further analysis. Let’s proceed to the next section to explore and transform the data in RStudio and Tableau Prep simultaneously.
To address the challenges with the file’s multi-header and the presence of “*” instead of NAs, we can read a specific range from each sheet and assign appropriate column names and types.
# For Single Detached sheet
single_detached <-
read_excel(
housing,
sheet = "Single Detached",
range = "B5:E164",
col_names = c("community", "2016", "2017", "2018"),
col_types = c("guess", "numeric", "numeric", "numeric"),
na = "*"
)
single_detached
## # A tibble: 160 × 4
## community `2016` `2017` `2018`
## <chr> <dbl> <dbl> <dbl>
## 1 100 Mile House 25 16 19
## 2 Abbotsford 336 252 270
## 3 Alert Bay NA NA NA
## 4 Anmore 18 18 6
## 5 Armstrong 35 43 34
## 6 Ashcroft NA NA NA
## 7 Barriere NA NA 12
## 8 Belcarra 9 NA 5
## 9 Bowen Island 28 30 31
## 10 Burnaby 371 315 254
## # … with 150 more rows
#For Multi Unit Homes sheet
multi_unit <-
read_excel(
housing,
sheet = "Multi Unit Homes",
range = "B5:E164",
col_names = c("community", "2016", "2017", "2018"),
col_types = c("guess", "numeric", "numeric", "numeric"),
na = "*"
)
multi_unit
## # A tibble: 160 × 4
## community `2016` `2017` `2018`
## <chr> <dbl> <dbl> <dbl>
## 1 100 Mile House NA 6 NA
## 2 Abbotsford 617 654 579
## 3 Alert Bay NA NA NA
## 4 Anmore NA NA NA
## 5 Armstrong NA 10 18
## 6 Ashcroft NA NA NA
## 7 Barriere NA NA NA
## 8 Belcarra NA NA NA
## 9 Bowen Island NA 6 NA
## 10 Burnaby 4322 3044 2347
## # … with 150 more rows
# For Purpose Built Rental sheet
purpose_rental <-
read_excel(
housing,
sheet = "Purpose Built Rental",
range = "B5:E164",
col_names = c("community", "2016", "2017", "2018"),
col_types = c("guess", "numeric", "numeric", "numeric"),
na = "*"
)
purpose_rental
## # A tibble: 160 × 4
## community `2016` `2017` `2018`
## <chr> <dbl> <dbl> <dbl>
## 1 100 Mile House NA NA NA
## 2 Abbotsford 327 NA 428
## 3 Alert Bay NA NA NA
## 4 Anmore NA NA NA
## 5 Armstrong NA NA NA
## 6 Ashcroft NA NA NA
## 7 Barriere NA NA NA
## 8 Belcarra NA NA NA
## 9 Bowen Island NA NA NA
## 10 Burnaby NA 237 114
## # … with 150 more rows
By specifying the range, column names, column types, and handling the “*” values as NAs, we ensure that the data is imported correctly and ready for exploration and analysis.
Discover how Tableau Prep handles the data import phase. We’ll explore the steps involved in bringing the sheets into Tableau Prep for further analysis and visualization.
Click on the image below to view it in full resolution.
In this step, we tackle essential data cleaning tasks, including column renaming, filtering out null values, and removing unwanted headers and punctuation marks. By applying these transformations, we ensure that the data is in a clean and structured format for further analysis.
Click on the image below to view it in full resolution.
Efficiency and consistency are paramount when working with multiple sheets containing similar data structures. Tableau Prep empowers you to streamline your data cleaning process by applying the same cleaning steps across multiple sheets effortlessly.
By leveraging the power of Tableau Prep’s intuitive interface, you can copy the cleaning steps from one sheet and paste them into the others, ensuring consistency in data cleaning operations. This feature eliminates the need for repetitive manual work, saving time and reducing the risk of errors.
Take a look at the image below to see how the cleaning steps are seamlessly applied to multiple sheets:
To facilitate further analysis and visualization, it’s essential to transform the data into a tidy structure. By reshaping the data, we can organize it in a consistent and standardized format. In this section, we will convert the three sheets into a tidy format using the pivot_longer() function.
Let’s take a closer look at the code:
single_detached_long <- single_detached %>%
pivot_longer(-community,
names_to = "year",
values_to = "single")
single_detached_long
## # A tibble: 480 × 3
## community year single
## <chr> <chr> <dbl>
## 1 100 Mile House 2016 25
## 2 100 Mile House 2017 16
## 3 100 Mile House 2018 19
## 4 Abbotsford 2016 336
## 5 Abbotsford 2017 252
## 6 Abbotsford 2018 270
## 7 Alert Bay 2016 NA
## 8 Alert Bay 2017 NA
## 9 Alert Bay 2018 NA
## 10 Anmore 2016 18
## # … with 470 more rows
multi_unit_long <- multi_unit %>%
pivot_longer(-community,
names_to = "year",
values_to = "multi")
multi_unit_long
## # A tibble: 480 × 3
## community year multi
## <chr> <chr> <dbl>
## 1 100 Mile House 2016 NA
## 2 100 Mile House 2017 6
## 3 100 Mile House 2018 NA
## 4 Abbotsford 2016 617
## 5 Abbotsford 2017 654
## 6 Abbotsford 2018 579
## 7 Alert Bay 2016 NA
## 8 Alert Bay 2017 NA
## 9 Alert Bay 2018 NA
## 10 Anmore 2016 NA
## # … with 470 more rows
purpose_rental_long <- purpose_rental %>%
pivot_longer(-community,
names_to = "year",
values_to = "rental")
purpose_rental_long
## # A tibble: 480 × 3
## community year rental
## <chr> <chr> <dbl>
## 1 100 Mile House 2016 NA
## 2 100 Mile House 2017 NA
## 3 100 Mile House 2018 NA
## 4 Abbotsford 2016 327
## 5 Abbotsford 2017 NA
## 6 Abbotsford 2018 428
## 7 Alert Bay 2016 NA
## 8 Alert Bay 2017 NA
## 9 Alert Bay 2018 NA
## 10 Anmore 2016 NA
## # … with 470 more rows
The code above transforms the single_detached dataset into a tidy structure by pivoting the columns (excluding the community column) into two new columns: year and single. Each row now represents a unique combination of a community, year, and the corresponding value for single detached properties.
Similarly, we apply the pivot_longer() function to the multi_unit and purpose_rental datasets to create multi_unit_long and purpose_rental_long, respectively. These datasets now follow the tidy data principles, providing a consistent and organized structure for analysis.
Take a look at the resulting datasets to observe the transformation and the tidy structure achieved.
As we move forward, this tidy data structure will enable us to easily perform various analyses, generate insightful visualizations, and extract valuable insights from the data.
To replicate the data transformation performed in RStudio’s pivot_longer() function, we will use Tableau Prep’s pivot functionality. This step allows us to reshape the data from a wide format to a long format, making it easier to analyze and visualize.
Take a closer look at the image below:
By copying the transformation process, we ensure consistency and efficiency in preparing the data for further exploration and analysis. The steps that we defined for pivoting the first sheet can be easily replicated for the other sheets, saving us time and effort.
In this section, we focus on combining the individual tables we obtained from the previous step. By joining these tables, we can create a consolidated view that incorporates all the relevant information previously stored in separate sheets.
Using the power of R, we leverage the left_join() function to merge the tables based on common columns such as “community” and “year”. This ensures that the data from different sources align correctly, allowing us to perform comprehensive analysis.
Take a look at the code snippet below:
new_homes_bc <-
single_detached_long %>%
left_join(multi_unit_long, by = c("community", "year")) %>%
left_join(purpose_rental_long, by = c("community", "year"))
new_homes_bc
## # A tibble: 480 × 5
## community year single multi rental
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 100 Mile House 2016 25 NA NA
## 2 100 Mile House 2017 16 6 NA
## 3 100 Mile House 2018 19 NA NA
## 4 Abbotsford 2016 336 617 327
## 5 Abbotsford 2017 252 654 NA
## 6 Abbotsford 2018 270 579 428
## 7 Alert Bay 2016 NA NA NA
## 8 Alert Bay 2017 NA NA NA
## 9 Alert Bay 2018 NA NA NA
## 10 Anmore 2016 18 NA NA
## # … with 470 more rows
By executing this code, we create a new table named new_homes_bc that combines the data from the single detached, multi-unit homes, and purpose-built rental sheets. This consolidated table provides a holistic view of the new homes data in British Columbia.
Joining the tables enables us to perform advanced analysis and gain insights that would be challenging to achieve with individual datasets. We now have a unified dataset that lays the foundation for deeper exploration and visualization.
Our first step in Tableau Prep is to join the tables containing the new homes data. By combining the information from the single detached, multi-unit homes, and purpose-built rental sheets, we create a unified dataset that will serve as the foundation for our visualization.
Refer to the image below, which showcases the join operation in Tableau Prep:
Once we have completed the necessary data preparations in Tableau Prep, we are ready to send the processed data to Tableau Desktop. Tableau Desktop is a robust data visualization and analytics software that empowers us to create interactive and engaging visualizations based on our prepared data.
The image below illustrates the seamless transition from Tableau Prep to Tableau Desktop:
By leveraging the capabilities of Tableau Desktop, we can unleash the full potential of our prepared data and craft compelling visualizations. Tableau Desktop offers a wide range of visualization options, interactive features, and customization tools to help us effectively communicate our findings and insights.
In this section, we will create visualizations to showcase the total number of housing units built in B.C. by municipality. We will focus on the top 10 municipalities with the greatest number of homes built, sorted from most to least. You have the option to choose whether you want to analyze this based on a single year or the total of all three years in the data.
Let’s start by plotting the data with the municipalities on the x-axis and the total number of new houses on the y-axis:
homes_chart <- new_homes_bc %>%
filter(year %in% c("2016", "2017", "2018")) %>%
group_by(community) %>%
summarise(total = sum(multi, na.rm= TRUE, rental, na.rm= TRUE, single, na.rm= TRUE)) %>%
arrange(desc(total)) %>%
slice(1:10) %>%
ggplot() +
geom_col(aes(x = reorder(community, -total), y = total))
homes_chart +
labs(title = "Top 10 Municipalities in B.C. with Highest Number of New Houses",
subtitle = "2016-2018",
caption = "Source: B.C. Ministry of Municipal Affairs and Housing",
y = "Total New Houses",
x = NULL)
In this plot, the municipalities are arranged on the x-axis based on the total number of new houses, from highest to lowest. The y-axis represents the total number of new houses. The title and subtitle provide context for the visualization, and the caption indicates the data source.
However, if you find that the labels of the municipalities are overlapping, we can swap the axes to improve readability:
homes_chart2 <- new_homes_bc %>%
filter(year %in% c("2016", "2017", "2018")) %>%
group_by(community) %>%
summarise(total = sum(multi, na.rm= TRUE, rental, na.rm= TRUE, single, na.rm= TRUE)) %>%
arrange(desc(total)) %>%
slice(1:10) %>%
ggplot() +
geom_col(aes(x = total, y = reorder(community, -total)))
homes_chart2 +
labs(title = "Top 10 Municipalities in B.C. with Highest Number of New Houses",
subtitle = "2016-2018",
caption = "Source: B.C. Ministry of Municipal Affairs and Housing",
y = "Total New Houses",
x = NULL)
In this alternative plot, the total number of new houses is represented on the x-axis, while the municipalities are reordered on the y-axis based on the total number of new houses. The other elements, such as the title, subtitle, caption, and axis labels, remain the same.
These visualizations provide a clear understanding of the top 10 municipalities in B.C. with the highest number of new houses, allowing for easy comparison and identification of the leading regions.
After processing the data in Tableau Prep, we created a bar chart to visualize the results. Here is the image of the bar chart:
The bar chart provides a visual representation of the data, allowing you to explore the top 10 municipalities in B.C. with the highest number of new houses. The interactive version of the chart provides additional functionality for further analysis and exploration. You can access the interactive chart by following this link:
https://public.tableau.com/app/profile/jose.velazquez/viz/NewHomesinB_C_/NewHomesB_C_
By analyzing the project and comparing the workflows and outcomes of data wrangling using RStudio and Tableau Prep, we can draw several conclusions:
Data Import and Cleaning: Both RStudio and Tableau Prep offer efficient ways to import and clean data from various sources. RStudio’s read_excel() function allows for flexible data import and manipulation, while Tableau Prep provides a user-friendly interface for visually exploring and transforming data.
Data Transformation: RStudio’s pivot_longer() function and Tableau Prep’s pivot functionality enable us to reshape the data from wide to long format, creating a tidy structure. These tools simplify the process of organizing and standardizing data, making it easier to analyze and visualize.
Consistency and Efficiency: Tableau Prep excels in handling multiple sheets with similar structures. The ability to copy cleaning steps and apply them across multiple sheets saves time and ensures consistency in data cleaning operations. This feature is especially useful when dealing with datasets that require repetitive transformations.
Data Joining: RStudio’s left_join() function and Tableau Prep’s join functionality allow us to merge datasets based on common columns, creating a unified dataset for comprehensive analysis. Both tools offer powerful capabilities to combine data from different sources, facilitating deeper insights and exploration.
Visualization: RStudio’s ggplot2 package and Tableau Desktop provide powerful visualization tools. With RStudio, we can create customizable and publication-ready plots using the ggplot() function. Tableau Desktop offers a rich set of interactive features, enabling the creation of visually appealing and interactive visualizations.
Data Insights: Through our analysis, we can identify the top 10 municipalities in B.C. with the highest number of new houses. By sorting the data and visualizing it in a bar chart, we can easily compare the housing development across municipalities. These insights can inform decision-making processes, urban planning strategies, and resource allocation.
Overall, this project highlights the strengths and unique features of both RStudio and Tableau Prep. RStudio provides powerful data manipulation capabilities, extensive statistical analysis tools, and flexible programming options. Tableau Prep, on the other hand, offers an intuitive visual interface, streamlined data cleaning operations, and seamless integration with Tableau Desktop for advanced visualization.
By understanding the nuances and possibilities offered by these tools, we can choose the most appropriate tool based on our specific data requirements and analytical goals. Whether it’s data wrangling and analysis with RStudio or data exploration and visualization with Tableau Prep and Tableau Desktop, these tools empower us to extract valuable insights from complex datasets and make data-driven decisions.
Special thanks to Martin Monkman! The RStudio side of this project is part of the course Data Analytics Coding Fundamentals at UVic.
Data from: New Homes Registry (2016-2018)