Introduction

Welcome to this project where we will embark on a comparative journey of data wrangling and cleaning using two powerful tools: RStudio and Tableau Prep. Our mission is to analyze the “bc-stats_2018-new-homes-data_tosend.xlsx” Excel file and explore the similarities and differences between these two platforms.

Through this project, we aim to gain valuable insights into the data wrangling process and understand the unique features and advantages offered by RStudio and Tableau Prep. By working side-by-side with these tools, we will uncover distinct approaches, techniques, and functionalities employed to transform raw data into a clean and structured format.

By comparing the workflows, functionalities, and outcomes achieved using RStudio and Tableau Prep, we will gain a comprehensive understanding of how each tool can effectively handle data wrangling tasks. This comparative analysis will provide us with valuable insights into the strengths and limitations of each platform, allowing us to make informed decisions about which tool best suits our specific data manipulation needs.

So, fasten your seatbelts and get ready for an exciting exploration of data wrangling with RStudio and Tableau Prep. Let’s dive into the world of data transformation and discover the nuances and possibilities of these two powerful tools!

Data Exploration

Let’s dive into the data and uncover its secrets. In this section, we will list the sheet names in the file and assign more intuitive names to each sheet to enhance our understanding.

# Specifying the path and filename of the Excel file
housing <- "bc-stats_2018-new-homes-data_tosend.xlsx"

# Listing the sheet names in the Excel file
excel_sheets(housing)
## [1] "Notes"                "Regional Districts"   "Single Detached"     
## [4] "Multi Unit Homes"     "Purpose Built Rental"

Importing a sheet

To import a specific sheet from the Excel file and examine its structure, we can use the read_excel() function with the sheet parameter set to the desired sheet name. Here’s the updated code and description:

# Importing the "Single Detached" sheet from the Excel file
read_excel(housing, sheet = "Single Detached")
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## # A tibble: 162 × 6
##    Single Detached Homes by Community, 2016, 201…¹ ...2  ...3  ...4  ...5  ...6 
##    <chr>                                           <chr> <chr> <chr> <lgl> <chr>
##  1 Community                                       Sing… <NA>  <NA>  NA    Note:
##  2 <NA>                                            2016  2017  2018  NA    * Fo…
##  3 100 Mile House                                  25    16    19    NA    **   
##  4 Abbotsford                                      336   252   270   NA    Cent…
##  5 Alert Bay                                       *     *     *     NA    Delt…
##  6 Anmore                                          18    18    6     NA    Kent…
##  7 Armstrong                                       35    43    34    NA    Lang…
##  8 Ashcroft                                        *     *     *     NA    Nort…
##  9 Barriere                                        *     *     12    NA    Nort…
## 10 Belcarra                                        9     *     5     NA    West…
## # … with 152 more rows, and abbreviated variable name
## #   ¹​`Single Detached Homes by Community, 2016, 2017, 2018`

By reading the “Single Detached” sheet, we gain insight into the position and column types of the data. This step is crucial for understanding the structure of the dataset and preparing it for further analysis. Let’s proceed to the next section to explore and transform the data in RStudio and Tableau Prep simultaneously.

To address the challenges with the file’s multi-header and the presence of “*” instead of NAs, we can read a specific range from each sheet and assign appropriate column names and types.

# For Single Detached sheet
single_detached <-
  read_excel(
    housing,
    sheet = "Single Detached",
    range = "B5:E164",
    col_names = c("community", "2016", "2017", "2018"),
    col_types = c("guess", "numeric", "numeric", "numeric"),
    na = "*"
  )

single_detached
## # A tibble: 160 × 4
##    community      `2016` `2017` `2018`
##    <chr>           <dbl>  <dbl>  <dbl>
##  1 100 Mile House     25     16     19
##  2 Abbotsford        336    252    270
##  3 Alert Bay          NA     NA     NA
##  4 Anmore             18     18      6
##  5 Armstrong          35     43     34
##  6 Ashcroft           NA     NA     NA
##  7 Barriere           NA     NA     12
##  8 Belcarra            9     NA      5
##  9 Bowen Island       28     30     31
## 10 Burnaby           371    315    254
## # … with 150 more rows
#For Multi Unit Homes sheet
multi_unit <-
  read_excel(
    housing,
    sheet = "Multi Unit Homes",
    range = "B5:E164",
    col_names = c("community", "2016", "2017", "2018"),
    col_types = c("guess", "numeric", "numeric", "numeric"),
    na = "*"
  )

multi_unit
## # A tibble: 160 × 4
##    community      `2016` `2017` `2018`
##    <chr>           <dbl>  <dbl>  <dbl>
##  1 100 Mile House     NA      6     NA
##  2 Abbotsford        617    654    579
##  3 Alert Bay          NA     NA     NA
##  4 Anmore             NA     NA     NA
##  5 Armstrong          NA     10     18
##  6 Ashcroft           NA     NA     NA
##  7 Barriere           NA     NA     NA
##  8 Belcarra           NA     NA     NA
##  9 Bowen Island       NA      6     NA
## 10 Burnaby          4322   3044   2347
## # … with 150 more rows
# For Purpose Built Rental sheet
purpose_rental <-
  read_excel(
    housing,
    sheet = "Purpose Built Rental",
    range = "B5:E164",
    col_names = c("community", "2016", "2017", "2018"),
    col_types = c("guess", "numeric", "numeric", "numeric"),
    na = "*"
  )

purpose_rental
## # A tibble: 160 × 4
##    community      `2016` `2017` `2018`
##    <chr>           <dbl>  <dbl>  <dbl>
##  1 100 Mile House     NA     NA     NA
##  2 Abbotsford        327     NA    428
##  3 Alert Bay          NA     NA     NA
##  4 Anmore             NA     NA     NA
##  5 Armstrong          NA     NA     NA
##  6 Ashcroft           NA     NA     NA
##  7 Barriere           NA     NA     NA
##  8 Belcarra           NA     NA     NA
##  9 Bowen Island       NA     NA     NA
## 10 Burnaby            NA    237    114
## # … with 150 more rows

By specifying the range, column names, column types, and handling the “*” values as NAs, we ensure that the data is imported correctly and ready for exploration and analysis.

Exploring the Data with Tableau Prep

Discover how Tableau Prep handles the data import phase. We’ll explore the steps involved in bringing the sheets into Tableau Prep for further analysis and visualization.

Click on the image below to view it in full resolution.

In this step, we tackle essential data cleaning tasks, including column renaming, filtering out null values, and removing unwanted headers and punctuation marks. By applying these transformations, we ensure that the data is in a clean and structured format for further analysis.

Click on the image below to view it in full resolution.

Efficiency and consistency are paramount when working with multiple sheets containing similar data structures. Tableau Prep empowers you to streamline your data cleaning process by applying the same cleaning steps across multiple sheets effortlessly.

By leveraging the power of Tableau Prep’s intuitive interface, you can copy the cleaning steps from one sheet and paste them into the others, ensuring consistency in data cleaning operations. This feature eliminates the need for repetitive manual work, saving time and reducing the risk of errors.

Take a look at the image below to see how the cleaning steps are seamlessly applied to multiple sheets:

Creating a Tidy Data Structure

To facilitate further analysis and visualization, it’s essential to transform the data into a tidy structure. By reshaping the data, we can organize it in a consistent and standardized format. In this section, we will convert the three sheets into a tidy format using the pivot_longer() function.

Let’s take a closer look at the code:

single_detached_long <- single_detached %>%
  pivot_longer(-community, 
               names_to = "year", 
               values_to = "single")

single_detached_long
## # A tibble: 480 × 3
##    community      year  single
##    <chr>          <chr>  <dbl>
##  1 100 Mile House 2016      25
##  2 100 Mile House 2017      16
##  3 100 Mile House 2018      19
##  4 Abbotsford     2016     336
##  5 Abbotsford     2017     252
##  6 Abbotsford     2018     270
##  7 Alert Bay      2016      NA
##  8 Alert Bay      2017      NA
##  9 Alert Bay      2018      NA
## 10 Anmore         2016      18
## # … with 470 more rows
multi_unit_long <- multi_unit %>%
  pivot_longer(-community, 
               names_to = "year", 
               values_to = "multi")

multi_unit_long
## # A tibble: 480 × 3
##    community      year  multi
##    <chr>          <chr> <dbl>
##  1 100 Mile House 2016     NA
##  2 100 Mile House 2017      6
##  3 100 Mile House 2018     NA
##  4 Abbotsford     2016    617
##  5 Abbotsford     2017    654
##  6 Abbotsford     2018    579
##  7 Alert Bay      2016     NA
##  8 Alert Bay      2017     NA
##  9 Alert Bay      2018     NA
## 10 Anmore         2016     NA
## # … with 470 more rows
purpose_rental_long <- purpose_rental %>%
  pivot_longer(-community, 
               names_to = "year", 
               values_to = "rental")

purpose_rental_long
## # A tibble: 480 × 3
##    community      year  rental
##    <chr>          <chr>  <dbl>
##  1 100 Mile House 2016      NA
##  2 100 Mile House 2017      NA
##  3 100 Mile House 2018      NA
##  4 Abbotsford     2016     327
##  5 Abbotsford     2017      NA
##  6 Abbotsford     2018     428
##  7 Alert Bay      2016      NA
##  8 Alert Bay      2017      NA
##  9 Alert Bay      2018      NA
## 10 Anmore         2016      NA
## # … with 470 more rows

The code above transforms the single_detached dataset into a tidy structure by pivoting the columns (excluding the community column) into two new columns: year and single. Each row now represents a unique combination of a community, year, and the corresponding value for single detached properties.

Similarly, we apply the pivot_longer() function to the multi_unit and purpose_rental datasets to create multi_unit_long and purpose_rental_long, respectively. These datasets now follow the tidy data principles, providing a consistent and organized structure for analysis.

Take a look at the resulting datasets to observe the transformation and the tidy structure achieved.

As we move forward, this tidy data structure will enable us to easily perform various analyses, generate insightful visualizations, and extract valuable insights from the data.

Transforming the Data in Tableau Prep

To replicate the data transformation performed in RStudio’s pivot_longer() function, we will use Tableau Prep’s pivot functionality. This step allows us to reshape the data from a wide format to a long format, making it easier to analyze and visualize.

Take a closer look at the image below:

By copying the transformation process, we ensure consistency and efficiency in preparing the data for further exploration and analysis. The steps that we defined for pivoting the first sheet can be easily replicated for the other sheets, saving us time and effort.

Joining tables

In this section, we focus on combining the individual tables we obtained from the previous step. By joining these tables, we can create a consolidated view that incorporates all the relevant information previously stored in separate sheets.

Using the power of R, we leverage the left_join() function to merge the tables based on common columns such as “community” and “year”. This ensures that the data from different sources align correctly, allowing us to perform comprehensive analysis.

Take a look at the code snippet below:

new_homes_bc <- 
single_detached_long %>% 
  left_join(multi_unit_long, by = c("community", "year")) %>% 
  left_join(purpose_rental_long, by = c("community", "year")) 

new_homes_bc
## # A tibble: 480 × 5
##    community      year  single multi rental
##    <chr>          <chr>  <dbl> <dbl>  <dbl>
##  1 100 Mile House 2016      25    NA     NA
##  2 100 Mile House 2017      16     6     NA
##  3 100 Mile House 2018      19    NA     NA
##  4 Abbotsford     2016     336   617    327
##  5 Abbotsford     2017     252   654     NA
##  6 Abbotsford     2018     270   579    428
##  7 Alert Bay      2016      NA    NA     NA
##  8 Alert Bay      2017      NA    NA     NA
##  9 Alert Bay      2018      NA    NA     NA
## 10 Anmore         2016      18    NA     NA
## # … with 470 more rows

By executing this code, we create a new table named new_homes_bc that combines the data from the single detached, multi-unit homes, and purpose-built rental sheets. This consolidated table provides a holistic view of the new homes data in British Columbia.

Joining the tables enables us to perform advanced analysis and gain insights that would be challenging to achieve with individual datasets. We now have a unified dataset that lays the foundation for deeper exploration and visualization.

Joining Tables in Tableau Prep

Our first step in Tableau Prep is to join the tables containing the new homes data. By combining the information from the single detached, multi-unit homes, and purpose-built rental sheets, we create a unified dataset that will serve as the foundation for our visualization.

Refer to the image below, which showcases the join operation in Tableau Prep:

Once we have completed the necessary data preparations in Tableau Prep, we are ready to send the processed data to Tableau Desktop. Tableau Desktop is a robust data visualization and analytics software that empowers us to create interactive and engaging visualizations based on our prepared data.

The image below illustrates the seamless transition from Tableau Prep to Tableau Desktop:

By leveraging the capabilities of Tableau Desktop, we can unleash the full potential of our prepared data and craft compelling visualizations. Tableau Desktop offers a wide range of visualization options, interactive features, and customization tools to help us effectively communicate our findings and insights.

Plotting

In this section, we will create visualizations to showcase the total number of housing units built in B.C. by municipality. We will focus on the top 10 municipalities with the greatest number of homes built, sorted from most to least. You have the option to choose whether you want to analyze this based on a single year or the total of all three years in the data.

Let’s start by plotting the data with the municipalities on the x-axis and the total number of new houses on the y-axis:

homes_chart <- new_homes_bc %>%  
  filter(year %in% c("2016", "2017", "2018")) %>%
  group_by(community) %>% 
  summarise(total = sum(multi, na.rm= TRUE, rental, na.rm= TRUE, single, na.rm= TRUE)) %>% 
  arrange(desc(total)) %>%
  slice(1:10) %>% 
  ggplot() +
  geom_col(aes(x = reorder(community, -total), y = total))

homes_chart +
  labs(title = "Top 10 Municipalities in B.C. with Highest Number of New Houses",
       subtitle = "2016-2018",
       caption = "Source: B.C. Ministry of Municipal Affairs and Housing",
       y = "Total New Houses",
       x = NULL)

In this plot, the municipalities are arranged on the x-axis based on the total number of new houses, from highest to lowest. The y-axis represents the total number of new houses. The title and subtitle provide context for the visualization, and the caption indicates the data source.

However, if you find that the labels of the municipalities are overlapping, we can swap the axes to improve readability:

homes_chart2 <- new_homes_bc %>%  
  filter(year %in% c("2016", "2017", "2018")) %>%
  group_by(community) %>% 
  summarise(total = sum(multi, na.rm= TRUE, rental, na.rm= TRUE, single, na.rm= TRUE)) %>% 
  arrange(desc(total)) %>%
  slice(1:10) %>% 
  ggplot() +
  geom_col(aes(x = total, y = reorder(community, -total)))

homes_chart2 +
  labs(title = "Top 10 Municipalities in B.C. with Highest Number of New Houses",
       subtitle = "2016-2018",
       caption = "Source: B.C. Ministry of Municipal Affairs and Housing",
       y = "Total New Houses",
       x = NULL)

In this alternative plot, the total number of new houses is represented on the x-axis, while the municipalities are reordered on the y-axis based on the total number of new houses. The other elements, such as the title, subtitle, caption, and axis labels, remain the same.

These visualizations provide a clear understanding of the top 10 municipalities in B.C. with the highest number of new houses, allowing for easy comparison and identification of the leading regions.

Tableau Desktop

After processing the data in Tableau Prep, we created a bar chart to visualize the results. Here is the image of the bar chart:

The bar chart provides a visual representation of the data, allowing you to explore the top 10 municipalities in B.C. with the highest number of new houses. The interactive version of the chart provides additional functionality for further analysis and exploration. You can access the interactive chart by following this link:

https://public.tableau.com/app/profile/jose.velazquez/viz/NewHomesinB_C_/NewHomesB_C_

Conclusion

By analyzing the project and comparing the workflows and outcomes of data wrangling using RStudio and Tableau Prep, we can draw several conclusions:

  1. Data Import and Cleaning: Both RStudio and Tableau Prep offer efficient ways to import and clean data from various sources. RStudio’s read_excel() function allows for flexible data import and manipulation, while Tableau Prep provides a user-friendly interface for visually exploring and transforming data.

  2. Data Transformation: RStudio’s pivot_longer() function and Tableau Prep’s pivot functionality enable us to reshape the data from wide to long format, creating a tidy structure. These tools simplify the process of organizing and standardizing data, making it easier to analyze and visualize.

  3. Consistency and Efficiency: Tableau Prep excels in handling multiple sheets with similar structures. The ability to copy cleaning steps and apply them across multiple sheets saves time and ensures consistency in data cleaning operations. This feature is especially useful when dealing with datasets that require repetitive transformations.

  4. Data Joining: RStudio’s left_join() function and Tableau Prep’s join functionality allow us to merge datasets based on common columns, creating a unified dataset for comprehensive analysis. Both tools offer powerful capabilities to combine data from different sources, facilitating deeper insights and exploration.

  5. Visualization: RStudio’s ggplot2 package and Tableau Desktop provide powerful visualization tools. With RStudio, we can create customizable and publication-ready plots using the ggplot() function. Tableau Desktop offers a rich set of interactive features, enabling the creation of visually appealing and interactive visualizations.

  6. Data Insights: Through our analysis, we can identify the top 10 municipalities in B.C. with the highest number of new houses. By sorting the data and visualizing it in a bar chart, we can easily compare the housing development across municipalities. These insights can inform decision-making processes, urban planning strategies, and resource allocation.

Overall, this project highlights the strengths and unique features of both RStudio and Tableau Prep. RStudio provides powerful data manipulation capabilities, extensive statistical analysis tools, and flexible programming options. Tableau Prep, on the other hand, offers an intuitive visual interface, streamlined data cleaning operations, and seamless integration with Tableau Desktop for advanced visualization.

By understanding the nuances and possibilities offered by these tools, we can choose the most appropriate tool based on our specific data requirements and analytical goals. Whether it’s data wrangling and analysis with RStudio or data exploration and visualization with Tableau Prep and Tableau Desktop, these tools empower us to extract valuable insights from complex datasets and make data-driven decisions.

Resources