Autumn Biggie

Course Reflections

2021-11-19T00:00:00+00:00

Looking back on my previous posts, not much has changed about my view on what data scientists do. As I’ve grown in my skillset and knowledge of different data-handling techniques, ways to access data, as well as how to create beautiful visualizations and presentations, I’ve been discovering what it FEELS like to be a data scientist. Reading about the difference between a statistician and a data scientist is one thing, but using new tools and learning to THINK like a data scientist is another thing. I’m excited to keep using the tools I’ve learned and continue exploring this growing field.

Over the span of this semester, I have grown not only to view R as an intuitive language for data science, but as a language I will use for the rest of my career. It has easy syntax, it’s well documented, and other coders are constantly sharing ways to do new things and analyze data. I think the value in R lies in the many different ways to present my work. R Markdown, Github pages, and ShinyApps provide so many possibilities for creating engaging reports and interactive presentations.

Now that I’ve taken ST 558, I’ll be looking for a job that allows me to use R frequently, or even as my main coding language. In addition, I’m excited to start using R for some personal projects outside of school, as well as to present data for other statistics classes.

I’ll keep you posted on what I create!

My Project 2 Experience

2021-10-30T00:00:00+00:00

Automation, Predictive Modeling, and Partners? Oh my!

Usually, I’m not too thrilled to work with a partner on a project. I’ve always been the one to end up with the heaviest workload, often because by eye for detail allows me to catch mistakes when reviewing the group’s work. I’m always nervous that my partner will do a lousy job and I’ll have to put in twice the effort.

This project was far from that scenario! For this project, my partner Ryan Bunn and I analyzed a news popularity dataset, producing multiple files reporting analyses for each of the six different data channels: lifestyle, entertainment, social media, business, tech, and world. We went through the process of reading in the dataset, data manipulation and variable creation, summary table creation, data visualization, and model fitting using linear regression, random forest, and boosted tree methods. At the end of each document, a “best model” was declared.

It was a pleasure to collaborate with Ryan on this project and I appreciate the opportunity for my view of group work to change. There is nothing that I would change about the process or the product of this project. Throughout the development process, each of our tasks as collaborators were clear and each person completed them in a timely manner. Communication between Ryan and I was always thorough and clear, allowing for a smooth workflow.

The most difficult part for me was getting excited about this data. During my last project, it was easy to visualize trends in the data during graph creation. However, this dataset was more complex, having many observations, interactions, and relationships between variables that were less pronounced. Exploring the variables absorbed a lot of my time because it was difficult to find variables that created an interesting scattterplot, histogram, etc.

My biggest takeaway from this project was the importance of doing model comparison on a test set. Sometimes the random forest or boosted tree model would appear to perform better on the training set we created, but one of the linear regression models would rise to the top when tested on the test set. This was surprising to me, but it exposed the value of having a test set to accurately measure the efficacy of each model.

Overall, my experience with this project was a pleasant one and I’m grateful to have been paired with someone so easy to work with. I’m excited to tackle the next project!

Visit our Project 2 Repository here

Visit the Project 2 landing page here

Reflections On My First R Project

2021-10-04T00:00:00+00:00

Today I finished my first project in RStudio. I built a GitHub page through R that teaches visitors how to access an API, building user-friendly functions in the process, as well as performing exploratory data analysis.

When I first glanced at the Project instructions, I was both excited and intimidated. The estimated amount of time seemed like a lot! However, I was glad to be able to choose the API I wanted to access using a provided list. Initially I chose the Covid-19 API, but became frustrated when a network error lasted all day, putting the project on hold. Since I had only written a function to access the API up to that point, I decided to start over and choose the OneCall portion of the OpenWeather API.

The longer I spent coding my way through this project, the more I enjoyed the process of function writing, debugging, discovering trends in the graphs I created, and committing my thoughts to the R Markdown notebook. The most difficult part of the process was getting over the intimidation of using the render() function to render the document instead of using the knit button, as well as having to add, commit, and push my changes regularly to GitHub. In retrospect, these are both very easy processes that didn’t take much time or brain power, but the fact that I couldn’t visualize how these processes function made both tasks seem difficult at first.

As far as the logic and programming, I didn’t have much trouble. The hardest part was probably looking up new functions for data cleaning as well as making decisions about what contingency tables, numerical summaries, and plots I wanted to make. However, that was also the most fun. Imagining interesting comparisons between variables and then working to put together the necessary code to create the plot built suspense because I was excited to see the result.

If I were to do this project over again (which I may do it just for fun with a different API), I would become more familiar with what functions and options are allowed to be used when rendering a github document in R Markdown. I ran into a few issues with leaflet() and a few other functions that easily work in HTML output, but are more tricky to include in GitHub pages. Becoming familiar with these limitations would save me hours of Google searches in the future.

Overall, I really enjoyed this project, including the learning process as well as the final vignette I created. I’m looking forward to tackling more stuff like this in the future!

To check out my project, visit https://atbiggie.github.io/Project1/
To see my Project 1 Repository, visit https://github.com/atbiggie/Project1
To see the repository that hosts my blog, visit https://github.com/atbiggie/atbiggie.github.io

Programming Background

2021-09-07T00:00:00+00:00

My (not so hot) take on R

I wasn’t looking forward to learning R…

When I started my first R class almost two years ago, I wasn’t all that excited about it. I was more comfortable in SAS and had dabbled a bit in Python, and I was content with limiting my coding knowledge to those two languages. R just looked… ugly?

After I got over my initial disgust toward the syntax, I realized that the same nasty looking syntax was surprisingly very easy to learn. The more capable I became as an R programmer, the more I realized that I could perform many of the same analyses and render the same plots that I could in SAS, only in much fewer lines of code and often with better graphics.

I especially appreciate the capability to create nicely formatted html pages as well as interactive graphics. Although I miss the visual appeal of a detailed and hierarchically formatted SAS proc step, I can finally say that I prefer R as my primary coding language because of its ease of use, flexibility, and intuitive (albeit ugly) syntax.

Example R Markdown Output

plot(iris)

What is a Data Scientist?

2021-08-17T00:00:00+00:00

As discussed in many of the articles below, defining what makes a data scientist seems to be quite complicated. I’ve always thought of them as being rebranded statisticians, but with a more current name. However, I’m learning that there are some skillset and responsibility differences that set data scientists apart from their fellow statisticians. Being able to handle data with thousands of variables and millions of lines of code is one skill that data scientists must master and that statisticians can avoid. Presumably, being able to manage that much data requires superior coding skills in multiple languages and maybe some knowledge of artificial intelligence. Data science seems to be about taking massive amounts of data and using statistics and dynamic code to learn from it as efficiently as possible. Data scientists are who drive the world’s most impactful business decisions for companies like Google and Facebook, while the word “statistician” evokes images of smaller clinical trials and carefully planned experiments or surveys. Data scientists know how to make sense of data sets that never stop growing.

Although the traditional role of statisticians may be based more in statistical theory than in programming, I think the profession is evolving to meet the current big-data-driven world’s needs. Statisticians still have a massive knowledge base of statistics and mathematics, but many educational programs are shifting toward more of a focus on programming so that their graduating statisticians have a better chance of qualifying for those data science jobs. More classes about artificial intelligence and developing strong programming skills are popping up in course catalogs to bolster theory-based statisticians with the computer skills needed to use their statistics knowledge efficiently and on a large scale.

As for myself, I believe I’m a statistician who is gaining the skills necessary to enter the world of data science. Most of my education has been about statistical theory and practice, but I’ve taken more recent classes that have allowed me to gain a strong knowledge base in SAS, dip my toes into SQL, and dive deep into R. I think it’s honing these skills, as well as exploring deep learning and artificial intelligence, that will prepare me to walk the common ground between statistics and data science.

~ Autumn

https://medium.com/odscjournal/data-scientists-versus-statisticians-8ea146b7a47f https://www.springboard.com/blog/ai-machine-learning/machine-learning-engineer-vs-data-scientist/ https://www.simplilearn.com/data-science-vs-data-analytics-vs-machine-learning-article https://mixpanel.com/blog/this-is-the-difference-between-statistics-and-data-science/