Data Science Beginnings

Matt Zhang
4 min readOct 26, 2020

It’s hard to believe that we have just about finished up phase 1 and it has definitely been a long process for myself. I remember first being introduced to git and it was such a foreign concept that I had to spend a couple nights figuring everything out and I’m still not fully understanding things without looking them up. The idea of a version control system never really occurred to me but as I dove into it I became fascinated. Yet, as we began our data cleaning topic, it was a completely new concept that I had difficulty grasping; especially the utilization of coding with Python. What was interesting to me was the fact that we could shift through data frames and change them entirely however we saw fit. It has helped me better understand what a data scientist can do and the abilities they have for manipulating large amounts of data. Furthermore, besides the labs we completed, the Phase 1 Project was our first real application of data cleaning and it was an experience that I learned a lot from. The project itself had a very interesting prompt that assigned us to retrieve information about the movies to help them enter video production industry. I ran into 2 roadblocks while working on this project that I would like to bring to light. The first one being data cleaning and my ability to edit data frames accordingly. With so much data, it is hard to know where to begin and I had so much in front of me that it was hard deciding what to use for my project. In the end, I went with my intuition and decided to use the provided datasets for the sake of convenience and organization. With the given files, I decided to use the IMDB, TMDB and Box Office Mojo files because I found those to contain the most useful information about movies. Using these files, I knew I had to remove unnecessary information by dropping columns through the .drop() method. This was actually a tool I found very useful throughout the course of this project since it expedited the cleaning process. With such a large data set, the null values were a little easier to deal with using the dropna() method that removed rows containing null values. Although bits of information was lost, I think the bigger pictured remained. What I had to do after cleaning the data and create questions based on the data frames I had. This leads into my second block: linking insightful questions to my data. I found this part extremely tricky and difficult as I spent so much time creating new questions but these questions were foiled when I ran into technical issues. Perhaps I could have approached this block differently. Going back to the first phase itself; this is definitely a phase that I want to spend more time on in terms of coding and data manipulation. Although all topics are of great importance, I find this phase to be the most weight personally. I realize Python is a vital tool in the world of not only data science, but also software engineering so I hope to exceed familiarization and hopefully become fluent at some point in my career. In retrospect, loops specifically have me intrigued and I think were part of the problem in my project roadblocks. Being able to utilize loops effectively probably would have made some of my previous questions viable and not ignored. I know I had to touch up on the loop topic when I needed to incorporate some into my project; specifically, when I had to loop through a data frame to count up the number of true and false occurrences for each column. Within this loop, I assigned the the values of each column to its individual name (in this case genres). Problems like this are the ones I tend to spend the most time on thinking about and actually understanding what is happening behind the code. It always helps to think smaller first and then create an entire loop around that. With assistance, I was able to first locate (using the efficient .loc method) a single columns and sum up the values with .sum(). Although the values were true and false, Python reads these as 1 and 0, respectively. Therefore, performing restricted calculations upon them. There is much I have to learn and many new ways of thinking that I have not come across, but this all comes along with more practice. Something else that I have found extremely helpful and that I hope to participate in more actively are the weekly coding warmups that require you to think quickly and are more technically based. This is important for me to train my brain to think at a certain level and I know this will be useful in future interviews and questions. My favorite questions come from the LeetCode website, those are challenging yet fun to do and I know it is a popular site to practice for those in recruiting. The nice thing about coding is that although you spend a lot of time trying to solve the problem, the rewarding feelings you get when you answer it correctly is unmatched; and that’s really what drives me to code more. There are definitely many things I am looking forward to for the next phase and taking a peak, a lot useful statistics that I need to refresh on. I realize data science encompasses a variety of topics so I shouldn’t stress completely about the coding. Although this first phase is over, I have gained valuable insights as to how I should approach future phases and I hope this first phase is more of a pre-cursor for how I should approach the rest of the program. There are some study habits that I wish to improve on and hopefully remove some of the distractions I experienced this phase. Looking forward to the rest of the program and learning the ins and outs of data science with my amazing cohort!

--

--