Week 3 - BALT 4363 - Handling and Cleaning Data with Python Libraries

 Chapter 3: Handling and Cleaning Data with Python Libraries

This past week in BALT 4363 , I learned about handling and cleaning data with python libraries. This is a very important topic to understand. During this semester, I have learned how to create nice visuals using RStudio and Python. Creating visuals is not highly difficult, the difficulty comes from organizing data to be able to create them. The trickiest part of this is cleaning the data to be able to create better visuals.

Pandas

Pandas is a library that provides easy, high performance data structures and data analysis tools. It is very useful for handling large datasets by offering flexible data manipulation tools. Inside of pandas, there are two primary data structures: Series and DataFrame. Series is a one dimensional array, DataFrame is a two dimensional data structure.

NumPy

NumPy (Numerical Python) is a library for Python that adds support for large arrays and matrices, while also having a large collection of high level mathematical functions to operate on the arrays. The main part of NumPy is the homogeneous multidimensional array that is a table of elements of the same type that are indexed by multiple non negative integers.

What do these do?

Now, you may be wondering what these tools really do, and the answer to that is, well, a lot! To clean data with Pandas, you can handle missing data, remove duplicates, rename columns, and replace values. Missing data is a really annoying issue when trying to create a visualization. by using Pandas, you can use the function clean_data = data.dropna() . This function will remove rows that have missing data, cleaning up the ways your visualizing look, overall making it easier to analyze it!

Why use these?

In my time as a business analytics major, I have heard the question "Why would we do this instead of just using excel?" multiple times. In this chapter, the question was answered really well. To summarize all of the ways, Python has flexibility, scalability, automation, customizable visualizations, advanced analytics, collaboration and resources. Python has greater tools than traditional spreadsheet software's, and are not extremely hard to use.












Comments

Popular posts from this blog

Week 5 - BALT 4363 - Probability and Statistics for Data Science

Week 2: BALT 4396 - Cleaning Data with Python Libraries

Week 6 - Coding through Python and AI