Data analysis is the process of inspecting, cleaning, transforming, and modeling data to uncover useful information and support decision-making.In python you can do data analysis using pandas. It involves various techniques and tools that help dissect complex data sets and extract meaningful insights. Data analysis can be performed through different approaches such as descriptive analysis, exploratory analysis, inferential analysis, and predictive analysis.
Data analysis using pandas
One of the popular tools for data analysis is the Python library called pandas. Pandas provides powerful and efficient data structures to manipulate and analyze structured data. Here’s an example of how to import pandas and load a dataset:
import pandas as pd
# Load dataset
# Replace 'SolarCal1567.csv' with the path or filename of your dataset
data=pd.read_csv('....SolarCal1567.csv')
Here is an example of how you can use the head()
function in Python’s pandas library to display the first few rows of a DataFrame:
df.head()
This code will return the first 5 rows of the DataFrame df
. If you want to display a different number of rows, you can pass an argument to the head()
function, like df.head(10)
to display the first 10 rows.
Similarly you can use the tail()
function in Python’s pandas library to display the last few rows of a DataFrame:
df.tail()
This code will return the last 5 rows of the DataFrame df
. If you want to display a different number of rows, you can pass an argument to the tail()
function, like df.tail(10)
to display the last 10 rows.
The info()
function in Python’s pandas library provides a concise summary of a DataFrame. It displays the column names, the data types of each column, and the number of non-null values.
To use the info()
function, you can simply call it on your DataFrame, like this:
data.info()
The output will look something like this:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1566 entries, 0 to 1565 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Lat 1566 non-null float64 1 Lon 1566 non-null float64 2 Siteid 1566 non-null object 3 DaylengthMax 1566 non-null float64 4 DaylengthMin 1566 non-null float64 5 Annual_Global_Insolation 1566 non-null object 6 Tilt_Angle_for_Solar_PV 1566 non-null int64 dtypes: float64(4), int64(1), object(2) memory usage: 85.8+ KB
This summary provides important information about your dataset, such as the total number of rows, the number of non-null values in each column, and the data types of the columns. This can be helpful in understanding the structure and integrity of your data.
The describe()
function in Python’s pandas library provides descriptive statistics of a DataFrame. It computes various summary statistics such as count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum for each numeric column in the DataFrame.
To use the describe()
function, you can simply call it on your DataFrame, like this:
data.describe()
The output will look something like this:
Lat Lon DaylengthMax DaylengthMin \ count 1566.000000 1566.000000 1566.000000 1566.000000 mean 36.771178 -119.152447 12.830366 11.456182 std 3.598548 6.788176 2.570655 2.179341 min 32.984700 -124.422100 7.830000 7.360000 25% 34.014200 -121.835900 10.989900 9.900000 50% 36.778300 -119.027500 12.782000 11.405000 75% 38.542200 -114.763850 14.487675 12.919900 max 41.998600 -114.318100 18.167700 16.532900 Tilt_Angle_for_Solar_PV count 1566.000000 mean 26.587499 std 17.331640 min -16.000000 25% 14.000000 50% 30.000000 75% 41.000000 max 51.000000
This summary provides key statistical information about each numeric column in your dataset, including the count of non-null values, mean, standard deviation, minimum, quartiles, and maximum values. It can help you gain insights into the distribution and range of your data.
The corr()
function in Python’s pandas library calculates the correlation between columns in a DataFrame. It computes the pairwise correlation of columns, excluding missing values.
To use the corr()
function, you can simply call it on your DataFrame, like this:
data.corr()
The output will be a correlation matrix, which is a square matrix where the columns and rows represent the variables, and each cell represents the correlation coefficient between two variables. The correlation coefficient ranges from -1 to 1, where -1 indicates a perfect negative correlation, 1 indicates a perfect positive correlation, and 0 indicates no correlation.
Here’s an example of what the output might look like:
Lat Lon DaylengthMax DaylengthMin Tilt_Angle_for_Solar_PV Lat 1.000000 -0.024078 0.002094 -0.008762 -0.048916 Lon -0.024078 1.000000 -0.743472 -0.182390 0.064835 DaylengthMax 0.002094 -0.743472 1.000000 0.319335 0.052063 DaylengthMin -0.008762 -0.182390 0.319335 1.000000 0.000927 Tilt_Angle_for_Solar_PV -0.048916 0.064835 0.052063 0.000927 1.000000
In this example, you can see the correlation coefficients between the various columns in the DataFrame. For example, the correlation coefficient between “Lat” and “Lon” is -0.024078, indicating a weak negative correlation. The correlation coefficient between “DaylengthMax” and “Lon” is -0.743472, indicating a strong negative correlation.
Understanding correlations between variables can help you identify relationships and dependencies within your data, which can be useful for making informed decisions and predictions.
To plot data we can use plot() function it takes two parameter x and y
data.plot(x='DaylengthMax',y='Tilt_Angle_for_Solar_PV')
<AxesSubplot:xlabel=’DaylengthMax’>
As shown above plot between two parameters can be plotted.