# Python Data & Analysis Tools

Data & Analysis Tools Using Python

Last Updated: January 30, 2021 by Pepe Sandoval

##### Want to show support?

If you find the information in this page useful and want to show your support, you can make a donation

Use PayPal

or

This will help me create more stuff and fix the existent content...

# Python Data & Analysis Tools

## numpy

### numpy arrays

• numpy Numerical Python is a library that allows us to generate and handle data

• Numpy uses an object/structure referred to as numpy array (numpy.ndarray) that can store data efficiently and execute functions on that data (better than built-in python list)

• install

• pip: pythom -m pip install numpy
• numpy arrays can be vectors (one-dimensional arrays) or matrices (two-dimensional arrays)

• In numpy a number followed by a dot . indicates it is a floating point number. e.x. [0., 0., 1.]

• arange generates an array from start to end but not including and lispace generates an array with exactly the number of elements specified and evenly spaced

• Numpy has a great variety of functions and distributions to generate random numbers

import numpy

# Create Numpy Array from python list
print("numpy array")
np_array = numpy.array([1, 2, 3])
print(type(np_array), np_array)

# Create Numpy matrix from python list of lists
print("numpy matrix")
np_matrix = numpy.array([[1, 1, 1], [2, 2, 2], [3, 3, 3]])
print(type(np_matrix))
print(np_matrix)

# numpy arange to generate arrays
print("numpy arange")
np_ar1 = numpy.arange(0, 5, 1)
np_ar2 = numpy.arange(0, 5, 0.5)
print(np_ar1, np_ar2)

# Create array or matrix of zeros
print("numpy zeros")
zeros = numpy.zeros(4)
print(zeros)

# Create array or matrix of ones
print("numpy ones")
ones = numpy.ones((2, 3))
print(ones)

# numpy linspace to generate evenly spaced numbers
# over specified interval
print("numpy linspace")
np_lin1 = numpy.linspace(0, 4, 5)
np_lin2 = numpy.linspace(0, 4.5, 6)
print(np_lin1, np_lin2)

# Create identity matrix
print("identity matrix")
eye = numpy.eye(3)
print(eye)

# Creates array with random samples
# of uniform distribution [0, 1)
# every number has the same probability of being picked
print("uniform distribution")
uni_rand1 = numpy.random.rand(1)
print(uni_rand1)
uni_rand2 = numpy.random.rand(4)
print(uni_rand2)
uni_rand3 = numpy.random.rand(3, 3)
print(uni_rand3)

# Creates array with samples from a
# standard normal distribution [0, 1)
# the closer you are to the zero the higher probability of the number being picked
print("standard normal distribution")
snor_rand1 = numpy.random.randn(1)
print(snor_rand1)
snor_rand2 = numpy.random.randn(4)
print(snor_rand2)
snor_rand3 = numpy.random.randn(3, 3)
print(snor_rand3)

# Random integer up to but not including high
print("random integers")
int_rand = numpy.random.randint(low=1, high=10+1)
print(int_rand)
int_rand_array = numpy.random.randint(low=1, high=10+1, size=5)
print(int_rand_array)

# generate always same random data with seed
numpy.random.seed(101)
print(numpy.random.rand(1))
print(numpy.random.rand(1))

# reshape array to a VALID dimension
print("reshape arrays")
orig_array = numpy.arange(0, 6)
print(orig_array, "shape", orig_array.shape, "storing", orig_array.dtype,
"max", orig_array.max(), "at index", orig_array.argmax())
print(orig_array.reshape(2, 3))  # numbers here must multiply to size of original array
print(orig_array.reshape(1, 6))
print("shape", orig_array.reshape(6, 1).shape)
print(orig_array.reshape(6, 1))

• Python indexing and slicing apply to numpy arrays but numpy also allows to do broadcast which means reassignment of values to the array or certain range of the array (arr[0:5] = 100)

• assignment of certain range will only give a reference to those values in the array so changes affect original array, you can user copy() method to explicitly copy
• numpy supports [r][c] or [r,c] indexing for matrixes

• Conditional selection used to grab elements from an array based on some operator by creating an numpy array of booleans arr[arr > 4]

import numpy as np

# Array slicing
arr = np.arange(0, 11)
print("arr", arr)
arr[1:5] = 100
print("reassign", arr)

arr = np.arange(0, 11)
slice_of_arr = arr[1:6]
print("slice_of_arr", slice_of_arr)
slice_of_arr[:] = 99
print("slice_of_arr reassign", slice_of_arr)
print("reassign", arr)

# Matrix slicing and indexing
mat = np.arange(9).reshape(3, 3)
print("mat")
print(mat)
print("mat[1][2]", mat[1][2])
print("mat[1,2]", mat[1, 2])
print("row 1 = mat[1]", mat[1])
print("mat[:2,1:]") ; print(mat[:2, 1:])
print("mat[1:,:2]") ; print(mat[1:, :2])
print("mat.sum") ; print(mat.sum(axis=0))  # axis: 0=rows/vertical, 1=cols/horizontal

# Conditional selection
arr = np.arange(1, 11)
bool_array = arr > 4
print("arr > 4", bool_array)
print("arr[bool_array]", arr[bool_array])
print("arr[arr>=9]", arr[arr <= 9])
print("arr[arr<=0]", arr[arr <= 0])


### numpy operations

• Usually behave element by element operations
• numpy usually returns a warning and nan on indeterminations like 0/0 and inf for a number divided by zero or -inf for other indeterminations but it does perform the operations
import numpy as np

arr = np.arange(0, 10)

print("sum", arr+arr)
print("mul", arr*arr)
print("div", arr/arr)
print("inv", 1/arr)
print("pow", arr ** 2)
print("sum scalar", arr+100)
print("sqrt", np.sqrt(arr))
print("exp", np.exp(arr)) # e to the power of each element of the array
print("sin", np.sin(arr))
print("log", np.log(arr))


## Pandas

• Pandas (PANel-DAta) is a data analysis library built off of numpy created to help with datasets specially finance data

• Pandas features:

• Provides a fast and efficient DataFrame Object used for data manipulation, this object also has integrated indexing
• Provides Tools for reading and writing data between in memory data structures and different formats like CSV files, text, files, Excel Files, SQL databases, HDF5 format, etc
• It is great interaction with visualization Python libraries
• Pandas is highly optimized for performance, critical code paths are written un Cython or C
• It can transform or aggregate data
• Pandas Series are arrays (or numpy arrays) than can be indexed by a named index, datetime index or any object index instead of just numerical index. They are one-dimensional numpy.ndarray with axis labels

import numpy as np
import pandas as pd

labels = ['a', 'b', 'c']
my_list = [0, 10, 20]
arr = np.arange(0, 30, 10)

print("Series from list use default index")
my_series = pd.Series(data=my_list)
print(type(my_series))
print(my_series)

print("Series from list with labels")
my_series = pd.Series(data=my_list, index=labels)
print(my_series)

print("Series from numpy ndarray with labels")
my_series = pd.Series(data=arr, index=labels)
print(my_series)

print("Series from any iterable list of objects")
my_series = pd.Series(data=[sum, max, len, all])
print(my_series)

print("Series from strings")
my_series = pd.Series(data=labels)
print(my_series)

print("Series from dict")
ser1 = pd.Series(data={'apple': 7, 'orange': 4, 'banana': 5})
ser2 = pd.Series(data={'apple': 3, 'watermelon': 1})
print("Series index as dict", ser1["apple"])
print("Adding series by default sums ONLY where there is a match")
print(ser1+ser2)


### DataFrames

• DataFrames are built on top of Pandas Series objects (pandas.core.series.Series), they are a collection of series, a DataFrame has rows and columns where each column is a series and each row is also a series

• DataFrame are just fancier numpy matrixes where axis=0 refers to the rows and axis=1 refers to columns

import numpy as np
import pandas as pd
from numpy.random import randn
np.random.seed(101)

data = randn(5, 4) # 5 rows 4 columns
rows =  ['A', 'B', 'C', 'D', 'E'] # 5 rows
columns =  ['W', 'X', 'Y', 'Z']   # 5 columns

# each column is a series
df = pd.DataFrame(data=data, index=rows, columns=columns)
print("DataFrame")
print(type(df))
print(df)

print("Get a column = Get a series")
print(type(df['W']))
print(df['W']) # pass column name as index
# print(df.W) # also possible but no recommended to avoid accesing series methods

print("Get multiple columns = Get multiple series = Get a subset DataFrame of original DataFrame")
print(df[['W', 'Z']])
print("Get just certain rows of all columns in a DataFrame")
print(df.iloc[0:3])

df['V'] = df['W'] + df['Y']
df['T'] = pd.Series(data={'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5})
print(df)

print("Remove columns from DataFrame")
df.drop('V', axis=1) # by default just return resulting dataframe but doesn't affect original
df.drop('T', axis=1, inplace=True) # set inplace to affect original dataframe
print(df)

print("Remove rows from DataFrame")
print(df.drop('E', axis=0))

print("Get a row = Get a series")
print(type(df.loc['A']))
print(df.loc['A']) # row name
print(df.iloc[0]) # row index

print("Get element from DataFrame")
print(type(df.loc['B', 'Y']))
print(df.loc['B', 'Y'])

print("Get subset of DataFrame")
print(type(df.loc[['A', 'B'], ['W', 'Y']]))
print(df.loc[['A', 'B'], ['W', 'Y']])

• DataFrames allow conditional selection for multiple conditions we need ot use & and | instead of and and or due to overloading of operators

• If you pass a series of Boolean values (for example a column with comparison operators df["W"]>0) you will get rows where the series happened to be true

import numpy as np
import pandas as pd
from numpy.random import randn
np.random.seed(101)

data = randn(5, 4) # 5 rows 4 columns
rows =  ['A', 'B', 'C', 'D', 'E'] # 5 rows
columns =  ['W', 'X', 'Y', 'Z']   # 5 columns

# each column is a series
df = pd.DataFrame(data=data, index=rows, columns=columns)
print("DataFrame conditional selection")
print(df[df > 0])

print("DataFrame conditional columns selection")
print(df[df["W"] > 0])
print("DataFrame conditional selection chained to access")
print(df[df["W"] > 0]["X"])
print("DataFrame multi conditional selection")
print(df[(df["W"] > 0) & (df["Y"]>1)])
# print(df[(df["W"] > 0) | (df["Y"]>1)])

print("DataFrame reset index to integers")
print(df.reset_index()) # needs inplace to affect original

print("DataFrame set new index")
new_rows = ["F", "G", "H", "I", "J"]
df["new"] = new_rows
print(df.set_index(keys="new")) # needs inplace to affect original
print(df)

• We can create dataframes with an index hierarchy which means an index with multiple levels (multi-level index)
import numpy as np
import pandas as pd
from numpy.random import randn
np.random.seed(101)

outside =  ['G1', 'G1', 'G1', 'G2', 'G2', 'G2']
inside =  [1, 2, 3, 1, 2, 3]

print("Create MultiIndex object from list of tuples")
hier_index = list(zip(outside, inside))
print(type(hier_index), hier_index)
hier_index = pd.MultiIndex.from_tuples(hier_index)
print(hier_index)

print("Create multi-level dataframe")
df = pd.DataFrame(data=randn(6, 2), index=hier_index, columns = ['A', 'B'])
print(df)

print("Get subset dataframe with an index level")
print(df.loc["G1"])
print("Get subset dataframe with chained index level")
print(df.loc["G1"].loc[2])

print("Set names for indexes")
df.index.names = ['Groups', 'Numbers']
print(df)

print("Get element on multi level index dataframe")
print(df.loc["G2"].loc[2]["B"],"=",df.loc["G2"].loc[2].loc["B"])

print("Get cross section of rows or columns")
print(df.xs(key="G1"))
print(df.xs(key=1, level="Numbers"))

• We can create Dataframes from dictionaries where keys will be the columns and each key must have a list (of equal size) that represent the data values, you can pass a list of index equal to the size of these lists to set specific index

• DataFrames access are always from top to bottom, from outer to inner indexes,from outer columns to last inner column, then from outer rows (indexes) to inner indexes

import numpy as np
import pandas as pd

d = {"A": [1,2, np.nan], "B": [7,np.nan,np.nan], "C": [4,5,6]}

df = pd.DataFrame(data=d)
#df = pd.DataFrame(data=d, index=["a", "b", "c"])
print(df)

print("Drop any row with missing values")
print(df.dropna(axis=0))

print("Drop any columns with missing values")
print(df.dropna(axis=1))

print("Drop based on threshold (number of NaN values) / keep rows with 2 or more valid values")
print(df.dropna(axis=0, thresh=2))

print("Fill missing values")
print(df.fillna(value=df['A'].mean()))

• Pandas GroupBy can be seen as the Filter By column in Excel

• Pandas GroupBy functionality allows you to group multiple rows by the values in a certain column and then perform an operation to combine those values

• groupby returns a pandas.core.groupby.generic.DataFrameGroupBy object, methods of this object by default ignore non numeric columns and return DataFrame objects where the indexes are the values from the column yo passed to groupby

• GroupBy lets you choose a column to group by values, gathers all those rows based off the same or other column and then performs an aggregate function on those values

• aggregate function is just any function that operate on a collection of values and returns a single value for example a sum of the values, average of the values, standard deviation of the values, etc
import numpy as np
import pandas as pd

d = {
"Company": ['GOOG','GOOG','MSFT','MSFT','FB','FB'],
"Person": ['Sam', 'Charlie', 'Amy', 'Vanessa', 'Carl', 'Sarah'],
"Sales": [200,120,340,124,243,350]
}

df=pd.DataFrame(data=d)
print(df)

byCompany = df.groupby("Company")
print(type(byCompany))

print("Get mean sales by company")
print(byCompany.mean())

print("Get std sales by company")
print(byCompany.std())

print("Get sum sales by company")
print(byCompany.sum())

print("Get sum sales of a certain company")
print(df.groupby("Company").sum().loc['GOOG'])

print("Count number of instances")
print(byCompany.count())

print("Get Max/Mins")
print(byCompany.max())
#print(byCompany.min())

print("Get Describe")
descdf = byCompany.describe()
print(descdf)
print("Specific mean value", descdf["Sales"]["mean"]["FB"])

print("Get Describe transposed")
tdf = byCompany.describe().transpose()
print(tdf)
print("Specific mean value", tdf["FB"]["Sales"]["mean"])

• Pandas provides 3 main ways of combining DataFrames
• Concatenating (pd.concat([df1, df2])): Combine DataFrames, dimension should match along the axis to concatenate (axis=0 -> add rows so we must have same number of columns)
• Merging (pd.merge(left, right, how='inner', on='key')): Merges data frames like merging DB SQL tables which means concatenate based on a key column they share
• Joining : used to combine columns of two dataframes can have different indexes, it is same as merge except keys are on the index instead of in a column

Concatenate can add rows or columns, join and merge usually adds columns but in general it can add both at the same time

import numpy as np
import pandas as pd

df1=pd.DataFrame(data={
"A": ['A0','A1','A2','A3'],
"B": ['B0','B1','B2','B3'],
"C": ['C0','C1','C2','C3'],
"D": ['D0','D1','D2','D3']
}, index=[0, 1, 2, 3])

df2=pd.DataFrame(data={
"A": ['A4','A5','A6','A7'],
"B": ['B4','B5','B6','B7'],
"C": ['C4','C5','C6','C7'],
"D": ['D4','D5','D6','D7']
}, index=[4, 5, 6, 7])

df3=pd.DataFrame(data={
"A": ['A8','A9','A10','A11'],
"B": ['B8','B9','B10','B11'],
"C": ['C8','C9','C10','C11'],
"D": ['D8','D9','D10','D11']
}, index=[8, 9, 10, 11])

#print(df1) ; print(df2) ; print(df3)

print(pd.concat([df1, df2, df3], axis=0))

print(pd.concat([df1, df2, df3], axis=1))

left=pd.DataFrame(data={
"A": ['A0','A1','A2','A3'],
"B": ['B0','B1','B2','B3'],
"key": ['K0','K1','K2','k3']
})

right=pd.DataFrame(data={
"C": ['C0','C1','C2','C3'],
"D": ['D0','D1','D2','D3'],
"key": ['K0','K1','K2','k3']
})
#print(left) ; print(right)
print("Merge on key")
print(pd.merge(left, right, how='inner', on='key'))

left=pd.DataFrame(data={
"A": ['A0','A1','A2'],
"B": ['B0','B1','B2'],
},  index=['K0','K1','K2'])

right=pd.DataFrame(data={
"C": ['C0','C1','C2'],
"D": ['D0','D1','D2'],
},  index=['K0','K2','K3'])
#print(left) ; print(right)

print("Join on index")
print(left.join(right))

• Dataframes support multiple useful operations, like creation of pivot tables which are just multi-index out of DataFrame
import pandas as pd

df = pd.DataFrame(data={
"col1": [1, 2, 3, 4],
"col2": [444, 555, 666, 444],
"col3": ['abc', 'def', 'ghi', 'xyz']
})
print(df)

print("Find Unique values in certain column")
print(df['col2'].nunique(), type(df['col2'].unique()), df['col2'].unique())

print("Count values in certain column")
print(type(df['col2'].value_counts()))
print(df['col2'].value_counts())

print("Applies function to column")
print(df['col1'].apply(func=lambda x: x*2))
print(df['col3'].apply(len))

print("Get column and index names")
print(type(df.columns), df.columns)
print(type(df.index), df.index)

print("Sort by column")
print(df.sort_values(by='col2'))

print("Find null values")
print(df.isnull())

df = pd.DataFrame(data={
"A": ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
"B": ['one', 'one', 'two', 'two', 'one', 'one'],
"C": ['x', 'y', 'x', 'y', 'x', 'y'],
"D": [1, 3, 2, 5, 4, 1]
})
print(df)

print("Create Pivot table")
print(df.pivot_table(values='D', index=['A', 'B'], columns=['C']))

• Pandas allows you to do read and write from multiple sources the most common and stable is to use .csv files
• example .csv file
a,b,c,d
0,1,2,3
4,5,6,7
8,9,10,11
12,13,14,15

import pandas as pd

in_file = r'.\example.csv'
out_file = r'.\example_out.csv'

print("Read from file to create dataframe")
print(df)

print("Write dataframe to file ignore index")
df.to_csv(path_or_buf=out_file, index=False)

print(html_data[0])

print("Read sql, need sql engine installed")
from sqlalchemy import create_engine
engine = create_engine("sqlite:///:memory:")
df.to_sql("my_table", engine)



## Visualization

• Pandas visualizations components are built on top of Matplotlib

### Matplotlib

• It is a Python 2-Dimensional plotting library based of Matlab plotting capabilities, allows to generate plots, histograms, power spectra, bar charts, error charts, scatter plots

• Matplotlib has two API structures you can use

1. Matplotlib Object Oriented API structure: instantiate figure objects then call methods and attributes from those objects
2. Matplotlib Functional API structure: just make calls to functions
• A figure in Matplotlib is just like an empty canvas, then add axes setting left bottom and width height percentages in relation to the empty canvas

• Matplotlib allows you to control figure size, aspect ratio and DPI (Dots Per Inch/ Pixels Per Inc)

##### Simple Functional plot
import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 5, 11)
y = x ** 2

# Functional method to plot
## Simple plot
plt.plot(x, y)
plt.xlabel("x")
plt.ylabel("y")
plt.title("y=x^2")
plt.show()

##### Simple Functional Subplot
import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 5, 11)
y = x ** 2

# Functional method to plot
## Subplot
plt.subplot(1, 2, 1)
plt.plot(x, y, 'r')

plt.subplot(1, 2, 2)
plt.plot(y, x, 'b')

plt.show()

##### Object Oriented Simple and Inner plot
import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 5, 11)
y = x ** 2

# Object Oriented method to plot
fig = plt.figure()
# set axis limits left, bottom, width, height
axes = fig.add_axes([0.1, 0.1, 0.8, 0.8])
axes.set_xlabel('x')
axes.set_ylabel('y')
axes.set_title('y = f(x)')
axes.plot(x, y)

fig = plt.figure()

axes1 = fig.add_axes([0.1, 0.1, 0.8, 0.8])
axes1.set_xlabel('x')
axes1.set_ylabel('y')
axes1.set_title('y = x')
axes1.plot(x, y)

axes2 = fig.add_axes([0.2, 0.5, 0.4, 0.3])
axes2.set_xlabel('y')
axes2.set_ylabel('x')
axes2.set_title('x = y')
axes2.plot(y, x)

plt.show()

##### Object Oriented Subplot
import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 5, 11)
y = x ** 2

# Object Oriented method to subplot
fig, axes = plt.subplots(nrows=1, ncols=2)

# axes is a list we can iterate
for current_axis in axes:
current_axis.plot(x, y)

# axes is a list we can index
fig, axes = plt.subplots(nrows=1, ncols=2)

axes[0].plot(x, y, 'r')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].set_title('y = x')

axes[1].plot(x, y, 'b')
axes[1].set_xlabel('x')
axes[1].set_ylabel('y')
axes[1].set_title('y = f(x)')

plt.tight_layout()
plt.show()

##### Matplotlib plot arguments and attributes
• Matplotlib plot allow you to set
• Figure Size
• DPI
• Color
• Line Width ('lw')
• Line Style ('ls')
• Draw Style ('ds')
• Markers
• Axis limits and plot range
• Labels
• Location to save a file of figure
import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 5, 11)

# Set figure size, DPI, legends and save figure
width, height = 10, 5
dpi = 100
fig = plt.figure(figsize=(width, height), dpi=dpi)

ax = fig.add_axes([0.1, 0.1, 0.8, 0.8])
ax.plot(x, x ** 3, label='x^3 steps', color='black', ds='steps')
ax.plot(x, x ** 3, label='x^3', color='blue', lw=1, ls='dashed')
ax.plot(x, x ** 2, label='x^2', color='green', lw=2, ls='dashdot')
ax.plot(x, 10*x , label='10x', color='red', ls='solid', marker='o', markersize=5)
ax.plot(x, x, label='x', color='#FF8C00', lw=3, alpha=0.5, ls='dotted')  # RGB Hex Code
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.set_title('f(x) Plot(s)')
ax.legend(loc=(0.05, 0.2))

fig.savefig('my_fig.png', dpi=dpi)
plt.show()

###### Matplotlib plotting a sine wave
import numpy
import math
import matplotlib.pyplot as plt

def f(x):
return math.sin(x)

start, stop, step = -2.0*math.pi, 2.0*math.pi, 0.1

x = [i for i in numpy.arange(start, stop, step)]
f_x = list(map(f, x))

plt.plot(x, f_x)
plt.ylabel('f(x)')
plt.xlabel('x')
plt.grid()
plt.xticks([i for i in numpy.arange(round(start), round(stop), 1)])

plt.show()

###### Matplotlib plotting with no GUI
import numpy as np
import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt

x = np.linspace(0, 5, 11)
fig = plt.figure(figsize=(16, 4))

ax = fig.add_axes([0.1, 0.1, 0.8, 0.8])
ax.plot(x, x ** 3)

fig.savefig('my_fig.png')


### Pandas Visualization

• Usually we use the plot method from either a Pandas DataFrame or a Pandas Series object (Columns in a DataFrame are Series, so we can call this directly from a column of the dataFrame)

• install seaborn optionally to format plots automatically. pythom -m pip install seaborn

###### Pandas plotting with no GUI
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt

d = {
"A": [0.039761986133905136, 0.9372879037285884,  0.7805044779316328,  0.6727174963492204,  0.05382860859967886, 0.2860433671280178,  0.4304355863327313,  0.3122955538295512,  0.1877648514121828,  0.9081621790575398],
"B": [0.2185172274750622,  0.04156728027953449, 0.008947537857148302,0.24786984946279625, 0.5201244020579979,  0.5934650440000543,  0.16623013749421356, 0.5028232900921878,  0.9970746427719338,  0.23272641071536715],
"C": [0.10342298051665423,0.8991254222382951, 0.5578084027546968, 0.2640713103088026, 0.5522642392797277, 0.9073072637456548, 0.4693825447762464, 0.8066087010958843, 0.8959552961495315, 0.4141382611943452],
"D": [0.9579042338107532, 0.9776795571253272, 0.7975104497549266, 0.44435791644122935, 0.19000759632053632, 0.6378977150631427, 0.4977008828313123, 0.8505190941429479, 0.530390137569463, 0.4320069001558664]
}
df = pd.DataFrame(data=d)

fig = df[["A", "B"]].plot(figsize=(16,6)).get_figure()
fig.savefig('my_fig.png')

##### Different types of plots with Pandas
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
sns.set_theme()

d = {
"A": [0.039761986133905136, 0.9372879037285884,  0.7805044779316328,  0.6727174963492204,  0.05382860859967886, 0.2860433671280178,  0.4304355863327313,  0.3122955538295512,  0.1877648514121828,  0.9081621790575398],
"B": [0.2185172274750622,  0.04156728027953449, 0.008947537857148302,0.24786984946279625, 0.5201244020579979,  0.5934650440000543,  0.16623013749421356, 0.5028232900921878,  0.9970746427719338,  0.23272641071536715],
"C": [0.10342298051665423,0.8991254222382951, 0.5578084027546968, 0.2640713103088026, 0.5522642392797277, 0.9073072637456548, 0.4693825447762464, 0.8066087010958843, 0.8959552961495315, 0.4141382611943452],
"D": [0.9579042338107532, 0.9776795571253272, 0.7975104497549266, 0.44435791644122935, 0.19000759632053632, 0.6378977150631427, 0.4977008828313123, 0.8505190941429479, 0.530390137569463, 0.4320069001558664]
}
df1 = pd.DataFrame(data=d)

# Plot a histogram from a column (Equivalents, just uncomment one)
df1['A'].hist(bins=5)
#df1['A'].plot.hist(bins=5)
#df1['A'].plot(kind='hist', bins=5)

# Plot an area from a dataframe
df1.plot.area(alpha=0.5)

# Plot an bar plot from a dataframe (Takes index as category)
#df1.plot.bar()
df1.plot.bar(stacked=True)

# Plot line, index will be used for x
# you can add matplotlib arguments like ds, lw, ls, etc
df1.plot.line(y='B', lw=2)

# Plot scatter, using two columns one for 'x' one for 'y'
df1.plot.scatter(x='A', y='B', color="blue")

# Plot scatter, 3 variables one column as color value
df1.plot.scatter(x='A', y='B', c='C', cmap='coolwarm')

# Plot scatter, 3 variables one column as s=size
df1.plot.scatter(x='A', y='B', s=df1['C']*500, color='blue')

# Plot box plot of dataframe
df1.plot.box()

# hexagonal plot (hex get darker as there are more points in them)
df = df1[['A', 'B']]
print(df)
df1.plot.hexbin(x='A', y='B', gridsize=15)

# Kernel density estimation (Equivalents, just uncomment one)
# Kernel density estimation (Equivalents, just uncomment one)
df1["A"].plot.density()
#df1["A"].plot.kde()
#df1.plot.density() # Density estimation fo whole dataframe

plt.show()


#### Pandas time series visualization

##### Basic Time series plot with Pandas
import matplotlib.dates as dates
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
sns.set_theme()

from io import StringIO
mcdon = pd.read_csv(mc_str, sep=",", index_col='Date', parse_dates=True)
print("DF", mcdon)

plt.figure()

plt.figure()
mcdon["Adj. Close"].plot(xlim=['2007-01-01', '2009-01-01'], ylim=[30, 50], ls='dashed', c='red')

idx_all = mcdon.index
idx = mcdon.loc['2007-01-01':'2007-05-01'].index

print("ALL Index", idx_all) ; print("SEL Index", idx) ; print("SEL stock", stock)

fig, ax = plt.subplots()
ax.plot_date(idx, stock, '-')
fig.autofmt_xdate()

# Location / locating
ax.xaxis.set_major_locator(dates.MonthLocator())
ax.xaxis.set_minor_locator(dates.WeekdayLocator(byweekday=0))
# Formatting
ax.xaxis.set_major_formatter(dates.DateFormatter('\n%B-%Y'))
ax.xaxis.set_minor_formatter(dates.DateFormatter('%d'))

plt.show()


## Data sources

• pandas-datareader is a separate package that allows you to connect to certain stock APIs (like Google) to grab data to create dataframes with it

• Quandl is a company that offers a python API to grab data from different data sources (some sources are free other are paid) and has a certain limit of queries per day for free accounts

• quandl.get gets you a single time series
• quandl.get_table gets you an entire database
import pandas_datareader.data as web
import seaborn as sns
import matplotlib.pyplot as plt
import quandl
import datetime
sns.set_theme()

YOUR_AV_API_KEY_HERE = "AAAAAAAAAA"
YOUR_QUANDL_API_KEY_HERE = "QQQQQQQQQQQQQ"

start, end = datetime.datetime(2015, 1, 1), datetime.datetime(2017, 1, 1)
fb = web.DataReader(name='FB', data_source='av-daily', start=start, end=end, api_key=YOUR_AV_API_KEY_HERE)
fb["close"].plot(title="FB close AV data")

# Save data .csv file
fb.to_csv("fb.csv", index_label="Date")

# quandl API
mydata = quandl.get("EIA/PET_RWTC_D", api_key=YOUR_QUANDL_API_KEY_HERE)
mydata.plot(title="EIA/PET_RWTC_D Quandl data")

# https://www.quandl.com/databases/WIKIP/data
aapl_data = quandl.get("WIKI/AAPL", api_key=YOUR_QUANDL_API_KEY_HERE)

aapl_open = aapl_data["Open"] # quandl.get("WIKI/AAPL.1")
plt.figure()
aapl_open.plot(title="WIKI/AAPL.1 Quandl data")

plt.show()


## Pandas Time Series Data

• Refers to data with a DateTime Index and some corresponding value, Pandas has specific features for this kind of data

### DateTime Index basics

• For cases where date and time information is not just another column in dataframe but instead is the actual index

• dateime is python built-in library to create timestamps and datetime objects (from datetime import datetime)

import numpy as np
import pandas as pd
from datetime import datetime

my_datetime = datetime(year=2017, month=1, day=2, hour=13, minute=30, second=15)
print(type(my_datetime), my_datetime)

n_days = 2
dates = [datetime(2016, 1, d) for d in range(1, n_days+1)]
print(type(dates), dates)

dt_index = pd.DatetimeIndex(data=dates)
print(type(dt_index), dt_index)

data = {
"a": list(range(1, n_days+1)),
"b": list(map(lambda x: x*3, range(1, n_days+1)))
}

df = pd.DataFrame(data=data, index=dt_index)
print(df)

print("min/fisrt date is at int index", df.index.argmin(), "date value is", df.index.min())
print("max/latest date is at int index", df.index.argmax(), "date value is", df.index.max())


### Time Resampling

• We usually have data that has small DateTimeIndex (like every day, every hour, every, minute, etc) but often we also want to aggregate or group data based off some frequency (like monthly, quarterly, yearly, etc) to do this we use Frequency Sampling

• Resample works on dataframes with DatetimeIndex, you give a rule (alias that tells how you want to resample)

Alias/Rule Description
C custom business day frequency (experimental)
D calendar day frequency
W weekly frequency
M month end frequency
SM semi-month end frequency (15th and end of month)
CBM custom business month end frequency
MS month start frequency
SMS semi-month start frequency (1st and 15th)
CBMS custom business month start frequency
Q quarter end frequency
QS quarter start frequency
A year end frequency
AS year start frequency
H hourly frequency
T, min minutely frequency
S secondly frequency
L, ms milliseconds
U, us microseconds
N nanoseconds

### Time Shifts

• We use time shifting when we want to shift data forward or backward a certain amount of time steps
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
sns.set_theme()

# Split steps to Convert column to datetime index
# df['Date'] = pd.to_datetime(df['Date']) # df['Date'].apply(pd.to_datetime) # Equivalent
# df.set_index('Date', inplace=True)

# Read and parse in same line

print("Resampler")
df_resampler_year = df.resample(rule='A') # A = year end frequency
print(type(df_resampler_year), df_resampler_year)
print("Mean of Resampler year")
print(df_resampler_year.mean())
print("Max of Resampler year")
print(df_resampler_year.max())
print("Function to Resampler year")
print(df_resampler_year.apply(lambda entry: entry[0])) # value of first day for the particular time period, lambda is applied to all elements of the period

print("Mean of Resampler quarter")
print(df.resample(rule='Q').mean())

df['Close'].resample(rule='A').mean().plot.bar(width=150)

print(df.tail())

print("Shift downwards / push at begining remove last")
print(df.shift(periods=1).tail())

print("Shift upwards / push at end remove first")
print(df.shift(periods=-1).tail())

print("T shift / shift index")
print(df.tshift(freq='M'))
print(df.tshift(freq='A'))

plt.tight_layout()
plt.show()


### Rolling and Expanding

• Rolling Mean / Moving Average (MA) is a common indicator of the general trend of the data (reduce noise) that's calculated by taking a window of time (e.x. 30 days), use the values for that window to compute the mean of those values then we shift by one unit (e.x. 1 day) and repeat the process for all the data set. This will result in a data set of the same size but the first N values, determined by the size of the window, (e.x. the first 30 days) won't have values because we need data from size of the window to compute the first rolling mean value

• Pandas rolling method allows you provide a window time period, then use that to calculate an aggregate statistic (like the mean, sum, std)

• The rolling method is usually used alongside the mean method to calculate the MA, we can use them to do the average on the specific window of time for example if you set the window argument to 7 days we can calculate the average using the previous 7 values of a set of data, which is the 7 days MA
• Pandas expanding allows you provide a minimum periods number to group values from the start of the data set to the current value

• The expanding method is usually used alongside the mean method to calculate the the average of all the values that came before a certain value
import pandas as pd
import matplotlib.pyplot as plt

# Read and parse in same line

# Resulting first row will be average of original first 7 rows

# Create new rows with MAs and Expanding data
df["Close 7 Day MA"] = df["Close"].rolling(window=7).mean()
df["Close 30 Day MA"] = df["Close"].rolling(window=30).mean()
df["Close 90 Day MA"] = df["Close"].rolling(window=90).mean()
df["Close Expanding"] = df["Close"].expanding().mean()

df[["Close 7 Day MA", "Close 30 Day MA", "Close 90 Day MA","Close", "Close Expanding"]].plot(figsize=(16,6))

plt.show()


#### Bollinger Bands

• Volatility Bands placed above (upper band) and below (lower band) a moving average

• Typically use a 20 day Moving average
• Volatility is based on the Standard Deviation (because the STD changes as volaity increases or decreases)
• Bands will widen when volatility increases
• high volatily = drastic changes = big drops & jumps
• Bands will narrow when volatility decreases
• low volatily = small changes = small drops & jumps
• Volatility Bands usage:

• To identify tops & bottoms and to determine strength of a trend
• To determine whether a price movement is significant
• Based on Bollinger Bands prices are relatively high when they are above the upper band and low they are below the lower band

import pandas as pd
import matplotlib.pyplot as plt

# Read and parse in same line

# days window (e.x days window = 20)
dw = 20
dw_str = str(dw)

# Close days_window MA
df["Close {} Day MA".format(dw_str)] = df["Close"].rolling(window=dw).mean()

# Upper Band = days_window MA + 2*STD(days_window)
df["Upper"] = df["Close {} Day MA".format(dw_str)] + 2*(df["Close"].rolling(window=dw).std())

# Lower Band = days_window MA - 2*STD(days_window)
df["Lower"] = df["Close {} Day MA".format(dw_str)] - 2*(df["Close"].rolling(window=dw).std())

df[["Close", "Close {} Day MA".format(dw_str), "Upper", "Lower", ]].plot(figsize=(16,6))

plt.show()


## Stocks/Assets Analysis

• Market Cap of company or assets refers to the value of stock/asset price multiplied by how many available units of stocks/assets there are

• You can obtain the Total Traded (Total money traded) by multiplying the volume by the opening price, which is a representation of the total amount of money being traded in a period of time

• A Candlestick OHLC is graph where the candles have the Open, High, Low, Close values of an asset

• A green candle means that the closing price on that day was higher than the opening price
• A red candle means that the closing price on that day was lower than the opening price
• Open and close values are the edges of the rectangle/candle
• High and Low are represented by the line
• Percentage change/Returns defines the $r_t$: return at time $t$ as equal to the price at time $t$ divided by the price at a previous point in time $t-1$ (the previous day, hour, month, etc.)

• Used in analyzing the volatility of a stock/asset. If daily returns have a wide distribution, the stock/asset is more volatile from one point in time to the next, but if you have a narrow distribution centered around zero that means you a relatively stable stock/asset
• Daily Percentage Change or Daily Returns is a specific case of the percent change that measures changes from one day to another

$$r_t = \dfrac{p_t}{p_{t-1}} - 1$$

• Cumulative return is the aggregate amount an investment has gained or lost over time independent of the period of time involved
• It is calculated relative to the day an investment is made
• If cumulative return is above one, you are making profits else you are in loss

$$i_t = \dfrac{p_t}{p_{t_0}}$$ $$i_t = (1+r_t) \cdot i_{t-1}$$

import os
import seaborn as sns
import numpy as np
import pandas as pd
from pandas.plotting import scatter_matrix
import mplfinance as mpf
from mplfinance.original_flavor import candlestick_ohlc
from matplotlib.dates import DateFormatter,date2num,WeekdayLocator,DayLocator,MONDAY
import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
sns.set_theme()

base_path = "/home/ubuntu/"

# Get data
tesla_stock = pd.read_csv(os.path.join(base_path, "tsla.csv"), sep=",", index_col='Date', parse_dates=True)
tesla_stock.name = "Tesla"
ford_stock = pd.read_csv(os.path.join(base_path, "ford.csv"), sep=",", index_col='Date', parse_dates=True)
ford_stock.name = "Ford"
gm_stock = pd.read_csv(os.path.join(base_path, "gm.csv"), sep=",", index_col='Date', parse_dates=True)
gm_stock.name ="GM"

car_companies_tuple = (tesla_stock, ford_stock, gm_stock)

tesla_stock['Total Traded'] = tesla_stock['Open'] * tesla_stock['Volume']
ford_stock['Total Traded'] = ford_stock['Open'] * ford_stock['Volume']
gm_stock['Total Traded'] = gm_stock['Open'] * gm_stock['Volume']

# Get max stats of stocks
## Max volume of stocks: Volume increase means big sell of or lots of trading happening
## Total Traded: Refers to total amount of money being traded on a given day
for stat in stats:
for stock_df in car_companies_tuple:
i_max = stock_df[stat].argmax()
date_max = stock_df.index[i_max]
max_stat = stock_df[stat].max()
print('{} max {} of {} in {}'.format(stock_df.name, stat, max_stat, date_max))

# Calculate MAs
tesla_stock["MA50"] = tesla_stock["Open"].rolling(window=50).mean()
tesla_stock["MA200"] = tesla_stock["Open"].rolling(window=200).mean()

ford_stock["MA50"] = ford_stock["Open"].rolling(window=50).mean()
ford_stock["MA200"] = ford_stock["Open"].rolling(window=200).mean()

gm_stock["MA50"] = gm_stock["Open"].rolling(window=50).mean()
gm_stock["MA200"] = gm_stock["Open"].rolling(window=200).mean()

# Combine columns in single dataframe
car_companies_df = pd.concat([tesla_stock["Open"], gm_stock["Open"], ford_stock["Open"]], axis=1)
car_companies_df.columns = ["Tesla Open", "GM Open", "Ford Open"]

# Calculate Daily Returns manually
#tesla_stock["Daily Returns"] = (tesla_stock["Close"] / tesla_stock["Close"].shift(1)) - 1
#ford_stock["Daily Returns"] = (ford_stock["Close"] / ford_stock["Close"].shift(1)) - 1
#gm_stock["Daily Returns"] = (gm_stock["Close"] / gm_stock["Close"].shift(1)) - 1

# Calculate Daily Returns pandas pct method
tesla_stock["Daily Returns"] = tesla_stock["Close"].pct_change(1)
ford_stock["Daily Returns"] = ford_stock["Close"].pct_change(1)
gm_stock["Daily Returns"] = gm_stock["Close"].pct_change(1)

# Concat Daily returns for box plotting
box_df = pd.concat([tesla_stock['Daily Returns'], ford_stock['Daily Returns'], gm_stock['Daily Returns']], axis=1)
box_df.columns = ['Tesla Returns', 'Ford Returns', 'GM Returns']

# Calculate Cumulative Returns pandas cumprod method
tesla_stock["Cumulative Returns"] = (1 + tesla_stock["Daily Returns"]).cumprod()
ford_stock["Cumulative Returns"] = (1 + ford_stock["Daily Returns"]).cumprod()
gm_stock["Cumulative Returns"] = (1 + gm_stock["Daily Returns"]).cumprod()

# Plot Single stats
"""
#to_plot = ('Open', 'Volume', 'Total Traded', 'Cumulative Returns')
to_plot = ('Cumulative Returns',)
for stat in to_plot:
fig = plt.figure(figsize = (12, 6))
plt.title(stat)
tesla_stock[stat].plot(label = 'Tesla')
ford_stock[stat].plot(label = 'Ford')
gm_stock[stat].plot(label = 'GM')
plt.legend()
fig.savefig(os.path.join(base_path, "{}_fig.png".format(stat.replace(" ", "_").lower())))
"""

# Plot MAs
"""
for stock_df in car_companies_tuple:
fig = stock_df[['Open', 'MA50', 'MA200']].plot(title="{} MA".format(stock_df.name), figsize=(16,6)).get_figure()
fig.savefig(os.path.join(base_path, "{}_ma.png".format(stock_df.name)))
"""

# Plot scatter matrix
"""
scatter_matrix(car_companies_df, alpha=0.2, hist_kwds={'bins':50})
plt.savefig(os.path.join(base_path, "scatter_matrix.png"))
"""

# Plot candlestick original ohlc method
"""
ford_reset = ford_stock.loc['2012-01'].reset_index()
ford_reset['dates_ax'] = ford_reset["Date"].apply(lambda date: date2num(date))

cols = ['dates_ax', 'Open', 'High', 'Low', 'Close']
ford_values = [tuple(vals) for vals in ford_reset[cols].values]

mondays = WeekdayLocator(MONDAY)        # major ticks on the mondays
alldays = DayLocator()              # minor ticks on the days
weekFormatter = DateFormatter('%b %d')  # e.g., Jan 12
dayFormatter = DateFormatter('%d')      # e.g., 12

fig, ax = plt.subplots()
ax.xaxis.set_major_locator(mondays)
ax.xaxis.set_minor_locator(alldays)
ax.xaxis.set_major_formatter(weekFormatter)
#ax.xaxis.set_minor_formatter(dayFormatter)

candlestick_ohlc(ax, ford_values, width=0.6, colorup='g', colordown="r")
fig.savefig(os.path.join(base_path, "candles.png"))
"""

# Plot Histograms
"""
fig = plt.figure(figsize = (12, 6))
for stock_df in car_companies_tuple:
stock_df["Daily Returns"].hist(bins=100, label="{} Daily Returns Hist".format(stock_df.name))
plt.legend()
fig.savefig(os.path.join(base_path, "comp_hist.png"))
"""

# Plot Kernel Density estimation
"""
# Based on KDE we can see Ford is the most stable and Tesla is the most volatile
fig = plt.figure(figsize = (10, 8))
for stock_df in car_companies_tuple:
stock_df["Daily Returns"].plot(kind='kde', label="{} Daily Returns Hist".format(stock_df.name))
plt.legend()
fig.savefig(os.path.join(base_path, "comp_kde.png"))
"""

# Plot box of daily returns
"""
fig = box_df.plot(kind='box', figsize = (8, 11)).get_figure()
fig.savefig(os.path.join(base_path, "comp_box.png"))
"""

# Plot scatter matrix of daily returns
"""
scatter_matrix(box_df, figsize=(16,16), alpha=0.2, hist_kwds={'bins':100})
plt.savefig(os.path.join(base_path, "scatter_matrix_daily.png"))
"""

# Plot scatter plot of two daily returns
"""
fig = box_df.plot(kind='scatter', x='Ford Returns', y='GM Returns', c='b', alpha=0.5, figsize=(8, 8)).get_figure()
fig.savefig(os.path.join(base_path, "ford_gm_scatter.png"))
"""


## Time Series

• As we know Time Series store data with a Date Time Index and some corresponding value for each Date Time Index

### Time Series Properties

• Describes on average what is the value doing. A trend tells us what is happening the the mean value e.x. moving upwards (increasing), staying stationary or going downwards (decreasing)

#### Seasonality

• Seasonality is a repeating trend or pattern, it tells us if there is a repetitive trend. E.x. A general trend can go downwards but have a seasonality that increases every certain months so you can pinpoint a season

#### Cyclical

• Refers to trends with no repetition, no seasonality E.x. a stock that sometimes go up sometimes go down but there is no pattern or repetition of trends so you cannot pinpoint a season

### statsmodels

• Python library to deal wth Time Series data, allows us to explore data, estimate statistical models and perform statistical testss

• statsmodel comes with datasets that can be used for testing purposes

• Using statsmodels to get the trend of time series can be done with the Hodrick-Prescott Filter that separates a time series $y_t$ into a trend $\tau_t$ and cyclical component (cycle) $\zeta_t$

$$y_t = \tau_t + \zeta_t$$

import os
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf

base_path = "/home/ubuntu/"

# Get Data information
print(sm.datasets.macrodata.NOTE)

# Generate index with statsmodels and update dataframe index
index = pd.Index(sm.tsa.datetools.dates_from_range(start='1959Q1', end='2009Q3'))
df.index = index

# Calculate trend
gdp_cycle, gdp_trend = sm.tsa.filters.hpfilter(df["realgdp"])
df["trend"] = gdp_trend

# Plot portion of data and trend
fig = df[["realgdp", "trend"]]["2000-03-31":].plot(figsize=(16, 6)).get_figure()
fig.savefig(os.path.join(base_path, "my_fig.png"))


### ETS Models

• ETS (Error-Trend-Seasonality) Models Try to take Error, Trend or Seasonality terms for smoothing purposes and may perform other linear transformations

• A Time Series Decomposition with ETS is method of breaking down a time series into Error/Residual, Trend or Seasonality terms

### EWMA Models

• Simple Moving Average weak points:

• It can gives more noise than signal in a smaller window
• We will have missing data at the beginning
• Doesn't really inform about possible future behavior, only describes trend
• Extreme historical values can really affect the MA
• EWMA (Exponential Waving Moving Average) is an alternative to Moving Average if we want to avoid its weak points since it puts more weight on values that occurred more recently, span can be thought as the time window size. E.x. set to 12 in a time series where each point is data for a month this will mean 12 month EWMA

import os
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.tsa.seasonal import seasonal_decompose

base_path = "/home/ubuntu/"

# EWMA
airline.dropna(inplace=True)
airline.index = pd.to_datetime(airline.index)

airline['6M SMA'] = airline['Thousands of Passengers'].rolling(window=6).mean()
airline['12M SMA'] = airline['Thousands of Passengers'].rolling(window=12).mean()
airline['12M EWMA'] = airline['Thousands of Passengers'].ewm(span=12).mean()

airline.plot(figsize=(16, 6)).get_figure().savefig(os.path.join(base_path, "EWMA.png"))

# ETS
result = seasonal_decompose(airline['Thousands of Passengers'], model="multiplicative")
fig = result.plot()
fig.set_size_inches(10, 16)
fig.savefig(os.path.join(base_path, "ETS.png"))


### ARIMA Models

• ARIMA (AutoRegressive Integrated Moving Averages) Models is generalization of the ARMA (AutoRegressive Moving Averages) Models, these modesl are used for predict future points in a time series (forecasting). Types:

• Non-Seasonal ARIMA: for data without repetions or patterns (no season) - we need to set p, d and q
• Seasonal ARIMA: for data witha repetition pattern or season - we need to set additional P, D and Q
• ARIMA Models are usually not used for financial data because they assume the Y values have a strong connection or correlation to the time ; stocks and financial assets follow a random walk (they go up and down), for those other models like Montecarlo are used

• Components of ARIMA models:

• AR (p) - Autoregrerssion: Regression model that uses the relationship between the current and previous observations over a period
• I (d) - Integrated: Differencing of observations (substracts an observation from a previous one)
• MA (q) - Moving Average: Model that uses the dependency between an observation and a residual error from a MA
• Stationary data

• Data has a constant mean and variance over a period of time
• Allows us to predict the mean and variance in a future period of time
• The average values is constant throughout a period of time
• Variance should not be a function of time
• Covariance should not a function of time
• The Augmented Dickey-Fuller test is a math test toc check for Stationary data

• To use ARIMA we need with stationary data in case our data is not stationary we can transform it through differencing which means just substract previous value $y_t = x_t - x_{t-1}$ where $x$ are the original values and $y$ are the new values, this comes with the cost of losing one value at the beginning of the data set

#### ACF & PACF

• Autocorrelation plots (Correlogram) shows the correlation of a data sets with itself shifted by $x$ time units, in these plots y-axis is the correlation coefficient and x-axis is the number of units the data was shifted

• Take the data set $T1$ of length $T$ an make a copy of it to get $T2$
• Delete first observation of $T1$ (shift left) and delete last observation of $T2$ (shift right)
• Now you have two series of length $T-1$
• Calculate correlation coefficient and plot it in $x=1$
• Repeat this by other values of $x$
• Types of autocorrelation plots:

• Gradual Decline: Gradually declining as we increase the number of shits
• Sharp Drop-off: Large positive or negative value at first then it hovers around zero

• We use the autocorrelation plots to determine if we will use the AR, MA or both components of an ARIMA model, when we use AR or MA we set the p and q

• Positive correlation at $x=1$ we use AR <-> Sharp Drop-off
• Negative correlation at $x=1$ we use MA <-> Gradual Decline
• Partial autocorrelation plots are conditional correlation, it is the correlation between two variables under the assumption that we know and take into account the values of some other set of variables

#### ARIMA Implementation

• ARIMA general process:
2. Visualize the Time Series Data
3. Make the time series data stationary
4. Plot Correlation and Autocorrelation to get which parameters to use in ARIMA
5. Construct ARIMA Model with previously defined parameters
6. Use ARIMA to make predictions
import os
import pandas as pd
import matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from pandas.plotting import autocorrelation_plot
from statsmodels.tsa.arima_model import ARIMA
from pandas.tseries.offsets import DateOffset

base_path = "."

df.columns = ["Month", "Milk in Pounds Per Cow"]
df.drop(len(df)-1, axis=0, inplace=True)
df["Month"] = pd.to_datetime(df["Month"])
df.set_index("Month", inplace=True)
print(df.describe().transpose())

# 1. Visualize the Time Series Data
# df.plot()
"""
time_series = df["Milk in Pounds Per Cow"]
time_series.rolling(window=12).mean().plot(label='12 Month Moving Average')
time_series.rolling(window=12).std().plot(label='12 Month STD')
time_series.plot()
plt.legend()
decomp = seasonal_decompose(x=time_series, period=12)
decomp.plot()
"""

# 2. Make the time series data stationary
# 2.1 First check whether data is stationary
# Small p-value (<= 0.05) points to stationary data
# large p-value (> 0.05) points to non-stationary data
print("Augmented Dicky Fuller test for '{}'".format(time_series.name))
labels = ["ADF Test Stat", "p-value", "# of lags", "Num of Observations used"]
for value, label in zip(result, labels):
print("{} : {}".format(label, value))
if result[1] <= 0.05:
print("Evidence points to stationary data")
else:
print("Evidence points to non-stationary data")
print("{}".format('-'*20))

df["First Diff"] = df["Milk in Pounds Per Cow"] - df["Milk in Pounds Per Cow"].shift(1)
df["Second Diff"] = df["First Diff"] - df["First Diff"].shift(1)
df["Seasonal Diff"] = df["Milk in Pounds Per Cow"] - df["Milk in Pounds Per Cow"].shift(12)
df["Seasonal First Diff"] = df["First Diff"] - df["First Diff"].shift(12)

# df["First Diff"].plot()
# df["Second Diff"].plot()
# df["Seasonal Diff"].plot()
# df["Seasonal First Diff"].plot()

# 3. Plot Correlation and Auto-correlation
# plot with statsmodels
# fig_first = plot_acf(df["First Diff"].dropna())
# fig_sec = plot_acf(df["Seasonal First Diff"].dropna())

# plot with pandas
# autocorrelation_plot(df["First Diff"].dropna())
# autocorrelation_plot(df["Seasonal First Diff"].dropna())

# Positive correlation so we use AR <-> Sharp Drop-off
# plot_acf(df["Seasonal First Diff"].dropna())
# plot_pacf(df["Seasonal First Diff"].dropna())

# 4. Construct ARIMA Model
model = sm.tsa.statespace.SARIMAX(df["Milk in Pounds Per Cow"], order=(0, 1, 0), seasonal_order=(1, 1, 1, 12))
result = model.fit()
print(result.summary())
#result.resid.plot()
#result.resid.plot(kind="kde")

# 5. Make predictions with ARIMA
future_dates = [df.index[-1] + DateOffset(months=x) for x in range(1, n_months_ahead)]
future_df = pd.DataFrame(index=future_dates, columns=df.columns)
final_df = pd.concat([df, future_df])

final_df[["Milk in Pounds Per Cow", "forecast"]].plot()

plt.show()


### References

##### Want to show support?

If you find the information in this page useful and want to show your support, you can make a donation

Use PayPal

or

This will help me create more stuff and fix the existent content...