Review of Python Courses (Part 26)
Posted by Mark on February 12, 2021 at 07:29 | Last modified: February 15, 2021 11:54In Part 25, I summarized my Datacamp courses 74-76. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #77 was Visualizing Time Series Data in Python. This course covers:
- Plot your first time series
- Customize your time series plot
- Clean your time series data (counting missing values in df)
- Plot aggregates of your data
- Summarizing the value in your time series data
- Autocorrelation and partial autocorrelation (from statsmodels.graphics import tsaplots)
- Seasonality, noise, and trend in time series data [from pylab import RCparams, sm.tsa.seasonal_decompose()]
- Working with more than one time series
- Plot multiple time series (adding statistical summaries to your plots)
- Find relationships between multiple time series [sns.heatmap(), sns.clustermap()]
- Apply your knowledge to a new dataset
- Beyond summary statistics
- Decompose time series data
- Compute correlations between time series
>
My course #78 was Financial Forecasting in Python. Topics covered in this course include:
- Introduction to financial statements
- Calculating sales and the cost of goods sold
- Working with raw datasets
- Introduction to the balance sheet
- Balance sheet efficiency ratios
- Financial periods and how to work with them
- The datetime library and Split function
- Tips and tricks when working with datasets
- Building sensitive forecast models and common forecast assumptions
- Dependencies and sensitivity in financial forecasting
- Working with variances in the forecast
>
My course #79 was Foundations of Probability in Python. This course covers:
- Let’s flip a coin in Python (from scipy.stats import bernoulli, binom)
- Probability mass and distribution functions
- Expected value, mean, and variance (from scipy.stats import describe)
- Calculating probabilities of two events (from scipy.stats import find_repeats, relfreq)
- Conditional probabilities
- Total probability law
- Bayes’ rule
- Normal distributions (from scipy.stats import norm, import matplotlib.pyplot as plt, import seaborn as sns)
- Risk factors
- Factor models
- Portfolio analysis tools
- Normal probabilities
- Poisson distributions (from scipy.stats import poisson)
- Geometric distributions (from scipy.stats import geom)
- From sample mean to population mean (from scipy.stats import binom, describe)
- Adding random variables
- Linear regression (from sklearn.linear_model import LinearRegression, from scipy.stats import linregress)
- Logistic regression (from sklearn.linear_model import LogisticRegression)
>
I will review more courses next time.
Categories: Python | Comments (0) | PermalinkReview of Python Courses (Part 25)
Posted by Mark on February 9, 2021 at 07:29 | Last modified: February 12, 2021 09:25In Part 24, I summarized my Datacamp courses 71-73. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #74 was Writing Functions in Python. Overall, I found this content to be quite challenging. The course covers:
- Docstrings (require string)
- DRY and “do one thing” [standardize function, mean_and_median()]
- Pass by assignment
- Using context managers
- Writing context managers
- Advanced topics
- Functions as objects
- Scope
- Closures
- Decorators
- Real-world examples
- Decorators and metadata (from functools import wraps)
- Decorators that take arguments
- Timeout(): a real-world example
>
My course #75 was AI Fundamentals. Topics covered in this course include:
- What is all the AI fuss about?
- All models are wrong but some are useful
- Three flavors of machine learning
- Supervised learning fundamentals
- Training and evaluating classification models (confusion matrix, true/false positives/negatives)
- Training and evaluating regression models (from sklearn.preprocessing import PolynomialFeatures)
- Dimensionality reduction
- Clustering
- Anomaly detection
- Selecting the right model
- Deep learning and beyond
- Convolutional neural networks
>
My course #76 was Introduction to Portfolio Analysis in Python. This course covers:
- Welcome to portfolio analysis
- Portfolio returns
- Measuring risk of a portfolio (formatting as percentage)
- Annualized returns
- Risk-adjusted returns (calculating SR)
- Non-normal distribution of returns
- Alternative measures of risk
- Comparing against a benchmark
- Risk factors
- Factor models
- Portfolio analysis tools
- MPT (from pypfopt.efficient_frontier import EfficientFrontier; from pypfopt import risk_models, expected_returns)
- Maximum Sharpe vs. minimum volatility
- Alternative portfolio optimization
>
I will review more courses next time.
Categories: Python | Comments (0) | PermalinkReview of Python Courses (Part 24)
Posted by Mark on February 4, 2021 at 07:41 | Last modified: February 10, 2021 16:23In Part 23, I summarized my Datacamp courses 68-70. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #71 was Improving Your Data Visualizations in Python. This course covers:
- Highlighting data
- Comparing groups
- Annotations
- Color in visualizations
- Continuous color palettes
- Categorical palettes
- Point estimate intervals
- Confidence bands
- Beyond 95% (visualizing multiple confidence bands at once)
- Visualizing the bootstrap
- Looking at the farmers market data
- Exploring the patterns
- Making your visualizations efficient
- Tweaking your plots
>
My course #72 was Command Line Automation in Python. Because I don’t use the shell much, I don’t see a whole lot of application here for me and I’m not sure how much I absorbed. In any case, topics covered in this course include:
- Learn the Python interpreter
- Capture IPython shell output
- Automate with SList
- Execute shell commands in subprocess (import subprocess; import os)
- Capture output of shell commands (from subprocess import Popen, PIPE)
- Sending input to processes
- Passing arguments safely to shell commands
- Dealing with file systems
- Find files matching a pattern (from pathlib import Path; import fnmatch, re)
- High-level file and directory operations (from shutil import copytree, ignore_patterns, rmtree, make_archive)
- Using pathlib (from pathlib import Path)
- Using functions for automation (from functools import wraps)
- Understand script input
- Introduction to click (import click)
- Using click to write command line tools (from click.testing import CliRunner)
>
My course #73 was Unit Testing for Data Science in Python. This course covers:
- Why unit test?
- Write a simple unit test using pytest
- Understanding test result report
- More benefits and test types
- Mastering assert statements
- Testing for exceptions instead of return values
- The well-tested function
- Test driven development (TDD)
- How to organize a growing set of tests?
- Mastering test execution
- Expected failures and conditional skipping
- Continuous integration and code coverage
- Beyond assertion: setup and teardown
- Mocking (from unittest.mock import call)
- Testing models
- Testing plots
>
I will review more courses next time.
Categories: Python | Comments (0) | PermalinkReview of Python Courses (Part 23)
Posted by Mark on February 1, 2021 at 07:34 | Last modified: February 10, 2021 10:35In Part 22, I summarized my Datacamp courses 65-67. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #68 was Linear Classifiers in Python. This course covers:
- Introduction (import sklearn.datasets)
- Applying logistic regression and SVM (general process, from sklearn.svm import LinearSVC)
- Linear decision boundaries
- Linear classifiers: prediction equations
- What is a loss function (from scipy.optimize import minimize)?
- Loss function diagrams
- Logistic regression and regularization
- Logistic regression and probabilities
- Multi-class logistic regression
- Support vectors
- Kernel SVMs
- Comparing logistic regression and SVM (from sklearn.linear_model import SGDClassifier)
>
My course #69 was Analyzing Social Media Data in Python. While I found this somewhat interesting, it seemed to incorporate as much JSON as it did Python. I have a hard enough time studying one new language—adding a second on top of that made things even more confusing for me:
- Analyzing Twitter data
- Collecting data through the Twitter API (from tweepy import Stream, OAuthHandler, API)
- Understanding Twitter JSON
- Processing Twitter text
- Counting words
- Time series
- Sentiment analysis
- Twitter networks
- Importing and visualizing Twitter networks (import networkx as nx)
- Node-level metrics
- Maps and Twitter data
- Geographical data in Twitter JSON
- Creating Twitter maps (from mpl_toolkits.basemap import Basemap)
>
My course #70 was Fraud Detection in Python. This course covers:
- Introduction to fraud detection
- Increasing successful detections using data resampling (from imblearn.over_sampling import RandomOverSampler)
- Fraud detection algorithms in action (from imblearn.pipeline import Pipeline)
- Review of classification methods
- Performance evaluation (from sklearn.metrics import precision_recall_curve, average_precision_score)
- More performance evaluation (from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score)
- Adjusting your algorithm weights
- Performance evaluation (from sklearn.model_selection import GridSearchCV)
- Ensemble methods (from sklearn.ensemble import VotingClassifier)
- Normal versus abnormal behavior
- Clustering methods (from sklearn.preprocessing import MinMaxScaler; from sklearn.cluster import MiniBatchKMeans)
- Assigning fraud versus non-fraud
- Other clustering fraud detection methods (from sklearn.cluster import DBSCAN)
- Using text data (from nltk import word_tokenize; import string)
- Text mining to detect fraud (from nltk.corpus import stopwords; from nltk.stem.wordnet import WordNetLemmatizer)
- Topic modeling on fraud (from gensim import corpora)
- Flagged fraud based on topics (import pyLDAvis.gensim for use with Jupyter Notebooks only)
>
I will review more courses next time.
Categories: Python | Comments (0) | PermalinkReview of Python Courses (Part 22)
Posted by Mark on January 29, 2021 at 07:31 | Last modified: February 9, 2021 13:29In Part 21, I summarized my Datacamp courses 62-64. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #65 was Reshaping Data with pandas. This course covers:
- Wide and long formats
- Reshaping using pivot method
- Pivot tables
- Reshaping with melt
- Wide to long function
- Working with string columns
- Stacking dataframes
- Unstacking dataframes
- Working with multiple levels
- Handling missing data
- Reshaping and combining data
- Transforming a list-like column
- Reading nested data into a dataframe (from pandas import json_normalize)
- Dealing with nested data columns
>
My course #66 was Building Data Engineering Pipelines in Python. For some reason, these data engineering courses did not sit well with me and much of this sailed over my head. This course covers:
- Components of a data platform
- Introduction to data ingestion with Singer
- Running an ingestion pipeline with Singer
- Basic introduction to PySpark (from pyspark.sql import SparkSession)
- Cleaning data
- Transforming data with Spark
- Packaging your application
- On the importance of tests
- Writing unit tests for PySpark
- Continuous testing
- Modern day workflow management
- Building a data pipeline with Airflow (from airflow.operators.bash_operator import BashOperator)
- Deploying Airflow (from airflow.models import DagBag)
>
My course #67 was Importing and Managing Financial Data in Python. This course covers:
- Reading, inspecting, and cleaning data from CSV (parse_dates explained)
- Read data from Excel worksheets
- Combine data from multiple worksheets (importing market data from multiple Excel files)
- The DataReader: access financial data online (from pandas_datareader.data import DataReader)
- Economic data from the Federal Reserve
- Select stocks and get data from Google Finance
- Get several stocks and manage a MultiIndex
- Summarize your data with descriptive stats
- Describe the distribution of your data with quantiles (np.arange() to .describe() with constant-step percentiles)
- Visualize the distribution of your data [ax = sns.distplot(df)]
- Summarize categorical variables
- Aggregate your data by category
- Summary statistics by category with seaborn [sns.countplot()]
- Distributions by category with seaborn [sns.boxplot(), sns.swarmplot()]
>
I will review more courses next time.
Categories: Python | Comments (0) | PermalinkReview of Python Courses (Part 21)
Posted by Mark on January 26, 2021 at 07:10 | Last modified: February 8, 2021 14:22In Part 20, I summarized my Datacamp courses 59-61. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #62 was Time Series Analysis in Python. This clearly has potential applications for investment returns, but in the end I wasn’t totally sure what those might be. The course covers:
- Introduction to the course
- Correlation of two time series
- Simple linear regressions (in statsmodels, numpy, pandas, scipy)
- Autocorrelation (convert index to datetime)
- Autocorrelation function (from statsmodels.graphics.tsaplots import plot_acf; from statsmodels.tsa.stattools import acf)
- White noise
- Random walk (from statsmodels.tsa.stattools import adfuller)
- Stationarity
- Introducing an AR model (from statsmodels.tsa.arima_process import ArmaProcess)
- Estimating and forecasting an AR model
- Choosing the right model (from statsmodels.graphics.tsaplots import plot_pacf)
- Estimation and forecasting an MA model
- ARMA models
- Cointegration models
- Case study: climate change
>
My course #63 was Intermediate Predictive Analytics in Python. This course covers:
- The basetable timeline
- The population
- The target
- Adding predictive variables
- Adding aggregated variables
- Adding evolutions
- Using evolution variables
- Creating dummies (avoiding multicollinearity)
- Missing values (list comprehension)
- Handling outliers (from scipy.stats.mstats import winsorize)
- Transformations
- Seasonality
- Using multiple snapshots
- The timegap
>
My course #64 was Building and Distributing Packages with Conda. This is another shell-related course I found hard to absorb since I do very little in the shell. I’m not the only newbie who feels this way, either. This was a recent post to the group:
> I have been doing Python courses for a while but now I actually wanna try some real
> live data on my laptop and I am not sure on how to install all of the needed stuff
> (pandas, numpy, etc.). I have downloaded the latest Python version and the PyCharm
> editor but… [the courses] do not really have anything to show you how to actually
> make the rest of the things work for inexperienced people such as myself.
I downloaded Spyder IDE, which has met most of my needs. It crashes sometimes and gives repetitive errors upon start-up, though, which are both quite annoying. I’ve also had mixed results downloading some libraries like Backtester.
Speaking of Anaconda, or conda for short, my 64th course covers:
- Anaconda Project
- Anaconda Project specification file
- Anaconda Project commands
- Python module and packages
- Python package directory
- Conda packages
- Conda package dependencies
>
I will review more courses next time.
Categories: Python | Comments (0) | PermalinkReview of Python Courses (Part 20)
Posted by Mark on January 21, 2021 at 07:00 | Last modified: February 8, 2021 10:04In Part 19, I summarized my Datacamp courses 56-58. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #59 was Dealing with Missing Data in Python. This course covers:
- Why deal with missing data (built-in Python NoneType vs. np.nan)?
- Handling missing values
- Analyze the amount of missingness (import missingno as msno)
- Is the data missing at random?
- Finding patterns in missing data
- Visualizing missingness across a variable
- When and how to delete missing data
- Mean, median, and mode imputations (from sklearn.impute import SimpleImputer)
- Imputing time-series data
- Visualizing time-series imputations
- Imputing using fancyimpute (from fancyimpute import KNN; from fancyimpute import IterativeImputer)
- Imputing categorical values
- Evaluation of different imputation techniques
>
My course #60 was Intermediate Python for Finance. This course covers:
- Representing time with datetimes
- Working with datetimes
- Dictionaries
- Comparison operators
- Boolean operators
- If statements (with dictionary)
- For and while loops
- Creating a dataframe
- Accessing data
- Aggregating and summarizing
- Extending and manipulating data
- Peeking at data with head, tail, and describe
- Filtering data
- Plotting data
>
My course #61 was Object-Oriented Programming in Python. These OOP-related courses were really confusing to me the first time through. This course covers:
- What is OOP?
- Class anatomy: attributes and methods
- Class anatomy: the __init__ constructor
- Instance and class data
- Class inheritance
- Customizing functionality via inheritance
- Operator overloading: comparison
- Operator overloading: string representation
- Exceptions (try – except – finally)
- Designing for inheritance and polymorphism (Liskov substitution principle)
- Managing data access: private attributes
- Properties
>
I will review more courses next time.
Categories: Python | Comments (0) | PermalinkReview of Python Courses (Part 19)
Posted by Mark on January 19, 2021 at 07:12 | Last modified: February 6, 2021 04:54In Part 18, I summarized my Datacamp courses 53-55. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #56 was Writing Efficient Code with pandas. This course covers:
- The need for efficient coding (time.time(), list comprehensions faster than for loop)
- Locate rows: .iloc[] (generally faster for rows) and .loc[] (generally faster for columns)
- Select random rows (built-in sample() function faster than numpy random integer generator)
- Replace scalar values using .replace() (much faster than using .loc[] to find values and reassigning them)
- Replace values using lists (.replace() faster than using .loc[] )
- Replace values using dictionaries (faster than using lists)
- Looping through the .iterrows() function [for loop using .range() is faster than the smarter/cleaner/optimized .iterrows()]
- Looping through the .apply() function (faster iterating along rows while native pandas .sum() faster along columns)
- Vectorization over pandas series [vectorization method .apply() works faster than .iterrows()]
- Vectorization using NumPy arrays using .values() (summing arrays is faster than summing series)
- Data transformation using .groupby().transform (.transform() cleaner and much faster than native Python code)
- Missing value imputation using .transform() (.transform() much faster than native Python code)
- Data filtration using the .filter() function (.groupby().filter() faster than list comprehension + for loop)
>
My course #57 was Credit Risk Modeling in Python. This course covers:
- Understanding credit risk
- Outliers in credit data
- Risk with missing data in loan data (finding, counting, and replacing missing data)
- Logistic regression for probability of default
- Predicting the probability of default
- Credit model performance
- Model discrimination and impact
- Gradient boosted trees with XGBoost
- Column selection for credit risk
- Cross validation for credit models
- Class imbalance in loan data
- Model evaluation and implementation (from sklearn.calibration import calibration_curve)
- Credit acceptance rates
- Credit strategy and maximum expected loss
>
My course #58 was Analyzing IoT Data in Python. This course covers:
- Introduction to IoT data
- Understand the data
- Introduction to data streams (import paho.mqtt.subscribe as subscribe)
- Perform EDA
- Clean data
- Gather minimalistic incremental data
- Prepare and visualize incremental data
- Combining data sources for further analysis
- Correlation
- Outliers (from statsmodels.graphics import tsaplots)
- Seasonality and trends
- Prepare data for machine learning
- Scaling data for machine learning
- Develop machine learning pipeline (from sklearn.pipeline import Pipeline)
- Apply a machine learning model
>
I will review more classes next time.
Categories: Python | Comments (0) | PermalinkReview of Python Courses (Part 18)
Posted by Mark on January 15, 2021 at 07:15 | Last modified: February 5, 2021 10:08In Part 17, I summarized my Datacamp courses 50-52. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #53 was Introduction to Python for Finance. This course covers:
- Why Python for finance?
- Comments and variables
- Variable data types
- Lists in Python
- Lists in lists
- Methods and functions
- Arrays (probably best for financial analysis)
- Two dimensional arrays
- Using arrays for analyses (indexing arrays—might work in place of .loc or .iloc?)
- Visualization in Python
- Histograms (normed arg)
- Introducing the dataset
- Closer look at the sectors
- Visualizing trends
>
My course #54 was Experimental Design in Python. This course covers:
- Intro to experimental design (import plotnine as p9)
- Our first hypothesis test—Student’s t-test (from scipy import stats)
- Testing proportion and correlation [stats.chisquare(), stats.fisher_exact(), stats.pearsonr()]
- Confounding variables
- Blocking and randomization (random sampling)
- ANOVA [import statsmodels as sm, stats.f_oneway()]
- Interactive effects (two- and three-way ANOVAs)
- Type I error (Bonferroni and Šidák correction for multiple comparisons)
- Sample size (from statsmodels.stats import power as pwr)
- Power
- Assumptions and normal distributions (Q-Q plot)
- Testing for normality [from scipy import stats, stats.shapiro()]
- Non-parametric tests: Wilcoxon rank-sum and signed-rank (paired) test
- More non-parametric tests: Spearman correlation
>
My course #55 was Introduction to Data Engineering. For some reason, these data engineering courses are not my cup of tea. This course covers:
- What is data engineering?
- Tools of the data engineer (data engineers are expert users of database systems)
- Cloud providers
- Databases
- Parallel computing (from multiprocessing import Pool) and computation frameworks
- Workflow scheduling frameworks
- Extract
- Transform
- Loading
- Putting it all together
- Case study: course ratings
- From ratings to recommendations
- Scheduling daily jobs
>
I will review more classes next time.
Categories: Python | Comments (0) | PermalinkReview of Python Courses (Part 17)
Posted by Mark on January 12, 2021 at 07:13 | Last modified: February 4, 2021 13:11In Part 16, I summarized my Datacamp courses 47-49. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #50 was Introduction to Shell. This course covers:
- How does the shell compare to a desktop interface?
- Where am I and how can I identify files and directories?
- How can I move to another directory (~ is home)?
- How to copy, rename, move, and delete files
- How to create and delete directories
- How to view file contents
- Modifying commands with flags
- Getting help for a command
- Selecting columns from a file
- Repeating commands
- Selecting lines with certain values
- Storing command output to a file or using as input
- Combining commands with pipe symbol
- Counting records in a file
- Specifying multiple files at once
- Wildcards
- Sorting lines of text and removing duplicate lines
- How to stop a running program
- Printing a variable’s value
- How does the shell store information?
- Repeating commands many times or once for each file
- Recording names of a set of files
- Variable’s name versus its value
- Running many commands in a single loop
- Using semicolons to do multiple things in a single loop
- Editing a file
- Saving commands to rerun later
- Reusing pipes
- Passing filenames to scripts
- Processing a single argument
- Writing loops in a shell script
>
My course #51 was Generalized Linear Models (GLM) in Python. This material is thick and really demands a third look (for me). This course covers:
- Going beyond linear regression (import statsmodels.api as sm; from statsmodels.formula.api import glm)
- How to build a GLM?
- How to fit a GLM in Python?
- Binary data and logistic regression (odds, odds ratio, and probability)
- Interpreting coefficients
- Interpreting model inference
- Computing and describing predictions
- Count data and Poisson distribution
- Interpreting model fit
- The problem of overdispersion
- Multivariable logistic regression (from statsmodels.stats.outliers_influence import variance_inflation_factor)
- Comparing models
- Model formula (from patsy import dmatrix)
- Categorical and interaction terms
>
My course #52 was Pandas Joins for Spreadsheet Users. This course covers:
- Joining data: a real-world necessity
- Concatenation
- Power and flexibility
- Types of joins
- A closer look at one-to-one joins
- Combining common data with inner joins
- “Out of many, one”
- Joining on key columns
- Index-based joins
- Joining data in real life
- Working with time data
>
I will review more classes next time.
Categories: Python | Comments (0) | Permalink