Downsizing the Data Set – Resampling and Binning of Time Series and other Data Sets

Data Sets are often too small. We do not have all data that we need in order to interpret, explain, visualize or use for training a meaningful model. However, quite often our data sets are too large. Or, more specifically, they have higher resolution than is necessary or even than is desirable. We may have a timeseries with values for every other second, although meaningful changes do not happen at lower frequencies than 30 seconds (more...)

Prepare Jupyter Notebook Workshop Environment through Docker container image and Bootstrap Notebook

Earlier this week, I presented a workshop on Data Analytics. I wanted to provide each of the participants with a fully prepared environment, right on everyone’s own laptop (and optionally in a cloud environment such as Katacoda). The environment consisted of Python 3.7, Jupyter Labs (for Notebooks), many additional Python libraries (Pandas, Plotly, Chart Studio, Matrix Profile, SAX, Fuzzy Search and many more) and a number of my own GitHub repositories containing the workshop (more...)

Django on Fedora 30

It seemed opportune to add Django to the Fedora 30 instance that I build and maintain for my students. Here are the instructions, which I developed with the prior Fedora 28/29 instructions.

  1. Check your Python3 installation with the following command:

    python3 -V
    

    It should return this but if it doesn’t you should install python3:

    Python 3.7.4
    

  2. Check whether pip3 is installation by installing it when its not:

    sudo def -y install python3-php
    

    (more...)

Python MySQL Query

Somebody asked me how to expand a prior example with the static variables so that it took arguments at the command line for the variables. This example uses Python 3 new features in the datetime package.

There’s a small trick converting the string arguments to date data types. Here’s a quick example that shows you how to convert the argument list into individual date data type variables:

#!/usr/bin/python3

# include standard modules
import sys
 (more...)

Python – Counter – Compare Lists

A few days back I wrote about using sorted(list) to compare 2 list.

Recently I learned we can also use Counter to compare list without taking their order into account.

Counter_Compare_List

Happy Learning !!

Merge json files using Pandas

Quick demo for merging multiple json files using Pandas –

import pandas as pd
import glob
import json

file_list = glob.glob("*.json")
>>> file_list
['b.json', 'c.json', 'a.json']

Use enumerate to assign counter to files.


allFilesDict = {v:k for v, k in enumerate(file_list, 1)}
>>> allFilesDict
{1: 'b.json', 2: 'c.json', 3: 'a.json'}

Append the data into list –

>>> data = []

for k,v in allFilesDict.items():
    if 1  (more...)

MySQL Python Connector

While building my student image on Fedora 30, I installed the MySQL PHP Connector (php-mysqlndrp) but neglected to install the Python Connector. This adds the installation and basic test of the Python Connector to the original blog post.

You use the following command with a wildcard as a privileged user. The wildcard is necessary because you need to load two libraries to support Python 2.7 and 3.7, which are installed on Fedora 30. (more...)

Determine the Language of a Document from the Letter Frequency – using Levenshtein Distance between sequences

imageEven though many languages share the same or a very similar alphabet, the use of letters in documents written in these languages is quite distinct. The letter ” e” is quite popular, but not the most used letter in every language. In fact, the letter frequency is very specific to a language – and can be used to determine the language of a document in a simple and pretty fast way.

The very simple steps (more...)

Tour de France Data Analysis using Strava data in Jupyter Notebook with Python, Pandas and Plotly – Step 2: combining and aligning multi rider data for analyzing and visualizing the Race

In this article, I analyze the race that took place in stage 14 of the 2019 Tour de France in a Jupyter Notebook using Python, Pandas and Plotly and based on the Strava performance data published by Steven Kruijswijk, Thomas de Gendt, Thibaut Pinot and Marco Haller. In this previous article I have explained how we can retrieve the Strava data for a specific rider for a stage in the Tour de France, and in (more...)

Tour de France Data Analysis using Strava data in Jupyter Notebook with Python, Pandas and Plotly – Step 1: single rider loading, exploration, wrangling, visualization

In this article, I will show how to analyze the performance of Steven Kruijswijk during stage 14 of the 2019 Tour de France in a Jupyter Notebook using Python, Pandas and Plotly. Strava collects data from athletes regarding their activities – such as running, cycling, walking and hiking. Members can upload data – and tens of millions do so, including some well known cyclists such as Steven Kruijswijk. In my previous article I have explained (more...)

Pandas – ValueError: If using all scalar values, you must pass an index

Reading json file using Pandas read_json can fail with “ValueError: If using all scalar values, you must pass an index”. Let see with an example –

cat a.json
{
  "creator": "CaptainAmerica",
  "last_modifier": "NickFury",
  "title": "Captain America: The First Avenger",
  "view_count": 12000
}
>>> import pandas as pd
>>> import glob
>>> for f in glob.glob('*.json'):
...     print(f)
...
b.json
c.json
a.json
>>> pd.read_json('a.json')
Traceback (most recent call last):
  File  (more...)

Python – sort() vs sorted(list)

You can compare list using sort() or sorted(list), but be careful with sort() –

>>> c = [('d',4), ('c',3), ('a',1), ('b', 2)]
>>> a = [('a',1), ('b', 2), ('c',3), ('d',4)]
>>> a.sort() == c.sort()
True
>>>
>>> a = [('a',1), ('b', 2), ('c',3), ('d',4)]
>>> b = [('b',2), ('c', 3), ('a',1)]
>>>
>>> a.sort() == b.sort()
True

>>> a = [('a',1), ('b', 2), ('c',3), ('d',4)]
>>> b = [('b',2), ('c', 3),  (more...)

Report Time Execution Prediction with Keras and TensorFlow

The aim of this post is to explain Machine Learning to software developers in hands-on terms. Model is based on a common use case in enterprise systems — predicting wait time until the business report is generated.

Report generation in business applications typically takes time, it can be from a few seconds to minutes. Report generation requires time, because typically it would fetch and process many records, this process needs time. Users often get frustrated, (more...)

Forecast Model Tuning with Additional Regressors in Prophet

I’m going to share my experiment results with Prophet additional regressors. My goal was to check how extra regressor would weight on forecast calculated by Prophet.

Using dataset from Kaggle — Bike Sharing in Washington D.C. Dataset. Data comes with a number for bike rentals per day and weather conditions. I have created and compared three models:

1. Time series Prophet model with date and number of bike rentals
2. A model with additional (more...)

NumPy in a Nutshell

Hello and welcome back. I have started a new category in my blog about Python. The purpose of this post is to go through NumPy library. I will be using Jupyter for the demo but will provide the py file if you prefer to run it in PyCharm for example. NumPy is a core Python Linear Algebra library for Data Science used for faster array processing than the native Python lists with a bunch of (more...)

Serving Prophet Model with Flask — Predicting Future

The solution to demonstrate how to serve Prophet model API on the Web with Flask. Prophet — Open-Source Python library developed by Facebook to predict time series data.

An accurate forecast and future prediction are crucial almost for any business. This is an obvious thing and it doesn’t need explanation. There is a concept of time series data, this data is ordered by date and typically each date is assigned with one or more values specific to (more...)

Managing imbalanced Data Sets with SMOTE in Python

When working with data sets for machine learning, lots of these data sets and examples we see have approximately the same number of case records for each of the possible predicted values. In this kind of scenario we are trying to perform some kind of classification, where the machine learning model looks to build a model based on the input data set against a target variable. It is this target variable that contains the value (more...)

Python – str.maketrans()

Working on a Python code, I had a requirement for removing the single/double quotes and open/close brackets from the string of below format —

>>> text = """with summary as (select '
...  'p.col1,p.col2,p.col3, ROW_NUMBER() '
...  'OVER(PARTITION BY p.col1,p.col3 ORDER BY '
...  'p.col2) AS rk from (select * from (select '
...  'col2, col1, col3, '
...  'sum(col4) as col6 from '
...  '"demo"."tab1" a join '
...  "(select lpad(col5, 12, '0') as  (more...)

Cat or Dog — Image Classification with Convolutional Neural Network

The goal of this post is to show how convnet (CNN — Convolutional Neural Network) works. I will be using classical cat/dog classification example described in François Chollet book — Deep Learning with Python. Source code for this example is available on François Chollet GitHub. I’m using this source code to run my experiment.

Convnet works by abstracting image features from the detail to higher level elements. An analogy can be described with the way how humans think. (more...)

Build it Yourself — Chatbot API with Keras/TensorFlow Model

Is not that complex to build your own chatbot (or assistant, this word is a new trendy term for chatbot) as you may think. Various chatbot platforms are using classification models to recognize user intent. While obviously, you get a strong heads-up when building a chatbot on top of the existing platform, it never hurts to study the background concepts and try to build it yourself. Why not use a similar model yourself. Chatbot implementation (more...)