ZhixiongXu's Blog: 2020

Saturday, March 7, 2020

hello, Random Forest

"""
The task here is to predict whether a bank currency note is authentic or not based on attributes such as variance (of wavelet transformed image).

The code is tuned from https://stackabuse.com/random-forest-algorithm-with-python-and-scikit-learn/

Get dataset.csv from https://drive.google.com/file/d/13nw-uRXPY8XIZQxKRNZ3yYlho-CYm_Qt/view
, column [0,4) is features(X: x0~x3), column 4 is class value(label).

This demo must run under conda, setup your conda env and go into yours( for me, ` conda activate zxxu_conda ` ):
(zxxu_conda) root@BadSectorsUbun...

"""

from os import name as os_name

from os.path import dirname
from os.path import join as join_path

OSN_HINT_UNIX   = 'posix'
OSN_HINT_WINDOWS= 'nt'
OSN             = os_name

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
#on error ` pip install -U scikit-learn ` under conda

from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

PrjDir = dirname(__file__)

dataset_path = join_path(PrjDir, "is_bank_currency_fake.csv")

dataset = pd.read_csv(dataset_path)

#select all rows, and all columns( [x0, x4) )
X = dataset.iloc[:, 0:4].values

y = dataset.iloc[:, 4].values

num_recs = X.shape[0]
print( "number of records:%d" % num_recs )

#if you don't want same split results, suggest setting random_state
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

num_1s_in_y_test = np.count_nonzero(y_test)
num_0s_in_y_test = y_test.size - num_1s_in_y_test
print( "number of 1s VS 0s in y_test:%d VS %d" % (num_1s_in_y_test, num_0s_in_y_test))

"""
if use NUM_HIJACK_OFF_1s_of_y_test, 1s in y_test will decrease while increasing 0s

NUM_HIJACK_OFF_1s_of_y_test = 2
assert num_1s_in_y_test > NUM_HIJACK_OFF_1s_of_y_test
num_hijack_off_1s_of_y_test = 0
for i in range(0,num_1s_in_y_test):
    if num_hijack_off_1s_of_y_test >= NUM_HIJACK_OFF_1s_of_y_test:
        break
    if y_test[i]:
        y_test[i] = 0
        num_hijack_off_1s_of_y_test += 1
"""

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

#The number of trees in the forest
#Changed in sklearn 0.22: The default value of n_estimators changed from 10 to 100
regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)

#now we have 20 percent samples predicated, but we don't know how good the predicates are
y_pred = regressor.predict(X_test)

"""
>>> import numpy as np
>>> np.round([0.49])
array([0.])
>>> np.round([0.51])
array([1.])
"""

"""
https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/

true positives (TP): These are cases in which we predicted yes (they have the disease), and they do have the disease.
true negatives (TN): We predicted no, and they don't have the disease.

True Positive Rate = When it's actually yes, how often does it predict yes = also known as "Sensitivity" or "Recall"
True Negative Rate = When it's actually no, how often does it predict no = also known as "Specificity"

FP_rate = 1 - TN_rate

Accuracy: Overall, how often is the classifier correct? (TP+TN)/total
Misclassification Rate = also known as "Error Rate": 1 - Accuracy

Precision: When it predicts yes, how often is it correct? TP/predicted_yes

#Prevalence: How often does the yes condition actually occur in our sample? actual yes/total

"""
y_pred_rounded = y_pred.round()
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_rounded).ravel()
assert (fp + tn) == num_0s_in_y_test
assert (fn + tp) == num_1s_in_y_test
print("true positives rate=%.2f, true negatives rate=%.2f"
    % (float(tp)/num_1s_in_y_test,float(tn)/num_0s_in_y_test))

"""
def of support:
The support is the number of occurrences of each class in y_true.
y_true is the ground truth (correct) target values.

if you don't understand the support column, use NUM_HIJACK_OFF_1s_of_y_test

F Score: This is a weighted average of the true positive rate (recall) and precision
"""
print(classification_report(y_test,y_pred_rounded))

Thursday, February 13, 2020

R for Beginners

in RStudio console, use ctrl+L to clear console;
in RStudio editor, use ctrl+shift+C to add multiline comments to selected lines.

?lm will show help of function lm()
help.search("tree") will display a list of the functions which help pages mention “tree”. Note that if some packages have been recently installed, it may be useful to refresh the database used by help.search using the option rebuild (e.g., help.search("tree", rebuild = TRUE)).

When R is running, variables, data, functions are stored in the active memory of the computer in the form of objects which have a name. The name of an object must start with a letter and CAN include dots(.)

#The functions available to the user are stored in a library localised on the disk in a directory called R_HOME/library

R.home() will show R_HOME, in Ubuntu 19.10 tested, it's "/usr/lib/R", this directory contains packages of functions.

The package named base is in a way the core of R, each package has a directory called R with a file named like the package , for instance, for the package base, this is the file R_HOME/library/base/R/base, This file contains functions of the package.

#++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# following example create&shows details of a frame:
# myFrame=data.frame(
#   emp_id = c (1:5),
#   emp_name = c("Rick","Dan","Michelle","Ryan","Gary")
# )
# ls.str(pat="myFrame")
#
# myFrame : 'data.frame':   5 obs. of 2 variables:
# $ emp_id : int 1 2 3 4 5
# $ emp_name: Factor w/ 5 levels "Dan","Gary","Michelle",..: 4 1 3 5 2

# if there are too many lines, use ls.str(pat="myFrame", max.level = -1) to hide details.

#To delete objects in memory, we use the function rm: rm(x) deletes the
#object x, rm(x,y) deletes both the objects x and y, rm(list=ls()) deletes all
#the objects in memory

#++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# > A <- "Gomphotherium"; compar <- TRUE; z <- -Inf
# > mode(A); mode(compar); mode(z); length(A)
# [1] "character"
# [1] "logical"
# [1] "numeric"
# [1] 1
#++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
read table from https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.txt :
1   6   a
2   7   b
3   8   c
4   9   d
5   10 e

url <- "https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.txt"
read.table(
url,
header = FALSE,
quote = "\"’",
colClasses = c("numeric","numeric","character"),
nrows = 2, #only read some rows
skip = 0, #start from the first row
check.names = TRUE, #checks that the variable|column names are valid
blank.lines.skip = TRUE,
comment.char = "" # no comment in this file
)

V1 V2 V3
1 1 6 a
2 2 7 b

#++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

use scan to read table:

scan(url, n = 2, blank.lines.skip = TRUE, comment.char="#")
Read 2 items
[1] 1 6

#sep = "" , not " "
scan(url, sep = "", what = list(0,0,""), nmax=2)
Read 2 records
[[1]]
[1] 1 2

[[2]]
[1] 6 7

[[3]]
[1] "a" "b"

//into 3 vectors|variables
#++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
sequence(c(3, 2, 4))
[1] 1 2 3 1 2 1 2 3 4

sequence(1:3)
[1] 1 1 2 1 2 3

seq(1, 5, 0.5)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

The function gl (generate levels) is very useful because it generates regular
series of factors.

> gl(3, 5, length=30)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Levels: 1 2 3

> gl(2, 6, label=c("Male", "Female"))
[1] Male Male Male Male Male Male
[7] Female Female Female Female Female Female
Levels: Male Female

> expand.grid(h=c(60,80), w=c(100, 300), sex=c("Male", "Female"))
   h   w    sex
1 60 100   Male
2 80 100   Male
3 60 300   Male
4 80 300   Male
5 60 100 Female
6 80 100 Female
7 60 300 Female
8 80 300 Female
#++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> matrix(1:6, 2, 3)
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
> matrix(1:6, 2, 3, byrow=TRUE)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

#++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

> fac <- factor(c(1, 10))
> fac
[1] 1 10
Levels: 1 10
> as.numeric(fac)
[1] 1 2
#++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> x <- matrix(1:6, 2, 3)
> x
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

> x[, 3]
[1] 5 6

> x[, 3, drop = FALSE]
     [,1]
[1,]    5
[2,]    6
#++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

> x <- 1:10
> x[x >= 5] <- 20
> x
[1] 1 2 3 4 20 20 20 20 20 20
> x[x == 1] <- 25
> x
[1] 25 2 3 4 20 20 20 20 20 20

Friday, January 31, 2020

gcc important preprocess , compile and assemble options

-E stop after the preprocessing stage; The output is in the form of preprocessed source code, which is sent to the standard output.
-S Stop after the stage of compilation, don't assemble. The output is in the form of an assembler code file.
-c Compile or assemble the source files
-o If this option is not specified, the default is to put an executable file in ‘a.out’, the object file in ‘source.o’, assembler file in ‘source.s’, a
precompiled header file in ‘source.suffix.gch’, and all preprocessed C source on standard output.