I. Implementation in R
R package: ‘stringdist’
Description: Calculate various string distances based on edits (dameraulevenshtein, hamming, levenshtein, optimal sting alignment),qgrams (q gram, cosine, jaccard distance) or heuristic metrics (jaro,jarowinkler).
1. amatch Examples
# which scifi heroes are stringdistantly nearest amatch("leia",c("uhura","leela"),maxDist=5) # we can restrict the search amatch("leia",c("uhura","leela"),maxDist=1) #setting nomatch returns a different value when no match is found amatch("leia",c("uhura","leela"),maxDist=1,nomatch=0) #this is always true if maxDist is Inf ain("leia",c("uhura","leela"),maxDist=Inf) # Let’s look in a neighbourhood of maximum 2 typo’s (by default, the OSA algorithm is used) ain("leia",c("uhura","leela"), maxDist=2)
2. qgrams Examples
qgrams(’hello world’,q=3) # qgrams are counted uniquely over a character vector qgrams(rep(’hello world’,2),q=3)
# to count them separately, do something like x < c(’hello’, ’world’) lapply(x,qgrams, q=3) # output rows may be named, and you can pass any number of character vectors x < "I will not buy this record, it is scratched" y < "My hovercraft is full of eels" z < c("this", "is", "a", "dead","parrot") qgrams(A = x, B = y, C = z,q=2) # a tonque twister, showing the effects of useBytes and useNames x < "peter piper picked a peck of pickled peppers" qgrams(x, q=2) qgrams(x, q=2, useNames=FALSE) qgrams(x, q=2, useBytes=TRUE) qgrams(x, q=2, useBytes=TRUE, useNames=TRUE)
3. stringdist Examples
# Simple example using optimal string alignment stringdist("ca","abc") # The same example using DamerauLevenshtein distance (multiple editing of substrings allowed) stringdist("ca","abc",method="dl") # string distance matching is case sensitive: stringdist("ABC","abc")
# so you may want to normalize a bit: stringdist(tolower("ABC"),"abc") # stringdist recycles the shortest argument: stringdist(c(’a’,’b’,’c’),c(’a’,’c’)) # stringdistmatrix gives the distance matrix (by default for optimal string alignment): stringdist(c(’a’,’b’,’c’),c(’a’,’c’)) # different edit operations may be weighted; e.g. weighted substitution: stringdist(’ab’,’ba’,weight=c(1,1,1,0.5)) # Nonunit weights for insertion and deletion makes the distance metric asymetric stringdist(’ca’,’abc’) stringdist(’abc’,’ca’) stringdist(’ca’,’abc’,weight=c(0.5,1,1,1)) stringdist(’abc’,’ca’,weight=c(0.5,1,1,1)) # Hamming distance is undefined for # strings of unequal lengths so stringdist returns Inf stringdist("ab","abc",method="h") # For strings of eqal length it counts the number of unequal characters as they occur # in the strings from beginning to end stringdist("hello","HeLl0",method="h") # The lcm (longest common substring) distance returns the number of # characters that are not part of the lcs. # # Here, the lcs is either ’a’ or ’b’ and one character cannot be paired: stringdist(’ab’,’ba’,method="lcs") # Here the lcs is ’surey’ and ’v’, ’g’ and one ’r’ of ’surgery’ are not paired stringdist(’survey’,’surgery’,method="lcs") # qgrams are based on the difference between occurrences of q consecutive characters # in string a and string b. # Since each character abc occurs in ’abc’ and ’cba’, the q=1 distance equals 0: stringdist(’abc’,’cba’,method=’qgram’,q=1) # since the first string consists of ’ab’,’bc’ and the second # of ’cb’ and ’ba’, the q=2 distance equals 4 (they have no q=2 grams in common): stringdist(’abc’,’cba’,method=’qgram’,q=2) # Wikipedia has the following example of the Jarodistance. stringdist(’MARTHA’,’MATHRA’,method=’jw’) # Note that stringdist gives a _distance_ where wikipedia gives the corresponding # _similarity measure_. To get the wikipedia result: 1  stringdist(’MARTHA’,’MATHRA’,method=’jw’) # The corresponding JaroWinkler distance can be computed by setting p=0.1 stringdist(’MARTHA’,’MATHRA’,method=’jw’,p=0.1) # or, as a similarity measure 1  stringdist(’MARTHA’,’MATHRA’,method=’jw’,p=0.1)
II. Wikipedia details:
Mathematically, the Levenshtein distance between two strings is given by where
where is the indicator function equal to 0 when and 1 otherwise.
Note that the first element in the minimum corresponds to deletion (from to ), the second to insertion and the third to match or mismatch, depending on whether the respective symbols are the same.
For example, the Levenshtein distance between “kitten” and “sitting” is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:
The Levenshtein distance has several simple upper and lower bounds. These include:
In approximate string matching, the objective is to find matches for short strings in many longer texts, in situations where a small number of differences is to be expected. The short strings could come from a dictionary, for instance. Here, one of the strings is typically short, while the other is arbitrarily long. This has a wide range of applications, for instance, spell checkers, correction systems for optical character recognition, and software to assist natural language translation based on translation memory.
The Levenshtein distance can also be computed between two longer strings, but the cost to compute it, which is roughly proportional to the product of the two string lengths, makes this impractical. Thus, when used to aid in fuzzy string searching in applications such as record linkage, the compared strings are usually short to help improve speed of comparisons.
There are other popular measures of edit distance, which are calculated using a different set of allowable edit operations. For instance,
Edit distance is usually defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite). This is further generalized by DNA sequence alignment algorithms such as the Smith–Waterman algorithm, which make an operation’s cost depend on where it is applied.
A straightforward recursive implementation LevenshteinDistance
function takes two strings, s and t, and returns the Levenshtein distance between them:
// len_s and len_t are the number of characters in string s and t respectively int LevenshteinDistance(string s, int len_s, string t, int len_t) { /* base case: empty strings */ if (len_s == 0) return len_t; if (len_t == 0) return len_s; /* test if last characters of the strings match */ if (s[len_s1] == t[len_t1]) cost = 0; else cost = 1; /* return minimum of delete char from s, delete char from t, and delete char from both */ return minimum(LevenshteinDistance(s, len_s  1, t, len_t ) + 1, LevenshteinDistance(s, len_s , t, len_t  1) + 1, LevenshteinDistance(s, len_s  1, t, len_t  1) + cost); }
Unfortunately, the straightforward recursive implementation is very inefficient because it recomputes the Levenshtein distance of the same substrings many times. A better method would never repeat the same distance calculation. For example, the Levenshtein distance of all possible prefixes might be stored in an array d[][]
where d[i][j]
is the distance between the first i
characters of string s
and the first j
characters of string t
. The table is easy to construct one row at a time starting with row 0. When the entire table has been built, the desired distance is d[len_s][len_t]
. While this technique is significantly faster, it will consume len_s * len_t
more memory than the straightforward recursive implementation.
This algorithm, an example of bottomup dynamic programming, is discussed, with variants, in the 1974 article The Stringtostring correction problem by Robert A. Wagner and Michael J. Fischer.^{[2]}
A straightforward implementation, as pseudocode for a function LevenshteinDistance that takes two strings, s of length m, and t of length n, and returns the Levenshtein distance between them:
int LevenshteinDistance(char s[1..m], char t[1..n]) { // for all i and j, d[i,j] will hold the Levenshtein distance between // the first i characters of s and the first j characters of t; // note that d has (m+1)*(n+1) values declare int d[0..m, 0..n] clear all elements in d // set each element to zero // source prefixes can be transformed into empty string by // dropping all characters for i from 1 to m { d[i, 0] := i } // target prefixes can be reached from empty source prefix // by inserting every characters for j from 1 to n { d[0, j] := j } for j from 1 to n { for i from 1 to m { if s[i] = t[j] then d[i, j] := d[i1, j1] // no operation required else d[i, j] := minimum ( d[i1, j] + 1, // a deletion d[i, j1] + 1, // an insertion d[i1, j1] + 1 // a substitution ) } } return d[m, n] }
Note that this implementation does not fit the definition precisely: it always prefers matches, even if insertions or deletions provided a better score. This is equivalent; it can be shown that for every optimal alignment (which induces the Levenshtein distance) there is another optimal alignment that prefers matches in the sense of this implementation.^{[3]}
Two examples of the resulting matrix (hovering over a number reveals the operation performed to get that number):


The invariant maintained throughout the algorithm is that we can transform the initial segment s[1..i]
into t[1..j]
using a minimum of d[i,j]
operations. At the end, the bottomright element of the array contains the answer.
As mentioned earlier, the invariant is that we can transform the initial segment s[1..i]
into t[1..j]
using a minimum of d[i,j]
operations. This invariant holds since:
s[1..i]
can be transformed into the empty string t[1..0]
by simply dropping all i
characters. Similarly, we can transform s[1..0]
to t[1..j]
by simply adding all j
characters.s[i] = t[j]
, and we can transform s[1..i1]
to t[1..j1]
in k
operations, then we can do the same to s[1..i]
and just leave the last character alone, giving k
operations.s[1..i]
to t[1..j1]
in k
operations, then we can simply add t[j]
afterwards to get t[1..j]
in k+1
operations (insertion).s[1..i1]
to t[1..j]
in k
operations, then we can remove s[i]
and then do the same transformation, for a total of k+1
operations (deletion).s[1..i1]
to t[1..j1]
in k
operations, then we can do the same to s[1..i]
, and exchange the original s[i]
for t[j]
afterwards, for a total of k+1
operations (substitution).s[1..n]
into t[1..m]
is of course the number required to transform all of s
into all of t
, and so d[n,m]
holds our result.This proof fails to validate that the number placed in d[i,j]
is in fact minimal; this is more difficult to show, and involves an argument by contradiction in which we assume d[i,j]
is smaller than the minimum of the three, and use this to show one of the three is not minimal.
Possible modifications to this algorithm include:
j
.[0,1]
.cost
values can be computed in parallel, and the algorithm can be adapted to perform the minimum
function in phases to eliminate dependencies.It turns out that only two rows of the table are needed for the construction: the previous row and the current row (the one being calculated).
The Levenshtein distance may be calculated iteratively using the following algorithm:^{[8]}
int LevenshteinDistance(string s, string t) { // degenerate cases if (s == t) return 0; if (s.Length == 0) return t.Length; if (t.Length == 0) return s.Length; // create two work vectors of integer distances int[] v0 = new int[t.Length + 1]; int[] v1 = new int[t.Length + 1]; // initialize v0 (the previous row of distances) // this row is A[0][i]: edit distance for an empty s // the distance is just the number of characters to delete from t for (int i = 0; i < v0.Length; i++) v0[i] = i; for (int i = 0; i < s.Length; i++) { // calculate v1 (current row distances) from the previous row v0 // first element of v1 is A[i+1][0] // edit distance is delete (i+1) chars from s to match empty t v1[0] = i + 1; // use formula to fill in the rest of the row for (int j = 0; j < t.Length; j++) { var cost = (s[i] == t[j]) ? 0 : 1; v1[j + 1] = Minimum(v1[j] + 1, v0[j + 1] + 1, v0[j] + cost); } // copy v1 (current row) to v0 (previous row) for next iteration for (int j = 0; j < v0.Length; j++) v0[j] = v1[j]; } return v1[t.Length]; }
The ggplot2 package is extremely flexible. The “gg” stands for the Grammar of Graphics.
In the book, The Grammar of Graphics, Wilkinson showed how you could describe plots not as discrete types like bar plot or pie chart, but using a “grammar” that would work not only for plots we commonly use but for almost any conceivable graphic. From this perspective a pie chart is just a bar chart with a circular (polar) coordinate system replacing the rectangular Cartesian coordinate system.
The ggplot2 is a simplified implementation of grammar of graphics written by Hadley Wickham for R. Wickham’s book, ggplot2: Elegant Graphics for Data Analysis, provides a detailed presentation of the ggplot2 package.
2. Functions and Examples
The ggplot2 package offers two main functions: quickplot() and ggplot().
2. The ggplot() function.
What are the fundamental parts of every data graph? They are:
Let us start our use of the ggplot() function with a single stacked bar plot. On the xaxis there really is no variable, so I plugged in a call to the factor() function that creates an empty one on the fly. I then fill the single bar in using the fill argument. There is only one type of geometric object on the plot, which I add with geom_bar.
1
2

> ggplot (mydata100, aes (x = factor ( "" ), fill = workshop) ) + + geom_bar () 
The xaxis comes out labeled as “factor(“”)” but we can overwrite that with a title for the xaxis. What is particularly interesting is that this can become a pie chart simply by changing its coordinate system to polar. The final line of code changes the label on the discrete xaxis to blank with “”.
1
2
3
4
5

> ggplot (mydata100, + aes (x = factor ( "" ), fill = workshop) ) + + geom_bar () + + coord_polar (theta = "y" ) + + scale_x_discrete ( "" ) 
Bar Plots
Note the unusual use of the plus sign “+” to add the effect of of geom_bar() to ggplot(). Only one variable plays an “aesthetic” role: workshop. The aes() function sets that role. So here is one way to write the code:
1
2

> ggplot (mydata100) + + geom_bar ( aes (workshop) ) 
A very useful feature of the ggplot() function is that it can pass aesthetic roles to all the functions that are “added” to it. As graphs become more complex, it can be a big timesaver to set as many aesthetic roles in the ggplot() function call and let it pass them through to various other functions that we will add on to build a more complex plot. For example, we can create exactly the same barplot with this code:
1
2

> ggplot (mydata100, aes (workshop) ) + + geom_bar () 
1
2

> ggplot (mydata100, aes (workshop) ) + + geom_bar () + coord_flip () 
If you want to fill the bars with color, you can do that using the “fill” argument.
1
2

> ggplot (mydata100, aes (workshop, fill = workshop ) ) + + geom_bar () 
Below I use fill to color the bars by workshop and set the “position” to stack.
1
2

> ggplot (mydata100, aes (gender, fill = workshop) ) + + geom_bar (position = "stack" ) 
In the plot above, the height of the bars represents the total number of males and females. This is fine if you want to compare counts, but if you want to compare proportions of each gender that took each class, you would have to make the bars equal heights. You can do that by simply changing the position to “fill”.
1
2

> ggplot (mydata100, aes (gender, fill=workshop) ) + + geom_bar (position= "fill" ) 
Here is the same plot changing only the bar position to be “dodge”.
1
2

> ggplot (mydata100, aes (gender, fill=workshop ) ) + + geom_bar (position= "dodge" ) 
You can change any of the above colored graphs to shades of grey by simply adding the scale_fill_grey() function. Here is the plot immediately above repeated in greyscale.
1
2
3

> ggplot (mydata100, aes (gender, fill=workshop ) ) + + geom_bar (position= "dodge" ) + + scale_fill_grey (start = 0, end = 1) 
You can get the same information that is in the above plot by making small separate plots for one of the groups. You can accomplish that with the facet_grid() function. It accepts a formula in the form “rows ~ colums”, so using “gender ~ .” asks for two rows for the genders (three if we had not removed missing values) and no columns.
1
2

> ggplot (mydata100, aes (workshop) ) + + geom_bar () + facet_grid (gender ~ .) 
Unsummarized Data
1
2
3
4
5
6
7

myTemp < data.frame ( + myGroup= factor ( c ( "Before" , "After" ) ), + myMeasure= c (40, 60) + ) > ggplot (data=myTemp, aes (myGroup, myMeasure) ) + + geom_bar () 
Dot Charts
Dot charts are similar to bar charts, but since they are plotting points on both an x and yaxis, they require a special variable called “..count..”. It calculates the counts and lets you plot them on the yaxis. The points use the “bin” statistic. Since dot charts are usually shown “sideways” I am adding the coord_flip() funtion.
1
2
3

> ggplot (mydata100, + aes (workshop, ..count.. ) ) + + geom_point (stat = "bin" , size = 3) + coord_flip () + facet_grid (gender ~ .) 
Adding Titles and Labels
To add a title, use the opts() function and its title argument. Adding titless to axes is trickier. You use four different functions depending on the axis and whether or not it is discrete: scale_x_discrete scale_y_distrete scale_x_continuous scale_y_continuous For a bar plot, the xaxis is discrete so I will use scale_x_discrete to assign a label to it. The character sequence “\n” tells R to go to a new line in all R packages.
1
2
3
4

> ggplot (mydata100, aes (workshop, ..count..)) + + geom_bar () + + opts ( title= "Workshop Attendance" ) + + scale_x_discrete ( "Statistics Package \nWorkshops" ) 
Histograms
The geom_histogram function is all you need. I have set the color of the bar edges to white. Without that, the bars all run together in the same shade of grey.
1
2

> ggplot (mydata100, aes (posttest) ) + + geom_histogram (color= "white" ) 
1
2

> ggplot (mydata100, aes (posttest) ) + > geom_histogram (binwidth = 0.5) 
If you prefer a density plot, that is easy too.
1
2

> ggplot (mydata100, aes (posttest)) + > geom_density () 
1
2
3
4

> ggplot (data=mydata100) + + geom_histogram ( aes (posttest, ..density..) ) + + geom_density ( aes (posttest, ..density..) ) + > geom_rug ( aes (posttest) ) 
1
2
3

> ggplot (mydata100, aes (posttest) ) + + geom_histogram (color = "white" ) + + facet_grid (gender ~ .) 
Normal QQ Plots
Normal QQ plots are done in ggplot with the stat_qq() function and the sample aesthetic.
1
2

> ggplot (mydata100, aes (sample = posttest) ) + + stat_qq () 
Strip Plots
With fairly small data sets, you can do strip plots using the point geom.
1
2

> ggplot (mydata100, aes (workshop, posttest) ) + + geom_point () 
1
2

> ggplot (mydata100, aes (workshop, posttest) ) + + geom_jitter () 
Scatter and Line Plots Various type of scatter and line plots can be done using different geoms as shown below. You can, of course, add multiple geoms to a plot. For example, you might want both points and lines, in which case you would simply add both geoms.
1
2

> ggplot (mydata100, aes (pretest, posttest)) + > geom_point () 
When you add a line geom, the ggplot sorts the data along the xaxis automatically. If you had timeseries data that were not sorted by date, it would do so.
1
2

> ggplot (mydata100, aes (pretest, posttest) ) + + geom_line () 
1
2

> ggplot (mydata100, aes (pretest, posttest) ) + + geom_path () 
Scatter Plots for Large Data Sets
Large data sets provide a challenge since so many points are obscured by other points. First let us create a data set with 5,000 points.
1
2
3
4
5

> pretest2 < round ( rnorm ( n = 5000, mean = 80, sd = 5) ) > posttest2 < round ( pretest2 + rnorm ( n = 5000, mean = 3, sd = 3) ) > pretest2 [pretest2 > 100] < 100 > posttest2[posttest2 > 100] < 100 > temp= data.frame (pretest2,posttest2) 
Now I will plot the data using smallsized points, jittering their positions and coloring them with some transparency (called “alpha” in computerspeak).
1
2
3

> ggplot (temp, aes (pretest2, posttest2), + size=2, position = position_jitter (x = 2,y = 2) ) + + geom_jitter (colour= alpha ( "black" ,0.15) ) 
1
2

> ggplot (temp, aes ( x=pretest2, y=posttest2) ) + + geom_point ( size=1 ) + geom_density2d () 
1
2

> ggplot (temp, aes (pretest2, posttest2)) + + geom_hex ( bins=30 ) 
Scatter Plots with Fit Lines
The ggplot() function makes it particularly easy to add fit lines to scatter plots. Simply adding the geom_smooth() function does the trick.
1
2

> ggplot (mydata100, aes (pretest, posttest) ) + + geom_point () + geom_smooth () 
Adding a linear regression fit requires only the addition of “method = lm” argument.
1
2

> ggplot (mydata100, aes (pretest, posttest) ) + + geom_point () + geom_smooth (method=lm) 
1
2
3

> ggplot (mydata100, + aes (pretest, posttest, label = as.character (gender))) + + geom_text (size = 3) 
To use point shapes to represent the value of a third variable, simply set the shape aesthetic.
1
2

> ggplot (mydata100, aes (pretest, posttest) ) + + geom_point ( aes (shape=gender ) ) 
Scatter Plots with Linear Fits by Group
One way to use a different fit for each group is to do them on the same plot. This involves setting aesthetics for both linetype and point shape. I tend to think of lines being added to the scattered points, but in this case I placed the geom_point() call last so that the shading from the gray confidence intervals would not shade the points themselves.
1
2
3

> ggplot (mydata100, aes (pretest, posttest) ) + + geom_smooth ( aes (linetype = gender), method = "lm" ) + + geom_point ( aes (shape = gender) ) 
Another way to display linear fits per group is to facet the plot.
1
2
3
4
5

> ggplot (mydata100, + aes (pretest, posttest ) ) + + geom_smooth (method = "lm" ) + + geom_point () + + facet_grid (gender ~ .) 
Box Plots
The ggplot package offers considerable control over how you can do box plots. Here I plot the raw points and then the boxes on top of them. This hides the points that are actually in the middle 50% of the data. They are usually dense and of less interest than the points that are further out. If you have a lot of data you might consider using geom_jitter() to spread the points around, preventing overplotting.
1
2
3

> ggplot (mydata100, + aes (workshop, posttest )) + + geom_point () + geom_boxplot () 
I’m expanding the coverage of my article, The Popularity of Data Analysis Software. This is the first installment, which includes a new opening and a greatly expanded analysis of the analytics job market. Here it is, from the abstract onward through the first section…
Abstract: This article presents various ways of measuring the popularity or market share of software for analytics including: Alteryx, Angoss, C / C++ / C#, BMDP, Cognos, Java, JMP, Lavastorm, MATLAB, Minitab, NCSS, Oracle Data Mining, Python, R, SAP Business Objects, SAP HANA, SAS, SAS Enterprise Miner, Salford Predictive Modeler (SPM) etc., TIBCO Spotfire, SPSS, Stata, Statistica, Systat, Tableau, Teradata Miner, WEKA / Pentaho. I don’t attempt to differentiate among variants of languages such as R vs. Revolution…
View original post 2,161 more words
Inner join is the most common type of Join which is used to combine the rows from two tables and create a result set containing only such records that are present in both the tables based on the joining condition (predicate).
Inner join returns rows when there is at least one match in both tables
If none of the record matches between two tables, then INNER JOIN will return a NULL set. Below is an example of INNER JOIN and the resulting set.
SELECT dept.name DEPARTMENT, emp.name EMPLOYEE FROM DEPT dept, EMPLOYEE emp WHERE emp.dept_id = dept.id
Department  Employee 

HR  Inno 
HR  Privy 
Engineering  Robo 
Engineering  Hash 
Engineering  Anno 
Engineering  Darl 
Marketing  Pete 
Marketing  Meme 
Sales  Tomiti 
Sales  Bhuti 
Outer Join can be full outer or single outer
Outer Join, on the other hand, will return matching rows from both tables as well as any unmatched rows from one or both the tables (based on whether it is single outer or full outer join respectively).
Notice in our record set that there is no employee in the department 5 (Logistics). Because of this if we perform inner join, then Department 5 does not appear in the above result. However in the below query we perform an outer join (dept left outer join emp), and we can see this department.
SELECT dept.name DEPARTMENT, emp.name EMPLOYEE FROM DEPT dept, EMPLOYEE emp WHERE dept.id = emp.dept_id (+)
Department  Employee 

HR  Inno 
HR  Privy 
Engineering  Robo 
Engineering  Hash 
Engineering  Anno 
Engineering  Darl 
Marketing  Pete 
Marketing  Meme 
Sales  Tomiti 
Sales  Bhuti 
Logistics 
The (+) sign on the emp side of the predicate indicates that emp is the outer table here. The above SQL can be alternatively written as below (will yield the same result as above):
SELECT dept.name DEPARTMENT, emp.name EMPLOYEE FROM DEPT dept LEFT OUTER JOIN EMPLOYEE emp ON dept.id = emp.dept_id
SQL JOIN allows us to “lookup” records on other table based on the given conditions between two tables. For example, if we have the department ID of each employee, then we can use this department ID of the employee table to join with the department ID of department table to lookup department names.
UNION operation allows us to add 2 similar data sets to create resulting data set that contains all the data from the source data sets. Union does not require any condition for joining. For example, if you have 2 employee tables with same structure, you can UNION them to create one result set that will contain all the employees from both of the tables.
SELECT * FROM EMP1 UNION SELECT * FROM EMP2;
UNION and UNION ALL both unify for add two structurally similar data sets, but UNION operation returns only the unique records from the resulting data set whereas UNION ALL will return all the rows, even if one or more rows are duplicated to each other.
In the following example, I am choosing exactly the same employee from the emp table and performing UNION and UNION ALL. Check the difference in the result.
SELECT * FROM EMPLOYEE WHERE ID = 5 UNION ALL SELECT * FROM EMPLOYEE WHERE ID = 5
ID  MGR_ID  DEPT_ID  NAME  SAL  DOJ 

5.0  2.0  2.0  Anno  80.0  01Feb2012 
5.0  2.0  2.0  Anno  80.0  01Feb2012 
SELECT * FROM EMPLOYEE WHERE ID = 5 UNION SELECT * FROM EMPLOYEE WHERE ID = 5
ID  MGR_ID  DEPT_ID  NAME  SAL  DOJ 

5.0  2.0  2.0  Anno  80.0  01Feb2012 
WHERE and HAVING both filters out records based on one or more conditions. The difference is, WHERE clause can only be applied on a static nonaggregated column whereas we will need to use HAVING for aggregated columns.
To understand this, consider this example.
Suppose we want to see only those departments where department ID is greater than 3. There is no aggregation operation and the condition needs to be applied on a static field. We will use WHERE clause here:
SELECT * FROM DEPT WHERE ID > 3
ID  NAME 

4  Sales 
5  Logistics 
Next, suppose we want to see only those Departments where Average salary is greater than 80. Here the condition is associated with a nonstatic aggregated information which is “average of salary”. We will need to use HAVING clause here:
SELECT dept.name DEPARTMENT, avg(emp.sal) AVG_SAL FROM DEPT dept, EMPLOYEE emp WHERE dept.id = emp.dept_id (+) GROUP BY dept.name HAVING AVG(emp.sal) > 80
DEPARTMENT  AVG_SAL 

Engineering  90 
As you see above, there is only one department (Engineering) where average salary of employees is greater than 80.
UNION combines the results from 2 tables and eliminates duplicate records from the result set.
MINUS operator when used between 2 tables, gives us all the rows from the first table except the rows which are present in the second table.
INTERSECT operator returns us only the matching or common rows between 2 result sets.
To understand these operators, let’s see some examples. We will use two different queries to extract data from our emp table and then we will perform UNION, MINUS and INTERSECT operations on these two sets of data.
SELECT * FROM EMPLOYEE WHERE ID = 5 UNION SELECT * FROM EMPLOYEE WHERE ID = 6
ID  MGR_ID  DEPT_ID  NAME  SAL  DOJ 

5  2  2.0  Anno  80.0  01Feb2012 
6  2  2.0  Darl  80.0  11Feb2012 
SELECT * FROM EMPLOYEE MINUS SELECT * FROM EMPLOYEE WHERE ID > 2
ID  MGR_ID  DEPT_ID  NAME  SAL  DOJ 

1  2  Hash  100.0  01Jan2012  
2  1  2  Robo  100.0  01Jan2012 
SELECT * FROM EMPLOYEE WHERE ID IN (2, 3, 5) INTERSECT SELECT * FROM EMPLOYEE WHERE ID IN (1, 2, 4, 5)
ID  MGR_ID  DEPT_ID  NAME  SAL  DOJ 

5  2  2  Anno  80.0  01Feb2012 
2  1  2  Robo  100.0  01Jan2012 
Self Join is the act of joining one table with itself.
Self Join is often very useful to convert a hierarchical structure into a flat structure
In our employee table example above, we have kept the manager ID of each employee in the same row as that of the employee. This is an example of how a hierarchy (in this case employeemanager hierarchy) is stored in the RDBMS table. Now, suppose if we need to print out the names of the manager of each employee right beside the employee, we can use self join. See the example below:
SELECT e.name EMPLOYEE, m.name MANAGER FROM EMPLOYEE e, EMPLOYEE m WHERE e.mgr_id = m.id (+)
EMPLOYEE  MANAGER 

Pete  Hash 
Darl  Hash 
Inno  Hash 
Robo  Hash 
Tomiti  Robo 
Anno  Robo 
Privy  Robo 
Meme  Pete 
Bhuti  Tomiti 
Hash 
The only reason we have performed a left outer join here (instead of INNER JOIN) is we have one employee in this table without a manager (employee ID = 1). If we perform inner join, this employee will not showup.
The usual way to do it in SQL is to use CASE statement or DECODE statement.
Generating a row number – that is a running sequence of numbers for each row is not easy using plain SQL. In fact, the method I am going to show below is not very generic either. This method only works if there is at least one unique column in the table. This method will also work if there is no single unique column, but collection of columns that is unique. Anyway, here is the query:
SELECT name, sal, (SELECT COUNT(*) FROM EMPLOYEE i WHERE o.name >= i.name) row_num FROM EMPLOYEE o order by row_num
NAME  SAL  ROW_NUM 

Anno  80  1 
Bhuti  60  2 
Darl  80  3 
Hash  100  4 
Inno  50  5 
Meme  60  6 
Pete  70  7 
Privy  50  8 
Robo  100  9 
Tomiti  70  10 
The column that is used in the row number generation logic is called “sort key”. Here sort key is “name” column. For this technique to work, the sort key needs to be unique. We have chosen the column “name” because this column happened to be unique in our Employee table. If it was not unique but some other collection of columns was, then we could have used those columns as our sort key (by concatenating those columns to form a single sort key).
Also notice how the rows are sorted in the result set. We have done an explicit sorting on the row_num column, which gives us all the row numbers in the sorted order. But notice that name column is also sorted (which is probably the reason why this column is referred as sortkey). If you want to change the order of the sorting from ascending to descending, you will need to change “>=” sign to “<=” in the query.
As I said before, this method is not very generic. This is why many databases already implement other methods to achieve this. For example, in Oracle database, every SQL result set contains a hidden column called ROWNUM. We can just explicitly select ROWNUM to get sequence numbers.
This question, often asked in many interviews, does not make any sense to me. The problem here is how do you define which record is first and which is second. Which record is retrieved first from the database is not deterministic. It depends on many uncontrollable factors such as how database works at that moment of execution etc. So the question should really be – “how to select any 5 records from the table?” But whatever it is, here is the solution:
In Oracle,
SELECT * FROM EMP WHERE ROWNUM <= 5;
In SQL Server,
SELECT TOP 5 * FROM EMP;
Generic solution,
I believe a generic solution can be devised for this problem if and only if there exists at least one distinct column in the table. For example, in our EMP table ID is distinct. We can use that distinct column in the below way to come up with a generic solution of this question that does not require database specific functions such as ROWNUM, TOP etc.
SELECT name FROM EMPLOYEE o WHERE (SELECT count(*) FROM EMPLOYEE i WHERE i.name < o.name) < 5
name 

Inno 
Anno 
Darl 
Meme 
Bhuti 
I have taken “name” column in the above example since “name” is happened to be unique in this table. I could very well take ID column as well.
In this example, if the chosen column was not distinct, we would have got more than 5 records returned in our output.
Do you have a better solution to this problem? If yes, post your solution in the comment.
ROWNUM is a pseudo column present in Oracle database returned result set prior to ORDER BY being evaluated. So ORDER BY ROWNUM does not work.
ROW_NUMBER() is an analytical function which is used in conjunction to OVER() clause wherein we can specify ORDER BY and also PARTITION BY columns.
Suppose if you want to generate the row numbers in the order of ascending employee salaries for example, ROWNUM will not work. But you may use ROW_NUMBER() OVER() like shown below:
SELECT name, sal, row_number() over(order by sal desc) rownum_by_sal FROM EMPLOYEE o
name  Sal  ROWNUM_BY_SAL 

Hash  100  1 
Robo  100  2 
Anno  80  3 
Darl  80  4 
Tomiti  70  5 
Pete  70  6 
Bhuti  60  7 
Meme  60  8 
Inno  50  9 
Privy  50  10 
The goal of the online advertising companies is to optimize the advertising campaign performance through some technologies that assess the traffic quality of publishers other than the major search engines and place different ads according to the type of keywords, device and geographic characteristics.
1. Some basic terms for the online advertising industry:
Impression, Click through, Conversion
CPM (cost per mille, Cost Per Thousand Impression)
CPC/PPC (cost per click/pay per click)
CPA/CPS (cost per action/cost per sale)
search portfolio ROI (return on investment)
KPI (Key Performance Indicators)
RPQ (Request price quotation)
bounce rate: the percentage of visitors who enter the site and “bounce” (leave the site) rather than continue viewing other pages within the same site.
2. Goals:
In a nutshell, Increase search portfolio ROI.
More detailed:
3. Possible performance metrics:
Note that performance metrics can be very specific according to different campaigns.For example, for a campaign of car ads, we might be interested in the data amount of visitors that request a quote, locate closest dealer, search inventory, compare vehicles, etc.
4. Possible elements for an advertiser campaign report:
Revenue over time
Avg. CPC and CPM
Coverage
Queries
Yield (RPQ, CPM)
Clicks (total, invalid)
top 10 keywords received
invalid reasons pie chart
Ads delivered
5. Possible Traffic Analyses besides the basic elements:
Geocustomization: detect the geographical origin of visitors
Event tracking: track the paths visitors follow inside the website, including the tracking of events such as downloads, or clicks to outside locations.
Referrer tracker
As prior distribution for P, we use a uniform p.d.f. on the interval [0,1]. Uniform at [0,1] is a special case of a Beta distribution for α = 1, β = 1.
With a Beta (α, β) prior on P, and a conditional binomial distribution with parameters n, p for X, the posterior of P is again Beta with parameters α˜ = α+x, β˜ = n−x+β.
Therefore, the posterior probablity distribution of P is Beta with α=1+x, β=40x+1.
Now let us compute E(Px) which is the best predictor of P when P has its posterior distribution. Since this distribution is Beta (α˜, β˜) with expectation E(Px) = α˜/(α˜ +β˜), we obtain
E(Px) = (α+x)/(α+x+n−x+β) = (1+x)/(1+ x + 40−x + 1)= (x+1)/42
Note that this predictor is very close to the observed sample proportion of success from X: pˆ= x/40.
]]>(1) SORT DATA
select variables
from table
order by variable desc;
(2) FILTER DATA
between
is null
= <> <=
in () not in ()
(3) WILDCARD FILTER
Wildcard  Description 

%  A substitute for zero or more characters where var like ‘fish%’ 
_  A substitute for a single character where var like ‘_ inch’ 
[charlist]  Sets and ranges of characters to match where name like ‘[JM]%’ first letter J or M 
[^charlist]  Matches only a character NOT specified within the brackets 
(4) CONCATENATE
Oracle : select city  ‘,’  state  ‘,’  zip as address
SQL Server +: select city + ‘,’ + state + ‘,’ + zip as address
trim padded spaces from right side with RTRIM() function : select RTRIM(city)  ‘,’  RTRIM(state)  ‘,’  RTRIM(zip)
(5) DATA MANIPULATION FUNCTION
len(), upper(), lower(), trim(), substr() Oracle / substring() SQL Server, sysdate Oracle/getdate() SQL Server, to_number(), to_char() Oracle / convert() SQL Server
soundex(): match all data that sound similar to the one you assign e.g. where soundex(name)=soundex(‘Mike Green’)
date: 1. SQL Server where datepart (yy, order_date)=2001 and datepart (mm, order_date)=2
2. where order_date between to_date(’01Jan2001′) and to_date(’31Dec2001′)
(6) AGGREGATE FUNCTION
avg(), count(), max(), min(), sum(), avg(distinct variable)
(7) GROUP DATA
select ID, count(*) as orders
from Orders
group by ID
having count(*) >=2;
(8) JOIN TABLES
1) left join: returns all rows from the left table, with the matching rows in the right table. The result is NULL in the right side when there is no match.
SELECT column_name(s)
FROM table1
LEFT JOIN table2
ON table1.column_name=table2.column_name;
2) inner join is the same as JOIN: selects all rows from both tables as long as there is a match between the columns in both tables.
SELECT column_name(s)
FROM table1
INNER JOIN table2
ON table1.column_name=table2.column_name;
3) full outer join: returns all rows from the left table and from the right table
SELECT column_name(s)
FROM table1
FULL OUTER JOIN table2
ON table1.column_name=table2.column_name;
4) Union: combines the result of two or more SELECT statements.each SELECT statement within the UNION must have the same number of columns. The columns must also have similar data types. Also, the columns in each SELECT statement must be in the same order.
(9) INSERT DATA
insert into data1 (name, age) values(‘Peter’, 25);
insert into data1 (name, age) select name, age from data 2 where name=’Peter’;
copy one table to another:
select * create table data1 as
into data1 select *
from data2; from data2; Oracle
(10) UPDATE AND DELETE DATA
update customers set cust_email=’spring14@yahoo.cn’ where cust_id=’001′;
delete from customers where cust_id=’001′;
delete table: drop table customers;
create index prod_index on table(product);
(11) OTHERS
SELECT DISTINCT column_name,column_name
FROM table_name;
SELECT SUBSTRING(column_name,start,length) AS sth FROM table_name e.g. substr(proc, 1, 1) not in (‘T’, ‘F’);
SELECT ROUND(column_name,decimals) FROM table_name e.g. ROUND (PER25 * decode(RLV,0,100, RLV)*(1c.fct),2);
increase speed: e.g. CREATE TABLE name PARALLEL 8 nologging compress for query high as
use partition instead of group by: e.g. SUM(FREQ) OVER (PARTITION BY PROC, MOD,GEOZIP)
create temporary table: e.g. WITH new_name AS (SELECT * from table where…) select * from new_name where…
case when: e.g. SELECT CASE WHEN (N = 1 OR CHARGE_SD = 0) THEN NULL
ELSE MD3/((N1)*POWER(CHARGE_SD,3)) END AS skewness from…
macro: SET DEFINE~ DEFINE VAR=’name’ select ~VAR from table;
nvl(): available in Oracle, and not in MySQL or SQL Server. replace NULL value with another value. ifnull() in MySQL and the isnull() in SQL Server.
e.g. nvl(trim(a.mod),’0′) not in (’20’, ’22’, ‘SG’)
references: http://www.w3schools.com/sql/default.asp
(12) PRACTICE
There’s no better way to improve SQL skills than to practice with some real SQL interview questions. The following SQL practice exercises were actually taken from real interview tests with Google and Amazon.
1. The names of all salespeople that have an order with Samsonic.
2. The names of all salespeople that do not have any order with Samsonic.
3. The names of salespeople that have 2 or more orders.
4. Write a SQL statement to insert rows into a table called highAchiever(Name, Age), where a salesperson must have a salary of 100,000 or greater to be included in the table.
1. select a.name from Salesperson a, Customer b, Orders c
where b.name = “Samsonic”
and b.ID=c.cust_id
and c.salesperson_id = a. ID;
2. select a.name from Salesperson a, Customer b, Orders c
where b.name not in “Samsonic”
and b.ID=c.cust_id
and c.salesperson_id = a. ID;
3. select a. name, count(c.salesperson_id) from Salesperson a, Orders c
where count(c.salesperson_id)>1
and c.salesperson_id = a. ID;
4. insert into highAchiever (Name, Age)
(select Name, Age from Salesperson where salary>100,000);
create table highAchiever as
select Name, Age from Salesperson
where salary>100,000;
Salesperson  Customer  


Orders  

Python is an Language that supports the Object Oriented Programming paradigm. Like other OOP languages, Python has classes which are defined wireframes of objects. Python supports class inheritance. A class may have many subclasses but many only inherit directly from one superclass.
class ClassName(object):
"""This is a class"""
class_variable
def __init__(self,*args):
self.args = args
def __repr__(self):
return "Something to represent the object as a string"
def other_method(self,*args):
//do something else
class Horse(object):
"""Horse represents a Horse"""
species = "Equus ferus caballus"
def __init__(self,color,weight,wild=False):
self.color = color
self.weight = weight
self.wild = wild
def __repr__(self):
return "%s horse weighing %f and wild status is %b" (self.color,self.weight,self.wild)
def make_sound(self):
print("neighhhh")
def movement(self):
return "walk"
class ClassName(SuperClass):
//same as above
//use 'super' keyword to get from above
class RaceHorse(Horse):
"""A faster horse that inherits from Horse"""
def movement(self):
return "run"
def movement_slow(self):
super(Horse,self).movement()
def __repr__(self):
return "%s race horse weighing %f and wild status is %b" (self.color,self.weight,self.wild)
>> horse3 = RaceHorse("white",200)
>> print horse3.movement_slow()
"walk"
>> print horse3.movement()
"run"
Dictionaries are Python’s builtin associative data type. A dictionary is made of keyvalue pairs where each key corresponds to a value. Like sets, dictionaries are unordered. A few notes about keys and values: * The key must be immutable and hashable while the value can be of any type. Common examples of keys are tuples, strings and numbers. * A single dictionary can contain keys of varying types and values of varying types.
dict() #creates new empty dictionary
{} #creates new empty dictionary
>> my_dict = {}
>> content_of_value1 = "abcd"
>> content_of_value2 = "wxyz"
>> my_dict.update({"key_name1":content_of_value1})
>> my_dict.update({"key_name2":content_of_value2})
>> my_dict
{'key_name1':"abcd", 'key_name2':"wxyz"}
>> my_dict.get("key_name2")
"wxyz"
{key1:value1,key2:value2}
>> my_dict = {"key1":[1,2,3],"key2":"I am a string",123:456}
>> my_dict["key1"] #[1,2,3]
>> my_dict[123] #456
>> my_dict["new key"] = "New value"
>> print my_dict
{"key2":"I am a string", "new key":"New value", "key1":[1,2,3],123:456}
Python functions can be used to abstract pieces of code to use elsewhere.
def function_name(parameters):
# Some code here
def add_two(a, b):
c = a + b
return c
# or without the interim assignment to c
def add_two(a, b):
return a + b
def function_name(parameters, named_default_parameter=value):
# Some code here
def shout(exclamation="Hey!"):
print exclamation
shout() # Displays "Hey!"
shout("Watch Out!") # Displays "Watch Out!"
Python functions are firstclass objects, which means that they can be stored in variables and lists and can even be returned by other functions.
# Storing function objects in variables:
def say_hello(name):
return "Hello, " + name
foo = say_hello("Alice")
# Now the value of 'foo' is "Hello, Alice"
fun = say_hello
# Now the value of 'fun' is a function object we can use like the original function:
bar = fun("Bob")
# Now the value of 'bar' is "Hello, Bob"
# Returning functions from functions
# A simple function
def say_hello(greeter, greeted):
return "Hello, " + greeted + ", I'm " + greeter + "."
# We can use it like this:
print say_hello("Alice", "Bob") # Displays "Hello, Bob, I'm Alice."
# We can also use it in a function:
def produce_greeting_from_alice(greeted):
return say_hello("Alice", greeted)
print produce_greeting_from_alice("Bob") # Displays "Hello, Bob, I'm Alice."
# We can also return a function from a function by nesting them:
def produce_greeting_from(greeter):
def greet(greeted):
return say_hello(greeter, greeted)
return greet
# Here we create a greeting function for Eve:
produce_greeting_from_eve = produce_greeting_from("Eve")
# 'produce_greeting_from_eve' is now a function:
print produce_greeting_from_eve("Alice") # Displays "Hello, Alice, I'm Eve."
# You can also invoke the function directly if you want:
print produce_greeting_from("Bob")("Eve") # Displays "Hello, Eve, I'm Bob."
# Using functions in a dictionary instead of long if statements:
# Let's say we have a variable called 'current_action' and we want stuff to happen based on its value:
if current_action == 'PAUSE':
pause()
elif current_action == 'RESTART':
restart()
elif current_action == 'RESUME':
resume()
# This can get long and complicated if there are many values.
# Instead, we can use a dictionary:
response_dict = {
'PAUSE': pause,
'RESTART': restart,
'RESUME': resume
}
response_dict[current_action]() # Gets the correct function from response_dict and calls it
Using len(some_object) returns the number of _toplevel_ items contained in the object being queried.
len(iterable)
>> my_list = [0,4,5,2,3,4,5]
>> len(my_list)
7
>> my_string = 'abcdef'
>> len(my_string)
6
Convenient ways to generate or extract information from lists.
[variable for variable in iterable condition]
[variable for variable in iterable]
>> x_list = [1,2,3,4,5,6,7]
>> even_list = [num for num in x_list if (num % 2 == 0)]
>> even_list
[2,4,6]
>> m_list = ['AB', 'AC', 'DA', 'FG', 'LB']
>> A_list = [duo for duo in m_list if ('A' in duo)]
>> A_list
['AB', 'AC', 'DA']
A Python data type that holds an ordered collection of values, which can be of any type. Lists are Python’s ordered mutable data type. Unlike tuples, lists can be modified inplace.
>> x = [1, 2, 3, 4]
>> y = ['spam', 'eggs']
>> x
[1, 2, 3, 4]
>> y
['spam','eggs']
>> y.append('mash')
>> y
['spam', 'eggs', 'mash']
>> y += ['beans']
>> y
['spam', 'eggs', 'mash', 'beans']
Python provides a clean iteration syntax. Note the colon and indentation.
>> for i in range(0, 3):
>> print(i*2)
0
2
4
>> m_list = ["Sir", "Lancelot", "Coconuts"]
>> for item in m_list:
>> print(item)
Sir
Lancelot
Coconuts
>> w_string = "Swift"
>> for letter in w_string:
>> print(letter)
S
w
i
f
t
A While loop permits code to execute repeatedly until a certain condition is met. This is useful if the number of iterations required to complete a task is unknown prior to flow entering the loop.
while condition:
//do something
>> looping_needed = True
>>
>> while looping_needed:
>> # some operation on data
>> if condition:
>> looping_needed = False
A function to display the output of a program. Using the parenthesized version is arguably more consistent.
>> # this will work in all modern versions of Python
>> print("some text here")
"some text here"
>> # but this only works in Python versions lower than 3.x
>> print "some text here too"
"some text here too"
The range() function returns a list of integers, the sequence of which is defined by the arguments passed to it.
argument variations:
range(terminal)
range(start, terminal)
range(start, terminal, step_size)
>> range(4)
[0, 1, 2, 3]
>> range(2, 8)
[2, 3, 4, 5, 6, 7]
>> range(2, 13, 3)
[2, 5, 8, 11]
Sets are collections of unique but unordered items. It is possible to convert certain iterables to a set.
>> new_set = {1, 2, 3, 4, 4, 4,'A', 'B', 'B', 'C'}
>> new_set
{'A', 1, 'C', 3, 4, 2, 'B'}
>> dup_list = [1,1,2,2,2,3,4,55,5,5,6,7,8,8]
>> set_from_list = set(dup_list)
>> set_from_list
{1, 2, 3, 4, 5, 6, 7, 8, 55}
A Pythonic way of extracting “slices” of a list using a special bracket notation that specifies the start and end of the section of the list you wish to extract. Leaving the beginning value blank indicates you wish to start at the beginning of the list, leaving the ending value blank indicates you wish to go to the end of the list. Using a negative value references the end of the list (so that in a list of 4 elements, 1 means the 4th element). Slicing always yields another list, even when extracting a single value.
>> # Specifying a beginning and end:
>> x = [1, 2, 3, 4]
>> x[2:3]
[3]
>> # Specifying start at the beginning and end at the second element
>> x[:2]
[1, 2]
>> # Specifying start at the next to last element and go to the end
>> x[2:]
[3, 4]
>> # Specifying start at the beginning and go to the next to last element
>> x[:1]
[1, 2, 3]
>> # Specifying a step argument returns every nth item
>> y = [1, 2, 3, 4, 5, 6, 7, 8]
>> y[::2]
[1, 3, 5, 7]
>> # Return a reversed version of the list ( or string )
>> x[::1]
[4, 3, 2, 1]
>> # String reverse
>> my_string = "Aloha"
>> my_string[::1]
"aholA"
Using the str() function allows you to represent the content of a variable as a string, provided that the data type of the variable provides a neat way to do so. str() does not change the variable in place, it returns a ‘stringified’ version of it. On a more technical note, `str()` calls the special `__str__` method of the object passed to it.
str(object)
>> # such features can be useful for concatenating strings
>> my_var = 123
>> my_var
123
>> str(my_var)
'123'
>> my_booking = "DB Airlines Flight " + str(my_var)
>> my_booking
'DB Airlines Flight 123'
Strings store characters and have many builtin convenience methods that let you modify their content. Strings are immutable, meaning they cannot be changed in place.
>> my_string1 = "this is a valid string"
>> my_string2 = 'this is also a valid string'
>> my_string3 = 'this is' + ' ' + 'also' + ' ' + 'a string'
>> my_string3
"this is also a string"
A Python data type that holds an ordered collection of values, which can be of any type. Python tuples are “immutable,” meaning that they cannot be changed once created.
>> x = (1, 2, 3, 4)
>> y = ('spam', 'eggs')
>> my_list = [1,2,3,4]
>> my_tuple = tuple(my_list)
>> my_tuple
(1, 2, 3, 4)
Tuples can be expanded into variables easily.
name, age = ("Alice", 19)
# Now name has the value "Alice" and age has the value 19
# You can also omit the parentheses:
name, age = "Alice", 19
Variables are assigned values using the ‘=’ operator, which is not to be confused with the ‘==’ sign used for testing equality. A variable can hold almost any type of value such as lists, dictionaries, functions.
>> x = 12
>> x
12
iteration
items()
will iterate over a dictionary and return those key/value pairs.keys()
function returns an array of the dictionary’s keysvalues()
function returns an array of the dictionary’s valuesfor letter in "Eric":
print letter, # note the comma!
# => "E" "r" "i" "c"
for
/in
and if
keywords e.g. evens_to_50 = [i for i in range(51) if i % 2 == 0]start
describes where the slice starts (inclusive), end
is where it ends (exclusive), and stride
describes the space between items in the sliced list.bitwise operators
0b111 & 0b1010 = 0b10
0 & 0 = 0 0 & 1 = 0 1 & 0 = 0 1 & 1 = 1
0
b110  0
b1010 = 0
b1110 0  0 = 0 0  1 = 1 1  0 = 1 1  1 = 1
0 ^ 0 = 0 0 ^ 1 = 1 1 ^ 0 = 1 1 ^ 1 = 0
Class
class DerivedClass(Base/SuperClass):
def some_method(self, args):
super(DerivedClass, self).method(args)
Input/Output
f = open("output.txt", "w")
w–write, r–readonly, r+ — read and write, a–appendTrue
when the file is closed andFalse
otherwise if my_file.closed==False:2. Example
1. simple decoding
garbled = “!XeXgXaXsXsXeXmX XtXeXrXcXeXsX XeXhXtX XmXaX XI”
message=garbled[::2]
print message
garbled = “IXXX aXXmX aXXXnXoXXXXXtXhXeXXXXrX sXXXXeXcXXXrXeXt mXXeXsXXXsXaXXXXXXgXeX!XX”
message=filter(lambda i: i!=”X”,garbled)
print message
2. turn on or turn off binary
(1) mask & : turn on
def check_bit4(x):
mask=0b1000
desired=x&mask
if desired >0:
return “on”
else:
return “off”
(2)mask  : turn on
a = 0b10111011
mask=0b100
result=amask
print bin(result)
(3)mask ^ : flip out
a = 0b11101110
mask =0b11111111
print bin(a^mask)
(4)mask ^ slide: flip out
def flip_bit(number,n):
mask=(0b1<<(n1))
return bin(number^mask)
3. Class Overview
(1) Fruit example
class Fruit(object):
“””A class that makes various tasty fruits.”””
def __init__(self, name, color, flavor, poisonous):
self.name = name
self.color = color
self.flavor = flavor
self.poisonous = poisonous
def description(self):
print “I’m a %s %s and I taste %s.” % (self.color, self.name, self.flavor)
def is_edible(self):
if not self.poisonous:
print “Yep! I’m edible.”
else:
print “Don’t eat me! I am super poisonous.”
lemon = Fruit(“lemon”, “yellow”, “sour”, False)
lemon.description()
lemon.is_edible()
(2) Shopping Cart example
class ShoppingCart(object):
“””Creates shopping cart objects
for users of our fine website.”””
items_in_cart = {}
def __init__(self, customer_name):
self.customer_name = customer_name
def add_item(self, product, price):
“””Add product to the cart.”””
if not product in self.items_in_cart:
self.items_in_cart[product] = price
print product + ” added.”
else:
print product + ” is already in the cart.”
def remove_item(self, product):
“””Remove product from the cart.”””
if product in self.items_in_cart:
del self.items_in_cart[product]
print product + ” removed.”
else:
print product + ” is not in the cart.”
my_cart=ShoppingCart(“Alice”)
my_cart.add_item(“Electric Toothbrush”, 35)
(3) Employee wage example
class Employee(object):
“””Models reallife employees!”””
def __init__(self, employee_name):
self.employee_name = employee_name
def calculate_wage(self, hours):
self.hours = hours
return hours * 20.00
# Add your code below!
class PartTimeEmployee(Employee):
def calculate_wage(self, hours):
self.hours = hours
return hours * 12.00
def full_time_wage(self, hours):
return super(PartTimeEmployee, self).calculate_wage(hours)
milton = PartTimeEmployee(“Milton”)
print milton.full_time_wage(10)
4. Input/Output
(1) Output
my_list = [i**2 for i in range(1,11)]
f = open(“output.txt”, “w”)
for item in my_list:
f.write(str(item) + “\n”)
f.close()
(2) Input: read
my_file=open(“output.txt”,”r”)
print my_file.read()
my_file.close()
(3) Input: Write
with open(“text.txt”,”w”) as my_file:
my_file.write(“Almost done!”)
loops
functions
2. Example
1. Board game
from random import randint
board = []
for x in range(5):
board.append([“O”] * 5)
def print_board(board):
for row in board:
print ” “.join(row)
print “Let’s play Battleship!”
print_board(board)
def random_row(board):
return randint(0, len(board) – 1)
def random_col(board):
return randint(0, len(board[0]) – 1)
ship_row = random_row(board)
ship_col = random_col(board)
for turn in range(4):
# Everything from here on should go in your for loop!
# Be sure to indent four spaces!
guess_row = int(raw_input(“Guess Row:”))
guess_col = int(raw_input(“Guess Col:”))
if guess_row == ship_row and guess_col == ship_col:
print “Congratulations! You sunk my battleship!”
break
else:
if (guess_row < 0 or guess_row > 4) or (guess_col < 0 or guess_col > 4):
print “Oops, that’s not even in the ocean.”
elif(board[guess_row][guess_col] == “X”):
print “You guessed that one already.”
else:
if turn==3:
print “Game Over”
else:
print “You missed my battleship!”
board[guess_row][guess_col] = “X”
# Print (turn + 1) here!
print turn+1
print_board(board)
2. Guess a number from 19
from random import randrange
random_number = randrange(1, 10)
count = 0
# Start your game!
while count<3:
guess = int(raw_input(“Enter a guess:”))
count+=1
if guess==random_number:
print ‘You win.’
break
else:
print ‘You lose!’
print random_number
3. is_even function
def is_even(x):
if x%2==0:
return True
else:
return False
4. is_int function
def is_int(x):
if xround(x)==0:
return True
else:
return False
5. digit_sum function
def digit_sum(n):
lst=[]
for i in str(n):
lst.append(int(i))
return sum(lst)
print digit_sum(1234)
6. factorial function
def factorial(x):
if x < 2:
return 1
else:
total = 1
for i in range (2, x+1):
total = total * i
return total
7. is_prime function
def is_prime(x):
if x < 2:
return False
else:
for i in range(2, x):
if x % i == 0:
return False
return True
8. string reverse function
def reverse(text):
word = “”
y = len(text) – 1
for i in range(y, 1, 1):
word += str(text[i])
return word
9. antivowel function:
def anti_vowel(text):
store = []
for letter in text:
if letter.lower() not in “aeiou”:
store.append(letter)
return “”.join(store)
text = “Hey look words!”
print anti_vowel(text)
10. add score function
score = {“a”: 1, “c”: 3, “b”: 3, “e”: 1, “d”: 2, “g”: 2, “f”: 4, “i”: 1, “h”: 4, “k”: 5, “j”: 8, “m”: 3, “l”: 1, “o”: 1, “n”: 1, “q”: 10, “p”: 3, “s”: 1, “r”: 1, “u”: 1, “t”: 1, “w”: 4, “v”: 4, “y”: 4, “x”: 8, “z”: 10}
def scrabble_score(word):
word = word.lower()
total=0
for i in word:
total=total+score[i]
return total
11. replace function
def censor(text, word):
return text.replace(word, ‘*’ * len(word))
print censor(“this hack is wack hack”,”hack”)
def censor(text,word):
text = text.split()
for i in range(len(text)):
if text[i] == word:
text[i] = “*” * len(word)
return ” “.join(text)
12. count repetition function
def count(sequence, item):
count=0
for i in sequence:
if i==item:
count+=1
return count
print count([1,2,1,1], 1)
13. remove odd function
def purify(lst):
new_lst=[]
for i in lst:
if i%2==0:
new_lst.append(i)
return new_lst
14. product function
def product(lst):
n=1
for i in range(len(lst)):
n=n*lst[i]
return n
15. remove duplicates function
def remove_duplicates(lst):
total=[]
for i in lst:
if i not in total:
total.append(i)
return total
16. median function
def median(lst):
lst=sorted(lst)
if len(lst)%2==0:
return (lst[len(lst)/21]+lst[len(lst)/2])/2.0
else:
return lst[(len(lst)+1)/21]
Comments
Multiline Comments
Some comments need to span several lines, use this if you have more than 4 single line comments in a row.
Singleline Comments
Augmenting code with human readable descriptions can help document design decisions.