Levenshtein distance

In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other. The phrase edit distance is often used to refer specifically to Levenshtein distance. It is named after Vladimir Levenshtein, who considered this distance in 1965. It is closely related to pairwise string alignments.

I. Implementation in R

R package: ‘stringdist’

Description: Calculate various string distances based on edits (damerau-levenshtein, hamming, levenshtein, optimal sting alignment),qgrams (q- gram, cosine, jaccard distance) or heuristic metrics (jaro,jaro-winkler). 

1. amatch Examples

    # which sci-fi heroes are stringdistantly nearest
    amatch("leia",c("uhura","leela"),maxDist=5)

    # we can restrict the search
   amatch("leia",c("uhura","leela"),maxDist=1)

    #setting nomatch returns a different value when no match is found
    amatch("leia",c("uhura","leela"),maxDist=1,nomatch=0)

    #this is always true if maxDist is Inf
    ain("leia",c("uhura","leela"),maxDist=Inf)

    # Let’s look in a neighbourhood of maximum 2 typo’s (by default, the OSA algorithm is used)
    ain("leia",c("uhura","leela"), maxDist=2)

2. qgrams Examples

qgrams(’hello world’,q=3)

# q-grams are counted uniquely over a character vector
qgrams(rep(’hello world’,2),q=3)
# to count them separately, do something like
    x <- c(’hello’, ’world’)
    lapply(x,qgrams, q=3)

# output rows may be named, and you can pass any number of character vectors
    x <- "I will not buy this record, it is scratched"
    y <- "My hovercraft is full of eels"
    z <- c("this", "is", "a", "dead","parrot")

    qgrams(A = x, B = y, C = z,q=2)

    # a tonque twister, showing the effects of useBytes and useNames
    x <- "peter piper picked a peck of pickled peppers"
    qgrams(x, q=2)
    qgrams(x, q=2, useNames=FALSE)

    qgrams(x, q=2, useBytes=TRUE)
    qgrams(x, q=2, useBytes=TRUE, useNames=TRUE)

3. stringdist Examples

  1. # Simple example using optimal string alignment
        stringdist("ca","abc")
    
        # The same example using Damerau-Levenshtein distance (multiple editing of substrings allowed)
        stringdist("ca","abc",method="dl")
    
        # string distance matching is case sensitive:
        stringdist("ABC","abc")
# so you may want to normalize a bit:    
stringdist(tolower("ABC"),"abc")

    # stringdist recycles the shortest argument:
    stringdist(c(’a’,’b’,’c’),c(’a’,’c’))

    # stringdistmatrix gives the distance matrix (by default for optimal string alignment):
    stringdist(c(’a’,’b’,’c’),c(’a’,’c’))

    # different edit operations may be weighted; e.g. weighted substitution:
    stringdist(’ab’,’ba’,weight=c(1,1,1,0.5))

    # Non-unit weights for insertion and deletion makes the distance metric asymetric
    stringdist(’ca’,’abc’)
    stringdist(’abc’,’ca’)
    stringdist(’ca’,’abc’,weight=c(0.5,1,1,1))
    stringdist(’abc’,’ca’,weight=c(0.5,1,1,1))

    # Hamming distance is undefined for
    # strings of unequal lengths so stringdist returns Inf
    stringdist("ab","abc",method="h")
    # For strings of eqal length it counts the number of unequal characters as they occur
    # in the strings from beginning to end
    stringdist("hello","HeLl0",method="h")

    # The lcm (longest common substring) distance returns the number of
    # characters that are not part of the lcs.
    #
    # Here, the lcs is either ’a’ or ’b’ and one character cannot be paired:
    stringdist(’ab’,’ba’,method="lcs")

    # Here the lcs is ’surey’ and ’v’, ’g’ and one ’r’ of ’surgery’ are not paired
    stringdist(’survey’,’surgery’,method="lcs")

    # q-grams are based on the difference between occurrences of q consecutive characters
    # in string a and string b.
    # Since each character abc occurs in ’abc’ and ’cba’, the q=1 distance equals 0:
    stringdist(’abc’,’cba’,method=’qgram’,q=1)

    # since the first string consists of ’ab’,’bc’ and the second
    # of ’cb’ and ’ba’, the q=2 distance equals 4 (they have no q=2 grams in common):
    stringdist(’abc’,’cba’,method=’qgram’,q=2)

    # Wikipedia has the following example of the Jaro-distance.
    stringdist(’MARTHA’,’MATHRA’,method=’jw’)
    # Note that stringdist gives a  _distance_ where wikipedia gives the corresponding
    # _similarity measure_. To get the wikipedia result:
    1 - stringdist(’MARTHA’,’MATHRA’,method=’jw’)

    # The corresponding Jaro-Winkler distance can be computed by setting p=0.1
    stringdist(’MARTHA’,’MATHRA’,method=’jw’,p=0.1)
    # or, as a similarity measure
    1 - stringdist(’MARTHA’,’MATHRA’,method=’jw’,p=0.1)

II. Wikipedia details:

1. Definition

Mathematically, the Levenshtein distance between two strings a, b is given by \operatorname{lev}_{a,b}(|a|,|b|) where

\qquad\operatorname{lev}_{a,b}(i,j) = \begin{cases}<br />
  \max(i,j) & \text{ if} \min(i,j)=0, \\<br />
  \min \begin{cases}<br />
          \operatorname{lev}_{a,b}(i-1,j) + 1 \\<br />
          \operatorname{lev}_{a,b}(i,j-1) + 1 \\<br />
          \operatorname{lev}_{a,b}(i-1,j-1) + 1_{(a_i \neq b_j)}<br />
       \end{cases} & \text{ otherwise.}<br />
\end{cases}

where 1_{(a_i \neq b_j)} is the indicator function equal to 0 when a_i = b_j and 1 otherwise.

Note that the first element in the minimum corresponds to deletion (from a to b), the second to insertion and the third to match or mismatch, depending on whether the respective symbols are the same.

Example

For example, the Levenshtein distance between “kitten” and “sitting” is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

  1. kitten → sitten (substitution of “s” for “k”)
  2. sitten → sittin (substitution of “i” for “e”)
  3. sittin → sitting (insertion of “g” at the end).

Upper and lower bounds

The Levenshtein distance has several simple upper and lower bounds. These include:

  • It is always at least the difference of the sizes of the two strings.
  • It is at most the length of the longer string.
  • It is zero if and only if the strings are equal.
  • If the strings are the same size, the Hamming distance is an upper bound on the Levenshtein distance.
  • The Levenshtein distance between two strings is no greater than the sum of their Levenshtein distances from a third string (triangle inequality).

2. Applications

In approximate string matching, the objective is to find matches for short strings in many longer texts, in situations where a small number of differences is to be expected. The short strings could come from a dictionary, for instance. Here, one of the strings is typically short, while the other is arbitrarily long. This has a wide range of applications, for instance, spell checkers, correction systems for optical character recognition, and software to assist natural language translation based on translation memory.

The Levenshtein distance can also be computed between two longer strings, but the cost to compute it, which is roughly proportional to the product of the two string lengths, makes this impractical. Thus, when used to aid in fuzzy string searching in applications such as record linkage, the compared strings are usually short to help improve speed of comparisons.

3. Relationship with other edit distance metrics

There are other popular measures of edit distance, which are calculated using a different set of allowable edit operations. For instance,

Edit distance is usually defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite). This is further generalized by DNA sequence alignment algorithms such as the Smith–Waterman algorithm, which make an operation’s cost depend on where it is applied.

4. Computing Levenshtein distance

Recursive

A straightforward recursive implementation LevenshteinDistance function takes two strings, s and t, and returns the Levenshtein distance between them:

// len_s and len_t are the number of characters in string s and t respectively
int LevenshteinDistance(string s, int len_s, string t, int len_t)
{
  /* base case: empty strings */
  if (len_s == 0) return len_t;
  if (len_t == 0) return len_s;

  /* test if last characters of the strings match */
  if (s[len_s-1] == t[len_t-1]) cost = 0;
  else                          cost = 1;

  /* return minimum of delete char from s, delete char from t, and delete char from both */
  return minimum(LevenshteinDistance(s, len_s - 1, t, len_t    ) + 1,
                 LevenshteinDistance(s, len_s    , t, len_t - 1) + 1,
                 LevenshteinDistance(s, len_s - 1, t, len_t - 1) + cost);
}

Unfortunately, the straightforward recursive implementation is very inefficient because it recomputes the Levenshtein distance of the same substrings many times. A better method would never repeat the same distance calculation. For example, the Levenshtein distance of all possible prefixes might be stored in an array d[][] where d[i][j] is the distance between the first i characters of string s and the first j characters of string t. The table is easy to construct one row at a time starting with row 0. When the entire table has been built, the desired distance is d[len_s][len_t]. While this technique is significantly faster, it will consume len_s * len_t more memory than the straightforward recursive implementation.

Iterative with full matrix

Computing the Levenshtein distance is based on the observation that if we reserve a matrix to hold the Levenshtein distances between all prefixes of the first string and all prefixes of the second, then we can compute the values in the matrix in a dynamic programming fashion, and thus find the distance between the two full strings as the last value computed.

This algorithm, an example of bottom-up dynamic programming, is discussed, with variants, in the 1974 article The String-to-string correction problem by Robert A. Wagner and Michael J. Fischer.[2]

A straightforward implementation, as pseudocode for a function LevenshteinDistance that takes two strings, s of length m, and t of length n, and returns the Levenshtein distance between them:

int LevenshteinDistance(char s[1..m], char t[1..n])
{
  // for all i and j, d[i,j] will hold the Levenshtein distance between
  // the first i characters of s and the first j characters of t;
  // note that d has (m+1)*(n+1) values
  declare int d[0..m, 0..n]

  clear all elements in d // set each element to zero

  // source prefixes can be transformed into empty string by
  // dropping all characters
  for i from 1 to m
    {
      d[i, 0] := i
    }

  // target prefixes can be reached from empty source prefix
  // by inserting every characters
  for j from 1 to n
    {
      d[0, j] := j
    }

  for j from 1 to n
    {
      for i from 1 to m
        {
          if s[i] = t[j] then
            d[i, j] := d[i-1, j-1]       // no operation required
          else
            d[i, j] := minimum
                    (
                      d[i-1, j] + 1,  // a deletion
                      d[i, j-1] + 1,  // an insertion
                      d[i-1, j-1] + 1 // a substitution
                    )
        }
    }

  return d[m, n]
}

Note that this implementation does not fit the definition precisely: it always prefers matches, even if insertions or deletions provided a better score. This is equivalent; it can be shown that for every optimal alignment (which induces the Levenshtein distance) there is another optimal alignment that prefers matches in the sense of this implementation.[3]

Two examples of the resulting matrix (hovering over a number reveals the operation performed to get that number):

k i t t e n
0 1 2 3 4 5 6
s 1 1 2 3 4 5 6
i 2 2 1 2 3 4 5
t 3 3 2 1 2 3 4
t 4 4 3 2 1 2 3
i 5 5 4 3 2 2 3
n 6 6 5 4 3 3 2
g 7 7 6 5 4 4 3
S a t u r d a y
0 1 2 3 4 5 6 7 8
S 1 0 1 2 3 4 5 6 7
u 2 1 1 2 2 3 4 5 6
n 3 2 2 2 3 3 4 5 6
d 4 3 3 3 3 4 3 4 5
a 5 4 3 4 4 4 4 3 4
y 6 5 4 4 5 5 5 4 3

The invariant maintained throughout the algorithm is that we can transform the initial segment s[1..i] into t[1..j] using a minimum of d[i,j]operations. At the end, the bottom-right element of the array contains the answer.

Proof of correctness

As mentioned earlier, the invariant is that we can transform the initial segment s[1..i] into t[1..j] using a minimum of d[i,j] operations. This invariant holds since:

  • It is initially true on row and column 0 because s[1..i] can be transformed into the empty string t[1..0] by simply dropping all i characters. Similarly, we can transform s[1..0] to t[1..j] by simply adding all j characters.
  • If s[i] = t[j], and we can transform s[1..i-1] to t[1..j-1] in k operations, then we can do the same to s[1..i] and just leave the last character alone, giving k operations.
  • Otherwise, the distance is the minimum of the three possible ways to do the transformation:
    • If we can transform s[1..i] to t[1..j-1] in k operations, then we can simply add t[j] afterwards to get t[1..j] in k+1 operations (insertion).
    • If we can transform s[1..i-1] to t[1..j] in k operations, then we can remove s[i] and then do the same transformation, for a total of k+1operations (deletion).
    • If we can transform s[1..i-1] to t[1..j-1] in k operations, then we can do the same to s[1..i], and exchange the original s[i] for t[j]afterwards, for a total of k+1 operations (substitution).
  • The operations required to transform s[1..n] into t[1..m] is of course the number required to transform all of s into all of t, and so d[n,m] holds our result.

This proof fails to validate that the number placed in d[i,j] is in fact minimal; this is more difficult to show, and involves an argument by contradiction in which we assume d[i,j] is smaller than the minimum of the three, and use this to show one of the three is not minimal.

Possible modifications

Possible modifications to this algorithm include:

  • We can adapt the algorithm to use less space, O(min(n,m)) instead of O(mn), since it only requires that the previous row and current row be stored at any one time.
  • We can store the number of insertions, deletions, and substitutions separately, or even the positions at which they occur, which is always j.
  • We can normalize the distance to the interval [0,1].
  • If we are only interested in the distance if it is smaller than a threshold k, then it suffices to compute a diagonal stripe of width 2k+1 in the matrix. In this way, the algorithm can be run in O(kl) time, where l is the length of the shortest string.[4]
  • We can give different penalty costs to insertion, deletion and substitution. We can also give penalty costs that depend on which characters are inserted, deleted or substituted.
  • By initializing the first row of the matrix with 0, the algorithm can be used for fuzzy string search of a string in a text.[5] This modification gives the end-position of matching substrings of the text. To determine the start-position of the matching substrings, the number of insertions and deletions can be stored separately and used to compute the start-position from the end-position.[6]
  • This algorithm parallelizes poorly, due to a large number of data dependencies. However, all the cost values can be computed in parallel, and the algorithm can be adapted to perform the minimum function in phases to eliminate dependencies.
  • By examining diagonals instead of rows, and by using lazy evaluation, we can find the Levenshtein distance in O(m (1 + d)) time (where d is the Levenshtein distance), which is much faster than the regular dynamic programming algorithm if the distance is small.[7]

Iterative with two matrix rows

It turns out that only two rows of the table are needed for the construction: the previous row and the current row (the one being calculated).

The Levenshtein distance may be calculated iteratively using the following algorithm:[8]

int LevenshteinDistance(string s, string t)
{
    // degenerate cases
    if (s == t) return 0;
    if (s.Length == 0) return t.Length;
    if (t.Length == 0) return s.Length;

    // create two work vectors of integer distances
    int[] v0 = new int[t.Length + 1];
    int[] v1 = new int[t.Length + 1];

    // initialize v0 (the previous row of distances)
    // this row is A[0][i]: edit distance for an empty s
    // the distance is just the number of characters to delete from t
    for (int i = 0; i < v0.Length; i++)
        v0[i] = i;

    for (int i = 0; i < s.Length; i++)
    {
        // calculate v1 (current row distances) from the previous row v0

        // first element of v1 is A[i+1][0]
        //   edit distance is delete (i+1) chars from s to match empty t
        v1[0] = i + 1;

        // use formula to fill in the rest of the row
        for (int j = 0; j < t.Length; j++)
        {
            var cost = (s[i] == t[j]) ? 0 : 1;
            v1[j + 1] = Minimum(v1[j] + 1, v0[j + 1] + 1, v0[j] + cost);
        }

        // copy v1 (current row) to v0 (previous row) for next iteration
        for (int j = 0; j < v0.Length; j++)
            v0[j] = v1[j];
    }

    return v1[t.Length];
}
Advertisements

Reblog: R Graphics ggplot2

1. Background:

The ggplot2 package is extremely flexible. The “gg” stands for the Grammar of Graphics.

In the book, The Grammar of Graphics, Wilkinson showed how you could describe plots not as discrete types like bar plot or pie chart, but using a “grammar” that would work not only for plots we commonly use but for almost any conceivable graphic. From this perspective a pie chart is just a bar chart with a circular (polar) coordinate system replacing the rectangular Cartesian coordinate system. 

The ggplot2 is a simplified implementation of grammar of graphics written by Hadley Wickham for R. Wickham’s book, ggplot2: Elegant Graphics for Data Analysis, provides a detailed presentation of the ggplot2 package.

2. Functions and Examples

The ggplot2 package offers two main functions: quickplot() and ggplot().

  1. The quickplot() function – also known as qplot(). It is particularly easy to use for simple plots. Below is an example of the default plots that qplot() makes. The command that created each plot is shown in the title of each graph.

2. The ggplot() function.

What are the fundamental parts of every data graph? They are:

  • Aesthetics – these are the roles that the variables play in each graph. A variable may control where points appear, the color or shape of a point, the height of a bar and so on.
  • Geoms – these are the geometric objects. Do you need bars, points, lines?
  • Statistics – these are the functions like linear regression you might need to draw a line.
  • Scales – these are legends that show things like circular symbols represent females while circles represent males.
  • Facets – these are the groups in your data. Faceting by gender would cause the graph to repeat for the two genders.

Let us start our use of the ggplot() function with a single stacked bar plot. On the x-axis there really is no variable, so I plugged in a call to the factor() function that creates an empty one on the fly. I then fill the single bar in using the fill argument. There is only one type of geometric object on the plot, which I add with geom_bar.

1
2
> ggplot(mydata100,  aes(x = factor(""), fill = workshop) ) +
+   geom_bar()

The x-axis comes out labeled as “factor(“”)” but we can over-write that with a title for the x-axis. What is particularly interesting is that this can become a pie chart simply by changing its coordinate system to polar. The final line of code changes the label on the discrete x-axis to blank with “”.

1
2
3
4
5
> ggplot(mydata100,
+   aes(x = factor(""), fill = workshop) ) +
+   geom_bar() +
+   coord_polar(theta = "y") +
+   scale_x_discrete("")

Bar Plots

Note the unusual use of the plus sign “+” to add the effect of of geom_bar() to ggplot(). Only one variable plays an “aesthetic” role: workshop. The aes() function sets that role. So here is one way to write the code:

1
2
> ggplot(mydata100) +
+   geom_bar( aes(workshop) )

A very useful feature of the ggplot() function is that it can pass aesthetic roles to all the functions that are “added” to it. As graphs become more complex, it can be a big time-saver to set as many aesthetic roles in the ggplot() function call and let it pass them through to various other functions that we will add on to build a more complex plot. For example, we can create exactly the same barplot with this code:

1
2
> ggplot(mydata100, aes(workshop) ) +
+   geom_bar()
Flipping from vertical to horizontal bars is easy with the addition of the coord_flip() function.

1
2
> ggplot(mydata100, aes(workshop) ) +
+   geom_bar() + coord_flip()

If you want to fill the bars with color, you can do that using the “fill” argument.

1
2
> ggplot(mydata100, aes(workshop, fill = workshop ) ) +
+   geom_bar()

Below I use fill to color the bars by workshop and set the “position” to stack.

1
2
> ggplot(mydata100, aes(gender, fill = workshop) ) +
+   geom_bar(position = "stack")

In the plot above, the height of the bars represents the total number of males and females. This is fine if you want to compare counts, but if you want to compare proportions of each gender that took each class, you would have to make the bars equal heights. You can do that by simply changing the position to “fill”.

1
2
> ggplot(mydata100, aes(gender, fill=workshop) ) +
+   geom_bar(position="fill")

Here is the same plot changing only the bar position to be “dodge”.

1
2
> ggplot(mydata100, aes(gender, fill=workshop ) ) +
+   geom_bar(position="dodge")

You can change any of the above colored graphs to shades of grey by simply adding the scale_fill_grey() function. Here is the plot immediately above repeated in greyscale.

1
2
3
> ggplot(mydata100, aes(gender, fill=workshop ) ) +
+   geom_bar(position="dodge")  +
+   scale_fill_grey(start = 0, end = 1)

You can get the same information that is in the above plot by making small separate plots for one of the groups. You can accomplish that with the facet_grid() function. It accepts a formula in the form “rows ~ colums”, so using “gender ~ .” asks for two rows for the genders (three if we had not removed missing values) and no columns.

1
2
> ggplot(mydata100, aes(workshop) ) +
+   geom_bar() + facet_grid(gender ~ .)

Unsummarized Data

1
2
3
4
5
6
7
myTemp <- data.frame(
+   myGroup=factor( c("Before","After") ),
+   myMeasure=c(40, 60)
+ )
> ggplot(data=myTemp, aes(myGroup, myMeasure) ) +
+   geom_bar()

Dot Charts

Dot charts are similar to bar charts, but since they are plotting points on both an x- and y-axis, they require a special variable called “..count..”. It calculates the counts and lets you plot them on the y-axis. The points use the “bin” statistic. Since dot charts are usually shown “sideways” I am adding the coord_flip() funtion.

1
2
3
> ggplot(mydata100, + aes(workshop, ..count.. ) ) +
+ geom_point(stat = "bin", size = 3) + coord_flip() +
    facet_grid(gender ~ .)

Adding Titles and Labels

To add a title, use the opts() function and its title argument. Adding titless to axes is trickier. You use four different functions depending on the axis and whether or not it is discrete: scale_x_discrete scale_y_distrete scale_x_continuous scale_y_continuous For a bar plot, the x-axis is discrete so I will use scale_x_discrete to assign a label to it. The character sequence “\n” tells R to go to a new line in all R packages.

1
2
3
4
> ggplot(mydata100, aes(workshop, ..count..)) +
+   geom_bar() +
+   opts( title="Workshop Attendance" ) +
+   scale_x_discrete("Statistics Package \nWorkshops")

Histograms

The geom_histogram function is all you need. I have set the color of the bar edges to white. Without that, the bars all run together in the same shade of grey.

1
2
> ggplot(mydata100, aes(posttest) ) +
+   geom_histogram(color="white")
You can change the number of bars used using the binwidth argument. Since this many bars do not touch, I did not bother setting the edge color to white.
1
2
> ggplot(mydata100, aes(posttest) ) +
>   geom_histogram(binwidth = 0.5)

If you prefer a density plot, that is easy too.

1
2
> ggplot(mydata100, aes(posttest)) +
>   geom_density()
It is easy to layer many different geometric objects onto your plots. In this case to get the same axis on the histogram as the density uses, I used a special ggplot2 variable named “..density..” on the y-axis. I also added a “rug” of carpet-like tick marks on the x-axis using geom_rug.
1
2
3
4
> ggplot(data=mydata100) +
+   geom_histogram( aes(posttest, ..density..) ) +
+   geom_density( aes(posttest, ..density..) ) +
>   geom_rug( aes(posttest) )
Comparing group histograms is easy when you facet them.
1
2
3
> ggplot(mydata100, aes(posttest) ) +
+   geom_histogram(color = "white") +
+   facet_grid(gender ~ .)

Normal QQ Plots

Normal QQ plots are done in ggplot with the stat_qq() function and the sample aesthetic.

1
2
> ggplot(mydata100, aes(sample = posttest) ) +
+   stat_qq()

Strip Plots
With fairly small data sets, you can do strip plots using the point geom.

1
2
> ggplot(mydata100, aes(workshop, posttest) ) +
+   geom_point()
With large data sets, you can use the jitter geom instead. Our data is so small that the default amount of jitter makes it hard to even notice where each group ends.
1
2
> ggplot(mydata100, aes(workshop, posttest) ) +
+   geom_jitter()

Scatter and Line Plots Various type of scatter and line plots can be done using different geoms as shown below. You can, of course, add multiple geoms to a plot. For example, you might want both points and lines, in which case you would simply add both geoms.

1
2
> ggplot(mydata100, aes(pretest, posttest)) +
>   geom_point()

When you add a line geom, the ggplot sorts the data along the x-axis automatically. If you had time-series data that were not sorted by date, it would do so.

1
2
> ggplot(mydata100, aes(pretest, posttest) ) +
+   geom_line()
The path geom leaves the order of the data as it is; it does not sort it before connecting the points. See the books for more examples.
1
2
> ggplot(mydata100, aes(pretest, posttest) ) +
+   geom_path()

Scatter Plots for Large Data Sets

Large data sets provide a challenge since so many points are obscured by other points. First let us create a data set with 5,000 points.

1
2
3
4
5
> pretest2 <- round( rnorm( n = 5000, mean = 80, sd = 5) )
> posttest2 <- round( pretest2 + rnorm( n = 5000, mean = 3, sd = 3) )
> pretest2 [pretest2 > 100] <- 100
> posttest2[posttest2 > 100] <- 100
> temp=data.frame(pretest2,posttest2)

Now I will plot the data using small-sized points, jittering their positions and coloring them with some transparency (called “alpha” in computer-speak).

1
2
3
> ggplot(temp, aes(pretest2, posttest2),
+   size=2, position = position_jitter(x = 2,y = 2) ) +
+   geom_jitter(colour=alpha("black",0.15) )
Next I will use very small sized points and lay a set of 2D density contours on top of them. To help see the contours more clearly, I will not jitter the points.
1
2
> ggplot(temp, aes( x=pretest2, y=posttest2) ) +
+   geom_point( size=1 ) + geom_density2d()
Finally, I will create a hexbin plot, that replaces bunches of points with a larger hexagonal symbol.
1
2
> ggplot(temp, aes(pretest2, posttest2))  +
+   geom_hex( bins=30 )

Scatter Plots with Fit Lines

The ggplot() function makes it particularly easy to add fit lines to scatter plots. Simply adding the geom_smooth() function does the trick.

1
2
> ggplot(mydata100, aes(pretest, posttest) ) +
+   geom_point() + geom_smooth()

Adding a linear regression fit requires only the addition of “method = lm” argument.

1
2
> ggplot(mydata100, aes(pretest, posttest) ) +
+   geom_point() + geom_smooth(method=lm)
To plot labels instead of point characters, add the label aesthetic. I placed “size = 3″ in the geom_text function to clarify its role.
1
2
3
> ggplot(mydata100,
+   aes(pretest, posttest, label = as.character(gender))) +
+   geom_text(size = 3)

To use point shapes to represent the value of a third variable, simply set the shape aesthetic.

1
2
> ggplot(mydata100, aes(pretest, posttest) ) +
+   geom_point( aes(shape=gender ) )

 

Scatter Plots with Linear Fits by Group

One way to use a different fit for each group is to do them on the same plot. This involves setting aesthetics for both linetype and point shape. I tend to think of lines being added to the scattered points, but in this case I placed the geom_point() call last so that the shading from the gray confidence intervals would not shade the points themselves.

1
2
3
> ggplot(mydata100, aes(pretest, posttest) ) +
+   geom_smooth( aes(linetype = gender), method = "lm") +
+   geom_point( aes(shape = gender) )

Another way to display linear fits per group is to facet the plot.

1
2
3
4
5
> ggplot(mydata100,
+   aes(pretest, posttest ) ) +
+   geom_smooth(method = "lm") +
+   geom_point()  +
+   facet_grid(gender ~ .)

Box Plots

The ggplot package offers considerable control over how you can do box plots. Here I plot the raw points and then the boxes on top of them. This hides the points that are actually in the middle 50% of the data. They are usually dense and of less interest than the points that are further out. If you have a lot of data you might consider using geom_jitter() to spread the points around, preventing over-plotting.

1
2
3
> ggplot(mydata100,
+   aes(workshop, posttest )) +
+   geom_point() + geom_boxplot()

Job Trends in the Analytics Market: New, Improved, now Fortified with C, Java, MATLAB, Python, Julia and Many More!

r4stats.com

I’m expanding the coverage of my article, The Popularity of Data Analysis Software. This is the first installment, which includes a new opening and a greatly expanded analysis of the analytics job market. Here it is, from the abstract onward through the first section…

Abstract: This article presents various ways of measuring the popularity or market share of software for analytics including: Alteryx, Angoss, C / C++ / C#, BMDP, Cognos, Java, JMP, Lavastorm, MATLAB, Minitab, NCSS, Oracle Data Mining, Python, R, SAP Business Objects, SAP HANA, SAS, SAS Enterprise Miner, Salford Predictive Modeler (SPM) etc., TIBCO Spotfire, SPSS, Stata, Statistica, Systat, Tableau, Teradata Miner, WEKA / Pentaho. I don’t attempt to differentiate among variants of languages such as R vs. Revolution…

View original post 2,161 more words

Top 10 SQL Interview Questions with Answers

Tags

1. What is the difference between inner and outer join? Explain with example.

Inner Join

Inner join is the most common type of Join which is used to combine the rows from two tables and create a result set containing only such records that are present in both the tables based on the joining condition (predicate).

Inner join returns rows when there is at least one match in both tables

If none of the record matches between two tables, then INNER JOIN will return a NULL set. Below is an example of INNER JOIN and the resulting set.

SELECT dept.name DEPARTMENT, emp.name EMPLOYEE 
FROM DEPT dept, EMPLOYEE emp
WHERE emp.dept_id = dept.id
Department Employee
HR Inno
HR Privy
Engineering Robo
Engineering Hash
Engineering Anno
Engineering Darl
Marketing Pete
Marketing Meme
Sales Tomiti
Sales Bhuti
Outer Join

Outer Join can be full outer or single outer

Outer Join, on the other hand, will return matching rows from both tables as well as any unmatched rows from one or both the tables (based on whether it is single outer or full outer join respectively).

Notice in our record set that there is no employee in the department 5 (Logistics). Because of this if we perform inner join, then Department 5 does not appear in the above result. However in the below query we perform an outer join (dept left outer join emp), and we can see this department.

SELECT dept.name DEPARTMENT, emp.name EMPLOYEE 
FROM DEPT dept, EMPLOYEE emp
WHERE dept.id = emp.dept_id (+)
Department Employee
HR Inno
HR Privy
Engineering Robo
Engineering Hash
Engineering Anno
Engineering Darl
Marketing Pete
Marketing Meme
Sales Tomiti
Sales Bhuti
Logistics  

The (+) sign on the emp side of the predicate indicates that emp is the outer table here. The above SQL can be alternatively written as below (will yield the same result as above):

SELECT dept.name DEPARTMENT, emp.name EMPLOYEE 
FROM DEPT dept LEFT OUTER JOIN EMPLOYEE emp
ON dept.id = emp.dept_id  

2. What is the difference between JOIN and UNION?

SQL JOIN allows us to “lookup” records on other table based on the given conditions between two tables. For example, if we have the department ID of each employee, then we can use this department ID of the employee table to join with the department ID of department table to lookup department names.

UNION operation allows us to add 2 similar data sets to create resulting data set that contains all the data from the source data sets. Union does not require any condition for joining. For example, if you have 2 employee tables with same structure, you can UNION them to create one result set that will contain all the employees from both of the tables.

SELECT * FROM EMP1
UNION
SELECT * FROM EMP2;

3. What is the difference between UNION and UNION ALL?

UNION and UNION ALL both unify for add two structurally similar data sets, but UNION operation returns only the unique records from the resulting data set whereas UNION ALL will return all the rows, even if one or more rows are duplicated to each other.

In the following example, I am choosing exactly the same employee from the emp table and performing UNION and UNION ALL. Check the difference in the result.

SELECT * FROM EMPLOYEE WHERE ID = 5
UNION ALL
SELECT * FROM EMPLOYEE WHERE ID = 5
ID MGR_ID DEPT_ID NAME SAL DOJ
5.0 2.0 2.0 Anno 80.0 01-Feb-2012
5.0 2.0 2.0 Anno 80.0 01-Feb-2012
SELECT * FROM EMPLOYEE WHERE ID = 5
UNION 
SELECT * FROM EMPLOYEE WHERE ID = 5
ID MGR_ID DEPT_ID NAME SAL DOJ
5.0 2.0 2.0 Anno 80.0 01-Feb-2012

4. What is the difference between WHERE clause and HAVING clause?

WHERE and HAVING both filters out records based on one or more conditions. The difference is, WHERE clause can only be applied on a static non-aggregated column whereas we will need to use HAVING for aggregated columns.

To understand this, consider this example. 
Suppose we want to see only those departments where department ID is greater than 3. There is no aggregation operation and the condition needs to be applied on a static field. We will use WHERE clause here:

SELECT * FROM DEPT WHERE ID > 3
ID NAME
4 Sales
5 Logistics

Next, suppose we want to see only those Departments where Average salary is greater than 80. Here the condition is associated with a non-static aggregated information which is “average of salary”. We will need to use HAVING clause here:

SELECT dept.name DEPARTMENT, avg(emp.sal) AVG_SAL
FROM DEPT dept, EMPLOYEE emp
WHERE dept.id = emp.dept_id (+)
GROUP BY dept.name
HAVING AVG(emp.sal) > 80
DEPARTMENT AVG_SAL
Engineering 90

As you see above, there is only one department (Engineering) where average salary of employees is greater than 80.

5. What is the difference among UNION, MINUS and INTERSECT?

UNION combines the results from 2 tables and eliminates duplicate records from the result set.

MINUS operator when used between 2 tables, gives us all the rows from the first table except the rows which are present in the second table.

INTERSECT operator returns us only the matching or common rows between 2 result sets.

To understand these operators, let’s see some examples. We will use two different queries to extract data from our emp table and then we will perform UNION, MINUS and INTERSECT operations on these two sets of data.

UNION

SELECT * FROM EMPLOYEE WHERE ID = 5
UNION 
SELECT * FROM EMPLOYEE WHERE ID = 6
ID MGR_ID DEPT_ID NAME SAL DOJ
5 2 2.0 Anno 80.0 01-Feb-2012
6 2 2.0 Darl 80.0 11-Feb-2012

MINUS

SELECT * FROM EMPLOYEE
MINUS
SELECT * FROM EMPLOYEE WHERE ID > 2
ID MGR_ID DEPT_ID NAME SAL DOJ
1   2 Hash 100.0 01-Jan-2012
2 1 2 Robo 100.0 01-Jan-2012

INTERSECT

SELECT * FROM EMPLOYEE WHERE ID IN (2, 3, 5)
INTERSECT
SELECT * FROM EMPLOYEE WHERE ID IN (1, 2, 4, 5)
ID MGR_ID DEPT_ID NAME SAL DOJ
5 2 2 Anno 80.0 01-Feb-2012
2 1 2 Robo 100.0 01-Jan-2012

6. What is Self Join and why is it required?

Self Join is the act of joining one table with itself.

Self Join is often very useful to convert a hierarchical structure into a flat structure

In our employee table example above, we have kept the manager ID of each employee in the same row as that of the employee. This is an example of how a hierarchy (in this case employee-manager hierarchy) is stored in the RDBMS table. Now, suppose if we need to print out the names of the manager of each employee right beside the employee, we can use self join. See the example below:

SELECT e.name EMPLOYEE, m.name MANAGER
FROM EMPLOYEE e, EMPLOYEE m
WHERE e.mgr_id = m.id (+)
EMPLOYEE MANAGER
Pete Hash
Darl Hash
Inno Hash
Robo Hash
Tomiti Robo
Anno Robo
Privy Robo
Meme Pete
Bhuti Tomiti
Hash  

The only reason we have performed a left outer join here (instead of INNER JOIN) is we have one employee in this table without a manager (employee ID = 1). If we perform inner join, this employee will not show-up.

7. How can we transpose a table using SQL (changing rows to column or vice-versa) ?

The usual way to do it in SQL is to use CASE statement or DECODE statement.

8. How to generate row number in SQL Without ROWNUM

Generating a row number – that is a running sequence of numbers for each row is not easy using plain SQL. In fact, the method I am going to show below is not very generic either. This method only works if there is at least one unique column in the table. This method will also work if there is no single unique column, but collection of columns that is unique. Anyway, here is the query:

SELECT name, sal, (SELECT COUNT(*)  FROM EMPLOYEE i WHERE o.name >= i.name) row_num
FROM EMPLOYEE o
order by row_num
NAME SAL ROW_NUM
Anno 80 1
Bhuti 60 2
Darl 80 3
Hash 100 4
Inno 50 5
Meme 60 6
Pete 70 7
Privy 50 8
Robo 100 9
Tomiti 70 10

The column that is used in the row number generation logic is called “sort key”. Here sort key is “name” column. For this technique to work, the sort key needs to be unique. We have chosen the column “name” because this column happened to be unique in our Employee table. If it was not unique but some other collection of columns was, then we could have used those columns as our sort key (by concatenating those columns to form a single sort key).

Also notice how the rows are sorted in the result set. We have done an explicit sorting on the row_num column, which gives us all the row numbers in the sorted order. But notice that name column is also sorted (which is probably the reason why this column is referred as sort-key). If you want to change the order of the sorting from ascending to descending, you will need to change “>=” sign to “<=” in the query.

As I said before, this method is not very generic. This is why many databases already implement other methods to achieve this. For example, in Oracle database, every SQL result set contains a hidden column called ROWNUM. We can just explicitly select ROWNUM to get sequence numbers.

9. How to select first 5 records from a table?

This question, often asked in many interviews, does not make any sense to me. The problem here is how do you define which record is first and which is second. Which record is retrieved first from the database is not deterministic. It depends on many uncontrollable factors such as how database works at that moment of execution etc. So the question should really be – “how to select any 5 records from the table?” But whatever it is, here is the solution:

 

In Oracle,

SELECT * 
FROM EMP
WHERE ROWNUM <= 5;

In SQL Server,

SELECT TOP 5 * FROM EMP;

Generic solution,

I believe a generic solution can be devised for this problem if and only if there exists at least one distinct column in the table. For example, in our EMP table ID is distinct. We can use that distinct column in the below way to come up with a generic solution of this question that does not require database specific functions such as ROWNUM, TOP etc.

SELECT  name 
FROM EMPLOYEE o
WHERE (SELECT count(*) FROM EMPLOYEE i WHERE i.name < o.name) < 5
name
Inno
Anno
Darl
Meme
Bhuti

I have taken “name” column in the above example since “name” is happened to be unique in this table. I could very well take ID column as well.

In this example, if the chosen column was not distinct, we would have got more than 5 records returned in our output.

Do you have a better solution to this problem? If yes, post your solution in the comment.

10. What is the difference between ROWNUM pseudo column and ROW_NUMBER() function?

ROWNUM is a pseudo column present in Oracle database returned result set prior to ORDER BY being evaluated. So ORDER BY ROWNUM does not work.

ROW_NUMBER() is an analytical function which is used in conjunction to OVER() clause wherein we can specify ORDER BY and also PARTITION BY columns.

Suppose if you want to generate the row numbers in the order of ascending employee salaries for example, ROWNUM will not work. But you may use ROW_NUMBER() OVER() like shown below:

SELECT name, sal, row_number() over(order by sal desc) rownum_by_sal
FROM EMPLOYEE o
name Sal ROWNUM_BY_SAL
Hash 100 1
Robo 100 2
Anno 80 3
Darl 80 4
Tomiti 70 5
Pete 70 6
Bhuti 60 7
Meme 60 8
Inno 50 9
Privy 50 10

My Knowledge of Online Advertising Industry

It is common to see ads popping up when we are surfing on the Internet. Due to the fact that thousands of millions of people use the major search engines like Google and Yahoo daily, there is a great market to do the advertising directly on the search engine. However, outside of these big search networks, there still exist web visitors, the amount of which is also very enormous.

 

The goal of the online advertising companies is to optimize the advertising campaign performance through some technologies that assess the traffic quality of publishers other than the major search engines and place different ads according to the type of keywords, device and geographic characteristics.

 

1. Some basic terms for the online advertising industry: 

Impression, Click through, Conversion

CPM (cost per mille, Cost Per Thousand Impression)

CPC/PPC (cost per click/pay per click)

CPA/CPS (cost per action/cost per sale)

search portfolio ROI (return on investment)

KPI (Key Performance Indicators)

RPQ (Request price quotation)

bounce rate: the percentage of visitors who enter the site and “bounce” (leave the site) rather than continue viewing other pages within the same site.

 

2. Goals:

In a nutshell, Increase search portfolio ROI.

More detailed:    

  •    Find targeted customers and provide the most relevant and effective ads to your users
  •    Increase user engagement / on-site engagementon-site activity
  •    Extend search presence
  •    Optimize spend toward performing sources

 

3. Possible performance metrics:

Note that performance metrics can be very specific according to different campaigns.For example, for a campaign of car ads, we might be interested in the data amount of visitors that request a quote, locate closest dealer, search inventory, compare vehicles, etc.

 

4. Possible elements for an advertiser campaign report:

Revenue over time

Avg. CPC and CPM

Coverage

Queries

Yield (RPQ, CPM)

Clicks (total, invalid)

top 10 keywords received

invalid reasons pie chart

Ads delivered 

 

5. Possible Traffic Analyses besides the basic elements:

  • Geo-customization: detect the geographical origin of visitors

  • Event tracking: track the paths visitors follow inside the website, including the tracking of events such as downloads, or clicks to outside locations.

  • Referrer tracker

Prediction of a success probability using Beta distributions as prior and posterior

Tags

, ,

Let X be the number of patients out of the first 40 in a clinical trial who have success as their outcome. Let P be the probability that an individual patient is a success. The conditional pmf of X given P = p is the binomial distribution with n = 40, p=p.

As prior distribution for P, we use a uniform p.d.f. on the interval [0,1]. Uniform at [0,1] is a special case of a Beta distribution for α = 1, β = 1.

With a Beta (α, β) prior on P, and a conditional binomial distribution with parameters n, p for X, the posterior of P is again Beta with parameters α˜ = α+x, β˜ = n−x+β.

Therefore, the posterior probablity distribution of P is Beta with α=1+x, β=40-x+1. 

Now let us compute E(P|x) which is the best predictor of P when P has its posterior distribution. Since this distribution is Beta (α˜, β˜) with expectation E(P|x) = α˜/(α˜ +β˜), we obtain
E(P|x) = (α+x)/(α+x+n−x+β) = (1+x)/(1+ x + 40−x + 1)= (x+1)/42 

Note that this predictor is very close to the observed sample proportion of success from X: pˆ= x/40.

SQL Review

SQL Structured Query Language

(1) SORT DATA

select variables

from table

order by variable desc;

(2) FILTER DATA

between

is null

=  <>  <=

in ()   not in ()

(3) WILDCARD FILTER

Wildcard Description
% A substitute for zero or more characters        where var like ‘fish%’
_ A substitute for a single character                where var like ‘_ inch’
[charlist] Sets and ranges of characters to match         where name like ‘[JM]%’  first letter J or M
[^charlist] Matches only a character NOT specified within the brackets

(4) CONCATENATE

Oracle ||:  select city || ‘,’ || state || ‘,’ || zip as address

SQL Server +: select city + ‘,’ + state + ‘,’ + zip as address

trim padded spaces from right side with RTRIM() function : select RTRIM(city) || ‘,’ || RTRIM(state) || ‘,’ || RTRIM(zip)

(5) DATA MANIPULATION FUNCTION

len(), upper(), lower(), trim(), substr() Oracle / substring() SQL Server, sysdate Oracle/getdate() SQL Server, to_number(), to_char() Oracle / convert() SQL Server

soundex(): match all data that sound similar to the one you assign e.g. where soundex(name)=soundex(‘Mike Green’)

date: 1. SQL Server   where datepart (yy, order_date)=2001 and datepart (mm, order_date)=2

2. where order_date between to_date(’01-Jan-2001′) and to_date(’31-Dec-2001′)

(6) AGGREGATE FUNCTION

avg(), count(), max(), min(), sum(), avg(distinct variable)

(7) GROUP DATA

select ID, count(*) as orders

from Orders

group by ID

having count(*) >=2;

(8) JOIN TABLES

1) left join: returns all rows from the left table, with the matching rows in the right table. The result is NULL in the right side when there is no match.

SELECT column_name(s)
FROM table1
LEFT JOIN table2
ON table1.column_name=table2.column_name;

SQL LEFT JOIN

2) inner join is the same as JOIN: selects all rows from both tables as long as there is a match between the columns in both tables.

SELECT column_name(s)
FROM table1
INNER JOIN table2
ON table1.column_name=table2.column_name;

SQL INNER JOIN

3) full outer join: returns all rows from the left table and from the right table

SELECT column_name(s)
FROM table1
FULL OUTER JOIN table2
ON table1.column_name=table2.column_name;

SQL FULL OUTER JOIN

4) Union: combines the result of two or more SELECT statements.each SELECT statement within the UNION must have the same number of columns. The columns must also have similar data types. Also, the columns in each SELECT statement must be in the same order.

(9) INSERT DATA

insert into data1 (name, age) values(‘Peter’, 25);

insert into data1 (name, age) select name, age from data 2 where name=’Peter’;

copy one table to another:

select *                                 create table data1 as

into data1                             select *

from data2;                           from data2;  Oracle

(10) UPDATE AND DELETE DATA

update customers set cust_email=’spring14@yahoo.cn’ where cust_id=’001′;

delete from customers where cust_id=’001′;

delete table:  drop table customers;

create index prod_index on table(product);

(11) OTHERS

SELECT DISTINCT column_name,column_name
FROM table_name;

SELECT SUBSTRING(column_name,start,length) AS sth FROM table_name             e.g. substr(proc, -1, 1) not in (‘T’, ‘F’);

SELECT ROUND(column_name,decimals) FROM table_name                                  e.g. ROUND (PER25 * decode(RLV,0,100, RLV)*(1-c.fct),2);

increase speed: e.g. CREATE TABLE name PARALLEL 8 nologging compress for query high as

use partition instead of group by: e.g. SUM(FREQ) OVER (PARTITION BY PROC, MOD,GEOZIP)

create temporary table: e.g. WITH new_name AS (SELECT * from table where…) select * from new_name where…

case when: e.g.  SELECT CASE WHEN (N = 1 OR CHARGE_SD = 0) THEN NULL
ELSE MD3/((N-1)*POWER(CHARGE_SD,3)) END AS skewness from…

macro: SET DEFINE~  DEFINE VAR=’name’  select ~VAR from table;

nvl(): available in Oracle, and not in MySQL or SQL Server. replace NULL value with another value. ifnull() in MySQL and the isnull() in SQL Server.

e.g. nvl(trim(a.mod),’0′) not in (’20’, ’22’, ‘SG’)

references: http://www.w3schools.com/sql/default.asp

(12) PRACTICE

There’s no better way to improve SQL skills than to practice with some real SQL interview questions. The following SQL practice exercises were actually taken from real interview tests with Google and Amazon.

1. The names of all salespeople that have an order with Samsonic.

2. The names of all salespeople that do not have any order with Samsonic.

3. The names of salespeople that have 2 or more orders.

4. Write a SQL statement to insert rows into a table called highAchiever(Name, Age), where a salesperson must have a salary of 100,000 or greater to be included in the table.

1. select a.name from Salesperson a, Customer b, Orders c

where b.name = “Samsonic”

and b.ID=c.cust_id

and c.salesperson_id = a. ID;

2. select a.name from Salesperson a, Customer b, Orders c

where b.name not in “Samsonic”

and b.ID=c.cust_id

and c.salesperson_id = a. ID;

3. select a. name, count(c.salesperson_id) from Salesperson a, Orders c

where count(c.salesperson_id)>1

and c.salesperson_id = a. ID;

4. insert into highAchiever (Name, Age)

(select Name, Age from Salesperson where salary>100,000);

create table highAchiever as

select Name, Age from Salesperson

where salary>100,000;

Salesperson Customer
ID Name Age Salary
1 Abe 61 140000
2 Bob 34 44000
5 Chris 34 40000
7 Dan 41 52000
8 Ken 57 115000
11 Joe 38 38000
     ID Name City Industry   Type
4 Samsonic pleasant J
6 Panasung oaktown J
7 Samony jackson B
9 Orange Jackson B
Orders
Number order_date cust_id salesperson_id Amount
10 8/2/96 4 2 540
20 1/30/99 4 8 1800
30 7/14/95 9 1 460
40 1/29/98 7 2 2400
50 2/3/98 6 7 600
60 3/2/98 6 7 720
70 5/6/98 9 7 150

Python Glossary from Codecademy

Class 

Python is an Language that supports the Object Oriented Programming paradigm. Like other OOP languages, Python has classes which are defined wireframes of objects. Python supports class inheritance. A class may have many subclasses but many only inherit directly from one superclass.

Syntax
class ClassName(object):
    """This is a class"""
    class_variable
    def __init__(self,*args):
        self.args = args
    def __repr__(self):
        return "Something to represent the object as a string"
    def other_method(self,*args):
        //do something else
Example
class Horse(object):
    """Horse represents a Horse"""
    species = "Equus ferus caballus"
    def __init__(self,color,weight,wild=False):
        self.color = color
        self.weight = weight
        self.wild = wild
    def __repr__(self):
        return "%s horse weighing %f and wild status is %b" (self.color,self.weight,self.wild)
    def make_sound(self):
        print("neighhhh")
    def movement(self):
        return "walk"
Syntax
class ClassName(SuperClass):
    //same as above
    //use 'super' keyword to get from above
Example
class RaceHorse(Horse):
    """A faster horse that inherits from Horse"""
    def movement(self):
        return "run"
    def movement_slow(self):
        super(Horse,self).movement()
    def __repr__(self):
        return "%s race horse weighing %f and wild status is %b" (self.color,self.weight,self.wild)

>> horse3 = RaceHorse("white",200)
>> print horse3.movement_slow()
"walk"
>> print horse3.movement()
"run"

Comments 

Multi-line Comments 

Some comments need to span several lines, use this if you have more than 4 single line comments in a row.

Example
'''
this is
a multi-line
comment, i am handy for commenting out whole
chunks of code very fast
'''

Single-line Comments 

Augmenting code with human readable descriptions can help document design decisions.

Example
# this is a single line comment.

Dictionaries 

Dictionaries are Python’s built-in associative data type. A dictionary is made of key-value pairs where each key corresponds to a value. Like sets, dictionaries are unordered. A few notes about keys and values: * The key must be immutable and hashable while the value can be of any type. Common examples of keys are tuples, strings and numbers. * A single dictionary can contain keys of varying types and values of varying types.

Syntax
dict() #creates new empty dictionary
{} #creates new empty dictionary
Example
>> my_dict = {}
>> content_of_value1 = "abcd"
>> content_of_value2 = "wxyz"
>> my_dict.update({"key_name1":content_of_value1})
>> my_dict.update({"key_name2":content_of_value2})
>> my_dict
{'key_name1':"abcd", 'key_name2':"wxyz"}
>> my_dict.get("key_name2")
"wxyz"
Syntax
{key1:value1,key2:value2}
Example
>> my_dict = {"key1":[1,2,3],"key2":"I am a string",123:456}
>> my_dict["key1"] #[1,2,3]
>> my_dict[123] #456
>> my_dict["new key"] = "New value"
>> print my_dict
{"key2":"I am a string", "new key":"New value", "key1":[1,2,3],123:456}

Functions 

Python functions can be used to abstract pieces of code to use elsewhere.

Syntax
def function_name(parameters):
  # Some code here
Example
def add_two(a, b):
  c = a + b
  return c

# or without the interim assignment to c
def add_two(a, b):
  return a + b
Syntax
def function_name(parameters, named_default_parameter=value):
  # Some code here
Example
def shout(exclamation="Hey!"):
  print exclamation

shout() # Displays "Hey!"

shout("Watch Out!") # Displays "Watch Out!"

Function Objects 

Python functions are first-class objects, which means that they can be stored in variables and lists and can even be returned by other functions.

Example
# Storing function objects in variables:

def say_hello(name):
  return "Hello, " + name

foo = say_hello("Alice")
# Now the value of 'foo' is "Hello, Alice"

fun = say_hello
# Now the value of 'fun' is a function object we can use like the original function:
bar = fun("Bob")
# Now the value of 'bar' is "Hello, Bob"
Example
# Returning functions from functions

# A simple function
def say_hello(greeter, greeted):
  return "Hello, " + greeted + ", I'm " + greeter + "."

# We can use it like this:
print say_hello("Alice", "Bob") # Displays "Hello, Bob, I'm Alice."

# We can also use it in a function:
def produce_greeting_from_alice(greeted):
  return say_hello("Alice", greeted)

print produce_greeting_from_alice("Bob") # Displays "Hello, Bob, I'm Alice."

# We can also return a function from a function by nesting them:
def produce_greeting_from(greeter):
  def greet(greeted):
    return say_hello(greeter, greeted)
  return greet

# Here we create a greeting function for Eve:
produce_greeting_from_eve = produce_greeting_from("Eve")
# 'produce_greeting_from_eve' is now a function:
print produce_greeting_from_eve("Alice") # Displays "Hello, Alice, I'm Eve."

# You can also invoke the function directly if you want:
print produce_greeting_from("Bob")("Eve") # Displays "Hello, Eve, I'm Bob."
Example
# Using functions in a dictionary instead of long if statements:

# Let's say we have a variable called 'current_action' and we want stuff to happen based on its value:

if current_action == 'PAUSE':
  pause()
elif current_action == 'RESTART':
  restart()
elif current_action == 'RESUME':
  resume()

# This can get long and complicated if there are many values.
# Instead, we can use a dictionary:

response_dict = {
  'PAUSE': pause,
  'RESTART': restart,
  'RESUME': resume
}

response_dict[current_action]() # Gets the correct function from response_dict and calls it

len() 

Using len(some_object) returns the number of _top-level_ items contained in the object being queried.

Syntax
len(iterable)
Example
>> my_list = [0,4,5,2,3,4,5]
>> len(my_list)
7

>> my_string = 'abcdef'
>> len(my_string)
6

List Comprehensions 

Convenient ways to generate or extract information from lists.

Syntax
[variable for variable in iterable condition]
[variable for variable in iterable]
Example
>> x_list = [1,2,3,4,5,6,7]
>> even_list = [num for num in x_list if (num % 2 == 0)]
>> even_list
[2,4,6]

>> m_list = ['AB', 'AC', 'DA', 'FG', 'LB']
>> A_list = [duo for duo in m_list if ('A' in duo)]
>> A_list
['AB', 'AC', 'DA']

Lists 

A Python data type that holds an ordered collection of values, which can be of any type. Lists are Python’s ordered mutable data type. Unlike tuples, lists can be modified in-place.

Example
>> x = [1, 2, 3, 4]
>> y = ['spam', 'eggs']
>> x
[1, 2, 3, 4]
>> y
['spam','eggs']

>> y.append('mash')
>> y
['spam', 'eggs', 'mash']

>> y += ['beans']
>> y
['spam', 'eggs', 'mash', 'beans']

Loops 

For Loops 

Python provides a clean iteration syntax. Note the colon and indentation.

Example
>> for i in range(0, 3):
>>     print(i*2)
0
2
4

>> m_list = ["Sir", "Lancelot", "Coconuts"]
>> for item in m_list:
>>     print(item)
Sir
Lancelot
Coconuts

>> w_string = "Swift"
>> for letter in w_string:
>>     print(letter)
S
w
i
f
t

While Loops 

A While loop permits code to execute repeatedly until a certain condition is met. This is useful if the number of iterations required to complete a task is unknown prior to flow entering the loop.

Syntax
while condition:
    //do something
Example
>> looping_needed = True
>>
>> while looping_needed:
>>     # some operation on data
>>     if condition:
>>          looping_needed = False

print 

range() 

The range() function returns a list of integers, the sequence of which is defined by the arguments passed to it.

Syntax
argument variations:
range(terminal)
range(start, terminal)
range(start, terminal, step_size)
Example
>> range(4)
[0, 1, 2, 3]

>> range(2, 8)
[2, 3, 4, 5, 6, 7]

>> range(2, 13, 3)
[2, 5, 8, 11]

Sets 

Sets are collections of unique but unordered items. It is possible to convert certain iterables to a set.

Example
>> new_set = {1, 2, 3, 4, 4, 4,'A', 'B', 'B', 'C'}
>> new_set
{'A', 1, 'C', 3, 4, 2, 'B'}

>> dup_list = [1,1,2,2,2,3,4,55,5,5,6,7,8,8]
>> set_from_list = set(dup_list)
>> set_from_list
{1, 2, 3, 4, 5, 6, 7, 8, 55}

Slice 

A Pythonic way of extracting “slices” of a list using a special bracket notation that specifies the start and end of the section of the list you wish to extract. Leaving the beginning value blank indicates you wish to start at the beginning of the list, leaving the ending value blank indicates you wish to go to the end of the list. Using a negative value references the end of the list (so that in a list of 4 elements, -1 means the 4th element). Slicing always yields another list, even when extracting a single value.

Example
>> # Specifying a beginning and end:
>> x = [1, 2, 3, 4]
>> x[2:3]
[3]

>> # Specifying start at the beginning and end at the second element
>> x[:2]
[1, 2]

>> # Specifying start at the next to last element and go to the end
>> x[-2:]
[3, 4]

>> # Specifying start at the beginning and go to the next to last element
>> x[:-1]
[1, 2, 3]

>> # Specifying a step argument returns every n-th item
>> y = [1, 2, 3, 4, 5, 6, 7, 8]
>> y[::2]
[1, 3, 5, 7]

>> # Return a reversed version of the list ( or string )
>> x[::-1]
[4, 3, 2, 1]

>> # String reverse
>> my_string = "Aloha"
>> my_string[::-1]
"aholA"

str() 

Using the str() function allows you to represent the content of a variable as a string, provided that the data type of the variable provides a neat way to do so. str() does not change the variable in place, it returns a ‘stringified’ version of it. On a more technical note, `str()` calls the special `__str__` method of the object passed to it.

Syntax
str(object)
Example
>> # such features can be useful for concatenating strings
>> my_var = 123
>> my_var
123

>> str(my_var)
'123'

>> my_booking = "DB Airlines Flight " + str(my_var)
>> my_booking
'DB Airlines Flight 123'

Strings 

Strings store characters and have many built-in convenience methods that let you modify their content. Strings are immutable, meaning they cannot be changed in place.

Example
>> my_string1 = "this is a valid string"
>> my_string2 = 'this is also a valid string'
>> my_string3 = 'this is' + ' ' + 'also' + ' ' + 'a string'
>> my_string3
"this is also a string"

Tuples 

A Python data type that holds an ordered collection of values, which can be of any type. Python tuples are “immutable,” meaning that they cannot be changed once created.

Example
>> x = (1, 2, 3, 4)
>> y = ('spam', 'eggs')

>> my_list = [1,2,3,4]
>> my_tuple = tuple(my_list)
>> my_tuple
(1, 2, 3, 4)

Tuple Assignment 

Tuples can be expanded into variables easily.

Example
name, age = ("Alice", 19)
# Now name has the value "Alice" and age has the value 19

# You can also omit the parentheses:
name, age = "Alice", 19

Variables 

Variables are assigned values using the ‘=’ operator, which is not to be confused with the ‘==’ sign used for testing equality. A variable can hold almost any type of value such as lists, dictionaries, functions.

Example
>> x = 12
>> x
12

Python Review More~~

1. Basic

iteration

  • dict.items(): the Python function items()will iterate over a dictionary and return those key/value pairs.
  • dict.keys():  the keys() function returns an array of the dictionary’s keys
  • dict.values(): the values() function returns an array of the dictionary’s values
  • with comma, we can print in one line instead of several
    • for letter in "Eric":
          print letter,  # note the comma!
      # => "E" "r" "i" "c"
  • list comprehension: a powerful way to generate lists using the for/in and if keywords e.g. evens_to_50 = [i for i in range(51) if i % 2 == 0]
  • list slicing: [start:end:stride] Where start describes where the slice starts (inclusive), end is where it ends (exclusive), and stride describes the space between items in the sliced list.
  • lambda: anonymous function  lambda x: x % 3 == 0
  • filter(condition, list)
  • for i in range(1,11): numbers 1 -10, left inclusive, right exclusive

bitwise operators

  • 5 >> 4  # Right Shift      0
    5 << 1  # Left Shift        10
    8 & 5   # Bitwise AND     0
    9 | 4   # Bitwise OR        13
    12 ^ 42 # Bitwise XOR   38
    ~88     # Bitwise NOT     -89
  • print 0b1 + 0b11          4
    print 0b11 * 0b11        9
  • (1) bin()/oct()/hex() function: print number into binary/base 8/base 16
  • e.g. bin(2)=10  bin(10)=1010
  • (2) int() function: return the value of that number in base ten
  • e.g. int(“10”,2)=2  int(bin(5),2)=5  int(“0b100”,2)=4
  • (3) shift: shifting all the 1s and 0s left or right by the specified number of slots
  • e.g. 0b1100>>2 = 0b11    0b1<<2 = 0b100
  • (4) & function: turn on if the corresponding bits of both numbers are 1
  • e.g. 0b111 & 0b1010 = 0b10     0 & 0 = 0 0 & 1 = 0 1 & 0 = 0 1 & 1 = 1
  • (5) | function: turn on if either of the corresponding bits of either number are 1
  • e.g. 0b110 | 0b1010 = 0b1110   0 | 0 = 0 0 | 1 = 1 1 | 0 = 1 1 | 1 = 1
  • (6) ^ function: turn on if either of the bits of the two numbers are 1,but not both
  • e.g. 0b1110 ^ 0b101=0b1011  0 ^ 0 = 0 0 ^ 1 = 1 1 ^ 0 = 1 1 ^ 1 = 0
  • (7) ~ function: equal to adding one to the number and then making it negative
  • e.g. ~1 =-2   ~123=-124

Class

  • class ClassName(object):
  • __init__(self)
  • instance:  newObject= ClassName(“Peter”, “Red”, 20)
  • Member variable: information that belongs to the class object
  • Inheritance:  one class takes on the attributes and methods of another
  • super() : access the attributes or methods of a superclass
class DerivedClass(Base/SuperClass):
   def some_method(self, args):
       super(DerivedClass, self).method(args)

Input/Output

  • open() function: f = open("output.txt", "w")      w–write,  r–read-only,             r+ — read and write,  a–append
  • .write() function: take string argument
  • .close() function
  • my_file.closed: True when the file is closed andFalse otherwise                            if my_file.closed==False:
    my_file.close()
    print my_file.closed
  • __exit__(): automatically close our files    with   as                                                with open(“text.txt”, “w”) as textfile:
    textfile.write(“Success!”)

2. Example

1. simple decoding

garbled = “!XeXgXaXsXsXeXmX XtXeXrXcXeXsX XeXhXtX XmXaX XI”
message=garbled[::-2]
print message

garbled = “IXXX aXXmX aXXXnXoXXXXXtXhXeXXXXrX sXXXXeXcXXXrXeXt mXXeXsXXXsXaXXXXXXgXeX!XX”
message=filter(lambda i: i!=”X”,garbled)
print message

2. turn on or turn off binary

(1) mask & : turn on

def check_bit4(x):
mask=0b1000
desired=x&mask
if desired >0:
return “on”
else:
return “off”

(2)mask | : turn on

a = 0b10111011
mask=0b100
result=a|mask
print bin(result)

(3)mask ^ : flip out

a = 0b11101110
mask =0b11111111
print bin(a^mask)

(4)mask ^ slide: flip out

def flip_bit(number,n):
mask=(0b1<<(n-1))
return bin(number^mask)

3. Class Overview

(1) Fruit example

class Fruit(object):
“””A class that makes various tasty fruits.”””
def __init__(self, name, color, flavor, poisonous):
self.name = name
self.color = color
self.flavor = flavor
self.poisonous = poisonous

def description(self):
print “I’m a %s %s and I taste %s.” % (self.color, self.name, self.flavor)

def is_edible(self):
if not self.poisonous:
print “Yep! I’m edible.”
else:
print “Don’t eat me! I am super poisonous.”

lemon = Fruit(“lemon”, “yellow”, “sour”, False)

lemon.description()
lemon.is_edible()

(2) Shopping Cart example

class ShoppingCart(object):
“””Creates shopping cart objects
for users of our fine website.”””
items_in_cart = {}
def __init__(self, customer_name):
self.customer_name = customer_name

def add_item(self, product, price):
“””Add product to the cart.”””
if not product in self.items_in_cart:
self.items_in_cart[product] = price
print product + ” added.”
else:
print product + ” is already in the cart.”

def remove_item(self, product):
“””Remove product from the cart.”””
if product in self.items_in_cart:
del self.items_in_cart[product]
print product + ” removed.”
else:
print product + ” is not in the cart.”

my_cart=ShoppingCart(“Alice”)
my_cart.add_item(“Electric Toothbrush”, 35)

(3) Employee wage example

class Employee(object):
“””Models real-life employees!”””
def __init__(self, employee_name):
self.employee_name = employee_name

def calculate_wage(self, hours):
self.hours = hours
return hours * 20.00

# Add your code below!
class PartTimeEmployee(Employee):
def calculate_wage(self, hours):
self.hours = hours
return hours * 12.00
def full_time_wage(self, hours):
return super(PartTimeEmployee, self).calculate_wage(hours)

milton = PartTimeEmployee(“Milton”)
print milton.full_time_wage(10)

4. Input/Output

(1) Output

my_list = [i**2 for i in range(1,11)]

f = open(“output.txt”, “w”)

for item in my_list:
f.write(str(item) + “\n”)

f.close()

(2) Input: read

my_file=open(“output.txt”,”r”)

print my_file.read()

my_file.close()

(3) Input: Write

with open(“text.txt”,”w”) as my_file:
my_file.write(“Almost done!”)

Python Review More~

1. Basic

loops

  • pay attention to infinite loop for while
  • enumerate(): counting index for the listchoices = [‘pizza’, ‘pasta’, ‘salad’, ‘nachos’]
    print ‘Your choices are:’
    for index, item in enumerate(choices):
    print index+1, item
  • zip(): multiple lists,  create pairs of elements when passed two lists, and will stop at the end of the shorter list.list_a = [3, 9, 17, 15, 19]
    list_b = [2, 4, 8, 10, 30, 40, 50, 60, 70, 80, 90]for a, b in zip(list_a, list_b):
    # Add your code here!
    if a>b:
    print a
    else:
    print b

functions

  • string.split()
  • ” “.join(list)
  • string.replace(word, anotherword)
  • string.index(i)
  • integer division:an integer divided by an integer will always return an integer. So 1/2 will return 0 instead of 0.5. To fix this, we’ll need to make use of the float() function. float(x)/y will correctly use normal division and so float(1)/2 will correctly return 0.5. Or use 1/2.0

2. Example

1. Board game

from random import randint

board = []

for x in range(5):
board.append([“O”] * 5)

def print_board(board):
for row in board:
print ” “.join(row)

print “Let’s play Battleship!”
print_board(board)

def random_row(board):
return randint(0, len(board) – 1)

def random_col(board):
return randint(0, len(board[0]) – 1)

ship_row = random_row(board)
ship_col = random_col(board)

for turn in range(4):
# Everything from here on should go in your for loop!
# Be sure to indent four spaces!
guess_row = int(raw_input(“Guess Row:”))
guess_col = int(raw_input(“Guess Col:”))

if guess_row == ship_row and guess_col == ship_col:
print “Congratulations! You sunk my battleship!”
break
else:
if (guess_row < 0 or guess_row > 4) or (guess_col < 0 or guess_col > 4):
print “Oops, that’s not even in the ocean.”
elif(board[guess_row][guess_col] == “X”):
print “You guessed that one already.”
else:
if turn==3:
print “Game Over”
else:
print “You missed my battleship!”
board[guess_row][guess_col] = “X”

# Print (turn + 1) here!
print turn+1
print_board(board)

2. Guess a number from 1-9

from random import randrange

random_number = randrange(1, 10)

count = 0
# Start your game!
while count<3:
guess = int(raw_input(“Enter a guess:”))
count+=1
if guess==random_number:
print ‘You win.’
break
else:
print ‘You lose!’
print random_number

3. is_even function

def is_even(x):
if x%2==0:
return True
else:
return False

4. is_int function

def is_int(x):
if x-round(x)==0:
return True
else:
return False

5. digit_sum function

def digit_sum(n):
lst=[]
for i in str(n):
lst.append(int(i))
return sum(lst)

print digit_sum(1234)

6. factorial function

def factorial(x):
if x < 2:
return 1
else:
total = 1
for i in range (2, x+1):
total = total * i
return total

7. is_prime function

def is_prime(x):
if x < 2:
return False

else:
for i in range(2, x):
if x % i == 0:
return False
return True

8. string reverse function

def reverse(text):
word = “”
y = len(text) – 1
for i in range(y, -1, -1):
word += str(text[i])
return word

9. anti-vowel function:

def anti_vowel(text):
store = []
for letter in text:
if letter.lower() not in “aeiou”:
store.append(letter)
return “”.join(store)

text = “Hey look words!”

print anti_vowel(text)

10. add score function

score = {“a”: 1, “c”: 3, “b”: 3, “e”: 1, “d”: 2, “g”: 2, “f”: 4, “i”: 1, “h”: 4, “k”: 5, “j”: 8, “m”: 3, “l”: 1, “o”: 1, “n”: 1, “q”: 10, “p”: 3, “s”: 1, “r”: 1, “u”: 1, “t”: 1, “w”: 4, “v”: 4, “y”: 4, “x”: 8, “z”: 10}

def scrabble_score(word):
word = word.lower()
total=0
for i in word:
total=total+score[i]
return total

11. replace function

def censor(text, word):
return text.replace(word, ‘*’ * len(word))

print censor(“this hack is wack hack”,”hack”)

def censor(text,word):
text = text.split()
for i in range(len(text)):
if text[i] == word:
text[i] = “*” * len(word)
return ” “.join(text)

12. count repetition function

def count(sequence, item):
count=0
for i in sequence:
if i==item:
count+=1
return count

print count([1,2,1,1], 1)

13. remove odd function

def purify(lst):
new_lst=[]
for i in lst:
if i%2==0:
new_lst.append(i)
return new_lst

14. product function

def product(lst):
n=1
for i in range(len(lst)):
n=n*lst[i]
return n

15. remove duplicates function

def remove_duplicates(lst):
total=[]
for i in lst:
if i not in total:
total.append(i)
return total

16. median function

def median(lst):
lst=sorted(lst)
if len(lst)%2==0:
return (lst[len(lst)/2-1]+lst[len(lst)/2])/2.0
else:
return lst[(len(lst)+1)/2-1]