Visualization Of Crime At North Carolina

Technologies Used:Python,MongoDB,D3.js,BootStrap,Flask

This is another project that is part of crime visualization series. This time we are visualizing the crimes that were commited in North Carolina.Unlike the previous project this is a project of smaller magnitude but still manages to tell us a lot of information about how crimes are distributed and information are correlated. This time we are analyzing about the type people who commited the crime and not the type of crimes.Unlike the Previous project which was a group project this was an induvidual project. This project was also completely built using Python and D3.js

This project involves more of backend processing in Python involviong Principal component analysis and Multi dimentional Scaling. We will see a lot of plots though drawn using D3.js

The first part of the project was to sample the Data. Sampling as we know is the processing of choosing a subset of data the represents the entire dataset. The main purpose why sampling was introduced was to reduce the processing time.The sampling technique followed was stratified sampling. For this we have to first form right amount of clusters.In order to identify the number of clusters we have to user K means Technique. K means technique tells us about the number of clusters that would be ideal for the given data set . The k can be obtanain by plotting average diatance vs number of clusters. The graph then forms an elbow . the point where elbow forms now will give us the value of K The following code can be used to plot the graph


for k in clusters:
	model=KMeans(n_clusters=k)
	model.fit(clus_train)
	clusassign=model.predict(clus_train)
	meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1))

It produces the following graph

From the elbow pot we can see that an elbow is formed at 3. Which means the ideal number of clusters to be formed is 3. We then call the KMeans function once again but this time with the number of clusters as 3. We then randomly sample the data from each clusters and then a new DataFrame. We iterated over each clusters and sample them randomly and finally congregate them into one dataframe as shown in the algorithm below.


for i in range(0,3):
	length = len(cluster_dict[i])
	cluster_sample[i]=random.sample(cluster_dict[i],length//3)
	for k in cluster_sample[i]:
		test= clus_train.iloc[[k]]
		df=df.append(clus_train.iloc[[k]],ignore_index=True)

Once we have finished clustering we perform Principal Component Analysis Principal component analysis is a dimension reduction technique that works by forming a linear combination of features and the n identifying the correlated and the non-correlated features among those. This will enable us to keep only those features that are highly uncorrelated as they contain more information about the data. For principal component analysis we first specify the number of columns that we have taken in our dataset as the parameter. This is then followed by fitting the DataFrame. The following code snippet explains the procedure. We then identify the squared loadings using the following fuction


def squaredLoadings():
	w, h = 3, 19;
	squaredLoadings = [0 for y in range(h)]
	for i in range(len(loadings)):
	sum=0
	for j in range(3):
		sum = sum + loadings[j][i] **2
		squaredLoadings[i]=sum
	return squaredLoadings

We then plot a scree plot which is basically a plot of variance vs the principal component analysis. Such a plot results in an elbow shaped curved similar to the one obtained in K means clustering. We can then deduce the number of significant principal axis by either looking at the point where the elbow is formed or up to the point where the variance or Eigen values falls below 1 or the place where an elbow is formed. From the graph below it is clar that number of significant principal component is 3.This process is followed by computation of sum of squared loadings.

Another task of this project was to project the dataset onto the top two PCA vectors obtained from the loadings and plot a scatter plot with the 2 one dimensional arrays obtained.This is basically done by setting the number of components as 2 while initializing the PCA which is the followed by fit and transform of the dataframe. The resultant matrix is then plotted as a scatter plot. A scatter plot is a small circle with set radius. We draw a scatter plot by specifying the coordinates of the center of the circle. These coordinates are given by the two arrays we obtained from the loadings. A Typical scatter plot can be drawn using then following code snippet.


svg.selectAll(".dot")
.data(d3.zip(xVal,yVal))
.enter().append("circle")
.attr("class", "dot")
.attr("r", 3.5)
.attr("cx", function(d,i) { return xScale(d[0]);})
.attr("cy", function(d) { return yScale(d[1]);})

We basically create a circle tag and specify the class name as dot. Then we specify the x and y coordinates of the center of each circle which take the value of the x and y arrays that are obtained from the loadings. A tooltip is place so as to indicate the value of each point when the mouse is hovered over the points.The following is the scatter plt matrix obtained for PCA

We also perform dimension reduction using MDS( Multi dimensional scaling).We perform MDS dimensionality reduction using two metrics.

  1. Euclidiean
  2. Correlated Distance

The following syntax indicates how to do MDS scaling on a dataset using Euclidian distance


MDS(n_components=2,dissimilarity='euclidean')

It basically calculates the Euclidian distance between the data points and identifies the top two significant components. For the case of correlational distance we set the syntax as follows


MDS(n_components=2,dissimilarity='precomputed')
precompute=pairwise_distances(sampled_dataFrame.values,metric='correlation')
return mdsData.fit_transform(precompute)

We then visualise the 2dimentions to which the features have been reduced to via a scatter plot as shown below. For the purpose of animation we register each dots with a tooltip and enlarge the dots radius on mouseover and decrease the radius on mouse out.

We then create a scatter matrix wherein we visualize the top 3 columns which has the highest square loadings. The following code snippet is used for the purpose.


def getColumnData():
	sortedSumSquareLoadings=sorted(sumSquareLoadings,reverse=True)
	columns=[0 for y in range(3)]
	columnsVals={}
	index=0
	for i in sortedSumSquareLoadings:
		columns[index]=clus_train.columns.values[sumSquareLoadings.index(i)]
		index =index+1
		if index==3:
			break
	for i in range(3):
		columnsVals.update({columns[i]:sampled_dataFrame.loc[:,columns[i]].tolist()})
	return render_template("scatterMatrix.html",dataVal=columnsVals,traits=columns)

We first sort the square loadings in descending order and the form a dictionary with 3 keys each representing the name of the column which has the highest sum of squared loadings in descending order. We then pass this data to the HTML page . For visualization we write a cross function which gives a combination of columns which we can visualize . In The below code for cross function we basically create 9 objects(3*3) which is the combination of each column with the other. We then create 9 different svgs.


function cross(a, b) {
	var c = [], n = a.length, m = b.length, i, j;
	for (i = -1; ++i < n;) for (j = -1; ++j < m;)
	c.push({x: a[i], i: i, y: b[j], j: j});
	return c;
}

The below plot represents the scatter matrix obtained for top three columns with highest square loadings.

And that concludes the project.We have seen numerous ways to get arelation between variables and have seen how effective PDS and MDA can be in identifying them.We also got to see how stratified sampling is done and what benefits it can offer.