Introduction

How fragile is the modern web development ecosystem? How reliant are popular websites on external libraries? These are the concerns present in the modern web ecosystem, especially for web developers in large companies. This is due to the fact that almost all websites rely on external libraries that are downloaded by package managers such as npm.

Although these package managers provide an overall positive service to the web development community, there are several different cases of package outages causing development issues. These outages also affect the users of these large sites. This is why we will be collecting and analyzing data about the dependencies of these npm packages.

For a more detailed description of our goal, see our initial project proposal.

Technical Description #

To begin addressing these questions, we first needed to gather a meaningful data set and construct an appropriate network. Namely, we are interested in the JavaScript packages hosted on npm and their dependencies. In the following sections, we will outline how we have done this and why we have chosen to do it in this manner.

Methodology #

Firstly, we needed to establish the fundamental network model of the network we were creating. It was clear that the entities of our network would represent packages hosted on npm. To create a network model that most directly addresses the original problems posed, we then established that the relationships in our network would denote one package depending on another. Because package dependencies are uni-directional, our network will be represented by a directed graph.

Miniature example of our network model

Next, we began collecting the data. Unfortunately, npm deprecated the use of the /-/all registry endpoint which had previously served as a list of all packages currently hosted on npm. This, although reasonable as they have recently surpassed 600,000 packages, meant that we were going to have to crawl and collect our own subset of packages. As such, we wrote an npm_crawler.py script that crawled using a modified Snowball Sampling method.

Our sampling script would begin at a random package and iteratively crawl to dependencies and dependents, storing all associated package info and all found package names. This process would repeat until all connected packages were gathered, and then the search would restart from a random package. This ensured that we were also collecting the packages with no dependencies. More info regarding our data is available below.

There are several key network characteristics that we will focus on in our analysis of the network data. Each of the following characteristics provides a unique and valuable insight into the inter-reliance of packages and the impact of important nodes:

Connectedness
Clustering Coefficient
Pivotal Nodes

Although global graph characteristics are relevant to us, we are particularly interested in the identification of pivotal nodes due to the relationship they have with our original problem. A pivotal, or important, package is one that, if it were to fail, would cause the most damage to the most other packages. In this way, the existence or non-existence of these nodes will directly answer the question originally posed. As such, we will be analyzing:

Dependence
Exclusion
Betweenness Centrality

Analysis #

When it came time to analyze the data we had collected and properly saved into edgelists (see our data), we decided on 9 metrics to calculate due to their relevance to our question. These metrics were:

Number of Nodes
Betweenness Centrality
Density
Transitivity
Average Path Length

Average Clustering Coefficient
Average Neighbor Degree
Average Closeness Centrality
Average Degree Centrality

We calculated and collected these metrics with 6 different sizes of sub-networks. This was important to ensure the validity of the conclusions we were drawing and how they would scale. These sub-networks were of the following sizes:

200 dependencies
1,000 dependencies
2,000 dependencies

5,000 dependencies
7,000 dependencies
10,000 dependencies

We then aggregated all of this data into a table to ease the analysis process.

Network Characteristic	200 Edges	1,000 Edges	2,000 Edges	5,000 Edges	7,000 Edges	10,000 Edges
# of Nodes	353	1402	2493	4617	5638	6595
Betweenness Centrality	2.1x10^-7	7.1x10^-8	7.7x10^-8	5.3x10^-7	4.5x10^-6	2.1x10^-5
Density	0.0016	0.0005	0.0003	0.0002	0.0002	0.0002
Transitivity	0	0	0.0012	0.0053	0.0074	0.0120
Avg. Path Length	0.3879	0.5298	0.7607	2.8633	6.392	7.7207
Avg. Clustering Coefficient	0	0	0.0001	0.0018	0.0043	0.0114
Avg. Neighbor Degree	0.0255	0.0879	0.2574	0.6898	1.1954	1.8215
Avg. Closeness Centrality	0.0016	0.0005	0.0004	0.0004	0.0008	0.0028
Avg. Degree Centrality	0.0032	0.001	0.0006	0.0005	0.0004	0.0005

Visualizations #

We also wanted to visualize the sub-networks that we were creating to help understand the data that we had collected. To accomplish this, we plotted two of our sub-networks in a circular graph using a spring layout. We also wanted to visualize some of the metrics we calculated in our analysis section, so we created charts.

Graphs #

In the following graphs, the red circles denote the npm packages and the black lines are the dependencies between these packages. The edges in our graphs are directed because the dependency relationship between packages is usually uni-directional. Our graphs visualize this by denoting the in-edges as thicker rectangles.

500 edge sub-network graphed

1,000 edge sub-network graphed

You may also notice that the ring of packages and inner circle area is more densely packed on the 1,000 edge sub-network. This is due to the fact that our sample included more dependencies, and as a consequence, more unique packages. But not all new packages were unique. You can tell in the visualizations that there are some packages more centered than the others as a consequence of more dependant packages.

For example, in the 500 edge sub-network, there is one isolated node in the center middle of the graph that has an increased number of in-edges relative to its immediate neighbors. This is an example of a highly depended-upon package.

Charts #

As demonstrated in our analysis section, we calculated statistics on many different metrics of many different sub-networks. Although the table layout was relatively easy to parse, we also wanted to understand these characteristics and the trends they represent further. As such, we created several charts to highlight change as the sample sizes increased.

Increasing average neighbor degree

Increasing average path length

Decreasing degree centrality

Decreasing chart density

Note that there were two kinds of change on display in the metrics. The first group was metrics that increased alongside the increase in sub-network sample size (positive correlation). The second group was metrics that decreased while the sub-network sample size increased (negative correlation). Although initially strange, it makes sense that, for example, as the number of dependencies increase, there are more likely to be straggler packages included, thus decreasing the density.

Data & Code #

To improve the reproducibility of our conclusions, we have open sourced our crawler, data, and our Jupyter notebook on our GitHub repo. For reproducibility, we created and ran our Jupyter notebook in a Docker container that was set up with DataQuest's Data Science Environment (the dataquestio/python3-starter image). This environment includes:

python3
numpy
scipy
etc.

Our crawler has generated several different JSON data files that are hosted here. This data is in its raw form in that it includes any possible information we thought could ever be valuable to our metrics. For example, an item in one of our JSON files would look as follows:

{
    "author": {
        "email": "sindresorhus@gmail.com",
        "name": "Sindre Sorhus",
        "url": "sindresorhus.com"
    },
    "dependencies": {
        "time-zone": "^1.0.0"
    },
    "description": "Pretty datetime: `2014-01-09 06:46:01`",
    "devDependencies": {
        "ava": "*",
        "xo": "*"
    },
    "license": "MIT",
    "name": "date-time"
}

We created two primary data sets, npm_data.json and npm_data_new.json. These files contained 16,515 and 10,310 packages respectively, each independently collected. The primary purpose of having two large data sets was to ensure we were collecting a wide array of packages rather than sticking in one area of the overall larger network.

From these two parent data sets, we created 6 simple edgelist sub-samples:

These edgelists were created using our npm_subset.py script that was created specifically for this purpose. These were also the edgelists used in our analysis above.

Our Scripts #

We created two main scripts to simplify the data collection and analysis process. We have mentioned and linked to them previously, so now we will provide a brief overview of their individual APIs and their capabilities.

npm_crawler.py

<data-file> - Where to save the final data
<avoid-file> - Where to save faulty packages
<save-interval> - How many packages should be found before saving
<auto-continue> - Should the script automatically continue

npm_subset.py

<data-file> - Where the data file to pull from is saved
<out-file> - Where to save the edgelist
<size> - How many dependencies to grab

The Team #

Peter Huettl

NAU Computer Science (BS) program.

ph289@nau.edu

Garrison Smith

NAU Computer Science (BS) program.

gts35@nau.edu