Introduction
How fragile is the modern web development ecosystem? How reliant are popular websites on external libraries? These are the concerns present in the modern web ecosystem, especially for web developers in large companies. This is due to the fact that almost all websites rely on external libraries that are downloaded by package managers such as npm.
Although these package managers provide an overall positive service to the web development community, there are several different cases of package outages causing development issues. These outages also affect the users of these large sites. This is why we will be collecting and analyzing data about the dependencies of these npm packages.
For a more detailed description of our goal, see our initial project proposal.
Technical Description #
To begin addressing these questions, we first needed to gather a meaningful data set and construct an appropriate network. Namely, we are interested in the JavaScript packages hosted on npm and their dependencies. In the following sections, we will outline how we have done this and why we have chosen to do it in this manner.
Methodology #
Firstly, we needed to establish the fundamental network model of the network we were creating. It was clear that the entities of our network would represent packages hosted on npm. To create a network model that most directly addresses the original problems posed, we then established that the relationships in our network would denote one package depending on another. Because package dependencies are uni-directional, our network will be represented by a directed graph.

Next, we began collecting the data. Unfortunately, npm deprecated the use of the /-/all registry endpoint which had previously served as a list of all packages currently hosted on npm. This, although reasonable as they have recently surpassed 600,000 packages, meant that we were going to have to crawl and collect our own subset of packages. As such, we wrote an npm_crawler.py script that crawled using a modified Snowball Sampling method.
Our sampling script would begin at a random package and iteratively crawl to dependencies and dependents, storing all associated package info and all found package names. This process would repeat until all connected packages were gathered, and then the search would restart from a random package. This ensured that we were also collecting the packages with no dependencies. More info regarding our data is available below.
There are several key network characteristics that we will focus on in our analysis of the network data. Each of the following characteristics provides a unique and valuable insight into the inter-reliance of packages and the impact of important nodes:
- Connectedness
- Clustering Coefficient
- Pivotal Nodes
Although global graph characteristics are relevant to us, we are particularly interested in the identification of pivotal nodes due to the relationship they have with our original problem. A pivotal, or important, package is one that, if it were to fail, would cause the most damage to the most other packages. In this way, the existence or non-existence of these nodes will directly answer the question originally posed. As such, we will be analyzing:
- Dependence
- Exclusion
- Betweenness Centrality
Analysis #
When it came time to analyze the data we had collected and properly saved into edgelists (see our data), we decided on 9 metrics to calculate due to their relevance to our question. These metrics were:
- Number of Nodes
- Betweenness Centrality
- Density
- Transitivity
- Average Path Length
- Average Clustering Coefficient
- Average Neighbor Degree
- Average Closeness Centrality
- Average Degree Centrality
We calculated and collected these metrics with 6 different sizes of sub-networks. This was important to ensure the validity of the conclusions we were drawing and how they would scale. These sub-networks were of the following sizes:
- 200 dependencies
- 1,000 dependencies
- 2,000 dependencies
- 5,000 dependencies
- 7,000 dependencies
- 10,000 dependencies
We then aggregated all of this data into a table to ease the analysis process.
Network Characteristic | 200 Edges | 1,000 Edges | 2,000 Edges | 5,000 Edges | 7,000 Edges | 10,000 Edges |
---|---|---|---|---|---|---|
# of Nodes | 353 | 1402 | 2493 | 4617 | 5638 | 6595 |
Betweenness Centrality | 2.1x10-7 | 7.1x10-8 | 7.7x10-8 | 5.3x10-7 | 4.5x10-6 | 2.1x10-5 |
Density | 0.0016 | 0.0005 | 0.0003 | 0.0002 | 0.0002 | 0.0002 |
Transitivity | 0 | 0 | 0.0012 | 0.0053 | 0.0074 | 0.0120 |
Avg. Path Length | 0.3879 | 0.5298 | 0.7607 | 2.8633 | 6.392 | 7.7207 |
Avg. Clustering Coefficient | 0 | 0 | 0.0001 | 0.0018 | 0.0043 | 0.0114 |
Avg. Neighbor Degree | 0.0255 | 0.0879 | 0.2574 | 0.6898 | 1.1954 | 1.8215 |
Avg. Closeness Centrality | 0.0016 | 0.0005 | 0.0004 | 0.0004 | 0.0008 | 0.0028 |
Avg. Degree Centrality | 0.0032 | 0.001 | 0.0006 | 0.0005 | 0.0004 | 0.0005 |
Visualizations #
We also wanted to visualize the sub-networks that we were creating to help understand the data that we had collected. To accomplish this, we plotted two of our sub-networks in a circular graph using a spring layout. We also wanted to visualize some of the metrics we calculated in our analysis section, so we created charts.
Graphs #
In the following graphs, the red circles denote the npm packages and the black lines are the dependencies between these packages. The edges in our graphs are directed because the dependency relationship between packages is usually uni-directional. Our graphs visualize this by denoting the in-edges as thicker rectangles.


You may also notice that the ring of packages and inner circle area is more densely packed on the 1,000 edge sub-network. This is due to the fact that our sample included more dependencies, and as a consequence, more unique packages. But not all new packages were unique. You can tell in the visualizations that there are some packages more centered than the others as a consequence of more dependant packages.
For example, in the 500 edge sub-network, there is one isolated node in the center middle of the graph that has an increased number of in-edges relative to its immediate neighbors. This is an example of a highly depended-upon package.
Charts #
As demonstrated in our analysis section, we calculated statistics on many different metrics of many different sub-networks. Although the table layout was relatively easy to parse, we also wanted to understand these characteristics and the trends they represent further. As such, we created several charts to highlight change as the sample sizes increased.




Note that there were two kinds of change on display in the metrics. The first group was metrics that increased alongside the increase in sub-network sample size (positive correlation). The second group was metrics that decreased while the sub-network sample size increased (negative correlation). Although initially strange, it makes sense that, for example, as the number of dependencies increase, there are more likely to be straggler packages included, thus decreasing the density.
Data & Code #
To improve the reproducibility of our conclusions, we have open sourced our crawler, data, and our Jupyter notebook on our GitHub repo. For reproducibility, we created and ran our Jupyter notebook in a Docker container that was set up with DataQuest's Data Science Environment (the dataquestio/python3-starter
image). This environment includes:
- python3
- numpy
- scipy
- etc.
Our crawler has generated several different JSON data files that are hosted here. This data is in its raw form in that it includes any possible information we thought could ever be valuable to our metrics. For example, an item in one of our JSON files would look as follows:
{
"author": {
"email": "sindresorhus@gmail.com",
"name": "Sindre Sorhus",
"url": "sindresorhus.com"
},
"dependencies": {
"time-zone": "^1.0.0"
},
"description": "Pretty datetime: `2014-01-09 06:46:01`",
"devDependencies": {
"ava": "*",
"xo": "*"
},
"license": "MIT",
"name": "date-time"
}
We created two primary data sets, npm_data.json and npm_data_new.json. These files contained 16,515 and 10,310 packages respectively, each independently collected. The primary purpose of having two large data sets was to ensure we were collecting a wide array of packages rather than sticking in one area of the overall larger network.
From these two parent data sets, we created 6 simple edgelist sub-samples:
- 200 dependencies edgelist
- 1000 dependencies edgelist
- 2000 dependencies edgelist
- 5000 dependencies edgelist
- 7000 dependencies edgelist
- 10000 dependencies edgelist
These edgelists were created using our npm_subset.py script that was created specifically for this purpose. These were also the edgelists used in our analysis above.
Our Scripts #
We created two main scripts to simplify the data collection and analysis process. We have mentioned and linked to them previously, so now we will provide a brief overview of their individual APIs and their capabilities.
npm_crawler.py- <data-file> - Where to save the final data
- <avoid-file> - Where to save faulty packages
- <save-interval> - How many packages should be found before saving
- <auto-continue> - Should the script automatically continue
- <data-file> - Where the data file to pull from is saved
- <out-file> - Where to save the edgelist
- <size> - How many dependencies to grab
The Team #

