What if you were passed a list of 3500 applications that a customer has deployed in their environment? How would you go about verifying each application’s support status? You could of course sit down and manually search the website for each of the applications in your list, but to be honest, I wouldn’t want to be doing that, nor would I be happy to pay an employee a salary to sit down and carry out such a task. The better way to do this of course would be to fully or at least partially automate the process.
There are a few ways that we could automate such a process. One would be to look and see if the website provides some kind of an API that you could query. If it doesn’t have an API, then you could do what we have done, and create a .CSV or .XLSX file which contains all of the data.
A colleague of mine was asked to do exactly what I described above. He was asked to verify the support status of some 3000+ applications on VMware vSphere. So we sat down and thought about how we could extract the current supported applications form the “VMware Business Applications on the VMware Platform” website into a CSV file. A few minutes after we sat down, we came up with the following theoretical method:
- We would download the raw HTML of each of the 287 pages of the website using wget.
- Once the raw HTML is on disk, we would parse the HTML using a Perl script and generate the CSV.
So my colleague set off and started writing a script that could download all of the website pages to HTML files on disk. Whilst he was doing this, I started working away on a small Perl script to parse the HTML. To make a long story short, about an hour after getting the idea we had all of the data in a CSV file that we could easily work with.
However, getting the data was still a two-step process. You had to download the HTML to disk using one script and then parse the HTML using another script. So, in order to make this a little more streamlined, I sat down over the weekend and made my Perl script a little better. The script can now download each page from the website and parse the HTML into a CSV file on the fly.
Although I would love to help the community, I have decided to not make our scripts available. However, I have decided to make the data available.
The data file contains the following data:
- Application Name
- Support Status
- Support Page (Links to the Application Support Page on the VMware Website)