Analytics Is a Journey
SkyFoundry and the Necessity of Distributed Computing
SkyFoundry, based in Richmond, VA, produces SkySpark, a leading platform for collecting and analyzing Internet of Things data. In an episode of Harbor’s Future Perfect Tech podcast, SkyFoundry co-founder John Petze talks about his long experience in IoT and his philosophy of Smart Systems design. This essay is based on parts of that conversation. You can listen to the podcast in its entirety here.
THE NECESSITY OF DISTRIBUTED COMPUTING
The big divide between yesterday and tomorrow is this: Yesterday you had to go back to some central connection point to make a decision. Tomorrow your local application and its compute function can essentially exist in a more or less fully autonomous manner.
A distributed computing environment means multiple computing nodes connected together to share data and communicate with each other, with an application to coordinate their operation running on each of the nodes.
Many people today talk about computing’s future as a battle between “the cloud” and “the edge,” but SkyFoundry greatly prefers to describe it in terms of “distributed computing.” They have found that if they say “edge” or “cloud,” people automatically assume it has to be one or the other. That’s not correct. It’s not an either-or issue. The cloud is really valuable, but the edge is really valuable too.
Much of the IT world still labors under the von Neumann model of computing, where technicians batch all the available data together, as if into a giant soup stockpot, and try to intuit next week’s behavior based upon what happened three weeks ago. But you can never process enough data in that batched architecture model to do the kinds of things you can do with distributed systems and local data processing.
DATA IS GOOD FOR YOU
Prospective clients often start a conversation with SkyFoundry by asking “What algorithms do you have?”—as if algorithms are secret commodities that data wizards develop in the abstract, completely apart from specific business problems. The idea seems to be that you buy yourself one of these magic beans and—presto!—all your problems are solved.
SkyFoundry’s usual response is to say, “Analytics is a journey. What kinds of insights would help your business? What do you want to get out of your data?” Only after doing such soul-searching is it possible to develop thought processes (and then algorithms) for what the company may potentially want to do.
Just as important, what companies actually do changes over time as they learn the data, the characteristics, the performance, the trends and the opportunities. If you don’t know how your grid or your building’s performing, you don’t know what potential opportunities you might have to do something valuable. Analytics is a journey. You can start out with a list of the top ten things you want to do, but that will quickly change.
Bear in mind that data science is almost brand-new to societies at large. Data management and analytics as areas of expertise didn’t exist until roughly a decade ago, when large data libraries became available to the general public. Thus, helping companies understand how data can help them turns out to be a significant piece of what a data analytics firm does.
SkyFoundry in particular advocates a whole-network distributed computing model where cloud-processing and edge-processing both come into play as needed. But they never say to a potential customer, “You must distribute your application to this or that level.” They spend time learning the customer’s actual needs and then they help determine what level of distribution is needed.
The flexibility to design to actual real-world needs is the point. If the client can live comfortably without full distribution, that’s fine. The important thing is that they look at what they’re trying to accomplish. Only then can you decide where the data, the compute, and the user interface need to be.
In any proprietary system, the designers made structural decisions about the data driving it, and those decisions govern how the system delivers its screens of analytics to you. When this is confined to well-understood data from a known source, everything is fine. But when you start to work with data from diverse devices, bringing them together to find patterns and correlations across the entire pool, the structure and format of that data suddenly becomes a big problem—so big that even experienced professionals can be shocked by the complexity that comes with it.
A couple of years ago, Harbor Research did a massive overview of data and analytics projects and found that 65-70% of the cost of a given project was devoted to organizing the data to make it useful. That level of data wrangling is not only a huge financial barrier but also a huge psychological one. It uses up a lot of time and investment by people who are continually frustrated by how much expense they are authorizing without seeing clear results from it.
The minute you have two manufacturers of two different devices producing two bits of data, you have a standards and interoperability issue. How can they agree on the terms and conditions to fuse this data and make a decision to do X, Y or Z? Completely unstructured data does not mean unvaluable data. And highly structured data doesn’t automatically make it the most valuable.
Sometimes combining two pieces of data can be revelatory. Every company knows what its electricity consumption is by reading its meter, and they can easily find out the electric rate they’re being charged. But it’s surprisingly rare for a company to put the two pieces of data together and question whether they’re being charged the right rate for what they consume, the time of day they consume it, etc. SkyFoundry has seen companies save a significant amount of money with a simple investigation of an issue like this.
90% of the world of analytics depend upon simple apps that triangulate something or put three or four pieces of data together to confirm or deny some state or capability. We still live in a world where data is not free, and not really free to fuse together. Without standards and interoperability, we’re still in the position of buying Lego sets from three different manufacturers and trying to build a beautiful castle out of them.
THE NEW HAVES AND HAVE-NOTS
Companies always want to plan out use cases, project ROI, and monitor if that ROI has been delivered. That desire gets to the crux of helping industry and society embrace data and analytics. However, it’s extremely difficult to prove ROI ahead of time because data analytics is about learning where the problems are. You don’t know the problems in your processes until you look for them.
It is possible, however, to showcase studies of how a customer implemented analytics with specific software, thus isolating a specific problem that paid for the whole project—never mind every other piece of value they got and continue to get. Another technique is to launch small, well-chosen pilot-projects that can give clients a real-world sense of the value they will get for a modest initial outlay.
In the end, it always comes down to specific corporate culture. There are organizations who have yet to embrace analytics, who keep trying to figure out why they continue to do worse and worse, but they are just not ready to step up. That’s the process of capital destruction, and you see that playing out everywhere. Platform vendors like SkyFoundry hope that pressure on organizations to embrace, adopt, pilot, explore and learn about these things will motivate them. But it won’t be universal.
Some will evolve and get it, and some won’t. This seems to foretell a future of haves and have-nots in terms of strategies built on data, and this is a bifurcation that Harbor Research believes will divide the financial performance of all ventures in the future.
SIMPLE, COMPOUND, COMPLEX
Harbor always finds it illuminating to think about the evolution and combinations of technologies in terms of simple, compound and complex. Most people jump straight to how difficult a complex solution is going to be, but in fact many of those solutions are built on simple and compound ones:
- Simple data: The results of a simple alert or alarm are compared to the state of a machine at a certain time, or to the charges imposed by a utility for water or power.
- Compound data: This involves combining data with more explicit time intervals processed at the edge. An example would be preventing an autonomous car from running over a pedestrian at an intersection.
- Complex data: An obvious example is optimizing the flow of traffic through a city, or optimizing emergency response services. In the latter case, an ambulance that has been dispatched to an accident arrives and technicians connect the patient to various data collection devices. The data should be available to people in the ambulance but also transferred in real-time to the hospital to alert appropriate specialists that the patient is coming in and what her state is. Meanwhile, an app in the cab of the ambulance is giving the driver the fastest route to the hospital.
In complex solutions, it doesn’t really matter whether it’s a building, a factory or a city: All relationships are interrelating data that requires interoperability. Whether routing a vehicle, predicting a machine breaking in a factory, or verifying that a supply chain is able to function, it always requires knowing what’s going on in all the different material theaters. And that’s predicated on the ability to process locally in a distributed way and interrelate data from all of those processing points to allow for aggregate decision-making.
Eventually you arrive at the ability to relate this to models that represent much longer timeframes and much longer operating histories. This lets you understand how to further optimize and manage any one of these kinds of complex systems.
CLOUD AND EDGE ARE NOT MUTUALLY EXCLUSIVE
Advocates of distributed computing must help their clients understand that there are things you can’t do in the cloud, for a variety of reasons. One of the most obvious is latency. There isn’t time for the data to get to the cloud and be analyzed for a decision. This can instantly be understood when you consider the example of the self-driving car. If you send real-time data about a crosswalk to the cloud for a decision about whether to apply the brakes, the latency will cause the decision to take too long and the car will hit the pedestrian crossing the street.
Everyone can understand that example, but again the edge does not exclude the cloud or vice-versa. To continue with our car example, there is data we want to work with in the cloud. We all use map applications on our phones. Anonymized cell phone data in the cloud lets us determine that all the cars in front of us are starting to slow down: Ergo, there’s a traffic jam up ahead. That can’t happen with edge processing. To get full value from vehicles or other machines, there’s a role for the cloud and a role for the edge, and the architecture has to support that.
Additional layers of value can be revealed by comparing what is being collected to some existing model. For example, we have a movement toward “road weather” that stems from putting sensors on trucks and other vehicles to get real-time data about road conditions. This data is typically compared to the weather forecast for today.
However, you can also build up a 3-year, 30-year, and eventually a 300-year model of past weather that sits in the cloud. These models can be compared to this moment’s weather to set yet another decision context that can supplement what’s going on in real-time. Going further into the future, you can get very sophisticated in the ways you meld these different worlds and understand legacy analytics by combining and recombining data in interesting ways to fit whatever tasks you’re trying to solve.
Besides the latency problem, there are other equally important but less obvious constraints. One of them is a constrained network that prevents you from sending your data to the cloud whether latency was a problem or not. Because most people have personal experience with slow networks, they usually leap to that example of constraint. But sometimes networks are constrained by something other than performance.
For example, a SkyFoundry customer had thousands of nodes in buildings as part of a grid system. They wanted to use the data coming from those nodes to make intelligent decisions about load management. But all of those devices were connected over cellular modems, and the mobile carrier was billing based upon the volume of data. Even if you could put all that data up in the cloud, the cost would be enormous. If you ran your algorithms at the edge, however, the only thing that had to go up over the cellular network was the decision to run this or that strategy. SkyFoundry built an application to that specification and saw data traffic cut by more than a hundred to one.
The other big issue is reliability. The pandemic year that we’ve just been through provides the perfect example of this. Has anybody had a video conference interrupted by a technical problem? You bet. Everybody instantly understands that, but there are multiple layers of reliability. If you are collecting really valuable data, you need to collect it locally as well as in the cloud to have a buffer in case you lose your connection.
We’ve already established that certain edge processes have to continue operating with or without the cloud, like the brakes on an autonomous car. But there is a much less obvious example: An operator running a machine that starts making loud groaning noises like it’s going to explode, but they can’t look at the data because the UI is served from the cloud and the network is down. That’s another scenario where the edge and the cloud are both required.
And finally, there is the issue of security. If a company like SkyFoundry offered only a cloud application, they would lose 50% of their potential business because some customers can’t or won’t send their data to a cloud. Sometimes the data has to stay on-premise for regulatory reasons. Sometimes the company’s Network Operations Center is in another country and the data has to stay in the local jurisdiction.
Either way, it brings us back again to the need for a distributed computing solution. You’re trying to secure devices, networks, data and applications all at the same time, and there can be many configurations of these distributed architectures, each serving unique use cases.
OPTIMIZING FOR THE EDGE
Many algorithms can run effectively at the edge once you know what they are. You might not develop them down there, but you can certainly run them down there. The ability to simulate a process suggests that we may be able to design algorithms that will look for much more focused data, and therefore be deployed to far less compute-intensive devices in the future. For example, when we’re trying to predict equipment failure in real-time, we know that we’re looking for a specific anomaly or a specific combination of events.
Machine learning and AI techniques are great techniques, but very compute-intensive. There are many other advanced math techniques involved in analytics, such as pattern recognition, frequency, domain analysis, advanced math calculus and matrix math. All of those techniques can run very effectively at the edge.
There is a perception out there that machine learning can only happen with huge volumes of data. But that’s not true. There are advanced math algorithms that can be run on smaller amounts of data to optimize control loops, understand perturbations and anomalies, and so on. That type of optimization can happen right at the edge.
HAYSTACK: OPEN STANDARDS FOR INTEROPERABLE DATA
In addition to leading several major commercial ventures in the Smart Systems world, John Petze is also Chairman and executive board member of Project Haystack, an open-source initiative to make data interoperable.
Haystack provides a metadata tagging standard for marking up data that comes out of machines and sensors. It helps everyone in industry solve the big impediment to data fusion: data coming out of equipment systems has limited, non-existent or totally proprietary semantic descriptors. If you tag your data following the haystack methodology, with the haystack tags and the haystack ontology, then no matter what device it came from, it can be correlated and analyzed in the context of any other data.
Unfortunately, although Haystack is widely used, its creators have found a hesitancy on the part of the major equipment suppliers to commit to using an open-source tagging methodology. Even those who are using the standard are hesitant to publicize it, and therefore other manufacturers are slow to build it into their equipment.
For this reason, system integrators are often faced with the manual mapping of data if they want higher level applications to be able to work with it. If industry would do it at the control system level and at the network gateway level, then from those base levels up, the ability to work with the data would be far easier, and the costs of interoperable data would be dramatically reduced or eliminated.
Failure to adopt this standard holds back IoT and the creation of data applications that would lead to discovering important insights. If you can’t get the data integrated, then you don’t know what you’re going to find, and therefore you can’t justify the project. That perpetuates a vicious cycle that continues to hold the industry back.
All this is especially discouraging in light of recent events: Between climate change, the pandemic, constrained resources, and infrastructure debacles like the Texas utilities disaster during their recent deep-freeze, we have had a rapidly converging awareness of the brittleness and rigidity of outmoded systems. Solving large problems like these will require interoperable data and analytics, as well as broad cooperation among technology players. ◆
This essay is supported by our Technology Insight, “Edge Computing Evolution and Opportunities.”
Fill out the form below to download the Insight for free.