LLMs and You: How AI Labs Use the Grass Network

Synopsis: Buyers on the Grass network use your bandwidth to scrape data from the internet. By exploring how AI labs train their language models, we can learn a bit about what types of material they use Grass to access, and why your personal data is not part of the equation.

Introduction

By now, you probably know that Grass is a network that sells your unused internet bandwidth to companies who use it to view the public web. As we’ve explained a bit in the past, the highest profile use case for this service is AI labs who need massive amounts of web data to train their language models. But what kind of data are they downloading, exactly? And why do they need it?

To understand the answers to these questions, we need to learn a bit about how large language models work. So strap in for a minute while we take a quick look at what’s going on behind the scenes. Today, we’ll try to explain LLMs like you’re five — or at least like you’re a sophisticated adult who wants to understand AI a little better. So where to begin?

LLMs and the Word Vectors They Produce

Let’s start simple: LLMs are AI algorithms to which you can pose questions in plain language and get an actual answer. You may ask for a summary of a given topic, a translation of a particular passage, or a detailed solution to a complex problem. In response, they will generate predictive text to satisfy whatever prompt you’ve decided to input. To the untrained eye, it’s a robot that can speak English.

But how do they work? Ultimately, LLMs comb through massive amounts of written language, find patterns in the ways that certain words relate to one another, then translate these words into strings of numbers that reflect these relationships. These numbers are the language that LLMs actually speak, and they are known as “word vectors.” Let’s give an example to see how they work.

Say you’re in the mood to eat something with meatballs, but you can’t remember the name of that pasta that goes with them. If you ask an LLM what to call this mysterious noodle, it will search for a noun that is A) a pasta, and B) likely to appear in the same sentence as “meatball.” Voila: “Spaghetti.”

In a very simple model, trained only to answer meatball-related questions for forgetful diners, each word vector might have only two dimensions.

1: Does this word describe a noodle? (1 for yes and 0 for no.)

2: How strong is the correlation between this word and “meatballs” in written text?

In this case, spaghetti might be represented as [1, 0.95], with the 1 signifying that spaghetti is a noodle and the 0.95 signifying a 95% correlation with the word “meatball.” This is a higher score than any other word the model has encountered, and thus most likely to be the correct answer. There you have it: Spaghetti and Meatballs.

So now we understand how LLMs communicate a word’s relationship with other words — but what happens when the questions become more complicated? Instead of asking what to call “spaghetti,” what if you asked what a seven year old would call spaghetti?

To find out, you’d have to read quotes from millions of seven year olds and determine which word has the highest correlation with “meatball” in these very specific contexts. As it turns out, seven year olds — hardly known for their facility with the Italian language — are liable to mispronounce the word as “sketti” or “basketti.” At least, that’s what ChatGPT reported back a few moments ago.

Now, this raises a few questions. When answering our prompt simply required a two dimensional assessment of general correlation, it was easy to comb through limited data and see which word appeared in the most sentences with “meatballs.” As soon as we started asking more complex questions, though, the word vectors needed to be exponentially longer, and thus draw on larger banks of information. Perhaps you can see where this is going. If you want to train an LLM to answer any question a user could possibly ask, you’re going to have to access much larger datasets.

Big Data

While the scientists in our example above may be content to study meatballs alone, major AI labs are working to create incredibly refined LLMs that will someday have access to all recorded human knowledge. This requires them to spit out word vectors with far more than two dimensions, which can capture more subtle relationships between the words they read. To illustrate, let’s use this model, which was trained on the entire English Wikipedia.

Consider the word “Donkey.” In English, it’s spelled D-O-N-K-E-Y. Vectorized, it’s spelled -0.092339 followed by another 5,507 digits. — a mouthful to say, and impossible to remember.

The word vectors in this model are so long because the model is trained on 199,430 unique words, and it’s capable of producing vectors for each of them that communicate its relationship with all the others. By training their model in this way on the entirety of Wikipedia, it’s able to answer any questions that might be contained in the articles within. The 5,000 character vector lengths bely the sheer amount of information that each one relates back to. So it’s not hard to figure out that if we want these LLMs to give accurate answers, the correlations that they draw between words — and the patterns they discover in written content — get more and more accurate as the data sets they’re trained on get larger and larger.

But how could an AI lab possibly access this much data?

The Grass Connect

This is where it all ties back to you, and the bandwidth you sell to these AI labs on Grass. If you look at the list of models on the website we linked earlier, you can see that a variety of them are available. One was trained by reading all of the words on Wikipedia, one by combing through mountains of Google News articles, and one on the British National Corpus. Whatever data a lab wants its model to be trained on, this is the content they need to access in order to train it.

Here’s the thing: this is relatively simple when the data is crystalized and the answers won’t change. If someone asks an LLM when Columbus discovered America, the answer will always be 1492. They could train it on the Encyclopedia Britannica.

But what if an LLM wants to answer questions about contemporary information? What if it wants to answer questions about popular sentiment, or how the average person feels about a certain topic? Where could you find billions of people expressing their thoughts and opinions on any topic imaginable, refreshed eternally in a never ending stream? Modern problems, as they say, require modern solutions. In this case, the solution is social media.

To access this information, however, requires a nonstop connection to the internet, viewed from every corner of the Earth, capable of downloading unfathomable volumes of written language. This, my friend, is the origin of Grass, Wynd Network’s marketplace where ordinary users sell internet bandwidth to AI Labs for downloading written words off the web.

Conclusion

So now you understand who these labs are, the LLMs they are trying to train, the types of data they use to train them, and where they can access it with the help of your internet bandwidth. This is only the most rudimentary explanation of how LLMs are trained, and we’ve obviously left a lot out in the interest of simplicity. But hopefully it goes some of the way towards explaining what exactly your bandwidth is being used for and how AI labs use the public data submitted on social media websites to train their AI models.

You’ll notice that nowhere in this conversation is your personal data mentioned even once, and that’s because it doesn’t factor into the equation. When we tell people they’re selling bandwidth so AI companies can download data, that’s often their first assumption — that they’re giving up their data, just like they do by using social media in the first place. We just wanted to write this primer so you would know that this doesn’t happen in any way, not even 1 percent — buyers simply access public web data, often from sites like Reddit, and nothing about you is visible whatsoever. So you can rest assured that your privacy is intact — and maybe you learned something along the way.

Grass: Progress Update and Road Ahead

Key Points:

  1. The Grass network has seen 80,000 individual downloads and nearly 1,000,000 unique residential IPs through our referral program alone
  2. The network will go live once we pass key thresholds in certain metrics, defined below
  3. The launch of our Android and iPhone mobile apps stand to significantly increase the size of the network and the uptime for users
  4. Formal compensation will occur when the network launches, but earnings are occurring now

Over the past few months, we’ve focused on educating people about the vision we have for Grass.  Defining proxy networks, identifying the use cases, and making sure we’re transparent about what we’re planning.  Now it’s time to shift into phase two: we’re working on carrying this thing to market, and today we’ll update you on what’s next.

As you know, Grass is currently building out its network of residential proxies by recruiting users like you to act as individual nodes.  The sooner we reach a threshold of active nodes, the sooner we launch and we all start to see the money roll in.  By signing up, downloading, and referring your friends, you’re playing a very active part in this process.  Thanks, by the way.

So where do things stand now?

Well, progress has been pretty incredible since our first announcement four months ago.  The network itself has seen almost 1,000,000 unique IP addresses since early June.  The entire point of a proxy network is to provide IP addresses for buyers to route web traffic through, so this is pretty massive.  

To date, 80,000 people have downloaded the web extension and all of them were referrals from the ref links you’ve been sharing on Twitter, YouTube, and TikTok.  This is really extraordinary for a word of mouth campaign, and we are more confident than ever that Grass will exceed every expectation we have.

So this is all great news, but now is the time to focus on the future.  We have some very concrete milestones we’re working towards, and a handful of key metrics that will tell us when the time has come.  Over the next few months, you’ll be able to watch the progress with your own eyes and feel the network get closer and closer to the fateful day we go live.  

Here’s what to look out for:

Development Milestones

  • New UI: The first order of business is to roll out our new UI, which will make the dashboard more intuitive and appeal more to a mass audience.  You can expect this sometime in the next few weeks.
  • Open Access to Grass: Currently, Grass is only available through the ref links you post.  Soon we’ll be opening up access to anyone and everyone, and we expect to see a substantial increase in downloads once this barrier is lifted.
  • Launch Android App: Grass is currently only available as a downloadable web extension, but soon we’ll be releasing an app that you can install on your phone.  This has the potential to be a watershed moment for a few reasons.  First, most people’s phones are on all the time, so the amount of active nodes will balloon when we start getting users with 24/7 uptime.  Second, the overwhelming majority of time people spend on the internet occurs on mobile devices, so when they see an ad or ref link, they’ll be able to download and install the app within mere seconds.  This means more users, more downloads, and more active nodes on the network.
  • Launch IOS app: This has the potential to be a huge milestone for all of the same reasons, but on an even larger scale.  62% of the mobile phones in America are iPhones, which means we’ll experience the Android effect at almost twice the intensity.  This might just be the event that pushes us over the line - but only time will tell.

Key Network Metrics

As we progress through these development milestones, we’ll also be keeping our eye on the underlying growth of the network itself.  This is what really matters when it comes to how soon we can launch, and we have a handful of metrics for measuring how far we are in the process.  If you familiarize yourself with them, it will be easier to follow along on the road ahead.

  • Downloads: This refers to the number of individual users who have downloaded the web extension. 
  • Referrals:  As you can probably guess, this is the number of people who have been referred by other members.  To us, it’s a measure of how much footwork the people themselves are putting in to spread the word, and how much faith the community has in Grass’s vision.
  • Unique IPs: This refers to the number of IP addresses that have been active on the network.  This metric is a bit more complex, as it also accounts for individual users who have provided bandwidth from multiple locations.  If you’ve checked your dashboard from a friend’s house or coffee shop, you’ve probably seen additional addresses show up at the bottom left.  This shows increased breadth in the network, but the number of active users is most important of all.
  • Concurrent active nodes:  Concurrent active nodes refers to the number of users who are active at any given time.  Essentially, if a buyer logged on to use bandwidth on the Grass network, this refers to the number of different proxies they could route their web traffic through.
  • Concurrent active nodes (US):  The holy grail.  US IP addresses have the highest demand in the world, and more buyers are willing to pay for American proxies than anywhere else.   We’re building a global network of residential proxies, but it’s particularly important to get Americans signed up.

Over the next few months, we’ll be providing updates not only when we attain the milestones listed above, but also when we reach key numbers on all of these metrics.  

The Road to Compensation

As soon as we launch, the network will start generating independent revenue - 100% of which will go to compensating active nodes.  The sooner that day comes, the sooner your points will be converted.

Don’t ever forget, though: you might have to wait until then to get compensated, but you’re already earning now.  If you’ve been checking your dashboard for four months at this point, that means you have four months of earnings stacked up before we even launch.  We’re currently testing the network with several proxy buyers who are looking to ramp up their usage once our network has sufficiently grown, and if you’re reading this, you are very early.

Hopefully this update gave you an idea of where Grass stands today, how far we’ve come, and a concrete sense of the path to going live.  We’ll continue to update you on all of the events we described above, so keep a close eye on our Twitter, Discord, and Blog for more news.  And remember - the more people you refer, the more earnings you can stack up before launch, and the faster we can all get there.  So go out and touch Grass!

Grass 101: How It Works

If you’re reading this blog post, you’ve probably heard about Grass, the flagship product from Wynd Network. Grass is an upcoming browser extension that lets users monetize their internet connection by selling unused network resources — by selling their “view of the internet.” But what exactly are these network resources, and what does “your view of the internet” mean?

Think of it like this: Grass enables you to sell a product you didn’t even know you have. Today we’ll explain exactly what that product is.

First, we’ll discuss why your internet connection is valuable, and why other people are willing to pay for it.

Then we’ll look at how the market for these resources works today, and how centralized proxy providers are already selling your network space without paying you for it at all.

Finally, we’ll introduce Grass: a decentralized residential proxy market that uses token rewards to upend the traditional business model for these networks and compensate its users fairly.

Grass has the potential to revolutionize this industry and create a more equitable, secure, and ethical marketplace for network resources. So let’s take a closer look and see how it all works.

 

1. Defining Residential IP Proxies

It all revolves around data.

Say there’s an airline who wants to know what all of their competitors are charging for plane tickets. This data exists on public websites, but how can they gather it all, particularly when it could vary based on the location of the viewer?

Or what if the same company paid for web advertising, and they want to see if their ads are showing up in all of the markets they paid to target?

To capture this information from the public web, they need to access the internet from the public’s point of view — from as many sources as possible, in as many locations as possible.

That’s where you come in.

Every time you access the internet, you do it from a unique IP address, and a lot of what you see is tailored to your location. When you act as a residential IP proxy, it simply means that someone routes their internet traffic through your IP address, so they see the internet from your point of view. Then, they can use this view to scrape the web for whatever public data they may need.

What does this look like in real terms?

It looks like sharing your internet connection with someone else. Say you pay for a connection with a maximum download speed of 100 MB/s. If you’re only using 30 MB/s to download a file, that leaves 70 MB/s of “idle” bandwidth that isn’t being used at that moment. This is the bandwidth that companies will use to scrape the web from your IP address, and this is the resource you are already giving away, without knowing it.

As tends to be the case with big data, this might not seem like much at first. How much would someone possibly pay to check a website from your IP address? Yet these numbers add up, as companies scrape ever more massive amounts of data in the name of market research each year. So if the acquisition of public web data is only becoming a larger and larger part of the business world, why are none of us seeing any rewards when our internet connections are the ones facilitating it?

 

2. The Residential Proxy Market Today

Today, the market for residential IP proxies is dominated by a small number of highly centralized service providers. These companies function by creating massive proxy networks using residential IPs from all over the world, then selling their unused bandwidth to buyers like our airline from above. Typically, these networks will have a list of authenticated IP addresses that are whitelisted to be used by purchasers. Unfortunately, this is where the arrangement stops being fair to all parties.

In the best case scenario, the addresses on this whitelist are added with the full consent of their owners. Permission is granted in exchange for some type of payment, and residential internet users can voluntarily sign up to sell their network resources (the unused bandwidth from the residential internet connection tied to their IP address.)

Here’s the thing: even when residential internet users consent to participation, and even when they are compensated for their resources, the network is incentivized to pay them as little as possible to maximize their own profits. There’s very little competition to provide these proxy networks, and buyers and sellers have no possible way of connecting outside of them. Thus, the networks dictate the terms by which buyers and sellers engage with each other, universally deciding to charge buyers as much as possible and pay sellers as little as possible.

In the worst case scenario, everyday internet users like yourself are cut out of the equation entirely. Whether you know it or not, many of the free apps you download have lines in their terms and conditions that sign you up to donate your unused bandwidth to proxy networks. This may help developers to monetize their products, and help proxy companies to recruit unwitting internet users, but you can bet that you’ll never see a penny of the proceeds.

The end result? You are paying for a certain amount of bandwidth, and then when you don’t use it all, your ISP doesn’t refund the money. Instead, it is sold to someone else, and they don’t even cut you in on the deal!

Obviously it’s not unfair to say that the existing landscape of the residential proxy industry falls somewhere between vaguely exploitative and outright unethical. But what can be done to address these problems and put network resources back in the hands of their rightful owners?

 

3. Introducing Grass

Simply put, Grass is a decentralized alternative to the networks described above. It is a network sharing application that allows users to sell their unused bandwidth. Where existing networks are operated by exploitative middlemen who extract value from the parties exchanging resources, Grass is an equitable solution in which both sides have an active stake in the network.

To individuals, it will appear as a web extension that is downloaded, left on, and forgotten about. It will do its work behind the scenes, helping others to acquire public web data in exchange for payment in the protocol’s native token.

Through this token, two things will happen. One, tokenholders will accrue a portion of the fees collected by the network. Two, it will function as a governance token, allowing users to vote on important decisions about the direction of the protocol. By this system, individuals who were disenfranchised and exploited by centralized proxy networks will be given a stake and a say in the Wynd network.

Compared to its centralized counterparts, Grass is:

  1. Ethical. These resources are already being sold out from under you, so Grass simply transfers the proceeds to their rightful owners.
  2. Democratic. By paying in tokens, Grass doesn’t just compensate you for your unused network resources — it compensates you with ownership of the network itself.
  3. Secure. There is an inherent danger in having a small number of companies control this infrastructure, which is mitigated by decentralization.

Ultimately, Grass takes a more principled approach to this industry than the bad actors who lead the pack today. Like many use cases of blockchain technology, it creates a more equitable distribution of resources by doing away with the centralized control of networks, making the world more fair in the process.

We’ll be releasing more details about the network over the next few months, and beta will launch in June. So stay tuned for more information and sign up now for early access. Before long, your internet will be back in your hands, and we can all finally touch grass.