The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall

@[email protected] · edit-2 1 month ago

The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall

@[email protected] · 1 month ago

When a firm outright admits to bypassing or trying to bypass measures taken to keep them out, you think that would be a slam dunk case of unauthorized access under the CFAA with felony enhancements.

@[email protected] · 1 month ago

Fuck that. I don’t need prosecutors and the courts to rule that accessing publicly available information in a way that the website owner doesn’t want is literally a crime. That logic would extend to ad blockers and editing HTML/js in an “inspect element” tag.

Encrypt-Keeper · 1 month ago

That logic would not extend to ad blockers, as the point of concern is gaining unauthorized access to a computer system or asset. Blocking ads would not be considered gaining unauthorized access to anything. In fact it would be the opposite of that.

@[email protected] · 1 month ago

gaining unauthorized access to a computer system

And my point is that defining “unauthorized” to include visitors using unauthorized tools/methods to access a publicly visible resource would be a policy disaster.

If I put a banner on my site that says “by visiting my site you agree not to modify the scripts or ads displayed on the site,” does that make my visit with an ad blocker “unauthorized” under the CFAA? I think the answer should obviously be “no,” and that the way to define “authorization” is whether the website puts up some kind of login/authentication mechanism to block or allow specific users, not to put a simple request to the visiting public to please respect the rules of the site.

To me, a robots.txt is more like a friendly request to unauthenticated visitors than it is a technical implementation of some kind of authentication mechanism.

Scraping isn’t hacking. I agree with the Third Circuit and the EFF: If the website owner makes a resource available to visitors without authentication, then accessing those resources isn’t a crime, even if the website owner didn’t intend for site visitors to use that specific method.

@[email protected] · edit-2 1 month ago

When sites put challenges like Anubis or other measures to authenticate that the viewer isn’t a robot, and scrapers then employ measures to thwart that authentication (via spoofing or other means) I think that’s a reasonable violation of the CFAA in spirit — especially since these mass scraping activities are getting attention for the damage they are causing to site operators (another factor in the CFAA, and one that would promote this to felony activity.)

The fact is these laws are already on the books, we may as well utilize them to shut down this objectively harmful activity AI scrapers are doing.

@[email protected] · 1 month ago

The fact is these laws are already on the books, we may as well utilize them to shut down this objectively harmful activity AI scrapers are doing.

Silly plebe! Those laws are there to target the working class, not to be used against corporations. See: Copyright.

@[email protected] · 1 month ago

@[email protected] · 1 month ago

Nah, that would also mean using Newpipe, YoutubeDL, Revanced, and Tachiyomi would be a crime, and it would only take the re-introduction of WEI to extend that criminalization to the rest of the web ecosystem. It would be extremely shortsighted and foolish of me to cheer on the criminalization of user spoofing and browser automation because of this.

@[email protected] · edit-2 1 month ago

Do you think DoS/DDoS activities should be criminal?

If you’re a site operator and the mass AI scraping is genuinely causing operational problems (not hard to imagine, I’ve seen what it does to my hosted repositories pages) should there be recourse? Especially if you’re actively trying to prevent that activity (revoking consent in cookies, authorization captchas).

In general I think the idea of “your right to swing your fists ends at my face” applies reasonably well here — these AI scraping companies are giving lots of admins bloody noses and need to be held accountable.

I really am amenable to arguments wrt the right to an open web, but look at how many sites are hiding behind CF and other portals, or outright becoming hostile to any scraping at all; we’re already seeing the rapid death of the ideal because of these malicious scrapers, and we should be using all available recourse to stop this bleeding.

@[email protected] · 1 month ago

DoS attacks are already a crime, so of course the need for some kind of solution is clear. But any proposal that gatekeeps the internet and restricts the freedoms with which the user can interact with it is no solution at all. To me, the openness of the web shouldn’t be something that people just consider, or are amenable to. It should be the foundation in which all reasonable proposals should consider as a principle truth.

@[email protected] · 1 month ago

That same logic is how Aaron Swartz was cornered into suicide for scraping JSTOR, something widely agreed to be a bad idea by a wide range of lawspeople including SCOTUS in its 2021 decision Van Buren v. US that struck this interpretation off the books.

Encrypt-Keeper · 1 month ago

If I put a banner on my site that says “by visiting my site you agree not to modify the scripts or ads displayed on the site,” does that make my visit with an ad blocker “unauthorized” under the CFAA?

How would you “authorize” a user to access assets served by your systems based on what they do with them after they’ve accessed them? That doesn’t logically follow so no, that would not make an ad blocker unauthorized under the CFAA. Especially because you’re not actually taking any steps to deny these people access either.

AI scrapers on the other hand are a type of users that you’re not authorizing to begin with, and if you’re using CloudFlares bot protection you’re putting into place a system to deny them access. To purposefully circumvent that access would be considered unauthorized.

@[email protected] · 1 month ago

That doesn’t logically follow so no, that would not make an ad blocker unauthorized under the CFAA.

The CFAA also criminalizes “exceeding authorized access” in every place it criminalizes accessing without authorization. My position is that mere permission (in a colloquial sense, not necessarily technical IT permissions) isn’t enough to define authorization. Social expectations and even contractual restrictions shouldn’t be enough to define “authorization” in this criminal statute.

To purposefully circumvent that access would be considered unauthorized.

Even as a normal non-bot user who sees the cloudflare landing page because they’re on a VPN or happen to share an IP address with someone who was abusing the network? No, circumventing those gatekeeping functions is no different than circumventing a paywall on a newspaper website by deleting cookies or something. Or using a VPN or relay to get around rate limiting.

The idea of criminalizing scrapers or scripts would be a policy disaster.

@[email protected] · 29 days ago

Site owners currently do and should have the freedom to decide who is and is not allowed to access the data, and to decide for what purpose it gets used for. Idgaf if you think scraping is malicious or not, it is and should be illegal to violate clear and obvious barriers against them at the cost of the owners and unsanctioned profit of the scrapers off of the work of the site owners.

@[email protected] · 28 days ago

to decide for what purpose it gets used for

Yeah, fuck everything about that. If I’m a site visitor I should be able to do what I want with the data you send me. If I bypass your ads, or use your words to write a newspaper article that you don’t like, tough shit. Publishing information is choosing not to control what happens to the information after it leaves your control.

Don’t like it? Make me sign an NDA. And even then, violating an NDA isn’t a crime, much less a felony punishable by years of prison time.

Interpreting the CFAA to cover scraping is absurd and draconian.

@[email protected] · edit-2 28 days ago

If you want anybody and everyone to be able to use everything you post for any purpose, right on, good for you, but don’t try to force your morality on others who rely on their writing, programming, and artworks to make a living to survive.

@[email protected] · 28 days ago

I’m gonna continue to use ad blockers and yt-dlp, and if you think I’m a criminal for doing so, I’m gonna say you don’t understand either technology or criminal law.

cm0002 · 1 month ago

You say, just as news breaks that the top German court has over turned a decision that declared “AD blocking isn’t piracy”

Encrypt-Keeper · 1 month ago

Unauthorized access into a computer system and “Piracy” are two very different things.

cm0002 · 1 month ago

Please instruct me on how I go to the timeline where the legal system always makes decisions based on logic, reasoning, evidence and fairness and not…the opposite…of all those things

You have a lot of trust placed in the courts to actually do the right thing

Encrypt-Keeper · edit-2 1 month ago

I’m not saying courts couldn’t pass a new law saying whatever they want. But the laws we have today would not allow for ad blocking to be considered unauthorized access. Not under the CFAA as mentioned.

I said “The logic would not extend to that” not that a legal system could not act illogically.

cm0002 · 1 month ago

The original comment reply to you was all about how the legal system would act, that’s the primary concern. All it would take is a Trump loyalist judge, a Trump leaning appeals court and the right-wing Supreme Court and boom suddenly the CFAA covers a whole lot more than what was “logical”

@[email protected] · 1 month ago

Ehhhh, you are gaining access to content due to assumption you are going to interact with ads and thus, bring revenue to the person and/or company producing said content. If you block ads, you remove authorisation brought to you by ads.

Encrypt-Keeper · edit-2 1 month ago

That doesn’t make any logical sense. You cant tie legal authorization to an unsaid implicit assumption, especially when that is in turn based on what you do with the content you’ve retrieved from a system after you’ve accessed and retrieved it.

When you access a system, are you authorized to do so, or aren’t you? If you are, that authorization can’t be retroactively revoked. If that were the case, you could be arrested for having used a computer at a job, once you’ve quit. Because even though you were authorized to use it and your corporate network while you worked there, now that you’ve quit and are no longer authorized that would apply retroactively back to when you DID work there.

ℍ𝕂-𝟞𝟝 · 1 month ago

There was no header on the request saying I want ads though

gian · 30 days ago

Carefull, this way even not looking at an ads positioned at the bottom of the page (or anyway not visible without scrolling) would mean to remove authorisation brought to you by ads.

@[email protected] · 1 month ago

They already prosecute people under the unauthorized access provision. They just don’t prosecute rich people under it.

@[email protected] · 1 month ago

They prosecuted and convicted a guy under the CFAA for figuring out the URL schema for an AT&T website designed to be accessed by the iPad when it first launched, and then just visiting that site by trying every URL in a script. And then his lawyer (the foremost expert on the CFAA) got his conviction overturned:

https://www.eff.org/cases/us-v-auernheimer

We have to maintain that fight, to make sure that the legal system doesn’t criminalize normal computer tinkering, like using scripts or even browser settings in ways that site owners don’t approve of.

@[email protected] · edit-2 1 month ago

Right? Isn’t this a textbook DMCA violation, too?

@[email protected] · 1 month ago

for us, not for them. wait until they argue in court that actually its us at fault and we need to provide access or else

floquant · 1 month ago

It’s difficult to be a shittier company than OpenAI, but Perplexity seems to be trying hard.

BigFig · 1 month ago

Step 1, SOMEHOW find a more punchable face than Altman

@[email protected] · edit-2 1 month ago

put META android zuckerberg on or mechahitler musk.

☂️- · edit-2 13 days ago

deleted by creator

@[email protected] · 30 days ago

Altman’s face looks like it’s already been punched

@[email protected] · 1 month ago

This is a nice CloudFlare ad

@[email protected] · 1 month ago

yeah. still not worth dealing with fucking cloudflare. fuck cloudflare.

Int32 · 1 month ago

DEATH TO CLOUDFLARE!

@[email protected] · 1 month ago

That would be terrible for a lot of people as they are the only company providing such services that doesn’t charge for traffic.

Int32 · edit-2 30 days ago

They can use web.archive.org as a cdn(I do that to cloudflare websites). But honestly, cloudflare or not, the internet is broken.

@[email protected] · 30 days ago

Can you explain please? How can I use archive.org as a cdn for my website?

Int32 · 29 days ago

just take a snapshot of your website… then make all links to your website link to that snapshot, and turn your server off.

@[email protected] · 29 days ago

Oh, well, it’s okay if it suits for you. Just not at all an alternative to cloudflare.

@[email protected] · 30 days ago

I’m out of the loop, what’s wrong with cloud flare?

@[email protected] · 30 days ago

Centralization, mostly, but also their hands-off approach to most fascist content.

@[email protected] · 30 days ago

They kind of have to be hands off or risk losing safe harbor protections.

@[email protected] · 29 days ago

I get the centralization concerns, but I would think that’s on the consumer since there are other options. As for the fascist content, as another commenter said, they could risk their safe harbor if they started stated regulating content that they weren’t legally required to regulate.

Just my thoughts.

@[email protected] · 1 month ago

Uh… good?

Frezik · 1 month ago

Traveling snake oil salesman complains he can’t pick people’s locks.

@[email protected] · 30 days ago

Uh, are they admitting they are trying to circumvent technological protections setup to restrict access to a system?

Isn’t that a literal computer crime?

dinckel · 30 days ago

No-no, see. When an AI-first company does it, it’s actually called courageous innovation. Crimes are for poor people

Silicon · 30 days ago

See: Facebook/Meta

@[email protected] · 30 days ago

puts on evil hat CloudFlare should DRM their protection then DMCA Perplexity and other US based “AI” companies to oblivion. Side effect, might break the Internet.

@[email protected] · 30 days ago

Worth it.

@[email protected] · 29 days ago

The Internet was already ruined, cloudflare is just bandaids on top of band aids.

@[email protected] · 1 month ago

You could say they are… Perplexed.

katy ✨ · 1 month ago

rare cloudflare w

@[email protected] · 1 month ago

As far as security is concerned, their w’s are pretty common tbh. It’s just the whole centralization issue.

@[email protected] · 1 month ago

That’s the entire point, dipshit. I wish we got one of the cool techno dystopias rather than this boring corporate idiot one.

Leon · 1 month ago

I’m still holding out for Stephen Hawking to mail out Demon Summoning programs.

@[email protected] · 1 month ago

sylver_dragon · 1 month ago

You’d think that a competent technology company, with their own AI would be able to figure out a way to spoof Cloudflare’s checks. I’d still think that.

snooggums · edit-2 1 month ago

Or find a more efficient way to manage data, since their current approach is basically DDOSing the internet for training data and also for responding to user interactions.

@[email protected] · 30 days ago

This is not about training data, though.

Perplexity argues that Cloudflare is mischaracterizing AI Assistants as web crawlers, saying that they should not be subject to the same restrictions since they are user-initiated assistants.

Personally I think that claim is a decent one: user-initiated request should not be subject to robot limitations, and are not the source of DDOS attack to web sites.

I think the solution is quite clear, though: either make use of the user identity to walz through the blocks, or even make use of the user browser to do it. Once a captcha appears, let the user solve it.

Though technically making all this happen flawlessly is quite a big task.

snooggums · 30 days ago

Personally I think that claim is a decent one: user-initiated request should not be subject to robot limitations, and are not the source of DDOS attack to web sites.

They are one of the sources!

The AI scraping when a user enters a prompt is DDOSing sites in addition to the scraping for training data that is DDOSing sites. These shitty companies are repeatedly slamming the same sites over and over again in the least efficient way because they are not using the scraped data from training when they process a user prompt that does a web search.

Scraping once extensively and scraping a bit less but far more frequently have similar impacts.

@[email protected] · 30 days ago

When user enters a prompt, the backend may retrieve a handful a pages to serve that prompt. It won’t retrieve all the pages of a site. Hardly different from a user using a search engine and clicking 5 topmost links into tabs. If that is not a DoS attack, then an agent doing the same isn’t a DDoS attack.

Constructing the training material in the first place is a different matter, but if you’re asking about fresh events or new APIs, the training data just doesn’t cut it. The training, and subsequenctly the material retrieval, has been done a long time ago.

The Quuuuuill · 1 month ago

see, but they’re not competent. further, they don’t care. most of these ai companies are snake oil. they’re selling you a solution that doesn’t meaningfully solve a problem. their main way of surviving is saying “this is what it can do now, just imagine what it can do if you invest money in my company.”

they’re scammers, the lot of them, running ponzi schemes with our money. if the planet dies for it, that’s no concern of theirs. ponzi schemes require the schemer to have no long term plan, just a line of credit that they can keep drawing from until they skip town before the tax collector comes

lemmyng · 1 month ago

Perplexity: “But that would cost us moneeyyyy!”

@[email protected] · 1 month ago

Good. I went through my CF panel, and blocked some of those “AI Assistants” that by default were open, including Perplexity’s.

@[email protected] · edit-2 1 month ago

CF panel? Your light bulb??

@[email protected] · 1 month ago

CF == Cloudflare :)

@[email protected] · 30 days ago

I don’t like cloudflare but it’s nice that they allow people to stop AI scrapping if they want to

@[email protected] · 30 days ago

CloudFlare has become an Internet protection racket and I’m not happy about it.

@[email protected] · edit-2 29 days ago

they’re good at protecting websites but damn, having a company being MITM feels so wrong

sandwich.make(bathing_in_bismuth) · 29 days ago

The shit they know. Plus their support for non-JS users or For are pure shite

@[email protected] · edit-2 29 days ago

Yeah, a few sites outright refuse to work because cloudflare just poops. EDIT: It was supposed to say “loops”, but I’m keeping it.

Avicenna · 1 month ago

ask AI how to do it?

dustycups · 1 month ago

They tried nothing & they’re all out of ideas.

@[email protected] · 1 month ago

Can’t believe I’ve lived to see Cloudflare be the good guys

@[email protected] · 29 days ago

They’re not. They’re using this as an excuse to become paid gatekeepers of the internet as we know it. All that’s happening is that Cloudflare is using this to menuever into position where they can say “nice traffic you’ve got there - would be a shame if something happened to it”.

AI companies are crap.

What Cloudflare is doing here is also crap.

And we’re cheering it on.

@[email protected] · 29 days ago

Lesser of two bad guys maybe?

The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall

The AI company Perplexity is complaining their bots can't bypass Cloudflare's firewall

Perplexity Says Cloudflare Is Blocking Legitimate AI Assistants