You probably don’t need me to tell you what happened when companies used scraped public data — often without the permission of those who created it — from news articles, books, and creative projects to teach AI tools how to, say, generate news articles, books, and creative projects.
The New York Times is currently suing OpenAI for allegedly using its expansive archives without permission to train chatbots (in a recent filing, OpenAI accused the Times of hiring “someone to hack” ChatGPT in order to prove that the chatbot was stealing their content). Getty Images sued Stable Diffusion for copyright infringement. Other lawsuits from authors and creators, angry to find that their works were used to train AI models, have faced setbacks in court.
The ways AI systems use the work of journalists, musicians, and photographers have pretty consequential implications for our information and cultural ecosystem and for the people who work in the fields that AI companies seem dead-set on developing tools to replace. The need to gather more and more training data with as little fuss as possible means that anyone who’s an online poster — whether its a fandom Tumblr account, an active Reddit presence, or a personal blog — could see access to their content being sold by the platforms hosting it to one of these big AI companies.
Below is a quick guide to what we know right now about who might be selling your best posts as training data.
Tumblr and WordPress.com
Earlier this week, 404 Media reported that Automattic, the parent company for Tumblr and WordPress, was preparing to announce deals selling user data to OpenAI and Midjourney. According to 404’s reporting, which describes such a deal as “imminent,” the data seems likely to include user posts on Tumblr and on WordPress.com. On Wednesday, a day after 404’s report, Automattic announced a way for users to opt out of sharing their public content with third parties.
The Tumblr staff announcement on the change framed the whole thing as a sign that the company was working to protect its users. “We already discourage AI crawlers from gathering content from Tumblr and will continue to do so,” the announcement read, “save for those with which we partner.”
Automattic said in a statement that it was “working directly with select AI companies as long as their plans align with what our community cares about: attribution, opt-outs, and control,” but has not provided any further information on the reported deals with OpenAI and Midjourney.
Reddit’s enormous archives of posts are driven by the labor of volunteers: Unpaid subreddit moderators oversee communities of unpaid users. Their collective efforts on Reddit make the platform valuable.
So when Reddit announced that it was launching an IPO, the company reached out to a selection of mods and frequent posters to offer them the opportunity to buy stock early. Some of those who received the offer were not super enthusiastic about it. But Reddit does not need buy-in from its users to profit from their work: It has already sold access to their posts to Google.
The reported deals above are just a couple that have become public. But this doesn’t mean that large AI models aren’t already being trained on your posts across the internet.
Last year, the Washington Post examined one of the massive data sets of scraped public internet data used to train generative AI models and found everything from World of Warcraft message boards to Patreon and Kickstarter and several huge repositories of personal blogs. And it should not be a surprise that Meta uses public posts from Facebook and Instagram to train its AI models.
rn rn vox-markrn rn rn rn rn rn“,”cross_community”:false,”internal_groups”:[{[{“base_type”:”EntryGroup”,”id”:112403,”timestamp”:1709640043,”title”:”Approach — Dissects something complicated”,”type”:”SiteGroup”,”url”:””,”slug”:”approach-dissects-something-complicated”,”community_logo”:”rn“,”community_name”:”Vox”,”community_url”:”https://www.vox.com/”,”cross_community”:false,”entry_count”:578,”always_show”:false,”description”:””,”disclosure”:””,”cover_image_url”:””,”cover_image”:null,”title_image_url”:””,”intro_image”:null,”four_up_see_more_text”:”View All”}],”image”:{“ratio”:”*”,”original_url”:”https://cdn.vox-cdn.com/uploads/chorus_image/image/73172456/2036327466.0.jpg”,”network”:”unison”,”bgcolor”:”white”,”pinterest_enabled”:false,”caption”:null,”credit”:”Photo Illustration by Rafael Henrique/SOPA Images/LightRocket via Getty Images”,”focal_area”:{“top_left_x”:2177,”top_left_y”:608,”bottom_right_x”:2977,”bottom_right_y”:1408},”bounds”:[0,0,5000,3617],”uploaded_size”:{“width”:5000,”height”:3617},”focal_point”:null,”image_id”:73172456,”alt_text”:”In this photo illustration, the Reddit logo is seen in behind a silhouette of a person typing.”},”hub_image”:{“ratio”:”*”,”original_url”:”https://cdn.vox-cdn.com/uploads/chorus_image/image/73172456/2036327466.0.jpg”,”network”:”unison”,”bgcolor”:”white”,”pinterest_enabled”:false,”caption”:null,”credit”:”Photo Illustration by Rafael Henrique/SOPA Images/LightRocket via Getty Images”,”focal_area”:{“top_left_x”:2177,”top_left_y”:608,”bottom_right_x”:2977,”bottom_right_y”:1408},”bounds”:[0,0,5000,3617],”uploaded_size”:{“width”:5000,”height”:3617},”focal_point”:null,”image_id”:73172456,”alt_text”:”In this photo illustration, the Reddit logo is seen in behind a silhouette of a person typing.”},”lede_image”:{“ratio”:”*”,”original_url”:”https://cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg”,”network”:”unison”,”bgcolor”:”white”,”pinterest_enabled”:false,”caption”:null,”credit”:”Photo Illustration by Rafael Henrique/SOPA Images/LightRocket via Getty Images”,”focal_area”:{“top_left_x”:2177,”top_left_y”:608,”bottom_right_x”:2977,”bottom_right_y”:1408},”bounds”:[0,0,5000,3617],”uploaded_size”:{“width”:5000,”height”:3617},”focal_point”:null,”image_id”:73172457,”alt_text”:”In this photo illustration, the Reddit logo is seen in behind a silhouette of a person typing.”},”group_cover_image”:null,”picture_standard_lead_image”:{“ratio”:”*”,”original_url”:”https://cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg”,”network”:”unison”,”bgcolor”:”white”,”pinterest_enabled”:false,”caption”:null,”credit”:”Photo Illustration by Rafael Henrique/SOPA Images/LightRocket via Getty Images”,”focal_area”:{“top_left_x”:2177,”top_left_y”:608,”bottom_right_x”:2977,”bottom_right_y”:1408},”bounds”:[0,0,5000,3617],”uploaded_size”:{“width”:5000,”height”:3617},”focal_point”:null,”image_id”:73172457,”alt_text”:”In this photo illustration, the Reddit logo is seen in behind a silhouette of a person typing.”,”picture_element”:{“loading”:”eager”,”html”:{},”alt”:”In this photo illustration, the Reddit logo is seen in behind a silhouette of a person typing.”,”default”:{“srcset”:”https://cdn.vox-cdn.com/thumbor/NNxTJ4FxWutRWRgkftexh8Jh7Ro=/0x0:5000×3617/320×240/filters:focal(2177×608:2977×1408)/cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg 320w, https://cdn.vox-cdn.com/thumbor/Wb7pKTT0WJb08QMWCmUvG7M693o=/0x0:5000×3617/620×465/filters:focal(2177×608:2977×1408)/cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg 620w, https://cdn.vox-cdn.com/thumbor/ZWQ0xGQ1D2F23O4BZl9XnAvqnzo=/0x0:5000×3617/920×690/filters:focal(2177×608:2977×1408)/cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg 920w, https://cdn.vox-cdn.com/thumbor/rgs7kkEi7T9VeIcQvBhRn2dHN3s=/0x0:5000×3617/1220×915/filters:focal(2177×608:2977×1408)/cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg 1220w, https://cdn.vox-cdn.com/thumbor/whUxnosIOTNT5E7rTHBncsgWtPY=/0x0:5000×3617/1520×1140/filters:focal(2177×608:2977×1408)/cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg 1520w”,”webp_srcset”:”https://cdn.vox-cdn.com/thumbor/X5Uvh_Kxbo_PH8jgLN5ILljpnII=/0x0:5000×3617/320×240/filters:focal(2177×608:2977×1408):format(webp)/cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg 320w, https://cdn.vox-cdn.com/thumbor/xkp2Kqf_H9nVvbYmR8q8ldxnw6g=/0x0:5000×3617/620×465/filters:focal(2177×608:2977×1408):format(webp)/cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg 620w, https://cdn.vox-cdn.com/thumbor/zK2bVgrfcNcFrKJzLUsGBC1a-2k=/0x0:5000×3617/920×690/filters:focal(2177×608:2977×1408):format(webp)/cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg 920w, https://cdn.vox-cdn.com/thumbor/jG39A446zonMPaY7DjyS7S0Ptuo=/0x0:5000×3617/1220×915/filters:focal(2177×608:2977×1408):format(webp)/cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg 1220w, https://cdn.vox-cdn.com/thumbor/m66WsjR4qKU4RM58ggRhls1JVkQ=/0x0:5000×3617/1520×1140/filters:focal(2177×608:2977×1408):format(webp)/cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg 1520w”,”media”:null,”sizes”:”(min-width: 809px) 485px, (min-width: 600px) 60vw, 100vw”,”fallback”:”https://cdn.vox-cdn.com/thumbor/d10YFXfHj1QI7S-3jvitpUPaDKI=/0x0:5000×3617/1200×900/filters:focal(2177×608:2977×1408)/cdn.vox-cdn.com/uploads/chorus_image/image/73172457/2036327466.0.jpg”},”art_directed”:[]}},”image_is_placeholder”:false,”image_is_hidden”:false,”network”:”vox”,”omits_labels”:false,”optimizable”:false,”promo_headline”:”A poster’s guide to who’s selling your data to train AI “,”recommended_count”:0,”recs_enabled”:false,”slug”:”technology/24086039/reddit-tumblr-wordpress-whos-selling-your-data-to-train-ai”,”dek”:”Those Tumblr, Reddit, and WordPress posts you never thought would see the light of day? Yep, them too.”,”homepage_title”:”A poster’s guide to who’s selling your data to train AI “,”homepage_description”:”Those Tumblr, Reddit, and WordPress posts you never thought would see the light of day? Yep, them too.”,”show_homepage_description”:false,”title_display”:”A poster’s guide to who’s selling your data to train AI “,”pull_quote”:null,”voxcreative”:false,”show_entry_time”:true,”show_dates”:true,”paywalled_content”:false,”paywalled_content_box_logo_url”:””,”paywalled_content_page_logo_url”:””,”paywalled_content_main_url”:””,”article_footer_body”:”At Vox, we believe that clarity is power, and that power shouldn’t only be available to those who can afford to pay. That’s why we keep our work free. Millions rely on Vox’s clear, high-quality journalism to understand the forces shaping today’s world. Support our mission and help keep Vox free for all by making a financial contribution to Vox today. rn”,”article_footer_header”:”Will you help keep Vox free for all?“,”use_article_footer”:true,”article_footer_cta_annual_plans”:”{rn “default_plan”: 1,rn “plans”:[rn{rn[rn{rn”amount”: 50,rn “plan_id”: 99546rn },rn {rn “amount”: 100,rn “plan_id”: 99547rn },rn {rn “amount”: 150,rn “plan_id”: 99548rn },rn {rn “amount”: 200,rn “plan_id”: 99549rn }rn ]rn}”,”article_footer_cta_button_annual_copy”:”year”,”article_footer_cta_button_copy”:”Yes, I’ll give”,”article_footer_cta_button_monthly_copy”:”month”,”article_footer_cta_default_frequency”:”monthly”,”article_footer_cta_monthly_plans”:”{rn “default_plan”: 0,rn “plans”:[rn{rn[rn{rn”amount”: 5,rn “plan_id”: 99543rn },rn {rn “amount”: 10,rn “plan_id”: 99544rn },rn {rn “amount”: 25,rn “plan_id”: 99545rn },rn {rn “amount”: 50,rn “plan_id”: 46947rn }rn ]rn}”,”article_footer_cta_once_plans”:”{rn “default_plan”: 0,rn “plans”:[rn{rn[rn{rn”amount”: 20,rn “plan_id”: 69278rn },rn {rn “amount”: 50,rn “plan_id”: 48880rn },rn {rn “amount”: 100,rn “plan_id”: 46607rn },rn {rn “amount”: 250,rn “plan_id”: 46946rn }rn ]rn}”,”use_article_footer_cta_read_counter”:true,”use_article_footer_cta”:true,”groups”:[{[{“base_type”:”EntryGroup”,”id”:27524,”timestamp”:1709653288,”title”:”Technology”,”type”:”SiteGroup”,”url”:”https://www.vox.com/technology”,”slug”:”technology”,”community_logo”:”rn“,”community_name”:”Vox”,”community_url”:”https://www.vox.com/”,”cross_community”:false,”entry_count”:24593,”always_show”:false,”description”:”Uncovering and explaining how our digital world is changing — and changing us.”,”disclosure”:””,”cover_image_url”:””,”cover_image”:null,”title_image_url”:””,”intro_image”:null,”four_up_see_more_text”:”View All”,”primary”:true},{“base_type”:”EntryGroup”,”id”:71037,”timestamp”:1709208016,”title”:”Social Media”,”type”:”SiteGroup”,”url”:”https://www.vox.com/social-media”,”slug”:”social-media”,”community_logo”:”rn“,”community_name”:”Vox”,”community_url”:”https://www.vox.com/”,”cross_community”:false,”entry_count”:707,”always_show”:false,”description”:”From Facebook to Twitter to YouTube, social media platforms are transforming communication and internet culture, even as they raise privacy concerns for users.”,”disclosure”:””,”cover_image_url”:””,”cover_image”:null,”title_image_url”:””,”intro_image”:null,”four_up_see_more_text”:”View All”,”primary”:false}],”featured_placeable”:false,”video_placeable”:false,”disclaimer”:null,”volume_placement”:”lede”,”video_autoplay”:false,”youtube_url”:”http://bit.ly/voxyoutube”,”facebook_video_url”:””,”play_in_modal”:true,”user_preferences_for_privacy_enabled”:false,”show_branded_logos”:true}”>