When Technology Takes Over

AI companies depend on large amounts of data to train their language models. To create an AI image generation tool like Midjourney or OpenArt, companies need to have vast repositories of imagery to work from, So often what they do is utilise AI training datasets like this one from LAION-5B. Companies like LAION.ai scrape the web to extract data at scale for their collections. Companies have utilised images (unlawfully) from their competitors’ sites for years but what we’re talking about here is different for two reasons:

Scraping happens at scale: A human being taking, using or scraping content or images is a slow arduous process. What we’re talking about here is large scale scraping of entire sites and entire collections automatically day after day.
Historically if someone right clicked and saved your images, there was a limit to the damage they could do. They could use them on their own sites, or the sites of suppliers or competitors or publish them without acknowledging copyright . . but there was a limit. What we’re talking about here with large scale scraping for AI models, is the large scale theft of your images to be fed into the machine to enable others to create derivative works directly from your IP.

What is web scraping? Here’s an explanation

During another one of my deep dives down another delightful OSINT rabbit hole, I came across the site Have I been trained? which enables you to search for your own work or imagery in the popularAI training datasets and see if a company has used your imagery in the training of any large language models (LLM).

In a 2020 article by Tom Waterman you can find here, a US court ruling (relating to Linkedin’s request to prevent an analytics company from scraping its data) is as Tom suggests:

“ a historic moment in the data privacy and data regulation era. It showed that any data that is publicly available and not copyrighted is fair game for web crawlers.”

The theory goes that as long as the data scraped is publicly available, web crawlers can utilise that data as long as they themselves don’t post the content. The example Tom gives . . is that a crawler could scrape Youtube video metadata, so long as they don’t post the videos themselves which are subject to copyright.

However a 2022 US lawsuit profiled in The New Yorker which covers a new class action suit brought by a Tennessee artist, suggests that every image that a generative tool produces “is an infringing, deriative work”.

And this gets worse, in 2022 we saw an artist who found private medical photos had been scraped and used in an AI training data set. We’re not talking about a small problem here and it’s not limited to creative IP.

In July of this year, OpenAI openly admitted their bot (GPTbot )is being used to scrape and collect online content for AI model training. The next version - GPT-5, will likely be trained on the data scraped up by this bot.

Having worked in the architectural product game for a long time at Eco Outdoor, I did a quick search and found that hundreds of their images are being used by LAION-5B as training data.

I then did a search of a friend of mine Pete Stutchbury and found that he too, was subject to crawlers scraping a large amount of his creative project photography, albeit mostly from design publishers’ media platforms rather than his own website.

Now the decoupling of metadata being utilised vs using the actual content gets blurry here; as unlike the youtube example, the imagery these crawlers are scraping is copyrighted and belongs to both Pete (and potentially to the media platforms / photographers who took the photos).

Don’t get me wrong . . I’m not a technophobe, quite the opposite in fact. I was an early Midjourney user and love the creative freedom it offers BUT and there’s a big but . . . for people like Pete and in fact any other architect (replace architect with industrial designer, interior designer and so forth and you get the picture). When the creative products, the images of their signature works designed, built and crafted over many years, is being fed into an AI training data set . . . what we’ll see is over time, an enormous increase in derivative works and derivative references which are themselves, completely disconnected from their creators (and owners). The implications of this are deep and wide from the loss of artistic integrity, creative ownership and sovereignty to the dissolution of intellectual property rights and the monopolisation of creativity by big tech.

I’m no digital genius but presumably there are things web companies can do to reduce the risk or ease of crawlers scraping their sites. A quick search reveals a few recent 2023 articles that might provide a start point:

How to block OpenAI’s new AI-training web crawler from ingesting your data
Major websites block AI crawlers from scraping their content

We need to do more. The regulation simply cannot keep pace with the evolving technology and many aren't thinking about the ramifications every time they press /imagine on their midjourney engine. So what does this mean for the future of our creative industries? We need to be talking about this and big tech (and the platforms that feed of them) need to be held accountable.

What about the responsibilities of the personal user? Not a conversation we seem ready to have. Yet.

AI companies depend on large amounts of data to train their language models. To create an AI image generation tool like Midjourney or OpenArt, companies need to have vast repositories of imagery to work from, So often what they do is utilise AI training datasets like this one from LAION-5B. Companies like LAION.ai scrape the web to extract data at scale for their collections. Companies have utilised images (unlawfully) from their competitors’ sites for years but what we’re talking about here is different for two reasons:

Scraping happens at scale: A human being taking, using or scraping content or images is a slow arduous process. What we’re talking about here is large scale scraping of entire sites and entire collections automatically day after day.
Historically if someone right clicked and saved your images, there was a limit to the damage they could do. They could use them on their own sites, or the sites of suppliers or competitors or publish them without acknowledging copyright . . but there was a limit. What we’re talking about here with large scale scraping for AI models, is the large scale theft of your images to be fed into the machine to enable others to create derivative works directly from your IP.

What is web scraping? Here’s an explanation

During another one of my deep dives down another delightful OSINT rabbit hole, I came across the site Have I been trained? which enables you to search for your own work or imagery in the popularAI training datasets and see if a company has used your imagery in the training of any large language models (LLM).

In a 2020 article by Tom Waterman you can find here, a US court ruling (relating to Linkedin’s request to prevent an analytics company from scraping its data) is as Tom suggests:

“ a historic moment in the data privacy and data regulation era. It showed that any data that is publicly available and not copyrighted is fair game for web crawlers.”

The theory goes that as long as the data scraped is publicly available, web crawlers can utilise that data as long as they themselves don’t post the content. The example Tom gives . . is that a crawler could scrape Youtube video metadata, so long as they don’t post the videos themselves which are subject to copyright.

However a 2022 US lawsuit profiled in The New Yorker which covers a new class action suit brought by a Tennessee artist, suggests that every image that a generative tool produces “is an infringing, deriative work”.

And this gets worse, in 2022 we saw an artist who found private medical photos had been scraped and used in an AI training data set. We’re not talking about a small problem here and it’s not limited to creative IP.

In July of this year, OpenAI openly admitted their bot (GPTbot )is being used to scrape and collect online content for AI model training. The next version - GPT-5, will likely be trained on the data scraped up by this bot.

Having worked in the architectural product game for a long time at Eco Outdoor, I did a quick search and found that hundreds of their images are being used by LAION-5B as training data.

I then did a search of a friend of mine Pete Stutchbury and found that he too, was subject to crawlers scraping a large amount of his creative project photography, albeit mostly from design publishers’ media platforms rather than his own website.

Now the decoupling of metadata being utilised vs using the actual content gets blurry here; as unlike the youtube example, the imagery these crawlers are scraping is copyrighted and belongs to both Pete (and potentially to the media platforms / photographers who took the photos).

Don’t get me wrong . . I’m not a technophobe, quite the opposite in fact. I was an early Midjourney user and love the creative freedom it offers BUT and there’s a big but . . . for people like Pete and in fact any other architect (replace architect with industrial designer, interior designer and so forth and you get the picture). When the creative products, the images of their signature works designed, built and crafted over many years, is being fed into an AI training data set . . . what we’ll see is over time, an enormous increase in derivative works and derivative references which are themselves, completely disconnected from their creators (and owners). The implications of this are deep and wide from the loss of artistic integrity, creative ownership and sovereignty to the dissolution of intellectual property rights and the monopolisation of creativity by big tech.

I’m no digital genius but presumably there are things web companies can do to reduce the risk or ease of crawlers scraping their sites. A quick search reveals a few recent 2023 articles that might provide a start point:

How to block OpenAI’s new AI-training web crawler from ingesting your data
Major websites block AI crawlers from scraping their content

We need to do more. The regulation simply cannot keep pace with the evolving technology and many aren't thinking about the ramifications every time they press /imagine on their midjourney engine. So what does this mean for the future of our creative industries? We need to be talking about this and big tech (and the platforms that feed of them) need to be held accountable.

What about the responsibilities of the personal user? Not a conversation we seem ready to have. Yet.

AI companies depend on large amounts of data to train their language models. To create an AI image generation tool like Midjourney or OpenArt, companies need to have vast repositories of imagery to work from, So often what they do is utilise AI training datasets like this one from LAION-5B. Companies like LAION.ai scrape the web to extract data at scale for their collections. Companies have utilised images (unlawfully) from their competitors’ sites for years but what we’re talking about here is different for two reasons:

Scraping happens at scale: A human being taking, using or scraping content or images is a slow arduous process. What we’re talking about here is large scale scraping of entire sites and entire collections automatically day after day.
Historically if someone right clicked and saved your images, there was a limit to the damage they could do. They could use them on their own sites, or the sites of suppliers or competitors or publish them without acknowledging copyright . . but there was a limit. What we’re talking about here with large scale scraping for AI models, is the large scale theft of your images to be fed into the machine to enable others to create derivative works directly from your IP.

What is web scraping? Here’s an explanation

During another one of my deep dives down another delightful OSINT rabbit hole, I came across the site Have I been trained? which enables you to search for your own work or imagery in the popularAI training datasets and see if a company has used your imagery in the training of any large language models (LLM).

In a 2020 article by Tom Waterman you can find here, a US court ruling (relating to Linkedin’s request to prevent an analytics company from scraping its data) is as Tom suggests:

“ a historic moment in the data privacy and data regulation era. It showed that any data that is publicly available and not copyrighted is fair game for web crawlers.”

The theory goes that as long as the data scraped is publicly available, web crawlers can utilise that data as long as they themselves don’t post the content. The example Tom gives . . is that a crawler could scrape Youtube video metadata, so long as they don’t post the videos themselves which are subject to copyright.

However a 2022 US lawsuit profiled in The New Yorker which covers a new class action suit brought by a Tennessee artist, suggests that every image that a generative tool produces “is an infringing, deriative work”.

And this gets worse, in 2022 we saw an artist who found private medical photos had been scraped and used in an AI training data set. We’re not talking about a small problem here and it’s not limited to creative IP.

In July of this year, OpenAI openly admitted their bot (GPTbot )is being used to scrape and collect online content for AI model training. The next version - GPT-5, will likely be trained on the data scraped up by this bot.

Having worked in the architectural product game for a long time at Eco Outdoor, I did a quick search and found that hundreds of their images are being used by LAION-5B as training data.

I then did a search of a friend of mine Pete Stutchbury and found that he too, was subject to crawlers scraping a large amount of his creative project photography, albeit mostly from design publishers’ media platforms rather than his own website.

Now the decoupling of metadata being utilised vs using the actual content gets blurry here; as unlike the youtube example, the imagery these crawlers are scraping is copyrighted and belongs to both Pete (and potentially to the media platforms / photographers who took the photos).

Don’t get me wrong . . I’m not a technophobe, quite the opposite in fact. I was an early Midjourney user and love the creative freedom it offers BUT and there’s a big but . . . for people like Pete and in fact any other architect (replace architect with industrial designer, interior designer and so forth and you get the picture). When the creative products, the images of their signature works designed, built and crafted over many years, is being fed into an AI training data set . . . what we’ll see is over time, an enormous increase in derivative works and derivative references which are themselves, completely disconnected from their creators (and owners). The implications of this are deep and wide from the loss of artistic integrity, creative ownership and sovereignty to the dissolution of intellectual property rights and the monopolisation of creativity by big tech.

I’m no digital genius but presumably there are things web companies can do to reduce the risk or ease of crawlers scraping their sites. A quick search reveals a few recent 2023 articles that might provide a start point:

How to block OpenAI’s new AI-training web crawler from ingesting your data
Major websites block AI crawlers from scraping their content

We need to do more. The regulation simply cannot keep pace with the evolving technology and many aren't thinking about the ramifications every time they press /imagine on their midjourney engine. So what does this mean for the future of our creative industries? We need to be talking about this and big tech (and the platforms that feed of them) need to be held accountable.

What about the responsibilities of the personal user? Not a conversation we seem ready to have. Yet.

AI companies depend on large amounts of data to train their language models. To create an AI image generation tool like Midjourney or OpenArt, companies need to have vast repositories of imagery to work from, So often what they do is utilise AI training datasets like this one from LAION-5B. Companies like LAION.ai scrape the web to extract data at scale for their collections. Companies have utilised images (unlawfully) from their competitors’ sites for years but what we’re talking about here is different for two reasons:

Scraping happens at scale: A human being taking, using or scraping content or images is a slow arduous process. What we’re talking about here is large scale scraping of entire sites and entire collections automatically day after day.
Historically if someone right clicked and saved your images, there was a limit to the damage they could do. They could use them on their own sites, or the sites of suppliers or competitors or publish them without acknowledging copyright . . but there was a limit. What we’re talking about here with large scale scraping for AI models, is the large scale theft of your images to be fed into the machine to enable others to create derivative works directly from your IP.

What is web scraping? Here’s an explanation

During another one of my deep dives down another delightful OSINT rabbit hole, I came across the site Have I been trained? which enables you to search for your own work or imagery in the popularAI training datasets and see if a company has used your imagery in the training of any large language models (LLM).

In a 2020 article by Tom Waterman you can find here, a US court ruling (relating to Linkedin’s request to prevent an analytics company from scraping its data) is as Tom suggests:

“ a historic moment in the data privacy and data regulation era. It showed that any data that is publicly available and not copyrighted is fair game for web crawlers.”

The theory goes that as long as the data scraped is publicly available, web crawlers can utilise that data as long as they themselves don’t post the content. The example Tom gives . . is that a crawler could scrape Youtube video metadata, so long as they don’t post the videos themselves which are subject to copyright.

However a 2022 US lawsuit profiled in The New Yorker which covers a new class action suit brought by a Tennessee artist, suggests that every image that a generative tool produces “is an infringing, deriative work”.

And this gets worse, in 2022 we saw an artist who found private medical photos had been scraped and used in an AI training data set. We’re not talking about a small problem here and it’s not limited to creative IP.

In July of this year, OpenAI openly admitted their bot (GPTbot )is being used to scrape and collect online content for AI model training. The next version - GPT-5, will likely be trained on the data scraped up by this bot.

Having worked in the architectural product game for a long time at Eco Outdoor, I did a quick search and found that hundreds of their images are being used by LAION-5B as training data.

I then did a search of a friend of mine Pete Stutchbury and found that he too, was subject to crawlers scraping a large amount of his creative project photography, albeit mostly from design publishers’ media platforms rather than his own website.

Now the decoupling of metadata being utilised vs using the actual content gets blurry here; as unlike the youtube example, the imagery these crawlers are scraping is copyrighted and belongs to both Pete (and potentially to the media platforms / photographers who took the photos).

Don’t get me wrong . . I’m not a technophobe, quite the opposite in fact. I was an early Midjourney user and love the creative freedom it offers BUT and there’s a big but . . . for people like Pete and in fact any other architect (replace architect with industrial designer, interior designer and so forth and you get the picture). When the creative products, the images of their signature works designed, built and crafted over many years, is being fed into an AI training data set . . . what we’ll see is over time, an enormous increase in derivative works and derivative references which are themselves, completely disconnected from their creators (and owners). The implications of this are deep and wide from the loss of artistic integrity, creative ownership and sovereignty to the dissolution of intellectual property rights and the monopolisation of creativity by big tech.

I’m no digital genius but presumably there are things web companies can do to reduce the risk or ease of crawlers scraping their sites. A quick search reveals a few recent 2023 articles that might provide a start point:

How to block OpenAI’s new AI-training web crawler from ingesting your data
Major websites block AI crawlers from scraping their content

We need to do more. The regulation simply cannot keep pace with the evolving technology and many aren't thinking about the ramifications every time they press /imagine on their midjourney engine. So what does this mean for the future of our creative industries? We need to be talking about this and big tech (and the platforms that feed of them) need to be held accountable.

What about the responsibilities of the personal user? Not a conversation we seem ready to have. Yet.

AI companies depend on large amounts of data to train their language models. To create an AI image generation tool like Midjourney or OpenArt, companies need to have vast repositories of imagery to work from, So often what they do is utilise AI training datasets like this one from LAION-5B. Companies like LAION.ai scrape the web to extract data at scale for their collections. Companies have utilised images (unlawfully) from their competitors’ sites for years but what we’re talking about here is different for two reasons:

Scraping happens at scale: A human being taking, using or scraping content or images is a slow arduous process. What we’re talking about here is large scale scraping of entire sites and entire collections automatically day after day.
Historically if someone right clicked and saved your images, there was a limit to the damage they could do. They could use them on their own sites, or the sites of suppliers or competitors or publish them without acknowledging copyright . . but there was a limit. What we’re talking about here with large scale scraping for AI models, is the large scale theft of your images to be fed into the machine to enable others to create derivative works directly from your IP.

What is web scraping? Here’s an explanation

During another one of my deep dives down another delightful OSINT rabbit hole, I came across the site Have I been trained? which enables you to search for your own work or imagery in the popularAI training datasets and see if a company has used your imagery in the training of any large language models (LLM).

In a 2020 article by Tom Waterman you can find here, a US court ruling (relating to Linkedin’s request to prevent an analytics company from scraping its data) is as Tom suggests:

“ a historic moment in the data privacy and data regulation era. It showed that any data that is publicly available and not copyrighted is fair game for web crawlers.”

The theory goes that as long as the data scraped is publicly available, web crawlers can utilise that data as long as they themselves don’t post the content. The example Tom gives . . is that a crawler could scrape Youtube video metadata, so long as they don’t post the videos themselves which are subject to copyright.

However a 2022 US lawsuit profiled in The New Yorker which covers a new class action suit brought by a Tennessee artist, suggests that every image that a generative tool produces “is an infringing, deriative work”.

And this gets worse, in 2022 we saw an artist who found private medical photos had been scraped and used in an AI training data set. We’re not talking about a small problem here and it’s not limited to creative IP.

In July of this year, OpenAI openly admitted their bot (GPTbot )is being used to scrape and collect online content for AI model training. The next version - GPT-5, will likely be trained on the data scraped up by this bot.

Having worked in the architectural product game for a long time at Eco Outdoor, I did a quick search and found that hundreds of their images are being used by LAION-5B as training data.

I then did a search of a friend of mine Pete Stutchbury and found that he too, was subject to crawlers scraping a large amount of his creative project photography, albeit mostly from design publishers’ media platforms rather than his own website.

Now the decoupling of metadata being utilised vs using the actual content gets blurry here; as unlike the youtube example, the imagery these crawlers are scraping is copyrighted and belongs to both Pete (and potentially to the media platforms / photographers who took the photos).

Don’t get me wrong . . I’m not a technophobe, quite the opposite in fact. I was an early Midjourney user and love the creative freedom it offers BUT and there’s a big but . . . for people like Pete and in fact any other architect (replace architect with industrial designer, interior designer and so forth and you get the picture). When the creative products, the images of their signature works designed, built and crafted over many years, is being fed into an AI training data set . . . what we’ll see is over time, an enormous increase in derivative works and derivative references which are themselves, completely disconnected from their creators (and owners). The implications of this are deep and wide from the loss of artistic integrity, creative ownership and sovereignty to the dissolution of intellectual property rights and the monopolisation of creativity by big tech.

I’m no digital genius but presumably there are things web companies can do to reduce the risk or ease of crawlers scraping their sites. A quick search reveals a few recent 2023 articles that might provide a start point:

How to block OpenAI’s new AI-training web crawler from ingesting your data
Major websites block AI crawlers from scraping their content

We need to do more. The regulation simply cannot keep pace with the evolving technology and many aren't thinking about the ramifications every time they press /imagine on their midjourney engine. So what does this mean for the future of our creative industries? We need to be talking about this and big tech (and the platforms that feed of them) need to be held accountable.

What about the responsibilities of the personal user? Not a conversation we seem ready to have. Yet.

When Technology Takes Over

Is AI Stripping Creativity from Architecture? The Dangers of Algorithm-Driven Design