How we built the model to help you make a sustainable choice

18 December 2019 -

We built a database with 2000+ clothing brands with a sustainability rating. In this article, we walk you through the interesting parts of the process. This is how we help to make a sustainable choice in clothing. The code is available open source for you to explore and improve.

The fashion industry is impactful, let’s make it positive

Pollution, bad working conditions and animal welfare are unfortunately topics that are often pushed away by profit. This holds for the modern ‘fast-fashion’ industry which produces most of our clothes. Fortunately, some brands and initiatives are making an effort, but how do you know as consumer? Searching for information is very time consuming. Therefore, we used scraping, artificial intelligence, natural language processing and explainability, to provide more sustainable clothing information, faster than current approaches.

Our database will not be the answer to all adverse effects of the ‘fast-fashion’ industry, but we hope to provide a piece of the sustainability puzzle. Do you want to know why we do this? Here you can read our introduction blog about this project!

Models need data

Our process consisted of 2 main steps: gathering the data (crawling and scraping) and predicting the sustainability for each of these brands. We gathered 2000+ clothing brands plus their homepages using the Google Search API. Moreover, to be able to build a supervised model, we need some examples of brands which are (not) sustainable. We collected ratings from a few initiatives who manually rate clothing brands based on their level of sustainability, such as Project Cece.

To keep our process efficient, we looked at a trade-off between speed and incorporating useful information. Therefore, we crawled through several levels of the clothing brand and selected the homepage +10 relevant pages based on a dictionary. These relevant pages included pages like about-us, history and blogs and excluded product pages, shopping carts and store locator. We used BeautifulSoup4 (Python) to obtain all information in the headers and paragraphs. As many brands do not have an English website, we use the Google Translator API to translate all non-English content.

Wordclouds with information

In the EDA (Explorative Data Analysis) stage, we already see some clear differences between data from ‘sustainable’ and ‘not-sustainable’ brands. These two word-clouds contain the most frequent words per class. We ignored words which occurred often at both classes and we applied a little pre-processing, such as lowercasing and removing punctuation and stopwords. As can be seen in the word-clouds, there is a clear difference between the two classes. Interestingly, we mainly see brand names in the ‘not-sustainable’ word-cloud. Fortunately, the model does not train on the names of the specific brands, which we know trough explainability.

High accuracy achieved using SVM

For assigning sustainability labels to clothing brands, we make use of supervised machine learning algorithms. After dividing our data in a training and hold out set, we kept improving our model using the results of 5-fold cross validation. We continued using the Linear SVC (Support Vector Machine) classifier, with tfidf-vectors of our data as its input, as this model performed best during cross validation. As shown in the table below, the accuracy is sufficiently high. One could argue that a metric punishing false negatives (‘sustainable’ brands classified as ‘non-sustainable’) even more, could be more appropriate.

Improving our model trough explainability

We firmly believe in explainability, because you need to know why a certain brand is sustainable, while the other is not. ‘Just trust me’ is nothing compared to ‘trust me because…’.

Looking at the most important features in the image below, with a handy tool called Explain Like I’m 5, the ‘sustainable’ label is predicted correctly, however this prediction heavily relies on noise like ‘javascript’ and ‘cart’. We use this input to refine our pre-processing which does lead to more sensible weighing terms, at the cost of a slight decrease in model performance on the training set (0,814), while accuracy on the hold-out set remains the same.

 

Correct prediction for Patagonia. Using ELI5 to highlight the used important features for making this decision. Lots of noise in the first text snippet, like ‘account’, ‘javascript’, ‘cart’ etcetera. After refined pre-processing, the model chooses more logical important features.

Strong results by knowing the weakness

Building a 100% correct model, is almost impossible. Besides using the output probability of the model, we came up with the idea of using Word2Vec embeddings to find brand neighbours. For each brand, we collect the 300 dimensional embeddings of the 100 most frequent words. Using these vectors, we can make two calculations:

  1. The distance between the vectors of a brand and the average vector of a class, and
  2. Checking the class of the brand that is closest, i.e. which brands are the neighbours of a given brand.

If the neighbours of an unlabelled brand are similar to the prediction, we trust the model. On the other hand, when a new brand is very different than the brands the model has ever seen before, we provide the ‘unknown’ class.

What you do today, can improve all our tomorrows

We would be very happy if you share your thoughts, ideas and improvements. The open source code plus ideas for improvements can be found here.

Judy Rotering

Talentmanager

judyrotering@solidprofessionals.nl
M: +31 (0)6 12 89 54 29
T: +31 (0)30 2400 511

Deel deze blog op

Related Insights & Blogs

Stap uit je tunnelvisie

Solid Professionals biedt integrale dienstverlening. Solid Professionals Solutions staat voor het gebruik van technologische toepassingen bij onze klant. Hiermee willen we klanten en medewerkers verbinden…

Procesoptimalisatie: hoe pak je dat aan?

In de loop der jaren heeft Solid Professionals Consulting veel procesoptimalisatietrajecten succesvol afgerond. We hebben hiermee de doorlooptijd van rapportages kunnen verkorten door het inzetten…

Solid Professionals Solutions: In vier stappen klaar voor de toekomst

Solid Professionals heeft de afgelopen jaren sterk geïnvesteerd in het nieuwe onderdeel Solid Professionals Solutions. Dit is een onderdeel dat verschillende innovatieve oplossingen, op het…

Procesoptimalisatie: dit heb je nodig

In dit artikel komen drie consultants van Solid Professionals aan het woord over hun ervaringen bij projecten ...

How we built the model to help you make a sustainable choice

We built a database with 2000+ clothing brands with a sustainability rating. In this article, we walk you through the interesting parts of the process.

Could Artificial intelligence help you buy sustainable clothing?

Since the horrible Dhaka collapse in 2013, the situation concerning working conditions has somewhat improved and clothing brands are trying to make a change. However…

Procesbeheersing: nieuwe uitdagingen, nieuwe mogelijkheden

Hoe combineer je de traditionele en nieuwe mogelijkheden van procesbeheersing? En hoe pak je dit als bedrijf aan?

Starten met AI, zo makkelijk kan het zijn

Het gebrek aan rekencapaciteit is een veelgehoord argument om niet te starten met AI. Maar is dat terecht?

AI, de juiste kennis op het juiste moment

De verschillende fases van volwassenheid van AI binnen organisaties en welke mensen horen daarbij?

Distributed Ledger Technology: waar staan we?

Distributed Ledger Technology (DLT) wordt veelal – al dan niet versterkt door ontwikkelingen die we zien rondom Bitcoin en andere cryptovaluta’s – gezien als één…