Humans – Stable Diffusion Model

by Ostris - Jaret Burkett | Jul 2, 2023

Logo for Humans v1
photo of a man, Serbian, middle-aged with crow's feet, architect
photo of a woman, Vietnamese, braids in elaborate updo, economist
photo of a man, Haitian, middle-aged with graying beard, doctor
photo of a woman, argentinian, social worker, headshot
photo of a man, Australian, unibrow, depth of field, musician
photo of a man, sagging cheeks, jovial expression, Canadian, chef
photo of a man, Norwegian, deep-set eyes, police officer
photo of a woman, Indonesian, wearing a hijab, teacher
photo of a woman, Kenyan, head tilted with a smile, nurse
photo of a woman, Norwegian, large aquiline nose, economist
photo of a woman, Zimbabwean, braids in elaborate updo, economist
photo of a man, Thai, stubble on a square jaw, software engineer
photo of a man, Vietnamese, middle-aged with crow's feet, architect
photo of a woman, Chilean, face framed by curly hair, doctor
photo of a man, Turkish, deep-set eyes, thin upper lip with fuller lower lip, scientist
photo of a man, Argentinian, squarish face, narrow mouth with thin lips, chef
photo of a man, Indian, round glasses, thin eyebrows, writer
photo of a man, Vietnamese, deep-set eyes, police officer

This model is designed to produce photo realistic images of normal people. Most SD models can only produce beautiful people. This is not that. You will get acne, moles, ratty hair, crooked teeth, wrinkles, and well, ordinary people.

The short version:

There are thousands of trigger words which can be found at https://gist.github.com/jaretburkett/cf8c224243834172fc13f72aaf49811d , or for a sorted list based on frequency, see here https://gist.github.com/jaretburkett/41370fdf69b791d2b406f3fa538d4b32 . The big one to know is the word “face”. A significant portion of the dataset has faces, and they were all tagged with face. Use it to get faces, without it you will get farther away shots, usually portraits. The model will do well with simple as well as much more complex prompts than normal SD models can handle. It will generate massive variations of people from seed to seed, even with the same prompt. Trained on bucket sizes of [328, 512, 640, 768, 896] at various aspect ratios, and should be able to produce images at those sizes without any hi-res fixes.

The long version:

The Dataset: I have been building this dataset for around a decade. It has around 100k (and growing) carefully curated, balanced, and labeled images with the goal to remove the bias in generative AI models. It was built and added to over the years for various products I have created, and I figured it would be good to throw Stable Diffusion at it. The dataset is designed to have mostly normal people though there are some beautiful people in it as well. I always tried to keep it as balanced with the general population as I could, which is hopefully apparent from the images this model can generate. There are a lot of faces in the dataset, and they are labeled with the key word “face” to help trigger or not trigger a close up of a face. Around half of the dataset if faces only, I am working on balancing this with more portraits, headshots, full body shots for version 2.

Labeling: Labeling was done partially by hand over the years, but mostly by BLIP2 more recently. I created a custom key word list for people’s photos that I use for the tagging library in addition to the standard BLIP2 captions. You can find this keyword list here https://gist.github.com/jaretburkett/cf8c224243834172fc13f72aaf49811d . It is mostly made with the help of GPT-4, and I plan to manually prune and improve this for version 2. I also plan to release my tagging code soon, but those familiar with custom interrogators can probably put this to use, if you want. The main purpose of the labeling process is to be thorough in describing people. Most SD models will do little more than old, young, man, woman, hair color and maybe race. I wanted to be able to do specifics with nose shapes, cheekbone depth, complexion, national origin, eye shape, hair styles, and very nuanced specifics, and so far I am very pleased with the results. The model now knows subtle details of the human face. This should aid in creating embeddings (textual inversions) as the model will know how to create these unique features of a face, they just need to be triggered by the embedding.

What is Next: This is version 1, and really an alpha version. I am still working on it and hope that version 2 will be mind blowing. I am already training it, and improving the dataset. Currently, this one is not perfect with some details. Eyes can get wonkey, and so can teeth, more than intended at least. It will take some time to train this out, and I plan to do just that as well as add more variety of image types of normal people.

Your Current LoRAs and embeddings: Yeah.. Your LoRA’s of beautiful people trained on models that can only create beautiful people are not going to work the same way here. You will likely get a picture of their back woods cousin instead of the intended subject, which is fun to play with. Give it a shot.

Hide