Last November, the artificial intelligence startup Stability AI launched an eye-catching innovative product—Stable Video. This model is an extension of their earlier released Stable Diffusion text-to-image model and is not only limited to generating images from text but also capable of generating videos from existing images, a rarity in the market at that time.
Stability AI did not keep secrets; they released the source code of the model and published the required model weights on the popular open platform HuggingFace. This enabled individuals and organizations with the necessary hardware and skills to download and run the model locally.
Recently, Stable Video began its public beta phase, allowing users who lack powerful GPUs or the technical know-how to experience its appeal and offered the service for free during the beta period. This move attracted a new wave of attention, especially as another giant in the field, Sora, still needs several months to launch its internal tests. Stable Video’s action has indeed attracted many eyes.
As for the performance of Stable Video, evaluations of video effectiveness can serve as a reference. Borrowing a 60-second video displayed by Sora at its release, this video deeply impressed the audience with its content’s realism. Although Sora’s creative team has not opened tests to the public, they have been sharing their latest generated short clips on social media, and one of the newly released videos even amazed a Hollywood director, leading to the postponement of their film studio’s trailer project worth hundreds of millions of dollars.
But in this field, other companies, including Stability AI, which influenced OpenAI’s GPT-4, are also actively aggressive. By using the same prompt words as the Sora video for testing to eliminate the impact of different factors as much as possible, we can see the achievements created by Stability AI. Users simply input prompt words, select one of the generated images, then choose an effect, and soon they can get the final video product.
Take a series of prompt words describing fashionable women on the streets of Tokyo as an example—involving fashionable women, Tokyo streets, neon lights and various fashion accessories such as leather jackets, long skirts, boots, etc., as well as other elements like sunglasses, lipstick, walking movements, reflective roads, and the dynamics of other pedestrians. The results from Stable Video showed most of these elements, but the four items “red long skirt, black boots, black purse, walking” did not come out very well. Particularly, “red long skirt and black purse” were somewhat confused in the video, and the effects of “black boots and walking” were not clearly displayed. Regarding video quality, at first glance, the character image is passable, lens effects are inevitable, and the background received corresponding blurring treatment. However, limited by video resolution, the overall picture looks quite blurry, especially parts at the edges of the screen. Moreover, the hair masking effect on the characters is relatively noticeable.
In an era exploring the capabilities of artificial intelligence video generation, Runway AI stands at the forefront. We tested its capacity with a set of specific prompts: “red long dress, black boots, black purse, sunglasses, walking, and pedestrians moving.” The results showed that Runway AI did not achieve perfection in realizing these elements. The color of the boots did not match the prompt, and other key elements such as the red long dress, black purse, and sunglasses were not represented in the video. The overall video lacked realism, had a heavy cartoon style, and was far from the expected “movie effect.” The blurring treatment of characters and the unnatural head-turning movements significantly reduced the overall viewing experience.
After comparing several other products, we found Pika 1.0, a software that has been popular since its release last November, also failed to perfectly recognize all the prompts. When integrating elements such as “red long dress, black purse, sunglasses, lipstick, and walking” into the video, it again presented many discrepancies, including incorrect colors and lengths of dresses and mismatched purse colors. As for “sunglasses, lipstick, and walking,” these elements did not appear at all in the final video. Although Pika performed slightly better in background processing, more in line with the cyberpunk style, the poor treatment of vehicle ghost images in the video exposed the immaturity of the generation technology.
We further evaluated the operating speeds and qualities of various video generation software. In terms of speed, Stable Video took a bit longer, close to a minute, while Runway Gen-2 and Pika processed relatively quickly. It’s worth mentioning that Sora may require a longer wait time for generating videos; compared to the speed at which OpenAI’s Dall-E 3 generates a single image, Sora takes more time and computational resources. Moreover, although internally unverified videos may appear high in quality, publicly tested videos show obvious errors.
During our testing of Stable Video’s Chinese comprehension, we made an unexpected discovery: never use Chinese prompts. Although we provided prompts in Chinese, the resulting videos were vastly different from what was expected. They were vaguely related to the keyword “girl,” but the rest of the content was almost completely disconnected. The fleeting appearance of a portrait at the end even added an unnecessary element of horror.
In the pursuit of innovation and convenience, Stable Video introduced technology that converts images into videos, a function touted as suitable for a wide range of fields including video production and web design. However, in practical tests, we learned that some users who attempted to convert their personal photos into videos unfortunately experienced severe distortion of the portrait images. Even when operating with the image sizes officially recommended, the desired flowing hair and fluttering window curtain effects were far from achieved, instead bringing about scary facial distortions.
Changing the settings to adjust the “camera” to “track” mode did not remedy the situation, indicating that the video generation quality is not determined by these effect settings but by the quality of the model itself. Based on such experiences, we do not recommend using images with faces for video generation, especially for those who are fond of photography, as the conversion effect may not be satisfactory.
So how does it fare with other types of images, such as those of animals? We tried out a cute kitten picture, hoping there would be no alterations that destroy its original cuteness. However, even without any special effects settings and using the official recommended size, the result still showed distortion in the kitten’s face, making us exclaim: Bring back my cute kitten!
As for non-human landscapes, perhaps these are our last glimmer of hope. After testing an image of flowers and vegetation, we found that although there was no distortion, the video overall had an unnatural feel, and the clarity was not ideal, with the whole picture appearing quite blurry.
Overall, even though Stable Video claims it can generate videos from images, we believe it’s not a good choice for handling pictures with humans or animals. While landscape images might be worth a try, caution is advised when considering whether to pay for the service. In other words, the Stable Video app is more suitable for situations where video quality is not a high priority.
Regarding the cost-effectiveness issue, attentive readers may have noticed that issues with aspect ratio and recommended resolution are common in the video generation settings. Although the official site has recommended resolutions such as 1024×576, 576×1024, or 768×768, no clear size recommendation is given to users during the operation. Due to this reason, we discovered this suggestion only at the end of our testing, and we carried out further tests to confirm whether different sizes affect the outcome. Unfortunately, the conclusion was that there was no difference, but it was a waste of points nonetheless.
Indeed, Stable Video provides an initial quota of 150 points per user, with each image-to-video conversion consuming 10 points and text-to-video requiring 11 points. This also implies that each generation process is closely linked to consumption.
If users do not use the videos they have generated, the related points will be refunded. Additionally, each user will automatically receive a certain number of free points daily, but the amount of these points may be adjusted in the future. Once these points are exhausted, users will need to move to a paid service phase: paying 10 US dollars (equivalent to about 72 RMB) to generate 50 videos, or 50 US dollars (equivalent to about 360 RMB) for 300 videos. This means that users would need to spend over 70 RMB to create a video longer than 3 minutes. Fortunately, unsatisfactory videos can request a refund of points, or else the cost-effectiveness would no longer exist.
Regarding the Stability AI official website, we must admit that the showcased results are very attractive. However, casually shot videos still have a long way to go to achieve such effects. Stable Video seems to perform better in terms of efficiency when it comes to generating videos from texts. Overall, the public testing of Stable Video by Stability AI seems like a trial for commercial operations, to gauge whether users are willing to pay for quality results. But looking at the current performance of Stable Video, there is still room for improvement.