ep. 7: The Importance of Real-time Gen AI to Users, Six Types of Gen AI Conversation, New Model for Music Generation
7 min read
Real-time generative AI was recently unlocked by a new machine learning technique called “Latent Consistency Model - Low-Rank Adaptation” (LCM-LoRA). Applications like KREA AI and fal use LCM-LoRA to allow people to transform simple sketches into high-fidelity images almost instantly. In other words, these tools reduce the latency between a user’s creative thoughts and seeing them in digital form, relative to existing solutions like MidJourney or Stable Diffusion. We’ll dive into the user benefits of real-time gen AI, and the tradeoffs when there are multiple competing priorities - such as cost versus speed for generating model output, per Open AI’s recent Developer Conference.
KREA AI’s Twitter post claiming “real-time is here”, showing how a user’s movement of the shapes in the left figure can result in near-instantaneous changes to the high-fidelity image on the right.
Real-time tools enable user flow state
A key user benefit to these real-time tools is enhanced creativity. Offering clear and immediate feedback is a key condition to achieve flow state. Flow is defined as a “subjective state that people report when they are completely involved in something to the point of forgetting time, fatigue, and everything else but the activity itself” (Csikszentmihalyi et al., 2005).
This clear, immediate feedback is exactly what tools like KREA AI and fal provide - they’re incredibly quick and responsive, which can lead users to a more fluid and iterative creative process. People can easily experiment and tweak their output, further enhancing the overall creative experience. The closer we can get to the average human visual reaction time of 190ms with real-time gen AI tools, the greater the benefits of enhanced creativity.
While companies using applications built on LCM-LoRA racing towards real-time AI, Open AI announced at their recent Developer Conference that “we decided to prioritize price first, but we’re going to work on speed next”. That is, they’ve determined that for their use cases, their current speed is “good enough”. There is currently a large gap from current ChatGPT or DALL-E 3 latency to output and achieving real-time performance. So, is the juice worth the squeeze when it comes to bringing real-time performance to users, even over cost? It depends.
UX tradeoffs and the importance of user goals
Each text-to-image tool has its strengths. DALL-E 3 excels at understanding and interpreting prompts. MidJourney excels at creating images that follow a certain style, created by a human rather than a computer. Stable Diffusion excels at providing user control. However, all of these tools are slow, relative to these new applications powered by LCM-LoRA. They also lack spatial and temporal consistency. For instance, when I used fal to move the sun from the left side of my gen-AI-generated mountain to the right, the mountain also kept changing in appearance. I wanted it to stay constant, and only move the sun’s positioning - a phenomenon known as a lack of spatial and temporal consistency.
Using fal, I wrote a text prompt and drew a few rudimentary lines to represent mountains, and an orange circle to represent the sun (left). I instantly saw the corresponding high-fidelity output (right). If I moved the sun, the position of the sunlight in the scene instantly updated in response to my movement. However, the shape of the mountain unexpectedly changed as well. This change is an example of lacking spatial and temporal consistency - a well-known challenge for text-to-video gen AI.
What this hands-on experience taught me was, while it was rewarding to see these real time changes, the lack of spatial and temporal consistency of the main part of my image (the mountain) was frustrating. This feedback also leads us to consider user goals: my goal in testing the application was to see the real-time changes in lighting from the sun, assuming the rest of the AI-generated image was a fixed backdrop. If my goal was to generate the best singular image, with sunlight on either the left or right side of the mountain, the output would have met this brief.
Therefore, based on my user goal, spatial and temporal consistency was more important to me in continuing to use the product than real-time output generation. Of course, waiting 5 or more seconds and having these mountains change shape would be even worse, so real-time is important and helps move the holistic user experience in the right direction. It also shows how the interplay between different strengths and weaknesses of the gen AI system varies based on user goals.
Key Takeaways
Real-time gen AI tools help enhance user’s creativity, by reducing the time between thought and seeing that thought actualized on their screen. This has the potential to accelerate workflows while deepening the level of creative depth.
However, real-time also means greater computation and monetary cost. Therefore, we need to consider how important real-time is to the user experience, relative to other capabilities (e.g., text understanding, spatial and temporal consistency).
To do so, we need to identify users’ goals, such as where in the workflow they are using a given gen AI tool. For example, spatial and temporal consistency matters less if I’m trying to quickly express one “good enough” visual concept, rather than create iterations of a fixed visual concept.
Want to better understand your customer’s goals to consider feature tradeoffs for your gen AI product? Sendfull can help - reach out at hello@sendfull.com
Human Computer Interaction News
The 6 Types of Conversations with Generative AI: This Nielsen Norman article details how users engage in six types of conversations with gen AI bots, depending on their skill levels and their information needs. Interfaces for UI bots should support and accommodate this diversity of conversation styles.
AI is about to completely change how you use computers: Here, Bill Gates covers the use cases for AI: a personal assistant for everyone, healthcare, education, productivity, entertainment and shopping. For AI as a personal assistant, he stresses, “Clippy was a bot, not an agent”. That is, Clippy was limited to simple tasks and lacked personalization or the ability to have a nuanced conversation with the user. In contrast, personal assistants of the future won’t have us using different apps for different tasks. We’ll just tell our device, in everyday language, what we want to do. Depending on how much information you choose to share with them, the software will be able to respond personally because it will have a rich understanding of your life.
Google DeepMind announces Lyria, their most advanced music generation model: Lyria can generate high-quality vocals, lyrics, and musical backing tracks that mimic the style of popular artists. This model moves gen AI for music creation forward, democratizing the ability to express themselves via music. Of course, this technology can be used unethically. DeepMind appears to use SynthID to watermark and identify synthetically generated content, allowing traceability of any content created by Lyria.
Galileo hallucination index identifies GPT-4 as best-performing LLM for different use cases: A new hallucination index shows that OpenAI’s GPT-4 model works best and hallucinates the least when challenged with multiple tasks. Of course, diagnosing hallucinations is only half the battle. How do you make it easy for a user to immediately understand how good your model is? As we discussed in episode 1, when you tell a customer that an “AI hallucinates 5% of the time”, it will likely be interpreted as playing Russian roulette with the product. Aible AI has explored the solution of helping users know what output is accurate versus a hallucination, through double-checking output and highlighting verified text.
Tangram Vision’s AI-powered 3D sensor could transform computer vision in robotics: Tangram Vision, a startup building software and hardware for robotic perception, announced a new 3D depth sensor, HiFi. By including powerful off-the-shelf computer vision capabilities, the sensor has the potential to help robots better perform “in the wild” (i.e., move from operating in closed-world conditions to open-world conditions) by providing them with a more robust perceptual system.
That’s a wrap 🌯 . More human-computer interaction news from Sendfull next week.