✔️ 50k Image caption generated by GPT-4o
✔️ 2k Video caption generated by GPT-4o
[TBD] Voice caption generated by GPT-4o
[TBD] More image/video caption/QA generated by GPT-4o
In the realm of large multimodal models, achieving efficient modality alignment is a critical challenge, often hindered by the scarcity of high-quality image-text, video-text data and audio-text data. To address this issue, we introduce the ShareGPT-4o dataset, a groundbreaking large-scale resource that we plan to open-source with 200K meticulously annotated images, 10K videos with highly descriptive captions, and 10K audio files with detailed descriptions. This dataset sets a new standard in diversity and informational richness, encompassing extensive world knowledge, detailed object properties, spatial relationships, and aesthetic evaluations. ShareGPT-4o leverages the advanced multimodal capabilities of GPT-4o, ensuring each data point is carefully curated for maximum utility. By releasing this dataset, we aim to provide a pivotal resource that will significantly advance the progress of the LMM community, facilitating more effective modality alignment and enhancing the overall performance of multimodal models.