Can Machines Imagine? Critical Thinking and Cultural Reasoning in Multimodal-Multilingual AI

Mohammad Awad AlAfnan, Siti Fatimah MohdZuki, Shefa Mohammad AlAfnan

Abstract


Effective communication across diverse languages and cultures is vital in a globally interconnected world. Multimodal-multilingual language models (MMMLMs) offer a transformative approach by combining text, speech, and visual understanding across multiple linguistic landscapes. This study evaluates four state-of-the-art MMMLMs- GIT, mPLUG, CLIP, and Whisper + GPT-4V- across a series of cross-lingual and cross-modal tasks, including image captioning, visual question answering (VQA), speech-to-image generation, and idiomatic translation. The models were tested in high-resource (English, Arabic), medium-resource (Malay), and low-resource (Macedonian) language contexts. Findings show that while these models excel in structured tasks and high-resource languages, they struggle with cultural nuance, semantic alignment, and figurative language in low-resource settings. Common issues include literal translations of idioms, gender and cultural bias, and Western-centric visual outputs. GIT outperformed others in most tasks, while Whisper+ GPT-4V demonstrated fluency in narrative generation but lacked cultural grounding. The study highlights the pressing need for inclusive training data, culturally informed evaluation protocols, and ethically guided development practices. It also calls for interdisciplinary collaboration involving linguists, ethicists, and local communities to build AI systems that reflect global diversity. Ultimately, MMMLMs hold immense potential for advancing cross-cultural communication. However, realizing this potential requires addressing the deep-rooted challenges of representation, equity, and context in multilingual and multimodal AI systems.

Keywords


Multimodal Language Models; Multilingual AI; Cross-Cultural Communication; Low-Resource Languages; Ethical AI

Full Text:

PDF

References


V. Voronkova, G. Vasyl’chuk, V. Nikitenko, Y. Kaganov, and N. Metelenko, “Transformation of digital education in the era of the fourth industrial revolution and globalization,” 2023.

Z. Chen et al., “Evolution and Prospects of Foundation Models: From Large Language Models to Large Multimodal Models,” Computers, Materials & Continua, vol. 80, no. 2, 2024.

V. Nataraj et al., “Generative AI in Multimodal Cross-Lingual Dialogue System for Inclusive Communication Support,” in Proc. IEEE Int. Conf. Information Reuse and Integration for Data Science (IRI), Aug. 2024, pp. 204–209.

J. Y. Koh, D. Fried, and R. R. Salakhutdinov, “Generating images with multimodal language models,” Advances in Neural Information Processing Systems, vol. 36, pp. 21487–21506, 2023.

P. Selvakumar and T. C. Manjunath, “AI in Text Paraphrasing,” in Using AI Tools in Text Analysis, Simplification, Classification, and Synthesis, IGI Global Scientific Publishing, 2025, pp. 351–376.

R. Shahmerdanova, “The Role of Translation in Global Diplomacy and International Relations,” J. Azerbaijan Lang. Educ. Stud., vol. 2, no. 1, pp. 34–48, 2025.

Y. M. Al-Worafi, “Technology in Public Health Education in Developing Countries,” in Handbook of Medical and Health Sciences in Developing Countries: Education, Practice, and Research, Cham: Springer Int. Publishing, 2024, pp. 1–20.

B. Emihovich, “Implementing Universal Design for Learning in Online Courses to Support Multilingual Students in Higher Education,” in Society for Information Technology & Teacher Education Int. Conf., Mar. 2024, pp. 2570–2577.

E. Dritsas, M. Trigka, C. Troussas, and P. Mylonas, “Multimodal Interaction, Interfaces, and Communication: A Survey,” Multimodal Technol. Interact., vol. 9, no. 1, p. 6, 2025.

J. Jiang et al., “A review of transformers in drug discovery and beyond,” J. Pharm. Anal., 2024, Art. no. 101081.

D. Gromann, “Neural language models for the multilingual, transcultural, and multimodal Semantic Web,” Semantic Web, vol. 11, no. 1, pp. 29–39, 2020.

J. Armitage, E. Kacupaj, G. Tahmasebzadeh, M. Swati, M. Maleshkova, R. Ewerth, and J. Lehmann, “MLM: A benchmark dataset for multitask learning with multiple languages and modalities,” in Proc. 29th ACM Int. Conf. Inf. Knowl. Manag., Oct. 2020, pp. 2967–2974.

L. Tay, S. E. Woo, L. Hickman, B. M. Booth, and S. D’Mello, “A conceptual framework for investigating and mitigating machine-learning measurement bias (MLMB) in psychological assessment,” Adv. Methods Pract. Psychol. Sci., vol. 5, no. 1, Art. no. 25152459211061337, 2022.

M. Balaban, L. Hamann, G. Khais, A. A. Saad, A. Maraee, and A. Sturm, “Mediation-based MLM in USE,” in Proc. ACM/IEEE 27th Int. Conf. Model Driven Eng. Lang. Syst., Sep. 2024, pp. 818–827.

A. Stojanov, I. Toskov, T. Rompf, and M. Püschel, “SIMD intrinsics on managed language runtimes,” in Proc. 2018 Int. Symp. Code Gener. Optim., Feb. 2018, pp. 2–15.

W. Zhang and J. Xu, “Graded subsidy policy-based equilibrium strategy applied to investment in electric vehicle chargers: A case study in Chengdu,” Transp. Res. Rec., vol. 2678, no. 3, pp. 698–714, 2024.

S. Mu, L. Han, and Z. Wen, “Language portraits going digital and multimodal: Deciphering the translanguaging space and linguistic repertoires among multilinguals,” Lang. Educ., vol. 39, no. 1, pp. 132–153, 2025.

A. Henlein et al., “An outlook for AI innovation in multimodal communication research,” in Int. Conf. Human-Comput. Interact., Cham, Switzerland: Springer Nature, Jun. 2024, pp. 182–234.

M. A. AlAfnan, “Large language models as computational linguistics tools: A comparative analysis of ChatGPT and Google machine translations,” J. Artif. Intell. Technol., vol. 5, pp. 20–32, 2025.

B. Min et al., “Recent advances in natural language processing via large pre-trained language models: A survey,” ACM Comput. Surv., vol. 56, no. 2, pp. 1–40, 2023.

W. Khan, A. Daud, K. Khan, S. Muhammad, and R. Haq, “Exploring the frontiers of deep learning and natural language processing: A comprehensive overview of key challenges and emerging trends,” Nat. Lang. Process. J., vol. 4, Art. no. 100026, 2023.

M. V. Koroteev, “BERT: A review of applications in natural language processing and understanding,” arXiv preprint arXiv:2103.11943, 2021.

K. Ethayarajh, “How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings,” arXiv preprint arXiv:1909.00512, 2019.

L. Floridi and M. Chiriatti, “GPT-3: Its nature, scope, limits, and consequences,” Minds Mach., vol. 30, pp. 681–694, 2020.

Y. Ma, “Cross-language text generation using mBERT and XLM-R: English-Chinese translation task,” in Proc. 2024 Int. Conf. Mach. Intell. Digit. Appl., May 2024, pp. 602–608.

M. Hafner et al., “CLIP and complementary methods,” Nat. Rev. Methods Primers, vol. 1, no. 1, Art. no. 20, 2021.

K. Q. Zhou and H. Nabus, “The ethical implications of DALL-E: Opportunities and challenges,” Mesopotamian J. Comput. Sci., 2023, pp. 16–21.

S. Wang, C. H. Yang, J. Wu, and C. Zhang, “Can Whisper perform speech-based in-context learning?,” in ICASSP 2024 - 2024 IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Apr. 2024, pp. 13421–13425.

G. Li, N. Duan, Y. Fang, M. Gong, and D. Jiang, “Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training,” in Proc. AAAI Conf. Artif. Intell., vol. 34, no. 07, Apr. 2020, pp. 11336–11344.

M. Ni et al., “M3P: Learning universal representations via multitask multilingual multimodal pre-training,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 3977–3986.

C. Li et al., “mPLUG: Effective and efficient vision-language learning by cross-modal skip-connections,” arXiv preprint arXiv:2205.12005, 2022.

S. Kanthed, “From code to cloud: The role of GitOps, GitHub, and GitLab in modern DevOps,” J. Technol. Innov., vol. 6, no. 1, 2025.

M. A. AlAfnan, “DeepSeek vs. ChatGPT: A comparative evaluation of AI tools in composition, business writing, and communication tasks,” J. Artif. Intell. Technol., 2025. [Online]. Available: https://doi.org/10.37965/jait.2025.0740

A. Jain et al., “MURAL: Multimodal, multitask representations across languages,” in Findings Assoc. Comput. Linguist.: EMNLP 2021, Nov. 2021, pp. 3449–3463.

J. Chen et al., “LiveCC: Learning video LLM with streaming speech transcription at scale,” arXiv preprint arXiv:2504.16030, 2025.

P. Xu, X. Zhu, and D. A. Clifton, “Multimodal learning with transformers: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 10, pp. 12113–12132, 2023.

L. Yang, Y. Zhang, S. Kang, Z. Wang, and C. Wu, “Microplastics in soil: A review on methods, occurrence, sources, and potential risk,” Sci. Total Environ., vol. 780, Art. no. 146546, 2021.

M. A. AlAfnan, “Taxonomy of educational objectives: Teaching, learning, and assessing in the information and artificial intelligence era,” J. Curric. Teach., vol. 13, no. 4, pp. 173–191, 2024.

E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, “On the dangers of stochastic parrots: Can language models be too big?🦜,” in Proc. 2021 ACM Conf. Fairness, Accountab., Transpar., Mar. 2021, pp. 610–623.

M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy, “SpanBERT: Improving pre-training by representing and predicting spans,” Trans. Assoc. Comput. Linguist., vol. 8, pp. 64–77, 2020.

E. Pavlick, “Symbols and grounding in large language models,” Philos. Trans. R. Soc. A, vol. 381, no. 2251, Art. no. 20220041, 2023.

D. H. Park et al., “Multimodal explanations: Justifying decisions and pointing to the evidence,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 8779–8788.

D. Elliott, S. Frank, K. Sima’an, and L. Specia, “Multi30K: Multilingual English-German image descriptions,” arXiv preprint arXiv:1605.00459, 2016.

E. G. Wilcox, J. Gauthier, J. Hu, P. Qian, and R. Levy, “On the predictive power of neural language models for human real-time comprehension behavior,” arXiv preprint arXiv:2006.01912, 2020.

T. Brummaier et al., “Cohort profile: Molecular signature in pregnancy (MSP): Longitudinal high-frequency sampling to characterise cross-omic trajectories in pregnancy in a resource-constrained setting,” BMJ Open, vol. 10, no. 10, Art. no. e041631, 2020.

M. A. AlAfnan, “Artificial intelligence and language: Bridging Arabic and English with technology,” J. Ecohumanism, vol. 4, no. 1, pp. 240–256, 2025.

M. W. U. Rahman, “Optimizing large language models for edge devices: A comparative study on reputation analysis,” M.S. thesis, Univ. of Arizona, 2023.

UNESCO, UNESCO’s Recommendation on the Ethics of Artificial Intelligence: Key Facts, United Nations Educational, Scientific and Cultural Organization, 2023.

L. K. Senel et al., “Kardeş-NLU: Transfer to low-resource languages with the help of a high-resource cousin—A benchmark and evaluation for Turkic languages,” in Proc. 18th Conf. Eur. Chapter Assoc. Comput. Linguist. (EACL), vol. 1, Long Papers, Mar. 2024, pp. 1672–1688.

M. A. Hasan, P. Tarannum, K. Dey, I. Razzak, and U. Naseem, “Do large language models speak all languages equally? A comparative study in low-resource settings,” arXiv preprint arXiv:2408.02237, 2024.

V. Protasov, E. Stakovskii, E. Voloshina, T. Shavrina, and A. Panchenko, “Super donors and super recipients: Studying cross-lingual transfer between high-resource and low-resource languages,” in Proc. 7th Workshop Technol. Mach. Transl. Low-Resource Lang. (LoResMT 2024), Aug. 2024, pp. 94–108.

D. Haraway, “Situated knowledges: The science question in feminism and the privilege of partial perspective,” in Women, Science, and Technology, Routledge, 2013, pp. 455–472.

R. G. Bender et al., “Global, regional, and national incidence and mortality burden of non-COVID-19 lower respiratory infections and aetiologies, 1990–2021: A systematic analysis from the Global Burden of Disease Study 2021,” Lancet Infect. Dis., vol. 24, no. 9, pp. 974–1002, 2024.




DOI: http://doi.org/10.11591/ijict.v15i2.pp823-838

Refbacks

  • There are currently no refbacks.


Copyright (c) 2026 Mohammad Awad AlAfnan, Siti Fatimah MohdZuki, Shefa Mohammad AlAfnan

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

The International Journal of Informatics and Communication Technology (IJ-ICT)
p-ISSN 2252-8776, e-ISSN 2722-2616
This journal is published by the Intelektual Pustaka Media Utama (IPMU).

Web Analytics View IJICT Stats